Bloom filter

A bloom filter is a probabilistic data structure that can be used to determine very quickly which data has already occurred in a data stream and which is occurring for the first time. For this purpose, a “fingerprint” of the data record read is left in a single-line hash table using a suitable number of hash functions .

Developed in 1970 by Burton H. Bloom to check spelling and to break words at the end of a line, Bloom filters are now often used in database administration and for routing in networks . In contrast to comparable algorithms , bloom filters require very little storage space . However, the following peculiarities are also of crucial importance for applicability: Key values that have been recorded once in the hash table remain there. Furthermore, false positive results are possible, i. H. what the filter accepts was most likely contained in the key values; however, what he rejects was definitely not included.

Working principle

An example of a bloom filter for the set { x , y , z }. The colored arrows indicate the position in the bit array of the bloom filter. The element w is not contained in the set { x , y , z } because it includes positions whose value is 0. In this picture m = 18 and k = 3.

A bloom filter consists of an m -digit bit array (which is initially filled with zeros) and k different hash functions with a value range from 0 to m -1. It is particularly important that the values supplied by the hash functions are evenly distributed ; cryptographic properties are not required. For this reason, simple and very fast hash procedures (e.g. MurmurHash or FNV ) are often used.

Initially, the Bloom filter is assigned its vocabulary . For this purpose, for the value to be stored by means of the k hash functions k calculated hash values, each of the determined k hash values representing a position in the bit array. Each of these positions is now set to 1, regardless of whether this position was previously a 1 or a 0.

If it is now to be checked whether any value is contained in the vocabulary, its k hash values are again determined. If one of the k digits of the bit array of the filter contains a 0 where the hash of the value to be checked has a 1, then it is certain that the value is not contained in the filter. If there is a 1 in each of the k places in the filter where the value to be checked also has a 1, it is very likely that the value was saved in the filter. However, a Bloom filter can never say with absolute certainty that a certain value is really stored, as there is a certain risk due to collisions of incorrectly classifying values as belonging ( false-positive classification ). However, this risk can be reduced by a suitable choice of k , m , and the capacity of the filter (i.e. the maximum number of values to be stored).

Since a certain position in the bit array can have been set to 1 by adding different values, it is not possible with conventional bloom filters to delete values that have been added afterwards. So-called counting bloom filters provide a remedy here .

Hash functions with a constant amount of images and a variable source amount are generally not bijective , which is why it is impossible to recover the values they contain using classic bloom filters. It can only be determined with a certain (possibly high) probability whether or not a given word is contained. This is a major disadvantage in many areas of application, since z. B. also the search with wildcards becomes impossible.

Bloom filters can be useful when sensitive data is to be stored. For example, the method can be used to reliably rule out during a search that a person who has just been checked is being searched for, without having to keep personal data in plain text.

example

Assume that the filter is 10 bits long ( m = 10 ) and has 3 hash functions ( k = 3 ). The vocabulary consists of the words "der", "die" and "das".

At first the filter is empty:

0 0 0 0 0  0 0 0 0 0

The words are now added to the filter one after the other. The results of the 3 hash functions are determined for each of the words.

"of the": Assuming that the hash functions return the values 8, 2 and 4 for the word "der". The positions 8, 2 and 4 in the array are set to 1. The array now looks like this:

0 1 0 1 0  0 0 1 0 0

"the": Assuming that the hash functions return the values 7, 4 and 1 for the word “die”. The positions 7, 4 and 1 in the array are set to 1. The array now looks like this:

1 1 0 1 0  0 1 1 0 0

"the": Assume that the hash functions return the values 9, 2 and 8 for the word "das". Positions 9, 2 and 8 in the array are set to 1, so the array looks like this:

1 1 0 1 0  0 1 1 1 0

We now want to check the existence of different words in the filter.

"who": For the word "who" it is assumed that the calculation of the hash functions results in the values 10, 1 and 3. Positions 10 and 3 in the array contain a 0, so the word is not in the filter, which is correct.
"the": Since the hash function delivers the same result for the same inputs, the same values result for the word "that" as before when inserting, namely 9, 2 and 8. All these positions are set to 1 in the array, so the word could be in the filter stored, which is also the case.
"she": For the word “she” it is assumed that the calculation of the hash functions results in the values 9, 1, 4. Since all of these positions in the array are set to 1, the word could be stored in the filter. However, since the word "she" was not added to the filter above, it is a false positive result.

It should be noted that this is a deliberately constructed extreme example which does not yet allow any conclusions to be drawn about the quality of the process.

variants

Numerous variants and extensions of bloom filters have been proposed in the scientific literature. The article Network applications of bloom filters: A survey by Broder and Mitzenmacher offers a good overview . These extensions include, for example, counting bloom filters , which allow elements to be removed at the expense of more space. Other bloom filter variants attempt to extend the advantages of the bloom filter to stream data (e.g. stable bloom filter ), while further variants offer options for probabilistic multiplicity estimates of inserted elements (e.g. bloomier filter ).

In the area of databases, buffered bloom filters are also used in order to be able to determine frequently accessed storage locations , especially in the case of solid state discs ( SSD ).

Applications

The proxy server Squid uses bloom filters to match cache entries.

The NoSQL databases Bigtable , Apache HBase and Apache Cassandra use Bloom filters to avoid expensive hard disk access when querying non-existent key values.

The Google Chrome web browser maintains a bloom filter with the signatures of dangerous websites and checks when a URL is entered whether it is included in this filter.
The cryptocurrency Bitcoin uses Bloomfilter to verify transactions in lightweight clients (SPV)

literature

Burton H. Bloom: Space / Time Trade-offs in Hash Coding with Allowable Errors . In: Communications of the ACM . tape 13 , no. 7 July 1970, ISSN 0001-0782 , pp. 422-426 ( PDF ).
Michael Mitzenmacher, Eli Upfal: Probability and Computing: Randomized Algorithms and Probabilistic Analysis . Cambridge University Press, 2005, ISBN 0-521-83540-2 ( Google Books [accessed May 21, 2013]).

Individual evidence

↑ Andrei Broder, Michael Mitzenmacher: Network applications of bloom filters: A survey . In: Internet Mathematics . 2002, p. 636-646 ( psu.edu [accessed May 21, 2013]).
↑ Michael Mitzenmacher: Compressed bloom filters . In: IEEE / ACM Transactions on Networking . 2002, p. 604-612 , doi : 10.1109 / TNET.2002.803864 .
↑ Fan Deng, Davood Rafiei: Approximately detecting duplicates for streaming data using stable bloom filters . In: Proceedings of the 2006 ACM SIGMOD international conference on Management of data . 2006, p. 25-36 , doi : 10.1145 / 1142473.1142477 .
↑ Bernard Chazelle , Joe Kilian, Ronitt Rubinfeld, Ayellet Tal: The Bloomier filter: an efficient data structure for static support lookup tables . In: Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms . 2004, p. 30-39 ( acm.org [accessed May 21, 2013]).
↑ Mustafa Canim, George A. Mihaila, Bishwaranjan Bhattacharjee: Buffered Bloom Filters on Solid State Storage . In: First International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (ADMS) . 2010 ( vldb.org [PDF; accessed April 23, 2019]).
↑ SquidFaq / CacheDigests. Retrieved May 21, 2013 .
↑ Alex Yakunin: Nice Bloom filter application. (No longer available online.) Archived from the original on October 27, 2010 ; Retrieved on May 21, 2013 (via Bloom Filtering for Safe Browsing in Google Chrome).

Web links

Orestes Bloomfilter - Open-source Java implementation of Bloomfilter and Counting Bloomfilter with extensive distribution, persistence and hashing options.
Libhashish - Bloomfilter library for C and C ++
A Garden Variety of Bloom filters - Comprehensive, English-language explanation of various Bloom filter variants
Table with false positive rates for various configurations

[1] Andrei Broder, Michael Mitzenmacher: Network applications of bloom filters: A survey . In: Internet Mathematics . 2002, p. 636-646 ( psu.edu [accessed May 21, 2013]).

[2] Michael Mitzenmacher: Compressed bloom filters . In: IEEE / ACM Transactions on Networking . 2002, p. 604-612 , doi : 10.1109 / TNET.2002.803864 .

[3] Fan Deng, Davood Rafiei: Approximately detecting duplicates for streaming data using stable bloom filters . In: Proceedings of the 2006 ACM SIGMOD international conference on Management of data . 2006, p. 25-36 , doi : 10.1145 / 1142473.1142477 .

[4] Bernard Chazelle , Joe Kilian, Ronitt Rubinfeld, Ayellet Tal: The Bloomier filter: an efficient data structure for static support lookup tables . In: Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms . 2004, p. 30-39 ( acm.org [accessed May 21, 2013]).

[5] Mustafa Canim, George A. Mihaila, Bishwaranjan Bhattacharjee: Buffered Bloom Filters on Solid State Storage . In: First International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (ADMS) . 2010 ( vldb.org [PDF; accessed April 23, 2019]).

[6] SquidFaq / CacheDigests. Retrieved May 21, 2013 .

[7] Alex Yakunin: Nice Bloom filter application. (No longer available online.) Archived from the original on October 27, 2010 ; Retrieved on May 21, 2013 (via Bloom Filtering for Safe Browsing in Google Chrome).