Bloom Filters

What are bloom filters?

Bloom filters are used all of the time. Ever wonder why data structures such as hashmaps and hashsets maintain an O(1) search time? The fast search time is enabled by bloom filters.

Bloom filters provide an easy way to determine whether an item is contained in the set. Bloom filters make use of hash functions to get this done.

Bloom Filters:

Creating the Bloom Filter

A bloom filter is made up of a bunch of bits (0 or 1). 1 represents that that item is present, and 0 means the item isn’t present. To build the bloom filter, every value will be passed into a hash function. A hash function will take anything and turn it into a number. The bit with the same number as the value outputted from the hash function will be switched on.

 

The hash function must be deterministic, meaning that if the same input is passed through, the function will always return the same thing.

When a value is requested, it will be passed through the hash function. The bit with the same value as the hash function will be checked. If the bit is on, then it is present. For Python users, the mmh3 hash function will work well for bloom filters. It also provides functionality, which will be useful later.

Hash Collisions:

It is possible that a hash collision can occur, causing an issue. A hash collision is when two different inputs both produce the same result from the hash function. This can cause confusion. As adding one value will also add another unwanted value by accident because they are both controlled by the same bit,

 

This is counteracted by having multiple hash functions. A value will be passed through all the hash functions, and all of the bits will be flipped on. When the item is queried, it will return multiple values from the hash functions. Then, the program will check if all the lights are flipped.

As mentioned, the mmh3 hash function also has different seeds. If different values are given as seeds, the hash functions will produce different values.

Deletions:

Deleting the bloom filter is a little more complicated. It is not as simple as just switching the bits that a certain item hashes to. There might be multiple different items based on that bit. So, an accompanying data structure like a hashtable might be needed to maintain how many items rely on that bit. For every value that the item hashes, lower the values in the hashtable by 1. If no value relies on the bit, then flipping the bit is now allowed without any issues.