Implementing a bloom filter in elixir

Arpan Ghoshal
6 min readNov 19, 2020

--

What are bloom filters?

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.

False-positive matches are possible, but false negatives are not — in other words, a query returns either “possibly in set” or “definitely not in set.”

Well, that's all they are.

Implementation

Now, let's dig a little bit into the implementation details of a bloom filter.

Well, an empty Bloom filter is just a bit array of m bits, all set to 0.

There must also be k different hash functions defined, each of these hash functions must produce a hash value in the range (0 to m), that is the hash value must be an index of our bit array of size m.

So we have…

m = size of the bit array

k = number of hash functions

k << m

Typically, k is a constant, much smaller than m, which is proportional to the number of elements to be added; the precise choice of k and the constant of proportionality of m are determined by the intended false positive rate of the filter.

Note: You can’t remove a data item from a filter because you might zero a bit that was also set by another data item and so mark it as not being in the filter as well.

Adding an element into a bloom filter

To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1.

Query the bloom filter

To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions.

If any of the bits at these positions is 0, the element is definitely not in the set

If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a false positive.

When should I use it and how does it work?

Well, there can be lots of use cases, like say you have very large datasets that typically don’t fit in memory and you want to check if it contains given elements.

Sequentially searching such a big dataset can be very expensive both in terms of memory and in terms of CPU.

Well, if you are okay with some amount of false positives then bloom filters can be the perfect candidate to solve such a problem.

How does it work?

In order to have a better intuition of how bloom filters work, we can think that each value that has to be stored in the bloom filter should produce a unique set of “k” indexes(array locations) when passed through our “k” hash functions.

Now when we want to check if a value is present in the bloom filter or not, we again pass it through the same “k “hash functions which give us “k” indexes(array positions), we now check if the bit in all these array positions are set to one or not.

If all the bits are set there is a high probability that the element is present in the bloom filter.

Using this technique we can say whether a large data set contains an element in constant time as opposed to performing a full search which might be expensive.

A picture is worth a thousand words…

Add and query a bloom filter

Here, S is a bloom filter. We are storing the values “x” and “y” in the bloom filter. In order to do so, we pass them through our 3(k=3) hash functions g, f, and h. They give us three array indexes which we set to 1(if already not set).

When querying to check if a value “x” is present in the bloom filter we will use a similar strategy as I explained above.

Elixir implementation

Well, here you go….

We can easily implement a bloom filter in elixir, there are already some good preexisting implementations that you can use like this one.

However, if you want to have your own boom filter implementation and store them in the database for later use keep reading…

Here, is an implementation of bloom filters in elixir…

Breakdown of the code

This code deals with binaries in elixir, so you might need to brush up on your elixir binaries before getting into the code.

Here, the get_mask/1 the function takes an integer value that is to be inserted into the bloom filter and creates an empty bloom filter of the size “m” by creating an elixir binary of “m” bits like…

mask = <<0::size(@m)>>

Then we pass the given integer element through our “k” hash functions.

Here, we use an implementation of the Murmur3 hash function, but remember we need “k” different hash function.

Instead of having to find “k” different hash function we seed the same hash functions using k different seed values so we have “k” hash functions now.

The hash functions might produce an array index beyond our array(bitstring) size so we take the remainder after dividing the value produced by our hash function by “m”(our array/bitstring/bloom filter ) size like…

element        
|> Murmur.hash_x86_32(elem(@seed, i))
|> rem(@m)

The set_bit/2 function is used to set the nth bit inside our elixir binary and the bit_set?/2function checks if a particular bit is set or not.

This code uses pattern matching in elixir on the binaries to achieve this.

For example…

<<prefix::size(pos), val::size(1), rest::bits>> = bit_string# New bit string with the (pos)th bit set<<prefix::size(pos), 1::size(1), rest::bits>

Suppose to set the 55th bit in this code we pattern match out the prefix(the first 54 bits), then we set our 55th bit, we keep the rest of the bits as it is(keep in mind indexing starts from 0).

The member?/2 function just uses the bit_set?/2 function to check if a given element is present in the bloom filter or not.

That's it, that's all you need to implement a bloom filter in elixir 🎉.

Storing to the database

Now, if you need to store your bloom filter in the database, keep reading.

Using the above approach we now have a bitstring in elixir which can be stored in an efficient manner in the database.

In PostgreSQL, we have ways for storying such bitstrings(array of bits).

In elixir create a migration defining a column having type varbit which will store our bloom filter like…

defmodule MyApp.Repo.Migrations.CreateBloomFilter do
use Ecto.Migration
def change do
create table(:bloom_filter) do
add :filter, :varbit
end
end
end

Now, fetching a bloom filter from the database is easy but how do we update it.

Well, say you want to store 100 elements inside the bloom filter, using the code I gave before we can achieve this easily without having to run many queries.

First, we create a bitmask for each of our 100 elements by calling the get_mask/1 function that we defined earlier for each of our 100 elements.

Now we do a logical OR operation on the 100 bitmasks we got to get a single bitmask with all the information for our 100 elements.

The elixir code for doing this might look like…

Enum.reduce(elements, BloomFilter.new(), fn element, mask ->
element
|> BloomFilter.get_mask()
|> BloomFilter.bor(mask)
end)

Here, we are using the bor/2 function that we defined earlier to logically OR two given bitmasks.

Finally, we can update the bloom filter stored in the database by doing a logical OR of the bitmask we have got with the bitstring stored in the database, this will update our bloom filter in the database and add all the 100 elements into it 🎉.

In elixir, the query for this might look like…

from(
bf in BloomFilter,
update: [set: [filter: fragment("filter | ?", ^mask)]]
)
|> Repo.update_all([])

We use the | operator directly on the database for ORing the bits.

That's all, thanks for going through this blog and I hope it helps you with our adventures exploring the mysteries of bloom filters.

Good day 😊

--

--

Arpan Ghoshal

I like to build new stuff and I have passion for coding