Let’s implement a Bloom Filter(onatm.dev) |
Let’s implement a Bloom Filter(onatm.dev) |
> 6 days after I started to write this post, it is proved that the math behind false positive probability was wrong for 50 years
This isn't true.
The "Bloom math is wrong!" clamor started at least 12 years ago with [1], which brazenly cites Bloom, then proceeds to give an analysis of Knuth's subtly different filter from TAOCP instead.
There are issues with [1], but its exact false positive formula was eventually proven correct. Its argument is incompatible with Bloom's original filter, though, and Knuth explicitly states that the false positive formula he gives is approximate.
In other words, nothing was proved wrong. The analysis doesn't apply to Bloom and doesn't contradict Knuth.
Now, [1] may be the first paper to give the exact false positive formula for Knuth's filter (and Bloom's was given in 1979 by [2]), so it's not entirely without merit. But in practice none of this matters anyway. At useful sizes the approximations are accurate. Knuth's is a lower, Bloom's an upper bound, and the two become functionally equivalent very quickly.
1. https://cglab.ca/~morin/publications/ds/bloom-submitted.pdf
2. Algorithm-A in https://sci-hub.tw/10.1109/proc.1979.11543
But looking up the 1M elements directly in a BTree will usually lead to 1-2 cache misses and depending on the data element size, may even fit in the L3 cache. Then, what is this bloom filter saving me from?
That being said, in the case of just 1M elements, a Bloom filter won't buy you much. For a bloom filter to be truly effective, you need all of the following:
1. Many queries come back negative; that is, you spend a lot of time searching for keys that are not in the DB. This is not the typical case. When the typical web app searches for a _key_ the key is typically in the DB. (as opposed to "find all sales on November 7th" or whatever, which Bloom filters cannot assist with)
2. The actual query is disproportionately expensive. Unless a Bloom filter can save you a disk/network call, it's generally not worth the extra overhead.
3. You rarely/never remove items, and your data grows slowly.
Many problems that can be solved with a Bloom filter can also be solved by simply buying more RAM. Many problems that can be solved with a Bloom filter can also be solved by using a faster lookup. (everybody overlooks hash tables. I don't know why.) Bloom filters are _not_ a general purpose algorithm. You really need to grok your data before pulling the trigger on a Bloom filter. (don't forget to benchmark)
It is probably worse than you think since a hash table can also replace the Bloom filter in its "false positives allowed" filtration role. Not sure if that's what you meant.
If all you save is the "fingerprint" (or a B-bit truncated hash code of the keys) then you get a very similar probabilistic data structure. Further, if you put these small integer fingerprints in an open addressed hash table with linear probing then you get a (very likely) single-cache miss test for membership. This alternate structure was, in fact, also analyzed by Burton Bloom (in the very same paper he introduced his famous filter). He found fingerprint filters to be slower only on very unrealistic memory systems with neither cache nor "word fetch" structure where access to every bit had unit cost no matter its location. If you can grab a big batch of bits in unit cost the analysis changes a lot. Modern systems grab 512..1024 bits at the same cost as 1...
This fingerprint filter alternative does take 2..4x more space depending on scale and allowed false positive rates, but that space is organized better in terms of caches. Most would choose to spend that space to save k-1 cache misses of time in today's era of 60-120 ns memory and multi GHz CPUs. It is slightly easier to implement a large bitvector than an array of B-bit numbers, depending on B. { It's pretty trivial for B=8*m, though. ;-) } It's kind of crazy that hash tables for this purpose fell into "also ran/obscure" status...
My immediate thought here was that this wasn't very space efficient as it requires over 9 bits per item?
I'd have to dig more into this though to confirm or deny my intuition!
Edit: this page agrees with the blog post and shows some interesting graphs https://hur.st/bloomfilter/?n=1000000&p=0.01&m=&k=
If you are looking for a key value then chances are it does exist in the DB so the bloom filter won't save you anything as you are then hitting slower storage to read the item that the key represents anyway.
Bloom filters are suited for situations where you are looking for larger non-keyed data such as URLs, where there is a fairly good chance that what you are looking for isn't in the data yet so the "no false negatives" property is important for saving further accesses (which is why cache arrangements is where you'll often find these structures used).
The similar technique but in a reverse I've seen in a factory myself, when 5-7 closest people at the gate queue pressed their buttons, and a controller collected their outstanding passes from the other side and then checked against their faces very quickly. That greatly saved time in a queue.
Bikeshedding a little, but that probably shouldn't be an array. hashers[0] and hashers[1] aren't semantically equivalent and the only usage that array gets is just getting element by index. Something like `startHasher` and `stepHasher` would probably better convey intent.
Bloom filters are just chaining hash tables, without the chain.
Has the state of the art moved on recently, or is it still a case of fighting TikZ?