Block structured Bloom filter #690

dcoutts · 2025-04-22T00:57:55Z

Long patch series to:

refactor the bloomfilter sub-lib and its uses
introduce a new block-structured implementation
switch to the new one

The result is better performance.

jorisdral · 2025-04-24T09:16:01Z

A test failure on 1ff45af:

cabal run bloomfilter-tests -- -p prop_calc_size_fpr_bits --quickcheck-replay="(SMGen 18024522305972736904 8705882024453529731,95)"
Data.BloomFilter
  Classic
    calculations
      prop_calc_size_fpr_bits: FAIL
        *** Failed! Falsified (after 1 test and 11 shrinks):
        BitsPerEntry 2.336869198112799
        NumEntries 1000
        0.3306894755756382 /= 0.3307128865771576 and not within (abs) tolerance of 1.0e-6
        Use --quickcheck-replay="(SMGen 18024522305972736904 8705882024453529731,95)" to reproduce.

1 out of 1 tests failed (0.00s)

One solution would be increase tolerance

jorisdral

I didn't finish my review yet, but I'm posting these PR comments here just so that I don't lose them between now and when I continue the review. The comments are also not polished, they're mostly draft notes, so no need to look at or resolve them @dcoutts . I'll curate the comments as part of my next leg of reviewing

EDIT: I've deleted most comments in favour of re-adding them in the next review

bloomfilter/src/Data/BloomFilter.hs

bloomfilter/src/Data/BloomFilter/Classic/Calc.hs

jorisdral

Looking very good! I have a bunch of comments / requests for clarifications, but nothing that should hold back the PR I think

There are also a number of comments that I left only on only one of the bloom filter types, but some of those comments probably also apply to the other type of bloom filter

test/Test/Database/LSMTree/Internal/BloomFilter.hs

jorisdral · 2025-04-25T09:47:51Z

test/Test/Database/LSMTree/Internal/RunBloomFilterAlloc.hs

 mkBloomFromAlloc :: Hashable a => RunBloomFilterAlloc -> BloomMaker a
-mkBloomFromAlloc alloc xs = runST $ do
-    mb <- newMBloom n alloc
-    mapM_ (MBloom.insert mb) xs
-    Bloom.unsafeFreeze mb
+mkBloomFromAlloc alloc = Bloom.fromList policy
  where
-    n = LSMT.NumEntries $ length xs
+    policy = case alloc of
+      RunAllocFixed bits     -> Bloom.policyForBits (fromIntegral bits)
+      RunAllocRequestFPR fpr -> Bloom.policyForFPR fpr


We're duplicating logic that is also in RunAcc.new, so it can get out of sync

test/Test/Database/LSMTree/Internal/RunBloomFilterAlloc.hs

jorisdral · 2025-04-25T09:50:12Z

bloomfilter/src/Data/BloomFilter/Classic/Mutable.hs

1be79d3 is a little hard to review by just looking at the diff because it moves, removes, and changes code at the same time. It might be nicer to move code in one commit, and make changes in another

jorisdral · 2025-04-25T09:56:08Z

bloomfilter/src/Data/BloomFilter/Classic/Mutable.hs

-- | Query a mutable Bloom filter for membership.  If the value is
-- present, return @True@.  If the value is not present, there is
-- /still/ some possibility that @True@ will be returned.
-elem :: Hashable a => a -> MBloom s a -> ST s Bool
-elem elt mb = elemHashes (makeHashes elt) mb
-
-elemHashes :: forall s a. CheapHashes a -> MBloom s a -> ST s Bool
-elemHashes !ch MBloom { numBits = m, numHashes = k, bitArray = v } =
-    go 0
-  where
-    go :: Int -> ST s Bool
-    go !i | i >= k    = return True
-          | otherwise = do let !idx' = evalHashes ch i
-                           let !idx = idx' `rem` fromIntegral m
-                           b <- V.unsafeRead v idx
-                           if b
-                           then go (i + 1)
-
-                           else return False


These functions disappear -- is the idea that one should now unsafeFreeze >> elem?

Without these functions, we're making it harder for use cases where someone wants to simultaneously query and insert into a bloom filter

I suppose we could keep them (for other use cases as you say), but then we need a good name for them since I'm squashing the API into one flat namespace, not separating mutable an immutable.

Got any good names? :-)

jorisdral · 2025-04-25T16:23:33Z

bloomfilter/src/Data/BloomFilter/Blocked/Internal.hs

+-- | The version of the format used by 'serialise' and 'deserialise'. The
+-- format number will change when there is an incompatible change in the
+-- library, such that deserialising and using the filter will not work.
+-- This can include more than just changes to the serialised format, for
+-- example changes to hash functions or how the hash is mapped to bits.
+--
+-- Note that the format produced does not include this version. Version
+-- checking is the responsibility of the user of the library.
+--
+formatVersion :: Int
+formatVersion = 1000


Ideally we'd have tests to detect breakage in the serialisation formats. Maybe we can add a TODO?

jorisdral · 2025-04-25T16:23:48Z

bloomfilter/src/Data/BloomFilter/Blocked/Internal.hs

+-- | The version of the format used by 'serialise' and 'deserialise'. The
+-- format number will change when there is an incompatible change in the
+-- library, such that deserialising and using the filter will not work.
+-- This can include more than just changes to the serialised format, for
+-- example changes to hash functions or how the hash is mapped to bits.
+--
+-- Note that the format produced does not include this version. Version
+-- checking is the responsibility of the user of the library.
+--
+formatVersion :: Int
+formatVersion = 1000


We should probably also mention that the format versions for classic and blocked do not overlap

Indeed we should.

jorisdral · 2025-04-25T16:24:18Z

bloomfilter/src/Data/BloomFilter/Classic/Internal.hs

+    -- Note that while x can cover the full Word64 range, since the result is
+    -- less than n, and since n was an Int then the result fits an Int too.


It's not immediately clear to me why the result is less than n

jorisdral · 2025-04-25T16:26:25Z

bloomfilter/tests/bloomfilter-tests.hs

-    , testGroup "equality"
-        [ testProperty "doesn't care about leftover bits a" $
-          BI.Bloom 1 48 (BV64.BV64 (VP.singleton 0xffff_0000_1234_5678)) ===
-          BI.Bloom 1 48 (BV64.BV64 (VP.singleton 0xeeee_0000_1234_5678))
-
-        , testProperty "doesn't care about leftover bits b" $
-          BI.Bloom 1 49 (BV64.BV64 (VP.singleton 0xffff_0000_1234_5678)) =/=
-          BI.Bloom 1 49 (BV64.BV64 (VP.singleton 0xeeee_0000_1234_5678))
-        ]
-    ]


This test is now trivial because we're using the Eq instance from PrimArray?

Yes exactly.

jorisdral · 2025-04-28T07:57:23Z

test/Test/Database/LSMTree/Internal/RunBloomFilterAlloc.hs

  ) where

 import           Control.Exception (assert)
-import           Control.Monad.ST
-import           Data.BloomFilter (Bloom)
+import           Data.BloomFilter (Bloom, Hashable)


This should be using the blocked bloom filter now if we're using the blocked one in the core library

Yes. Good catch. Only Hashable is common between the two impls.

jorisdral · 2025-04-28T08:47:13Z

bloomfilter/src/Data/BloomFilter/Blocked/Calc.hs

+{-
+Regression, FPR indepedent, bits depedent:
+Fit {fitParams = V3 8.035531421107756e-2 1.653017726702572 0.5343568065075601, fitErrors = V3 7.602655075308541e-4 8.422591688796256e-3 2.0396917012822195e-2, fitNDF = 996, fitWSSR = 18.362899348627252}
+Fit {fitParams = V3 8.079418894776325e-2 1.6462569292513933 0.5550062950289885, fitErrors = V3 7.713375250014809e-4 8.542261871094414e-3 2.0678969159415226e-2, fitNDF = 996, fitWSSR = 19.00125036371992}
+-}


Were these left in intentionally or accidentally?

Yes :-)

They're the output of two runs of the fpr-calc program to generate a regression fit. It's not strictly necessary that they be recorded here, but alternatively it might be nice to add a proper comment about how the magic numbers were obtained (including saved output) so people can reproduce it or change it in future.

dcoutts · 2025-04-28T09:07:50Z

        0.3306894755756382 /= 0.3307128865771576 and not within (abs) tolerance of 1.0e-6

One solution would be increase tolerance

It'd need 3e-5 here for this one. The calculations get particularly approximate around 2 bits or less, so another approach would be to adjust the tolerance so it's greater at the low end only.

For testing the bloomfilter lib in isolation, rather than the use in the lsm-tree lib.

We don't need multiple schemes.

rather than Data.Array.Byte. It seems to be compatible with a wider range of ghc & lib versions this way.

merge Data.BloomFilter.Mutable.Internal into Data.BloomFilter.Mutable

and export a helpful construction function.

The spell example was a test suite but does not really test anything. It really is an example, not a test. Remove the Words example since it was being used as a benchmark, but we now have better benchmarks.

Calculate the optimal number of bits and hashes directly rather than via an optimisation algorithm.

Remove the prime-based approach.

It was used to calculate the table of primes that we no longer use.

This improves code clarity. Previously we would pass the number of bits and hashes separately. This makes it clearer what we're supplying by using a distinct type, and lets us directly pass the result of the size calc functions.

and rename the (M)Bloom field names. Also change internal rep of numBits to use Int rather than Word64. We already use Int in BloomSize and we only support up to 2^48 bits anyway on 64bit machines. Switch to record rather than positional style for (M)Bloom so we can more easily change representation. Temporarily disable the bloom-query-fast mechanism. The implementation of BloomFilterQuery2 will need significant updates.

At the moment, the bloomfilter sub-library does not itself support serialisation, and this is handled by the lsm-tree lib directly. It does so by being overly familiar with the internal representation of the bloomfilter lib. We still want the file format to be determined by the lsm-tree library, but we would like a better abstraction boundary between the bloomfilter and lsm-tree libraries. So we introduce functions to convert the bloom filters to/from a lower level representation as a BloomSize and a (Mutable)ByteArray for the bits.

This provides more detail, and makes it easier to convert exceptions from the fs-api into this exception type (since it contains a FsErrorPath).

Note that we now need the bloom bit array to always be pinned, because we can do I/O directly into that buffer.

…r.Classic in preparation for adding blocked bloom filters. Though we'll do a bit of module consolodation first.

in preparation for removing the Easy module. By making the normal stuff easy, we don't need a special Easy version of the API. We have made the normal stuff easy by allowing the size to be specified via either (sizeForFPR fpr n) or (sizeForBits b n). Whereas originally the low level API needed to be told the number of bits and number of hashes, which is indeed tricky. And then the Easy API wrapped that.

It's ok to allow zero, and the zero case was being covered by tests for RunAcc that use newMBloom. The previous patch changed newMBloom from using its own custom code for determining the bits and hashes to using the Calc module ones, including policyForBits. So we could either adjust the RunAcc test to avoid zero, or just allow zero. We allow it. In the conversion from policy to size we enforce a minimum size of 1 bit. So we can allow less than 1 bits per key.

Sync the Classic version with the Blocked version. This version uses a double nested loop to keep the filter as a constant for the inner loop. This produces mildly better code with recent GHC versions that do join points properly.

We provide a low level API for bloom filter inserts and elem operations. These allow for sharing a single hash calculation across many filters, and potentially allows for prefetching. Now instead of a CheapHashes type and operations in the Hash module, each of the two implementations (Classic and Blocked) provide a Hashes type and constructor. The classic one uses the (renamed) CheapHashes, while the Blocked one uses its own scheme. Also tidy up the code for the Blocked Hashes functions, to make the pattern clearer. Follow the pattern of a PRNG.

This is a blocked bloom filter where the bit array is split into 64byte blocks. Instead of probing bits randomly across the whole bit array (which needs one cache line load per bit probe), it first selects one block and then selects a number of bits within the block. So the same number of bit probes are done overall, but they're all clustered within one block. The blocks are cache line sized and aligned to cache boundaries. Thus lookups and inserts only cost a single cache line load (or load and store for an insert).

The program produces output for gnuplot, and the gnuplot script plots the calculated and actual FPR vs bits-per-key for both the classic and the blocked implementations. We use 2..20 bits for classic, as that gets us down to below an FPR of 1e-4, while for the blocked one we need to use 2..24 bits to get to approximately the same FPR. This graph output (checked in) is also a great test of the FPR to bits code for both implementations, and of the FPR of each impl in general.

and make the example program a tad more elegant

In the Classic implementation. The Blocked one uses this already. The classic algorithm is the remainder after division. But division is slow, especially for 64bit numbers. There's a faster method based on multiplication. This changes the format version, since it changes the mapping from element hash to the bit selected.

So fewer places depend on the exact BloomFilter implementation. In particular several places just need to talk about the type without using any operations. This will make switching implementation easier.

Co-authored-by: Joris Dral <[email protected]>

dcoutts requested review from jorisdral, mheinzel, recursion-ninja and wenkokke as code owners April 22, 2025 00:57

dcoutts force-pushed the dcoutts/bloomfilter-blocked branch 2 times, most recently from c59568d to 23971a2 Compare April 22, 2025 16:48

jorisdral force-pushed the dcoutts/bloomfilter-blocked branch 2 times, most recently from e232688 to 1ff45af Compare April 24, 2025 08:57

jorisdral reviewed Apr 24, 2025

View reviewed changes

bloomfilter/src/Data/BloomFilter.hs Outdated Show resolved Hide resolved

bloomfilter/src/Data/BloomFilter/Classic/Calc.hs Outdated Show resolved Hide resolved

jorisdral mentioned this pull request Apr 25, 2025

Compute requested FPR using equations instead of an algorithm #644

Closed

jorisdral reviewed Apr 28, 2025

View reviewed changes

dcoutts force-pushed the dcoutts/bloomfilter-blocked branch from 75758cb to 75ef51e Compare April 28, 2025 10:33

dcoutts and others added 15 commits April 29, 2025 14:02

bloomfilter: Add a simple construction benchmark

c08ee6d

For testing the bloomfilter lib in isolation, rather than the use in the lsm-tree lib.

bloomfilter: removes Hashes, specialise to CheapHashes scheme

02d8311

We don't need multiple schemes.

bloomfilter: use ByteArray type from primitive package

c53370d

rather than Data.Array.Byte. It seems to be compatible with a wider range of ghc & lib versions this way.

bloomfilter: combine a couple modules into one

dbff02a

merge Data.BloomFilter.Mutable.Internal into Data.BloomFilter.Mutable

bloomfilter: Remove pointless exported functions

a5fe945

and export a helpful construction function.

bloomfilter: misc minor cleanups of the tests

98add8e

bloomfilter: change the example spell program into an executable

c1eea06

The spell example was a test suite but does not really test anything. It really is an example, not a test. Remove the Words example since it was being used as a benchmark, but we now have better benchmarks.

bloomfilter: Add new size calculation code

458ba2e

Calculate the optimal number of bits and hashes directly rather than via an optimisation algorithm.

bloomfilter: add tests for new size calculation functions

e4a9333

bloomfilter: change Easy module to use new size calculations

7438e05

Remove the prime-based approach.

bloomfilter: remove primes helper program

d30bc58

It was used to calculate the table of primes that we no longer use.

bloomfilter: remove old calc functions

630995b

dcoutts added 7 commits April 29, 2025 14:02

convert bloomFilterToLBS to use new Bloom.serialise

3c83476

Switch FsPath to FsErrorPath in FileCorruptedError exception type

305bff5

This provides more detail, and makes it easier to convert exceptions from the fs-api into this exception type (since it contains a FsErrorPath).

bloomfilter: fix showing counterexamples in prop_verifyFPR

1927d00

convert bloomFilterFromSBS to use new Bloom.deserialise

4aa4349

Note that we now need the bloom bit array to always be pinned, because we can do I/O directly into that buffer.

bloomfilter: Move most Data.BloomFilter modules under Data.BloomFilte…

cfcffa0

…r.Classic in preparation for adding blocked bloom filters. Though we'll do a bit of module consolodation first.

dcoutts force-pushed the dcoutts/bloomfilter-blocked branch from 75ef51e to d403b5a Compare April 29, 2025 13:05

dcoutts added 14 commits April 29, 2025 14:17

bloomfilter: remove last uses of internal modules

02bf170

bloomfilter: use a mildly better version of unfoldr

6c25424

Sync the Classic version with the Blocked version. This version uses a double nested loop to keep the filter as a constant for the inner loop. This produces mildly better code with recent GHC versions that do join points properly.

bloomfilter: generalise tests to cover the Blocked implementation

df0fb16

bloomfilter: extend benchmark to blocked implementation

c83b359

bloomfilter: add operation (?) = flip elem

52cdac3

and make the example program a tad more elegant

bloomfilter: export a formatVersion number

d808621

Use Bloom.filterVersion number in the lsm-tree serialisation code

685e3d2

Re-export (M)Bloom via D.LSMTree.I.BloomFilter to reduce coupling

cca509b

So fewer places depend on the exact BloomFilter implementation. In particular several places just need to talk about the type without using any operations. This will make switching implementation easier.

Switch lsm-tree to use the Blocked bloom filter implementation

b977a02

bloomfilter: enable the same warnings as other packages

4ce0308

dcoutts force-pushed the dcoutts/bloomfilter-blocked branch from d403b5a to 4ce0308 Compare April 29, 2025 13:23

dcoutts and others added 7 commits April 29, 2025 14:30

Update bloomfilter/src/Data/BloomFilter/Classic/BitArray.hs

b4231ae

Co-authored-by: Joris Dral <[email protected]>

Update bloomfilter/src/Data/BloomFilter/Blocked/BitArray.hs

dbd10c8

Co-authored-by: Joris Dral <[email protected]>

Apply suggestions from code review

1f445ac

Co-authored-by: Joris Dral <[email protected]>

Apply suggestions from code review

7965a1f

Co-authored-by: Joris Dral <[email protected]>

Apply suggestions from code review

c8c1f91

Co-authored-by: Joris Dral <[email protected]>

Apply suggestions from code review

91e9bbc

Co-authored-by: Joris Dral <[email protected]>

Apply suggestions from code review

9714096

Co-authored-by: Joris Dral <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block structured Bloom filter #690

Block structured Bloom filter #690

dcoutts commented Apr 22, 2025

jorisdral commented Apr 24, 2025

jorisdral left a comment •

edited

Loading

jorisdral left a comment

jorisdral Apr 25, 2025

jorisdral Apr 25, 2025

dcoutts Apr 28, 2025

jorisdral Apr 25, 2025

dcoutts Apr 28, 2025

jorisdral Apr 25, 2025

jorisdral Apr 25, 2025

dcoutts Apr 28, 2025

jorisdral Apr 25, 2025

jorisdral Apr 25, 2025

dcoutts Apr 28, 2025

jorisdral Apr 28, 2025

dcoutts Apr 28, 2025

jorisdral Apr 28, 2025

dcoutts Apr 28, 2025

dcoutts commented Apr 28, 2025

		-- Note that while x can cover the full Word64 range, since the result is
		-- less than n, and since n was an Int then the result fits an Int too.

Block structured Bloom filter #690

Are you sure you want to change the base?

Block structured Bloom filter #690

Conversation

dcoutts commented Apr 22, 2025

jorisdral commented Apr 24, 2025

jorisdral left a comment • edited Loading

Choose a reason for hiding this comment

jorisdral left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcoutts commented Apr 28, 2025

jorisdral left a comment •

edited

Loading