-
Notifications
You must be signed in to change notification settings - Fork 7
Block structured Bloom filter #690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
c59568d
to
23971a2
Compare
e232688
to
1ff45af
Compare
A test failure on 1ff45af: cabal run bloomfilter-tests -- -p prop_calc_size_fpr_bits --quickcheck-replay="(SMGen 18024522305972736904 8705882024453529731,95)"
Data.BloomFilter
Classic
calculations
prop_calc_size_fpr_bits: FAIL
*** Failed! Falsified (after 1 test and 11 shrinks):
BitsPerEntry 2.336869198112799
NumEntries 1000
0.3306894755756382 /= 0.3307128865771576 and not within (abs) tolerance of 1.0e-6
Use --quickcheck-replay="(SMGen 18024522305972736904 8705882024453529731,95)" to reproduce.
1 out of 1 tests failed (0.00s) One solution would be increase tolerance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't finish my review yet, but I'm posting these PR comments here just so that I don't lose them between now and when I continue the review. The comments are also not polished, they're mostly draft notes, so no need to look at or resolve them @dcoutts . I'll curate the comments as part of my next leg of reviewing
EDIT: I've deleted most comments in favour of re-adding them in the next review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking very good! I have a bunch of comments / requests for clarifications, but nothing that should hold back the PR I think
There are also a number of comments that I left only on only one of the bloom filter types, but some of those comments probably also apply to the other type of bloom filter
mkBloomFromAlloc :: Hashable a => RunBloomFilterAlloc -> BloomMaker a | ||
mkBloomFromAlloc alloc xs = runST $ do | ||
mb <- newMBloom n alloc | ||
mapM_ (MBloom.insert mb) xs | ||
Bloom.unsafeFreeze mb | ||
mkBloomFromAlloc alloc = Bloom.fromList policy | ||
where | ||
n = LSMT.NumEntries $ length xs | ||
policy = case alloc of | ||
RunAllocFixed bits -> Bloom.policyForBits (fromIntegral bits) | ||
RunAllocRequestFPR fpr -> Bloom.policyForFPR fpr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're duplicating logic that is also in RunAcc.new
, so it can get out of sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1be79d3 is a little hard to review by just looking at the diff because it moves, removes, and changes code at the same time. It might be nicer to move code in one commit, and make changes in another
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry.
-- | Query a mutable Bloom filter for membership. If the value is | ||
-- present, return @True@. If the value is not present, there is | ||
-- /still/ some possibility that @True@ will be returned. | ||
elem :: Hashable a => a -> MBloom s a -> ST s Bool | ||
elem elt mb = elemHashes (makeHashes elt) mb | ||
|
||
elemHashes :: forall s a. CheapHashes a -> MBloom s a -> ST s Bool | ||
elemHashes !ch MBloom { numBits = m, numHashes = k, bitArray = v } = | ||
go 0 | ||
where | ||
go :: Int -> ST s Bool | ||
go !i | i >= k = return True | ||
| otherwise = do let !idx' = evalHashes ch i | ||
let !idx = idx' `rem` fromIntegral m | ||
b <- V.unsafeRead v idx | ||
if b | ||
then go (i + 1) | ||
|
||
else return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions disappear -- is the idea that one should now unsafeFreeze >> elem
?
Without these functions, we're making it harder for use cases where someone wants to simultaneously query and insert into a bloom filter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we could keep them (for other use cases as you say), but then we need a good name for them since I'm squashing the API into one flat namespace, not separating mutable an immutable.
Got any good names? :-)
-- | The version of the format used by 'serialise' and 'deserialise'. The | ||
-- format number will change when there is an incompatible change in the | ||
-- library, such that deserialising and using the filter will not work. | ||
-- This can include more than just changes to the serialised format, for | ||
-- example changes to hash functions or how the hash is mapped to bits. | ||
-- | ||
-- Note that the format produced does not include this version. Version | ||
-- checking is the responsibility of the user of the library. | ||
-- | ||
formatVersion :: Int | ||
formatVersion = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'd have tests to detect breakage in the serialisation formats. Maybe we can add a TODO?
-- | The version of the format used by 'serialise' and 'deserialise'. The | ||
-- format number will change when there is an incompatible change in the | ||
-- library, such that deserialising and using the filter will not work. | ||
-- This can include more than just changes to the serialised format, for | ||
-- example changes to hash functions or how the hash is mapped to bits. | ||
-- | ||
-- Note that the format produced does not include this version. Version | ||
-- checking is the responsibility of the user of the library. | ||
-- | ||
formatVersion :: Int | ||
formatVersion = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably also mention that the format versions for classic and blocked do not overlap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed we should.
-- Note that while x can cover the full Word64 range, since the result is | ||
-- less than n, and since n was an Int then the result fits an Int too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not immediately clear to me why the result is less than n
, testGroup "equality" | ||
[ testProperty "doesn't care about leftover bits a" $ | ||
BI.Bloom 1 48 (BV64.BV64 (VP.singleton 0xffff_0000_1234_5678)) === | ||
BI.Bloom 1 48 (BV64.BV64 (VP.singleton 0xeeee_0000_1234_5678)) | ||
|
||
, testProperty "doesn't care about leftover bits b" $ | ||
BI.Bloom 1 49 (BV64.BV64 (VP.singleton 0xffff_0000_1234_5678)) =/= | ||
BI.Bloom 1 49 (BV64.BV64 (VP.singleton 0xeeee_0000_1234_5678)) | ||
] | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is now trivial because we're using the Eq
instance from PrimArray
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly.
) where | ||
|
||
import Control.Exception (assert) | ||
import Control.Monad.ST | ||
import Data.BloomFilter (Bloom) | ||
import Data.BloomFilter (Bloom, Hashable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be using the blocked bloom filter now if we're using the blocked one in the core library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Good catch. Only Hashable
is common between the two impls.
{- | ||
Regression, FPR indepedent, bits depedent: | ||
Fit {fitParams = V3 8.035531421107756e-2 1.653017726702572 0.5343568065075601, fitErrors = V3 7.602655075308541e-4 8.422591688796256e-3 2.0396917012822195e-2, fitNDF = 996, fitWSSR = 18.362899348627252} | ||
Fit {fitParams = V3 8.079418894776325e-2 1.6462569292513933 0.5550062950289885, fitErrors = V3 7.713375250014809e-4 8.542261871094414e-3 2.0678969159415226e-2, fitNDF = 996, fitWSSR = 19.00125036371992} | ||
-} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were these left in intentionally or accidentally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes :-)
They're the output of two runs of the fpr-calc program to generate a regression fit. It's not strictly necessary that they be recorded here, but alternatively it might be nice to add a proper comment about how the magic numbers were obtained (including saved output) so people can reproduce it or change it in future.
It'd need 3e-5 here for this one. The calculations get particularly approximate around 2 bits or less, so another approach would be to adjust the tolerance so it's greater at the low end only. |
75758cb
to
75ef51e
Compare
For testing the bloomfilter lib in isolation, rather than the use in the lsm-tree lib.
We don't need multiple schemes.
rather than Data.Array.Byte. It seems to be compatible with a wider range of ghc & lib versions this way.
merge Data.BloomFilter.Mutable.Internal into Data.BloomFilter.Mutable
and export a helpful construction function.
The spell example was a test suite but does not really test anything. It really is an example, not a test. Remove the Words example since it was being used as a benchmark, but we now have better benchmarks.
Calculate the optimal number of bits and hashes directly rather than via an optimisation algorithm.
Remove the prime-based approach.
It was used to calculate the table of primes that we no longer use.
This improves code clarity. Previously we would pass the number of bits and hashes separately. This makes it clearer what we're supplying by using a distinct type, and lets us directly pass the result of the size calc functions.
and rename the (M)Bloom field names. Also change internal rep of numBits to use Int rather than Word64. We already use Int in BloomSize and we only support up to 2^48 bits anyway on 64bit machines. Switch to record rather than positional style for (M)Bloom so we can more easily change representation. Temporarily disable the bloom-query-fast mechanism. The implementation of BloomFilterQuery2 will need significant updates.
At the moment, the bloomfilter sub-library does not itself support serialisation, and this is handled by the lsm-tree lib directly. It does so by being overly familiar with the internal representation of the bloomfilter lib. We still want the file format to be determined by the lsm-tree library, but we would like a better abstraction boundary between the bloomfilter and lsm-tree libraries. So we introduce functions to convert the bloom filters to/from a lower level representation as a BloomSize and a (Mutable)ByteArray for the bits.
This provides more detail, and makes it easier to convert exceptions from the fs-api into this exception type (since it contains a FsErrorPath).
Note that we now need the bloom bit array to always be pinned, because we can do I/O directly into that buffer.
…r.Classic in preparation for adding blocked bloom filters. Though we'll do a bit of module consolodation first.
in preparation for removing the Easy module. By making the normal stuff easy, we don't need a special Easy version of the API. We have made the normal stuff easy by allowing the size to be specified via either (sizeForFPR fpr n) or (sizeForBits b n). Whereas originally the low level API needed to be told the number of bits and number of hashes, which is indeed tricky. And then the Easy API wrapped that.
It's ok to allow zero, and the zero case was being covered by tests for RunAcc that use newMBloom. The previous patch changed newMBloom from using its own custom code for determining the bits and hashes to using the Calc module ones, including policyForBits. So we could either adjust the RunAcc test to avoid zero, or just allow zero. We allow it. In the conversion from policy to size we enforce a minimum size of 1 bit. So we can allow less than 1 bits per key.
75ef51e
to
d403b5a
Compare
Sync the Classic version with the Blocked version. This version uses a double nested loop to keep the filter as a constant for the inner loop. This produces mildly better code with recent GHC versions that do join points properly.
We provide a low level API for bloom filter inserts and elem operations. These allow for sharing a single hash calculation across many filters, and potentially allows for prefetching. Now instead of a CheapHashes type and operations in the Hash module, each of the two implementations (Classic and Blocked) provide a Hashes type and constructor. The classic one uses the (renamed) CheapHashes, while the Blocked one uses its own scheme. Also tidy up the code for the Blocked Hashes functions, to make the pattern clearer. Follow the pattern of a PRNG.
This is a blocked bloom filter where the bit array is split into 64byte blocks. Instead of probing bits randomly across the whole bit array (which needs one cache line load per bit probe), it first selects one block and then selects a number of bits within the block. So the same number of bit probes are done overall, but they're all clustered within one block. The blocks are cache line sized and aligned to cache boundaries. Thus lookups and inserts only cost a single cache line load (or load and store for an insert).
The program produces output for gnuplot, and the gnuplot script plots the calculated and actual FPR vs bits-per-key for both the classic and the blocked implementations. We use 2..20 bits for classic, as that gets us down to below an FPR of 1e-4, while for the blocked one we need to use 2..24 bits to get to approximately the same FPR. This graph output (checked in) is also a great test of the FPR to bits code for both implementations, and of the FPR of each impl in general.
and make the example program a tad more elegant
In the Classic implementation. The Blocked one uses this already. The classic algorithm is the remainder after division. But division is slow, especially for 64bit numbers. There's a faster method based on multiplication. This changes the format version, since it changes the mapping from element hash to the bit selected.
So fewer places depend on the exact BloomFilter implementation. In particular several places just need to talk about the type without using any operations. This will make switching implementation easier.
d403b5a
to
4ce0308
Compare
Co-authored-by: Joris Dral <[email protected]>
Co-authored-by: Joris Dral <[email protected]>
Co-authored-by: Joris Dral <[email protected]>
Co-authored-by: Joris Dral <[email protected]>
Co-authored-by: Joris Dral <[email protected]>
Co-authored-by: Joris Dral <[email protected]>
Co-authored-by: Joris Dral <[email protected]>
Long patch series to:
The result is better performance.