-
Notifications
You must be signed in to change notification settings - Fork 7
Block structured Bloom filter #690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dcoutts
wants to merge
43
commits into
main
Choose a base branch
from
dcoutts/bloomfilter-blocked
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
43 commits
Select commit
Hold shift + click to select a range
c08ee6d
bloomfilter: Add a simple construction benchmark
dcoutts 02d8311
bloomfilter: removes Hashes, specialise to CheapHashes scheme
dcoutts c53370d
bloomfilter: use ByteArray type from primitive package
dcoutts dbff02a
bloomfilter: combine a couple modules into one
dcoutts a5fe945
bloomfilter: Remove pointless exported functions
dcoutts 98add8e
bloomfilter: misc minor cleanups of the tests
dcoutts c1eea06
bloomfilter: change the example spell program into an executable
jorisdral 458ba2e
bloomfilter: Add new size calculation code
dcoutts e4a9333
bloomfilter: add tests for new size calculation functions
dcoutts 7438e05
bloomfilter: change Easy module to use new size calculations
dcoutts d30bc58
bloomfilter: remove primes helper program
dcoutts 630995b
bloomfilter: remove old calc functions
dcoutts 1cff8e7
bloomfilter: use new BloomSize type for filter construction functions
dcoutts b6a4675
bloomfilter: change length to size returning BloomSize
dcoutts 14784e9
bloomfilter: add (de)serialise functions, for better abstraction
dcoutts 3c83476
convert bloomFilterToLBS to use new Bloom.serialise
dcoutts 305bff5
Switch FsPath to FsErrorPath in FileCorruptedError exception type
dcoutts 1927d00
bloomfilter: fix showing counterexamples in prop_verifyFPR
dcoutts 4aa4349
convert bloomFilterFromSBS to use new Bloom.deserialise
dcoutts cfcffa0
bloomfilter: Move most Data.BloomFilter modules under Data.BloomFilte…
dcoutts 8074c4b
bloomfilter: improve naming in Calc functions
dcoutts 3c5feae
bloomfilter: allow 0 bits in policyForBits
dcoutts 02bf170
bloomfilter: remove last uses of internal modules
dcoutts 6c25424
bloomfilter: use a mildly better version of unfoldr
dcoutts 7435cc7
bloomfilter: establish a common API for hash-based insert and elem
dcoutts f6a7188
bloomfilter: Add new Data.BloomFilter.Blocked implementation
dcoutts df0fb16
bloomfilter: generalise tests to cover the Blocked implementation
dcoutts c83b359
bloomfilter: extend benchmark to blocked implementation
dcoutts f91a7da
bloomfilter: add bloomfilter-fpr-calc and gnuplot script
dcoutts 52cdac3
bloomfilter: add operation (?) = flip elem
dcoutts d808621
bloomfilter: export a formatVersion number
dcoutts 685e3d2
Use Bloom.filterVersion number in the lsm-tree serialisation code
dcoutts 0181539
bloomfilter: switch range reduction from division to multiplication
dcoutts cca509b
Re-export (M)Bloom via D.LSMTree.I.BloomFilter to reduce coupling
dcoutts b977a02
Switch lsm-tree to use the Blocked bloom filter implementation
dcoutts 4ce0308
bloomfilter: enable the same warnings as other packages
dcoutts b4231ae
Update bloomfilter/src/Data/BloomFilter/Classic/BitArray.hs
dcoutts dbd10c8
Update bloomfilter/src/Data/BloomFilter/Blocked/BitArray.hs
dcoutts 1f445ac
Apply suggestions from code review
dcoutts 7965a1f
Apply suggestions from code review
dcoutts c8c1f91
Apply suggestions from code review
dcoutts 91e9bbc
Apply suggestions from code review
dcoutts 9714096
Apply suggestions from code review
dcoutts File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,9 +9,9 @@ module Bench.Database.LSMTree.Internal.BloomFilter ( | |
) where | ||
|
||
import Criterion.Main | ||
import qualified Data.Bifoldable as BiFold | ||
import Data.BloomFilter (Bloom) | ||
import qualified Data.BloomFilter as Bloom | ||
import qualified Data.BloomFilter.Easy as Bloom.Easy | ||
import Data.BloomFilter.Hash (Hashable) | ||
import qualified Data.Foldable as Fold | ||
import Data.Map.Strict (Map) | ||
|
@@ -38,8 +38,11 @@ benchmarks = bgroup "Bench.Database.LSMTree.Internal.BloomFilter" [ | |
] | ||
, env (constructionEnv 2_500_000) $ \ m -> | ||
bgroup "construction" [ | ||
bench "easyList 0.1" $ whnf (constructBloom Bloom.Easy.easyList 0.1) m | ||
, bench "easyList 0.9" $ whnf (constructBloom Bloom.Easy.easyList 0.9) m | ||
bench "FPR = 0.1" $ | ||
whnf (constructBloom 0.1) m | ||
|
||
, bench "FPR = 0.9" $ | ||
whnf (constructBloom 0.9) m | ||
] | ||
] | ||
|
||
|
@@ -57,7 +60,9 @@ elemEnv fpr nbloom nelemsPositive nelemsNegative = do | |
$ uniformWithoutReplacement @UTxOKey stdgen (nbloom + nelemsNegative) | ||
ys2 = sampleUniformWithReplacement @UTxOKey stdgen' nelemsPositive xs | ||
zs <- generate $ shuffle (ys1 ++ ys2) | ||
pure (Bloom.Easy.easyList fpr (fmap serialiseKey xs), fmap serialiseKey zs) | ||
pure ( Bloom.fromList (Bloom.policyForFPR fpr) (fmap serialiseKey xs) | ||
, fmap serialiseKey zs | ||
) | ||
|
||
-- | Used for benchmarking 'Bloom.elem'. | ||
elems :: Hashable a => Bloom a -> [a] -> () | ||
|
@@ -74,8 +79,11 @@ constructionEnv n = do | |
|
||
-- | Used for benchmarking the construction of bloom filters from write buffers. | ||
constructBloom :: | ||
(Double -> [SerialisedKey] -> Bloom SerialisedKey) | ||
-> Double | ||
Double | ||
-> Map SerialisedKey SerialisedKey | ||
-> Bloom SerialisedKey | ||
constructBloom mkBloom fpr m = mkBloom fpr (Map.keys m) | ||
constructBloom fpr m = | ||
-- For faster construction, avoid going via lists and use Bloom.create, | ||
-- traversing the map inserting the keys | ||
Bloom.create (Bloom.sizeForFPR fpr (Map.size m)) $ \b -> | ||
BiFold.bifoldMap (\k -> Bloom.insert b k) (\_v -> pure ()) m | ||
Comment on lines
80
to
+89
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Today I learned that |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
module Main where | ||
|
||
import qualified Data.BloomFilter.Blocked as B.Blocked | ||
import qualified Data.BloomFilter.Classic as B.Classic | ||
import Data.BloomFilter.Hash (Hashable (..), hash64) | ||
|
||
import Data.Word (Word64) | ||
import System.Random | ||
|
||
import Criterion.Main | ||
|
||
main :: IO () | ||
main = | ||
defaultMain [ | ||
bgroup "Data.BloomFilter.Classic" [ | ||
env newStdGen $ \g0 -> | ||
bench "construct m=1e6 fpr=1%" $ | ||
whnf (constructBloom_classic 1_000_000 0.01) g0 | ||
|
||
, env newStdGen $ \g0 -> | ||
bench "construct m=1e6 fpr=0.1%" $ | ||
whnf (constructBloom_classic 1_000_000 0.001) g0 | ||
|
||
, env newStdGen $ \g0 -> | ||
bench "construct m=1e7 fpr=0.1%" $ | ||
whnf (constructBloom_classic 10_000_000 0.001) g0 | ||
] | ||
, bgroup "Data.BloomFilter.Blocked" [ | ||
env newStdGen $ \g0 -> | ||
bench "construct m=1e6 fpr=1%" $ | ||
whnf (constructBloom_blocked 1_000_000 0.01) g0 | ||
|
||
, env newStdGen $ \g0 -> | ||
bench "construct m=1e6 fpr=0.1%" $ | ||
whnf (constructBloom_blocked 1_000_000 0.001) g0 | ||
|
||
, env newStdGen $ \g0 -> | ||
bench "construct m=1e7 fpr=0.1%" $ | ||
whnf (constructBloom_blocked 10_000_000 0.001) g0 | ||
] | ||
] | ||
|
||
constructBloom_classic :: Int -> Double -> StdGen -> B.Classic.Bloom Word64 | ||
constructBloom_classic n fpr g0 = | ||
B.Classic.unfold (B.Classic.sizeForFPR fpr n) (nextElement n) (g0, 0) | ||
|
||
constructBloom_blocked :: Int -> Double -> StdGen -> B.Blocked.Bloom Word64 | ||
constructBloom_blocked n fpr g0 = | ||
B.Blocked.unfold (B.Blocked.sizeForFPR fpr n) (nextElement n) (g0, 0) | ||
|
||
{-# INLINE nextElement #-} | ||
nextElement :: Int -> (StdGen, Int) -> Maybe (Word64, (StdGen, Int)) | ||
nextElement !n (!g, !i) | ||
| i >= n = Nothing | ||
| otherwise = Just (x, (g', i+1)) | ||
where | ||
(!x, !g') = uniform g | ||
|
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,22 +1,16 @@ | ||
{-# LANGUAGE BangPatterns #-} | ||
module Main (main) where | ||
|
||
import Control.Exception (IOException, catch) | ||
import Control.Monad (forM_, when) | ||
import Data.Char (isLetter, toLower) | ||
import System.Environment (getArgs) | ||
|
||
import Data.BloomFilter.Easy (easyList, notElem) | ||
import Prelude hiding (notElem) | ||
import qualified Data.BloomFilter as B | ||
|
||
main :: IO () | ||
main = do | ||
files <- getArgs | ||
dictionary <- readFile "/usr/share/dict/words" `catchIO` \_ -> return "yes no" | ||
let !bloom = easyList 0.01 (words dictionary) | ||
forM_ files $ \file -> do | ||
ws <- words <$> readFile file | ||
forM_ ws $ \w -> when (w `notElem` bloom) $ putStrLn w | ||
|
||
catchIO :: IO a -> (IOException -> IO a) -> IO a | ||
catchIO = catch | ||
dictionary <- readFile "/usr/share/dict/words" | ||
let !bloom = B.fromList (B.policyForFPR 0.01) (words dictionary) | ||
forM_ files $ \file -> | ||
putStrLn . unlines . filter (`B.notElem` bloom) . words | ||
=<< readFile file |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check if it's faster/better going through
bifoldMap
?