Scalable Bloom Filter implemented in Ruby.
Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. For more detail: en.wikipedia.org/wiki/Bloom_filter
Instead of using k different hash functions, this implementation seeds the CRC32 hash with k different initial values (0, 1, …, k-1). This may or may not give you a good distribution, it all depends on the data.
require 'bloomfilter' # M (size of bit array) # K (number of hash functions) # R (random seed) 100000000, k=4, random seed=1 # M, K, R bf = BloomFilter.new(10, 2, 1) bf.insert("test") bf.include?("test") => true bf.include?("test2") => false # Hash with a bloom filter! bf["test2"] = "bar" bf["test2"] => "bar" bf["test3"] => nil bf.stats Number of filter bits (m): 10 Number of filter elements (n): 2 Number of filter hashes (k) : 2 Predicted false positive rate = 10.87%
Performance of the Bloom filter depends on a number of variables:
-
size of the bit array
-
number of hash functions
To figure out the values for these parameters, refer to:
http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/
Tatsuya Mori <[email protected]> (Original: http://vald.x0.com/sb/)