-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit cd33273
Showing
15 changed files
with
2,050 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
# Auto detect text files and perform LF normalization | ||
* text=auto |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
nimcache | ||
nimcache/* | ||
tests/test | ||
benches/bench | ||
benches/bench_arch_end | ||
bloom | ||
*.html | ||
*.css | ||
.DS_Store | ||
src/.DS_Store |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
The MIT License (MIT) | ||
|
||
Copyright (c) 2013 Nick Greenfield | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy of | ||
this software and associated documentation files (the "Software"), to deal in | ||
the Software without restriction, including without limitation the rights to | ||
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of | ||
the Software, and to permit persons to whom the Software is furnished to do so, | ||
subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS | ||
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR | ||
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER | ||
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN | ||
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
# nim-bloom | ||
***NOTE: THIS IMPLEMENTATION IS NOT PEER-REVIEWED YET. PLEASE USE WITH CAUTION.*** | ||
|
||
A high-performance Bloom filter implementation in Nim offering standard and custom hash function options with different performance characteristics and false positive rates. | ||
|
||
## Features | ||
|
||
- Fast string element insertion and lookup | ||
- Configurable error rates | ||
- Choice between standard Nim hash and custom MurmurHash3 (128-bit or 32-bit) | ||
- Optimized for supporting different use cases of speed and accuracy | ||
- Comprehensive test suite and benchmarks | ||
|
||
## Usage | ||
|
||
Basic usage (defaults to MurmurHash3_128): | ||
```nim | ||
import bloom2 | ||
# Initialize with default hash (MurmurHash3_128) | ||
var bf = initializeBloomFilter(capacity = 10000, errorRate = 0.01) | ||
# Or explicitly specify hash type | ||
var bf32 = initializeBloomFilter( | ||
capacity = 10000, | ||
errorRate = 0.01, | ||
hashType = htMurmur32 # Use 32-bit implementation | ||
) | ||
# Basic operations | ||
bf.insert("test") | ||
assert bf.lookup("test") | ||
``` | ||
|
||
## Hash Function Selection | ||
|
||
1. Use MurmurHash3_128 (default) when: | ||
- You need the best balance of performance and accuracy | ||
- Memory isn't severely constrained | ||
- Working with large datasets | ||
- False positive rates are important | ||
|
||
2. Use MurmurHash3_32 when: | ||
- Running on 32-bit systems | ||
- Memory is constrained | ||
- Working with smaller datasets | ||
- String concatenation overhead for second hash, causing higher insertion and lookup times, is acceptable. | ||
|
||
3. Use NimHash when: | ||
- Consistency with Nim's hashing is important | ||
- Working with smaller datasets where performance is less critical | ||
- Future availability of better hash functions or performant implementations | ||
|
||
Nim's Hash Implementation: | ||
- Default (no flags): Uses FarmHash implementation | ||
- With `-d:nimStringHash2`: Uses Nim's MurmurHash3_32 implementation | ||
- Our implementation allows explicit choice regardless of compilation flags and our MurmurHash3_32 performs better because of directly using a native C Implementation | ||
|
||
## Performance Characteristics | ||
### For 1M items - Random Strings | ||
``` | ||
Insertion Speed: | ||
MurmurHash3_128: ~6.8M ops/sec | ||
MurmurHash3_32: ~5.9M ops/sec | ||
FarmHash: ~2.1M ops/sec | ||
False Positive Rates: | ||
MurmurHash3_128: ~0.84% | ||
MurmurHash3_32: ~0.83% | ||
FarmHash: ~0.82% | ||
``` | ||
|
||
These measurements show MurmurHash3_128's balanced performance profile, offering best speed and competitive false positive rates. | ||
|
||
Performance will vary based on: | ||
- Choice of hash function | ||
- Hardware specifications | ||
- Data size and memory access patterns (inside vs outside cache) | ||
- Compiler optimizations | ||
|
||
For detailed benchmarks across different data patterns and sizes, see [benches](benches/). | ||
|
||
## Implementation Details | ||
|
||
### Double Hashing Technique | ||
This implmentation uses the Kirsch-Mitzenmacher method to generate k hash values from two initial hashes. The implementation varies by hash type: | ||
|
||
1. MurmurHash3_128: | ||
```nim | ||
h(i) = abs((hash1 + i * hash2) mod m) | ||
``` | ||
- Uses both 64-bit hashes from 128-bit output | ||
- Natural double-hash implementation | ||
|
||
2. MurmurHash3_32: | ||
```nim | ||
let baseHash = murmurHash32(item, 0'u32) | ||
let secondHash = murmurHash32(item & " b", 0'u32) | ||
``` | ||
- Uses string concatention by default for the second hash | ||
- Bit Rotation for second hash provides sufficient randomness in some use cases while being much faster than string concatenation (but results in higher FP rate) | ||
- Choose between bit rotation or string concatenation as per your use-case. | ||
|
||
3. Nim's Hash: | ||
```nim | ||
let | ||
hashA = abs(hash(item)) mod maxValue | ||
hashB = abs(hash(item & " b")) mod maxValue | ||
h(i) = abs((hashA + n * hashB)) mod maxValue | ||
``` | ||
- Farm Hash or Nim's Murmur Hash based (if compliation flag is passed) | ||
- Uses string concatention by default. | ||
- Lower FP rate than bit rotation but comes at the cost of higher insertion and lookup times. | ||
|
||
*Tip:* Bit rotation values can be configurable as well. Use prime numbers for better mixing: 7, 11, 13, 17 for 32-bit; 21, 23, 27, 33 for 64-bit. Smaller rotations provides lesser mixing but as faster than higher rotations. | ||
|
||
## Testing | ||
|
||
Run the test suite: | ||
```bash | ||
nimble test | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,123 @@ | ||
import times, random, strutils | ||
include bloom | ||
|
||
type | ||
DataPattern = enum | ||
dpRandom, # Random strings | ||
dpSequential, # Sequential numbers | ||
dpFixed, # Fixed length strings | ||
dpLong, # Long strings | ||
dpSpecial # Strings with special characters | ||
|
||
type | ||
BenchmarkResult = tuple[ | ||
insertTime: float, | ||
lookupTime: float, | ||
falsePositives: int | ||
] | ||
|
||
proc generateBenchData(pattern: DataPattern, size: int, isLookupData: bool = false): seq[string] = | ||
result = newSeq[string](size) | ||
let offset = if isLookupData: size * 2 else: 0 # Ensure lookup data is well separated | ||
|
||
case pattern: | ||
of dpRandom: | ||
for i in 0..<size: | ||
var s = "" | ||
for j in 0..rand(5..15): | ||
s.add(chr(rand(ord('a')..ord('z')))) | ||
result[i] = s | ||
of dpSequential: | ||
for i in 0..<size: | ||
result[i] = $(i + offset) # Add offset for lookup data | ||
of dpFixed: | ||
for i in 0..<size: | ||
result[i] = "fixed" & align($(i + offset), 10, '0') | ||
of dpLong: | ||
for i in 0..<size: | ||
result[i] = repeat("x", 100) & $(i + offset) | ||
of dpSpecial: | ||
for i in 0..<size: | ||
result[i] = "test@" & $(i + offset) & "#$%^&*" & $rand(1000) | ||
|
||
proc benchmarkHashType(hashType: HashType, size: int, errorRate: float, | ||
data: seq[string], lookupData: seq[string]): BenchmarkResult = | ||
# Initialize Bloom filter and run benchmark for given hash type | ||
var bf = initializeBloomFilter(size, errorRate, hashType = hashType) | ||
|
||
# Measure insert time | ||
let startInsert = cpuTime() | ||
for item in data: | ||
bf.insert(item) | ||
let insertTime = cpuTime() - startInsert | ||
|
||
# Measure lookup time and count false positives | ||
var falsePositives = 0 | ||
let startLookup = cpuTime() | ||
for item in lookupData: | ||
if bf.lookup(item): falsePositives.inc | ||
let lookupTime = cpuTime() - startLookup | ||
|
||
result = (insertTime, lookupTime, falsePositives) | ||
|
||
proc printResults(hashName: string, result: BenchmarkResult, | ||
dataSize: int, lookupDataSize: int) = | ||
echo "\n", hashName, " Results:" | ||
echo " Insert time: ", result.insertTime, "s (", dataSize.float/result.insertTime, " ops/sec)" | ||
echo " Lookup time: ", result.lookupTime, "s (", lookupDataSize.float/result.lookupTime, " ops/sec)" | ||
echo " False positives: ", result.falsePositives, " (", | ||
result.falsePositives.float / lookupDataSize.float * 100, "%)" | ||
|
||
proc runBenchmark(size: int, errorRate: float, pattern: DataPattern, name: string) = | ||
echo "\n=== Benchmark: ", name, " ===" | ||
echo "Size: ", size, " items" | ||
echo "Pattern: ", pattern | ||
|
||
# Generate test data | ||
let data = generateBenchData(pattern, size, false) | ||
let lookupData = generateBenchData(pattern, size div 2, true) | ||
|
||
# Run benchmarks for each hash type | ||
let nimHashResult = benchmarkHashType(htNimHash, size, errorRate, data, lookupData) | ||
let murmur128Result = benchmarkHashType(htMurmur128, size, errorRate, data, lookupData) | ||
let murmur32Result = benchmarkHashType(htMurmur32, size, errorRate, data, lookupData) | ||
|
||
# Print individual results | ||
printResults("Nim's Hash (Farm Hash)", nimHashResult, size, lookupData.len) | ||
printResults("MurmurHash3_128", murmur128Result, size, lookupData.len) | ||
printResults("MurmurHash3_32", murmur32Result, size, lookupData.len) | ||
|
||
# Print comparisons | ||
echo "\nComparison (higher means better/faster):" | ||
echo " Insert Speed:" | ||
echo " Murmur128 vs NimHash: ", nimHashResult.insertTime/murmur128Result.insertTime, "x faster" | ||
echo " Murmur32 vs NimHash: ", nimHashResult.insertTime/murmur32Result.insertTime, "x faster" | ||
echo " Murmur128 vs Murmur32: ", murmur32Result.insertTime/murmur128Result.insertTime, "x faster" | ||
|
||
echo " Lookup Speed:" | ||
echo " Murmur128 vs NimHash: ", nimHashResult.lookupTime/murmur128Result.lookupTime, "x faster" | ||
echo " Murmur32 vs NimHash: ", nimHashResult.lookupTime/murmur32Result.lookupTime, "x faster" | ||
echo " Murmur128 vs Murmur32: ", murmur32Result.lookupTime/murmur128Result.lookupTime, "x faster" | ||
|
||
echo " False Positive Rates:" | ||
let fpRateNimHash = nimHashResult.falsePositives.float / lookupData.len.float | ||
let fpRateMurmur128 = murmur128Result.falsePositives.float / lookupData.len.float | ||
let fpRateMurmur32 = murmur32Result.falsePositives.float / lookupData.len.float | ||
|
||
echo " Murmur128 vs NimHash: ", fpRateNimHash/fpRateMurmur128, "x better" | ||
echo " Murmur32 vs NimHash: ", fpRateNimHash/fpRateMurmur32, "x better" | ||
echo " Murmur128 vs Murmur32: ", fpRateMurmur32/fpRateMurmur128, "x better" | ||
|
||
when isMainModule: | ||
const errorRate = 0.01 | ||
|
||
# Test each pattern | ||
for pattern in [dpRandom, dpSequential, dpFixed, dpLong, dpSpecial]: | ||
# Small dataset | ||
runBenchmark(10_000, errorRate, pattern, "Small " & $pattern) | ||
|
||
# Medium dataset | ||
runBenchmark(100_000, errorRate, pattern, "Medium " & $pattern) | ||
|
||
# Large dataset | ||
runBenchmark(1_000_000, errorRate, pattern, "Large " & $pattern) |
Oops, something went wrong.