Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
shash256 committed Dec 9, 2024
0 parents commit cd33273
Show file tree
Hide file tree
Showing 15 changed files with 2,050 additions and 0 deletions.
2 changes: 2 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Auto detect text files and perform LF normalization
* text=auto
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
nimcache
nimcache/*
tests/test
benches/bench
benches/bench_arch_end
bloom
*.html
*.css
.DS_Store
src/.DS_Store
20 changes: 20 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
The MIT License (MIT)

Copyright (c) 2013 Nick Greenfield

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
122 changes: 122 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# nim-bloom
***NOTE: THIS IMPLEMENTATION IS NOT PEER-REVIEWED YET. PLEASE USE WITH CAUTION.***

A high-performance Bloom filter implementation in Nim offering standard and custom hash function options with different performance characteristics and false positive rates.

## Features

- Fast string element insertion and lookup
- Configurable error rates
- Choice between standard Nim hash and custom MurmurHash3 (128-bit or 32-bit)
- Optimized for supporting different use cases of speed and accuracy
- Comprehensive test suite and benchmarks

## Usage

Basic usage (defaults to MurmurHash3_128):
```nim
import bloom2
# Initialize with default hash (MurmurHash3_128)
var bf = initializeBloomFilter(capacity = 10000, errorRate = 0.01)
# Or explicitly specify hash type
var bf32 = initializeBloomFilter(
capacity = 10000,
errorRate = 0.01,
hashType = htMurmur32 # Use 32-bit implementation
)
# Basic operations
bf.insert("test")
assert bf.lookup("test")
```

## Hash Function Selection

1. Use MurmurHash3_128 (default) when:
- You need the best balance of performance and accuracy
- Memory isn't severely constrained
- Working with large datasets
- False positive rates are important

2. Use MurmurHash3_32 when:
- Running on 32-bit systems
- Memory is constrained
- Working with smaller datasets
- String concatenation overhead for second hash, causing higher insertion and lookup times, is acceptable.

3. Use NimHash when:
- Consistency with Nim's hashing is important
- Working with smaller datasets where performance is less critical
- Future availability of better hash functions or performant implementations

Nim's Hash Implementation:
- Default (no flags): Uses FarmHash implementation
- With `-d:nimStringHash2`: Uses Nim's MurmurHash3_32 implementation
- Our implementation allows explicit choice regardless of compilation flags and our MurmurHash3_32 performs better because of directly using a native C Implementation

## Performance Characteristics
### For 1M items - Random Strings
```
Insertion Speed:
MurmurHash3_128: ~6.8M ops/sec
MurmurHash3_32: ~5.9M ops/sec
FarmHash: ~2.1M ops/sec
False Positive Rates:
MurmurHash3_128: ~0.84%
MurmurHash3_32: ~0.83%
FarmHash: ~0.82%
```

These measurements show MurmurHash3_128's balanced performance profile, offering best speed and competitive false positive rates.

Performance will vary based on:
- Choice of hash function
- Hardware specifications
- Data size and memory access patterns (inside vs outside cache)
- Compiler optimizations

For detailed benchmarks across different data patterns and sizes, see [benches](benches/).

## Implementation Details

### Double Hashing Technique
This implmentation uses the Kirsch-Mitzenmacher method to generate k hash values from two initial hashes. The implementation varies by hash type:

1. MurmurHash3_128:
```nim
h(i) = abs((hash1 + i * hash2) mod m)
```
- Uses both 64-bit hashes from 128-bit output
- Natural double-hash implementation

2. MurmurHash3_32:
```nim
let baseHash = murmurHash32(item, 0'u32)
let secondHash = murmurHash32(item & " b", 0'u32)
```
- Uses string concatention by default for the second hash
- Bit Rotation for second hash provides sufficient randomness in some use cases while being much faster than string concatenation (but results in higher FP rate)
- Choose between bit rotation or string concatenation as per your use-case.

3. Nim's Hash:
```nim
let
hashA = abs(hash(item)) mod maxValue
hashB = abs(hash(item & " b")) mod maxValue
h(i) = abs((hashA + n * hashB)) mod maxValue
```
- Farm Hash or Nim's Murmur Hash based (if compliation flag is passed)
- Uses string concatention by default.
- Lower FP rate than bit rotation but comes at the cost of higher insertion and lookup times.

*Tip:* Bit rotation values can be configurable as well. Use prime numbers for better mixing: 7, 11, 13, 17 for 32-bit; 21, 23, 27, 33 for 64-bit. Smaller rotations provides lesser mixing but as faster than higher rotations.

## Testing

Run the test suite:
```bash
nimble test
```
123 changes: 123 additions & 0 deletions benches/bench.nim
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
import times, random, strutils
include bloom

type
DataPattern = enum
dpRandom, # Random strings
dpSequential, # Sequential numbers
dpFixed, # Fixed length strings
dpLong, # Long strings
dpSpecial # Strings with special characters

type
BenchmarkResult = tuple[
insertTime: float,
lookupTime: float,
falsePositives: int
]

proc generateBenchData(pattern: DataPattern, size: int, isLookupData: bool = false): seq[string] =
result = newSeq[string](size)
let offset = if isLookupData: size * 2 else: 0 # Ensure lookup data is well separated

case pattern:
of dpRandom:
for i in 0..<size:
var s = ""
for j in 0..rand(5..15):
s.add(chr(rand(ord('a')..ord('z'))))
result[i] = s
of dpSequential:
for i in 0..<size:
result[i] = $(i + offset) # Add offset for lookup data
of dpFixed:
for i in 0..<size:
result[i] = "fixed" & align($(i + offset), 10, '0')
of dpLong:
for i in 0..<size:
result[i] = repeat("x", 100) & $(i + offset)
of dpSpecial:
for i in 0..<size:
result[i] = "test@" & $(i + offset) & "#$%^&*" & $rand(1000)

proc benchmarkHashType(hashType: HashType, size: int, errorRate: float,
data: seq[string], lookupData: seq[string]): BenchmarkResult =
# Initialize Bloom filter and run benchmark for given hash type
var bf = initializeBloomFilter(size, errorRate, hashType = hashType)

# Measure insert time
let startInsert = cpuTime()
for item in data:
bf.insert(item)
let insertTime = cpuTime() - startInsert

# Measure lookup time and count false positives
var falsePositives = 0
let startLookup = cpuTime()
for item in lookupData:
if bf.lookup(item): falsePositives.inc
let lookupTime = cpuTime() - startLookup

result = (insertTime, lookupTime, falsePositives)

proc printResults(hashName: string, result: BenchmarkResult,
dataSize: int, lookupDataSize: int) =
echo "\n", hashName, " Results:"
echo " Insert time: ", result.insertTime, "s (", dataSize.float/result.insertTime, " ops/sec)"
echo " Lookup time: ", result.lookupTime, "s (", lookupDataSize.float/result.lookupTime, " ops/sec)"
echo " False positives: ", result.falsePositives, " (",
result.falsePositives.float / lookupDataSize.float * 100, "%)"

proc runBenchmark(size: int, errorRate: float, pattern: DataPattern, name: string) =
echo "\n=== Benchmark: ", name, " ==="
echo "Size: ", size, " items"
echo "Pattern: ", pattern

# Generate test data
let data = generateBenchData(pattern, size, false)
let lookupData = generateBenchData(pattern, size div 2, true)

# Run benchmarks for each hash type
let nimHashResult = benchmarkHashType(htNimHash, size, errorRate, data, lookupData)
let murmur128Result = benchmarkHashType(htMurmur128, size, errorRate, data, lookupData)
let murmur32Result = benchmarkHashType(htMurmur32, size, errorRate, data, lookupData)

# Print individual results
printResults("Nim's Hash (Farm Hash)", nimHashResult, size, lookupData.len)
printResults("MurmurHash3_128", murmur128Result, size, lookupData.len)
printResults("MurmurHash3_32", murmur32Result, size, lookupData.len)

# Print comparisons
echo "\nComparison (higher means better/faster):"
echo " Insert Speed:"
echo " Murmur128 vs NimHash: ", nimHashResult.insertTime/murmur128Result.insertTime, "x faster"
echo " Murmur32 vs NimHash: ", nimHashResult.insertTime/murmur32Result.insertTime, "x faster"
echo " Murmur128 vs Murmur32: ", murmur32Result.insertTime/murmur128Result.insertTime, "x faster"

echo " Lookup Speed:"
echo " Murmur128 vs NimHash: ", nimHashResult.lookupTime/murmur128Result.lookupTime, "x faster"
echo " Murmur32 vs NimHash: ", nimHashResult.lookupTime/murmur32Result.lookupTime, "x faster"
echo " Murmur128 vs Murmur32: ", murmur32Result.lookupTime/murmur128Result.lookupTime, "x faster"

echo " False Positive Rates:"
let fpRateNimHash = nimHashResult.falsePositives.float / lookupData.len.float
let fpRateMurmur128 = murmur128Result.falsePositives.float / lookupData.len.float
let fpRateMurmur32 = murmur32Result.falsePositives.float / lookupData.len.float

echo " Murmur128 vs NimHash: ", fpRateNimHash/fpRateMurmur128, "x better"
echo " Murmur32 vs NimHash: ", fpRateNimHash/fpRateMurmur32, "x better"
echo " Murmur128 vs Murmur32: ", fpRateMurmur32/fpRateMurmur128, "x better"

when isMainModule:
const errorRate = 0.01

# Test each pattern
for pattern in [dpRandom, dpSequential, dpFixed, dpLong, dpSpecial]:
# Small dataset
runBenchmark(10_000, errorRate, pattern, "Small " & $pattern)

# Medium dataset
runBenchmark(100_000, errorRate, pattern, "Medium " & $pattern)

# Large dataset
runBenchmark(1_000_000, errorRate, pattern, "Large " & $pattern)
Loading

0 comments on commit cd33273

Please sign in to comment.