Initial commit

waku-org · Dec 9, 2024 · cd33273 · cd33273
commit cd33273
Show file tree

Hide file tree

Showing 15 changed files with 2,050 additions and 0 deletions.
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,2 @@
+# Auto detect text files and perform LF normalization
+* text=auto
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,10 @@
+nimcache
+nimcache/*
+tests/test
+benches/bench
+benches/bench_arch_end
+bloom
+*.html
+*.css
+.DS_Store
+src/.DS_Store
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,20 @@
+The MIT License (MIT)
+
+Copyright (c) 2013 Nick Greenfield
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of
+this software and associated documentation files (the "Software"), to deal in
+the Software without restriction, including without limitation the rights to
+use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
+the Software, and to permit persons to whom the Software is furnished to do so,
+subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
+FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
+COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
+IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,122 @@
+# nim-bloom
+***NOTE: THIS IMPLEMENTATION IS NOT PEER-REVIEWED YET. PLEASE USE WITH CAUTION.***
+
+A high-performance Bloom filter implementation in Nim offering standard and custom hash function options with different performance characteristics and false positive rates.
+
+## Features
+
+- Fast string element insertion and lookup
+- Configurable error rates
+- Choice between standard Nim hash and custom MurmurHash3 (128-bit or 32-bit)
+- Optimized for supporting different use cases of speed and accuracy
+- Comprehensive test suite and benchmarks
+
+## Usage
+
+Basic usage (defaults to MurmurHash3_128):
+```nim
+import bloom2
+
+# Initialize with default hash (MurmurHash3_128)
+var bf = initializeBloomFilter(capacity = 10000, errorRate = 0.01)
+
+# Or explicitly specify hash type
+var bf32 = initializeBloomFilter(
+  capacity = 10000, 
+  errorRate = 0.01,
+  hashType = htMurmur32  # Use 32-bit implementation
+)
+
+# Basic operations
+bf.insert("test")
+assert bf.lookup("test")
+```
+
+## Hash Function Selection
+
+1. Use MurmurHash3_128 (default) when:
+    - You need the best balance of performance and accuracy
+    - Memory isn't severely constrained
+    - Working with large datasets
+    - False positive rates are important
+
+2. Use MurmurHash3_32 when:
+    - Running on 32-bit systems
+    - Memory is constrained
+    - Working with smaller datasets
+    - String concatenation overhead for second hash, causing higher insertion and lookup times, is acceptable.
+
+3. Use NimHash when:
+    - Consistency with Nim's hashing is important
+    - Working with smaller datasets where performance is less critical
+    - Future availability of better hash functions or performant implementations
+
+Nim's Hash Implementation:
+  - Default (no flags): Uses FarmHash implementation
+  - With `-d:nimStringHash2`: Uses Nim's MurmurHash3_32 implementation
+  - Our implementation allows explicit choice regardless of compilation flags and our MurmurHash3_32 performs better because of directly using a native C Implementation
+
+## Performance Characteristics
+### For 1M items - Random Strings
+```
+Insertion Speed:
+MurmurHash3_128: ~6.8M ops/sec
+MurmurHash3_32:  ~5.9M ops/sec
+FarmHash:        ~2.1M ops/sec
+
+False Positive Rates:
+MurmurHash3_128: ~0.84%
+MurmurHash3_32:  ~0.83%
+FarmHash:        ~0.82%
+```
+
+These measurements show MurmurHash3_128's balanced performance profile, offering best speed and competitive false positive rates.
+
+Performance will vary based on:
+- Choice of hash function
+- Hardware specifications
+- Data size and memory access patterns (inside vs outside cache)
+- Compiler optimizations
+
+For detailed benchmarks across different data patterns and sizes, see [benches](benches/).
+
+## Implementation Details
+
+### Double Hashing Technique
+This implmentation uses the Kirsch-Mitzenmacher method to generate k hash values from two initial hashes. The implementation varies by hash type:
+
+1. MurmurHash3_128:
+```nim
+h(i) = abs((hash1 + i * hash2) mod m)
+```
+- Uses both 64-bit hashes from 128-bit output
+- Natural double-hash implementation
+
+2. MurmurHash3_32:
+```nim
+let baseHash = murmurHash32(item, 0'u32)
+let secondHash = murmurHash32(item & " b", 0'u32)
+```
+- Uses string concatention by default for the second hash
+- Bit Rotation for second hash provides sufficient randomness in some use cases while being much faster than string concatenation (but results in higher FP rate)
+- Choose between bit rotation or string concatenation as per your use-case.
+
+3. Nim's Hash:
+```nim
+  let
+    hashA = abs(hash(item)) mod maxValue
+    hashB = abs(hash(item & " b")) mod maxValue
+  h(i) = abs((hashA + n * hashB)) mod maxValue
+```
+- Farm Hash or Nim's Murmur Hash based (if compliation flag is passed)
+- Uses string concatention by default.
+- Lower FP rate than bit rotation but comes at the cost of higher insertion and lookup times.
+
+*Tip:* Bit rotation values can be configurable as well. Use prime numbers for better mixing: 7, 11, 13, 17 for 32-bit; 21, 23, 27, 33 for 64-bit. Smaller rotations provides lesser mixing but as faster than higher rotations.
+
+## Testing
+
+Run the test suite:
+```bash
+nimble test
+```
diff --git a/benches/bench.nim b/benches/bench.nim
@@ -0,0 +1,123 @@
+import times, random, strutils
+include bloom
+
+type
+  DataPattern = enum
+    dpRandom,      # Random strings
+    dpSequential,  # Sequential numbers
+    dpFixed,       # Fixed length strings
+    dpLong,        # Long strings
+    dpSpecial      # Strings with special characters
+
+type 
+  BenchmarkResult = tuple[
+    insertTime: float,
+    lookupTime: float, 
+    falsePositives: int
+  ]
+
+proc generateBenchData(pattern: DataPattern, size: int, isLookupData: bool = false): seq[string] =
+  result = newSeq[string](size)
+  let offset = if isLookupData: size * 2 else: 0  # Ensure lookup data is well separated
+
+  case pattern:
+  of dpRandom:
+    for i in 0..<size:
+      var s = ""
+      for j in 0..rand(5..15):
+        s.add(chr(rand(ord('a')..ord('z'))))
+      result[i] = s
+  of dpSequential:
+    for i in 0..<size:
+      result[i] = $(i + offset)  # Add offset for lookup data
+  of dpFixed:
+    for i in 0..<size:
+      result[i] = "fixed" & align($(i + offset), 10, '0')
+  of dpLong:
+    for i in 0..<size:
+      result[i] = repeat("x", 100) & $(i + offset)
+  of dpSpecial:
+    for i in 0..<size:
+      result[i] = "test@" & $(i + offset) & "#$%^&*" & $rand(1000)
+
+proc benchmarkHashType(hashType: HashType, size: int, errorRate: float, 
+                      data: seq[string], lookupData: seq[string]): BenchmarkResult =
+  # Initialize Bloom filter and run benchmark for given hash type
+  var bf = initializeBloomFilter(size, errorRate, hashType = hashType)
+
+  # Measure insert time
+  let startInsert = cpuTime()
+  for item in data:
+    bf.insert(item)
+  let insertTime = cpuTime() - startInsert
+
+  # Measure lookup time and count false positives  
+  var falsePositives = 0
+  let startLookup = cpuTime()
+  for item in lookupData:
+    if bf.lookup(item): falsePositives.inc
+  let lookupTime = cpuTime() - startLookup
+
+  result = (insertTime, lookupTime, falsePositives)
+
+proc printResults(hashName: string, result: BenchmarkResult, 
+                 dataSize: int, lookupDataSize: int) =
+  echo "\n", hashName, " Results:"
+  echo "  Insert time: ", result.insertTime, "s (", dataSize.float/result.insertTime, " ops/sec)"
+  echo "  Lookup time: ", result.lookupTime, "s (", lookupDataSize.float/result.lookupTime, " ops/sec)"
+  echo "  False positives: ", result.falsePositives, " (", 
+       result.falsePositives.float / lookupDataSize.float * 100, "%)"
+
+proc runBenchmark(size: int, errorRate: float, pattern: DataPattern, name: string) =
+  echo "\n=== Benchmark: ", name, " ==="
+  echo "Size: ", size, " items"
+  echo "Pattern: ", pattern
+
+  # Generate test data
+  let data = generateBenchData(pattern, size, false)
+  let lookupData = generateBenchData(pattern, size div 2, true)
+
+  # Run benchmarks for each hash type
+  let nimHashResult = benchmarkHashType(htNimHash, size, errorRate, data, lookupData)
+  let murmur128Result = benchmarkHashType(htMurmur128, size, errorRate, data, lookupData)
+  let murmur32Result = benchmarkHashType(htMurmur32, size, errorRate, data, lookupData)
+
+  # Print individual results
+  printResults("Nim's Hash (Farm Hash)", nimHashResult, size, lookupData.len)
+  printResults("MurmurHash3_128", murmur128Result, size, lookupData.len)
+  printResults("MurmurHash3_32", murmur32Result, size, lookupData.len)
+
+  # Print comparisons
+  echo "\nComparison (higher means better/faster):"
+  echo "  Insert Speed:"
+  echo "    Murmur128 vs NimHash: ", nimHashResult.insertTime/murmur128Result.insertTime, "x faster"
+  echo "    Murmur32 vs NimHash: ", nimHashResult.insertTime/murmur32Result.insertTime, "x faster"
+  echo "    Murmur128 vs Murmur32: ", murmur32Result.insertTime/murmur128Result.insertTime, "x faster"
+
+  echo "  Lookup Speed:"
+  echo "    Murmur128 vs NimHash: ", nimHashResult.lookupTime/murmur128Result.lookupTime, "x faster"
+  echo "    Murmur32 vs NimHash: ", nimHashResult.lookupTime/murmur32Result.lookupTime, "x faster"
+  echo "    Murmur128 vs Murmur32: ", murmur32Result.lookupTime/murmur128Result.lookupTime, "x faster"
+
+  echo "  False Positive Rates:"
+  let fpRateNimHash = nimHashResult.falsePositives.float / lookupData.len.float
+  let fpRateMurmur128 = murmur128Result.falsePositives.float / lookupData.len.float
+  let fpRateMurmur32 = murmur32Result.falsePositives.float / lookupData.len.float
+
+  echo "    Murmur128 vs NimHash: ", fpRateNimHash/fpRateMurmur128, "x better"
+  echo "    Murmur32 vs NimHash: ", fpRateNimHash/fpRateMurmur32, "x better"
+  echo "    Murmur128 vs Murmur32: ", fpRateMurmur32/fpRateMurmur128, "x better"
+
+when isMainModule:
+  const errorRate = 0.01
+
+  # Test each pattern
+  for pattern in [dpRandom, dpSequential, dpFixed, dpLong, dpSpecial]:
+    # Small dataset
+    runBenchmark(10_000, errorRate, pattern, "Small " & $pattern)
+
+    # Medium dataset
+    runBenchmark(100_000, errorRate, pattern, "Medium " & $pattern)
+
+    # Large dataset
+    runBenchmark(1_000_000, errorRate, pattern, "Large " & $pattern)
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		# Auto detect text files and perform LF normalization
		* text=auto