Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions dev-hash/benches/hybrid.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use data_resource::ResourceId;
use rand::prelude::*;
use std::path::Path;

use dev_hash::Hybrid;

// Add files to benchmark here
const FILE_PATHS: [&str; 2] =
["../test-assets/lena.jpg", "../test-assets/test.pdf"];
// Modify time limit here
const BENCHMARK_TIME_LIMIT: std::time::Duration =
std::time::Duration::from_secs(20);

fn generate_random_data(size: usize) -> Vec<u8> {
let mut rng = rand::thread_rng();
(0..size).map(|_| rng.gen()).collect()
}

/// Benchmarks the performance of resource ID creation
/// from file paths and random data.
///
/// - Measures the time taken to create a resource ID from file paths.
/// - Measures the time taken to create a resource ID from random data.
fn bench_resource_id_creation(c: &mut Criterion) {
let mut group = c.benchmark_group("blake3_resource_id_creation");
group.measurement_time(BENCHMARK_TIME_LIMIT);

// Benchmarks for computing from file paths
for file_path in FILE_PATHS.iter() {
assert!(
Path::new(file_path).is_file(),
"The file: {} does not exist or is not a file",
file_path
);

let id = format!("compute_from_path:{}", file_path);
group.bench_function(id, move |b| {
b.iter(|| {
<Hybrid as ResourceId>::from_path(black_box(file_path))
.expect("from_path returned an error")
});
});
}

// Benchmarks for computing from random data
let inputs = [("small", 1024), ("medium", 65536), ("large", 1048576)];

for (name, size) in inputs.iter() {
let input_data = generate_random_data(*size);

let id = format!("compute_from_bytes:{}", name);
group.bench_function(id, move |b| {
b.iter(|| {
<Hybrid as ResourceId>::from_bytes(black_box(&input_data))
.expect("from_bytes returned an error")
});
});
}

group.finish();
}

criterion_group!(benches, bench_resource_id_creation);
criterion_main!(benches);
136 changes: 136 additions & 0 deletions dev-hash/src/hybrid.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
use std::{
fs,
io::{BufRead, BufReader},
path::Path,
};

use blake3::Hasher as Blake3Hasher;
use core::{fmt::Display, str::FromStr};
use hex::encode;
use serde::{Deserialize, Serialize};

use data_error::Result;
use data_resource::ResourceId;

use std::hash::{Hash, Hasher};

const FNV_OFFSET_BASIS: u64 = 0xcbf29ce484222325;
const FNV_PRIME: u64 = 0x100000001b3;

fn fnv_hash_bytes(bytes: &[u8]) -> u64 {
let mut hash = FNV_OFFSET_BASIS;
for &byte in bytes.iter() {
hash ^= byte as u64;
hash = hash.wrapping_mul(FNV_PRIME);
}
hash
}

fn fnv_hash_path<P: AsRef<Path>>(path: P) -> u64 {
let mut hasher = std::collections::hash_map::DefaultHasher::new();
path.as_ref().hash(&mut hasher);
let hash = hasher.finish();
fnv_hash_bytes(hash.to_le_bytes().as_slice())
}

#[derive(
Debug, Clone, PartialEq, Eq, Ord, PartialOrd, Hash, Serialize, Deserialize,
)]
pub struct Hybrid(pub String);

impl FromStr for Hybrid {
type Err = hex::FromHexError;

fn from_str(s: &str) -> core::result::Result<Self, Self::Err> {
Ok(Hybrid(s.to_string()))
}
}

impl Display for Hybrid {
fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
write!(f, "{}", self.0)
}
}

const THRESHOLD: u64 = 1024 * 1024 * 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A wild idea, is it difficult to make this constant a type parameter? So we could instantiate same class using different thresholds? It would be really great to have benchmarks of optimized "skip-chunks" hash function for different sizes. The goal of such benchmarks is not only to see the speed improvement, but also to see collisions ratio.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nop, it is not difficult. I just haven't done it already, because I wanted to keep the implementation as similar as possible to the other implementations (Blake3 and CRC32) in this PoC


impl ResourceId for Hybrid {
fn from_path<P: AsRef<Path>>(file_path: P) -> Result<Self> {
let size = fs::metadata(file_path.as_ref())?.len();

if size < THRESHOLD {
// Use Blake3 for small files
log::debug!(
"Computing BLAKE3 hash for file: {:?}",
file_path.as_ref()
);

let file = fs::File::open(file_path)?;
let mut reader = BufReader::new(file);
let mut hasher = Blake3Hasher::new();
let mut buffer = Vec::new();
loop {
let bytes_read = reader.read_until(b'\n', &mut buffer)?;
if bytes_read == 0 {
break;
}
hasher.update(&buffer);
buffer.clear();
}
let hash = hasher.finalize();
Ok(Hybrid(encode(hash.as_bytes())))
} else {
// Use fnv hashing for large files
log::debug!(
"Computing simple hash for file: {:?}",
file_path.as_ref()
);

let hash = fnv_hash_path(file_path);
Ok(Hybrid(format!("{}_{}", size, hash)))
}
}

fn from_bytes(bytes: &[u8]) -> Result<Self> {
let size = bytes.len() as u64;
if size < THRESHOLD {
// Use Blake3 for small files
log::debug!("Computing BLAKE3 hash for bytes");

let mut hasher = Blake3Hasher::new();
hasher.update(bytes);
let hash = hasher.finalize();
Ok(Hybrid(encode(hash.as_bytes())))
} else {
// Use fnv hashing for large files
log::debug!("Computing simple hash for bytes");

let hash = fnv_hash_bytes(bytes);
Ok(Hybrid(format!("{}_{}", size, hash)))
Comment on lines +96 to +109
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The original idea is the opposite: use Blake3 for small and medium files, and use faster function for large files where size of contents is large enough to make collision ratio low enough.
  2. FNV hashing can be added separately as dedicated hash function. Same as "skip-chunk" hash function.
  3. A wild idea: can we parameterize this hybrid hash function with other hash functions? So we could compose 2 "dedicated" hash functions into threshold-based hash function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, any file that has size below the THRESHOLD is being hashed by Blake3 already.
  2. 100%
  3. Yes, totally. I'm not sure if there are higher priority things to do before it, but we could even create a fully parameterized implementation, that allows indefinite pairs composed of a hash function and its related threshold. I've done something similar to this in JavaScript once

}
}
}

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn sanity_check() {
let file_path = Path::new("../test-assets/lena.jpg");
let id = Hybrid::from_path(file_path)
.expect("Failed to compute resource identifier");
assert_eq!(
id,
Hybrid("172b4bf148e858b13dde0fc6613413bcb7552e5c4e5c45195ac6c80f20eb5ff5".to_string())
);

let raw_bytes = fs::read(file_path).expect("Failed to read file");
let id = <Hybrid as ResourceId>::from_bytes(&raw_bytes)
.expect("Failed to compute resource identifier");
assert_eq!(
id,
Hybrid("172b4bf148e858b13dde0fc6613413bcb7552e5c4e5c45195ac6c80f20eb5ff5".to_string())
);
}
}
3 changes: 3 additions & 0 deletions dev-hash/src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
mod blake3;
mod crc32;

mod hybrid;

pub use blake3::Blake3;
pub use crc32::Crc32;
pub use hybrid::Hybrid;