-
Notifications
You must be signed in to change notification settings - Fork 48
Description
Hi,
Thank you for writing this library.
I wanted to share a project I built with it - pl_series_hash. It is a plugin for the dataframe library polars to hash series.
I wanted to have a very fast hash function per series so that I can cache summary stats for another project.
This was my first project in Rust. So I'm still learning.
Here is the rust file
https://github.com/paddymul/pl_series_hash/blob/main/src/expressions.rs
and the python test
https://github.com/paddymul/pl_series_hash/blob/main/tests/test_pl_series_hash.py
I have tests that verify this, but I'd still like to get a sanity check that I'm hashing series properly.
I'm worried about hash collisions from a poor implementation. Here is my approach
For each series I first write out a type identifier.
For each element in a series I add the bytes, for strings I also write a STRING_SEPERATOR of 128u16 which isn't a valid UTF8 symbol and shouldn't ever appear.
For NANs/Nulls I write out NAN_SEPERATOR - 129u16 also an invalid unicode character.
Next I write out the array position in bytes (u64)
All of this is then hashed.