Skip to content

I wrote a polars plugin with twox-hash #114

@paddymul

Description

@paddymul

Hi,

Thank you for writing this library.

I wanted to share a project I built with it - pl_series_hash. It is a plugin for the dataframe library polars to hash series.

I wanted to have a very fast hash function per series so that I can cache summary stats for another project.

This was my first project in Rust. So I'm still learning.

Here is the rust file
https://github.com/paddymul/pl_series_hash/blob/main/src/expressions.rs
and the python test
https://github.com/paddymul/pl_series_hash/blob/main/tests/test_pl_series_hash.py

I have tests that verify this, but I'd still like to get a sanity check that I'm hashing series properly.

I'm worried about hash collisions from a poor implementation. Here is my approach

For each series I first write out a type identifier.

For each element in a series I add the bytes, for strings I also write a STRING_SEPERATOR of 128u16 which isn't a valid UTF8 symbol and shouldn't ever appear.
For NANs/Nulls I write out NAN_SEPERATOR - 129u16 also an invalid unicode character.

Next I write out the array position in bytes (u64)

All of this is then hashed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions