Skip to content

Conversation

@jukofyork
Copy link
Collaborator

@jukofyork jukofyork commented Oct 24, 2025

This PR replaces the very slow FNV-1a hash (which processes each byte separately), with XXHash64 which processes blocks of 4x 8-bytes (64bit words) in a loop that the compiler should be able to unroll to process the bulk of the data in 32-byte blocks:

const uint64_t* block = reinterpret_cast<const uint64_t*>(p);
for (int i = 0; i < 4; ++i) {
    uint64_t v = state[i] + block[i] * prime2;
    state[i] = ((v << 31) | (v >> (64 - 31))) * prime1;
}
p += 32;

NOTE: I did try manually unrolling the loop:

Spolier
      const uint64_t* w       = reinterpret_cast<const uint64_t*>(p);
      const uint64_t* wLimit  = reinterpret_cast<const uint64_t*>(end - 32); // last start for a full 32-byte stripe
      for (; w <= wLimit; w += 4) {
          // unrolled lanes
          uint64_t v0 = state[0] + w[0] * prime2;
          state[0] = ((v0 << 31) | (v0 >> (64 - 31))) * prime1;

          uint64_t v1 = state[1] + w[1] * prime2;
          state[1] = ((v1 << 31) | (v1 >> (64 - 31))) * prime1;

          uint64_t v2 = state[2] + w[2] * prime2;
          state[2] = ((v2 << 31) | (v2 >> (64 - 31))) * prime1;

          uint64_t v3 = state[3] + w[3] * prime2;
          state[3] = ((v3 << 31) | (v3 >> (64 - 31))) * prime1;
      }
      // advance byte pointer by the processed stripes (32 bytes per iteration)
      p = reinterpret_cast<const uint8_t*>(w);

but it didn't seem to help.


I haven't done any benchmarking, but it is clearly many times faster and in line with the reported benchmarks:

image

A few notes:

  • It assumes all will use the same Endianness.
  • It assumes the input uint8_t* data is aligned to 8-byte boundaries.
  • It was only really possible to figure out using this blog post as the original author's code is near impenetrable...
  • Any old FNV-1a-hashed tensors will still live on until the .cache/llama.cpp/rpc/ folder is cleared, but otherwise I don't think there will be any problems with 64bit FNV-1a <--> XXHash64 collisions (ie: the chance of this will be miniscule...).

NOTE: I haven't had time yet to compare against the reference implementation, so leaving this as a draft for now.

@rgerganov can you give this a try and see what you think?

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 24, 2025
@rgerganov
Copy link
Collaborator

Thanks for the patch, I will benchmark this on several RPC setups and get back to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants