Add fused delta(for(bitpacking)) decode kernel (unstable)#140
Add fused delta(for(bitpacking)) decode kernel (unstable)#140joseph-isaacs wants to merge 3 commits into
Conversation
Introduce `Delta::unfor_undelta_pack` (and a runtime-width `unchecked_unfor_undelta_pack` dispatcher) that decodes a delta-of-frame-of-reference-of-bitpacking stack in a single pass: for each lane the W-bit values are unpacked, the FoR `reference` is wrapping-added, and the result is accumulated against the running per-lane `base`. This fuses what were previously three passes (unpack, unfor, undelta) and two intermediate buffers into one. Covered by a new `test_unfor_undelta` round-trip that checks both the const-generic kernel and the runtime-width dispatcher against an independent unfor+undelta reference. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The fused delta(for(bitpacking)) kernel and its runtime-width dispatcher are monomorphized across every (type x bit-width), which is a meaningful `.text` cost. Put both trait methods, their impls, and the round-trip test behind a new default-off `unstable` feature so the code (and its code-size) is opt-in until the API stabilizes. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
|
|
Merging this PR will degrade performance by 23.79%
|
| Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|
| ❌ | unfor_pack_16_from_3_stack |
1.1 µs | 1.6 µs | -30.51% |
| ❌ | unchecked_unfor_pack_16_from_3_stack |
1.3 µs | 1.7 µs | -26.68% |
| ❌ | unpack_then_add_reference_16_from_3_stack |
2.1 µs | 2.6 µs | -19.25% |
| ❌ | unchecked_unpack_then_add_reference_16_from_3_stack |
2.3 µs | 2.8 µs | -17.98% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/delta-bitpacking-fastlanes-V6mTZ (267717c) with develop (075d0dd)
The `[patch.crates-io]` previously pointed at a sibling `../fastlanes` checkout, which does not exist in CI and broke workspace resolution for every job. Point it at the pushed fastlanes branch (spiraldb/fastlanes#140) so the workspace resolves and both default and all-features builds compile. To be replaced by a published fastlanes version bump once that PR merges. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Name the feature after what it gates (the fused delta(for(bitpacking)) decode kernel) rather than the generic `unstable`. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Reference the exact fastlanes revision (spiraldb/fastlanes#140) instead of the branch for reproducibility. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Summary
Adds a fused decode kernel for a
delta(for(bitpacking))stack, gated behind a new default-offdelta_for_bitpackingfeature.Delta::unfor_undelta_pack::<LANES, W, B>decodes in a single pass over theW-bit packed buffer: for each lane it unpacks, wrapping-adds the frame-of-reference (inverting FoR), then accumulates against the running per-lane base (inverting delta). This fuses what were three passes —unpack→unfor→undelta— and removes two intermediate buffers. A runtime-width dispatcherunchecked_unfor_undelta_packselects the compile-timeW(same pattern asFoR::unchecked_unfor_pack). It reuses the existingunpack!macro, so it inherits the transposed-iteration ordering that makes delta fusion correct.Feature flag
Both trait methods, their macro impls, and the test live behind
[features] delta_for_bitpacking = [](off by default), because the kernel is monomorphized across every(type × bit-width)and that has a real.textcost (below). Downstream (Vortex) turns it on via its ownunstable_encodingsfeature.Tests
test_unfor_undeltaround-trips both the const-generic kernel and the runtime-width dispatcher against an independentundelta-of-FoR-decoded reference.cargo test --features delta_for_bitpacking,cargo clippy --features delta_for_bitpacking --all-targets, and the default (feature-off)cargo clippy+cargo fmt --checkare all clean. Real CI here (Build / MSRV / Benchmark) is green.Code-size analysis
Release
libfastlanesrlib (nm --print-size, summingt/Tsymbols):.textdelta_for_bitpackingoff (default)delta_for_bitpackingon128 = 124 width-specialized
unfor_undelta_pack(9 u8 + 17 u16 + 33 u32 + 65 u64) + 4unchecked_dispatchers. Comparable to the existingunfor_packfamily (128 / ~224 KiB). Behind a default-off feature, it costs nothing unless enabled.Performance — is the kernel optimal? (asm + A/B, measured locally)
The kernel is at parity with the shipped kernels. Microbench (best-of-30, same chunk), u32 W=11:
unfor_undelta_pack0.156 vsundelta_pack0.153 vsunfor_pack0.152 ns/elem — within noise. The fused FoR-add + undelta-add fold into the existingvpadddchain for free.Wider SIMD does not help — it regresses normal widths:
Asm confirms the AVX2 build is clean (163 insns vs SSE2's 195, full
%ymm, zero cross-lane shuffles) yet ~10% slower — it's shift/mask/add port-throughput + AVX frequency-bound, not codegen quality. Only the degenerate W=1 (trivial unpack → pure stores) benefits. So a hand-written intrinsic kernel can't beat LLVM's SSE2 autovectorization here. The real win is the fusion, not kernel micro-optimization (see the Vortex PR for end-to-end numbers).🤖 Generated with Claude Code