Skip to content

Add fused delta(for(bitpacking)) decode kernel (unstable)#140

Draft
joseph-isaacs wants to merge 3 commits into
developfrom
claude/delta-bitpacking-fastlanes-V6mTZ
Draft

Add fused delta(for(bitpacking)) decode kernel (unstable)#140
joseph-isaacs wants to merge 3 commits into
developfrom
claude/delta-bitpacking-fastlanes-V6mTZ

Conversation

@joseph-isaacs

@joseph-isaacs joseph-isaacs commented Jun 2, 2026

Copy link
Copy Markdown
Member

Summary

Adds a fused decode kernel for a delta(for(bitpacking)) stack, gated behind a new default-off delta_for_bitpacking feature.

Delta::unfor_undelta_pack::<LANES, W, B> decodes in a single pass over the W-bit packed buffer: for each lane it unpacks, wrapping-adds the frame-of-reference (inverting FoR), then accumulates against the running per-lane base (inverting delta). This fuses what were three passes — unpackunforundelta — and removes two intermediate buffers. A runtime-width dispatcher unchecked_unfor_undelta_pack selects the compile-time W (same pattern as FoR::unchecked_unfor_pack). It reuses the existing unpack! macro, so it inherits the transposed-iteration ordering that makes delta fusion correct.

Feature flag

Both trait methods, their macro impls, and the test live behind [features] delta_for_bitpacking = [] (off by default), because the kernel is monomorphized across every (type × bit-width) and that has a real .text cost (below). Downstream (Vortex) turns it on via its own unstable_encodings feature.

Tests

test_unfor_undelta round-trips both the const-generic kernel and the runtime-width dispatcher against an independent undelta-of-FoR-decoded reference. cargo test --features delta_for_bitpacking, cargo clippy --features delta_for_bitpacking --all-targets, and the default (feature-off) cargo clippy + cargo fmt --check are all clean. Real CI here (Build / MSRV / Benchmark) is green.

Code-size analysis

Release libfastlanes rlib (nm --print-size, summing t/T symbols):

feature new symbols new .text
delta_for_bitpacking off (default) 0 0 B
delta_for_bitpacking on 128 ~254 KiB

128 = 124 width-specialized unfor_undelta_pack (9 u8 + 17 u16 + 33 u32 + 65 u64) + 4 unchecked_ dispatchers. Comparable to the existing unfor_pack family (128 / ~224 KiB). Behind a default-off feature, it costs nothing unless enabled.

Performance — is the kernel optimal? (asm + A/B, measured locally)

The kernel is at parity with the shipped kernels. Microbench (best-of-30, same chunk), u32 W=11: unfor_undelta_pack 0.156 vs undelta_pack 0.153 vs unfor_pack 0.152 ns/elem — within noise. The fused FoR-add + undelta-add fold into the existing vpaddd chain for free.

Wider SIMD does not help — it regresses normal widths:

ns/elem SSE2 AVX2 AVX-512
fused u32 W=11 0.156 0.172 0.172
fused u64 W=17 0.374 0.415 0.416
fused u32 W=1 0.157 0.124 0.119

Asm confirms the AVX2 build is clean (163 insns vs SSE2's 195, full %ymm, zero cross-lane shuffles) yet ~10% slower — it's shift/mask/add port-throughput + AVX frequency-bound, not codegen quality. Only the degenerate W=1 (trivial unpack → pure stores) benefits. So a hand-written intrinsic kernel can't beat LLVM's SSE2 autovectorization here. The real win is the fusion, not kernel micro-optimization (see the Vortex PR for end-to-end numbers).

🤖 Generated with Claude Code

claude added 2 commits June 2, 2026 18:06
Introduce `Delta::unfor_undelta_pack` (and a runtime-width
`unchecked_unfor_undelta_pack` dispatcher) that decodes a
delta-of-frame-of-reference-of-bitpacking stack in a single pass: for each
lane the W-bit values are unpacked, the FoR `reference` is wrapping-added, and
the result is accumulated against the running per-lane `base`. This fuses what
were previously three passes (unpack, unfor, undelta) and two intermediate
buffers into one.

Covered by a new `test_unfor_undelta` round-trip that checks both the
const-generic kernel and the runtime-width dispatcher against an independent
unfor+undelta reference.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The fused delta(for(bitpacking)) kernel and its runtime-width dispatcher are
monomorphized across every (type x bit-width), which is a meaningful `.text`
cost. Put both trait methods, their impls, and the round-trip test behind a new
default-off `unstable` feature so the code (and its code-size) is opt-in until
the API stabilizes.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codspeed-hq

codspeed-hq Bot commented Jun 2, 2026

Copy link
Copy Markdown

Merging this PR will degrade performance by 23.79%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

❌ 4 regressed benchmarks
✅ 143 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
unfor_pack_16_from_3_stack 1.1 µs 1.6 µs -30.51%
unchecked_unfor_pack_16_from_3_stack 1.3 µs 1.7 µs -26.68%
unpack_then_add_reference_16_from_3_stack 2.1 µs 2.6 µs -19.25%
unchecked_unpack_then_add_reference_16_from_3_stack 2.3 µs 2.8 µs -17.98%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/delta-bitpacking-fastlanes-V6mTZ (267717c) with develop (075d0dd)

Open in CodSpeed

joseph-isaacs pushed a commit to vortex-data/vortex that referenced this pull request Jun 2, 2026
The `[patch.crates-io]` previously pointed at a sibling `../fastlanes` checkout,
which does not exist in CI and broke workspace resolution for every job. Point
it at the pushed fastlanes branch (spiraldb/fastlanes#140) so the workspace
resolves and both default and all-features builds compile. To be replaced by a
published fastlanes version bump once that PR merges.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Name the feature after what it gates (the fused delta(for(bitpacking)) decode
kernel) rather than the generic `unstable`.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
joseph-isaacs pushed a commit to vortex-data/vortex that referenced this pull request Jun 2, 2026
Reference the exact fastlanes revision (spiraldb/fastlanes#140) instead of the
branch for reproducibility.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants