Add fused delta(for(bitpacking)) decode kernel (unstable) by joseph-isaacs · Pull Request #140 · spiraldb/fastlanes

joseph-isaacs · 2026-06-02T18:28:01Z

Summary

Adds a fused decode kernel for a delta(for(bitpacking)) stack, gated behind a new default-off delta_for_bitpacking feature.

Delta::unfor_undelta_pack::<LANES, W, B> decodes in a single pass over the W-bit packed buffer: for each lane it unpacks, wrapping-adds the frame-of-reference (inverting FoR), then accumulates against the running per-lane base (inverting delta). This fuses what were three passes — unpack → unfor → undelta — and removes two intermediate buffers. A runtime-width dispatcher unchecked_unfor_undelta_pack selects the compile-time W (same pattern as FoR::unchecked_unfor_pack). It reuses the existing unpack! macro, so it inherits the transposed-iteration ordering that makes delta fusion correct.

Feature flag

Both trait methods, their macro impls, and the test live behind [features] delta_for_bitpacking = [] (off by default), because the kernel is monomorphized across every (type × bit-width) and that has a real .text cost (below). Downstream (Vortex) turns it on via its own unstable_encodings feature.

Tests

test_unfor_undelta round-trips both the const-generic kernel and the runtime-width dispatcher against an independent undelta-of-FoR-decoded reference. cargo test --features delta_for_bitpacking, cargo clippy --features delta_for_bitpacking --all-targets, and the default (feature-off) cargo clippy + cargo fmt --check are all clean. Real CI here (Build / MSRV / Benchmark) is green.

Code-size analysis

Release libfastlanes rlib (nm --print-size, summing t/T symbols):

feature	new symbols	new `.text`
`delta_for_bitpacking` off (default)	0	0 B
`delta_for_bitpacking` on	128	~254 KiB

128 = 124 width-specialized unfor_undelta_pack (9 u8 + 17 u16 + 33 u32 + 65 u64) + 4 unchecked_ dispatchers. Comparable to the existing unfor_pack family (128 / ~224 KiB). Behind a default-off feature, it costs nothing unless enabled.

Performance — is the kernel optimal? (asm + A/B, measured locally)

The kernel is at parity with the shipped kernels. Microbench (best-of-30, same chunk), u32 W=11: unfor_undelta_pack 0.156 vs undelta_pack 0.153 vs unfor_pack 0.152 ns/elem — within noise. The fused FoR-add + undelta-add fold into the existing vpaddd chain for free.

Wider SIMD does not help — it regresses normal widths:

ns/elem	SSE2	AVX2	AVX-512
fused u32 W=11	0.156	0.172	0.172
fused u64 W=17	0.374	0.415	0.416
fused u32 W=1	0.157	0.124	0.119

Asm confirms the AVX2 build is clean (163 insns vs SSE2's 195, full %ymm, zero cross-lane shuffles) yet ~10% slower — it's shift/mask/add port-throughput + AVX frequency-bound, not codegen quality. Only the degenerate W=1 (trivial unpack → pure stores) benefits. So a hand-written intrinsic kernel can't beat LLVM's SSE2 autovectorization here. The real win is the fusion, not kernel micro-optimization (see the Vortex PR for end-to-end numbers).

🤖 Generated with Claude Code

Introduce `Delta::unfor_undelta_pack` (and a runtime-width `unchecked_unfor_undelta_pack` dispatcher) that decodes a delta-of-frame-of-reference-of-bitpacking stack in a single pass: for each lane the W-bit values are unpacked, the FoR `reference` is wrapping-added, and the result is accumulated against the running per-lane `base`. This fuses what were previously three passes (unpack, unfor, undelta) and two intermediate buffers into one. Covered by a new `test_unfor_undelta` round-trip that checks both the const-generic kernel and the runtime-width dispatcher against an independent unfor+undelta reference. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

The fused delta(for(bitpacking)) kernel and its runtime-width dispatcher are monomorphized across every (type x bit-width), which is a meaningful `.text` cost. Put both trait methods, their impls, and the round-trip test behind a new default-off `unstable` feature so the code (and its code-size) is opt-in until the API stabilizes. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

CLAassistant · 2026-06-02T18:28:09Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

codspeed-hq · 2026-06-02T18:30:31Z

Merging this PR will degrade performance by 23.79%

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

❌ 4 regressed benchmarks
✅ 143 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Benchmark	`BASE`	`HEAD`	Efficiency
❌	`unfor_pack_16_from_3_stack`	1.1 µs	1.6 µs	-30.51%
❌	`unchecked_unfor_pack_16_from_3_stack`	1.3 µs	1.7 µs	-26.68%
❌	`unpack_then_add_reference_16_from_3_stack`	2.1 µs	2.6 µs	-19.25%
❌	`unchecked_unpack_then_add_reference_16_from_3_stack`	2.3 µs	2.8 µs	-17.98%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/delta-bitpacking-fastlanes-V6mTZ (267717c) with develop (075d0dd)}

The `[patch.crates-io]` previously pointed at a sibling `../fastlanes` checkout, which does not exist in CI and broke workspace resolution for every job. Point it at the pushed fastlanes branch (spiraldb/fastlanes#140) so the workspace resolves and both default and all-features builds compile. To be replaced by a published fastlanes version bump once that PR merges. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Name the feature after what it gates (the fused delta(for(bitpacking)) decode kernel) rather than the generic `unstable`. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Reference the exact fastlanes revision (spiraldb/fastlanes#140) instead of the branch for reproducibility. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

claude added 2 commits June 2, 2026 18:06

joseph-isaacs mentioned this pull request Jun 2, 2026

Fused delta(for(bitpacking)) decode (unstable_encodings) vortex-data/vortex#8224

Closed

Rename feature unstable -> delta_for_bitpacking

267717c

Name the feature after what it gates (the fused delta(for(bitpacking)) decode kernel) rather than the generic `unstable`. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add fused delta(for(bitpacking)) decode kernel (unstable)#140

Add fused delta(for(bitpacking)) decode kernel (unstable)#140
joseph-isaacs wants to merge 3 commits into
developfrom
claude/delta-bitpacking-fastlanes-V6mTZ

joseph-isaacs commented Jun 2, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Jun 2, 2026

Uh oh!

codspeed-hq Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

joseph-isaacs commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Feature flag

Tests

Code-size analysis

Performance — is the kernel optimal? (asm + A/B, measured locally)

Uh oh!

CLAassistant commented Jun 2, 2026

Uh oh!

codspeed-hq Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 23.79%

Performance Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joseph-isaacs commented Jun 2, 2026 •

edited

Loading

codspeed-hq Bot commented Jun 2, 2026 •

edited

Loading