Skip to content

Fused delta(for(bitpacking)) decode (unstable_encodings)#8224

Closed
joseph-isaacs wants to merge 7 commits into
developfrom
claude/delta-bitpacking-fastlanes-V6mTZ
Closed

Fused delta(for(bitpacking)) decode (unstable_encodings)#8224
joseph-isaacs wants to merge 7 commits into
developfrom
claude/delta-bitpacking-fastlanes-V6mTZ

Conversation

@joseph-isaacs

@joseph-isaacs joseph-isaacs commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Wires the new fastlanes::Delta::unfor_undelta_pack fused kernel into delta decompression, behind a new default-off unstable_encodings feature.

When a DeltaArray's deltas child is a FoR array (unsigned reference) wrapping a BitPacked array stored as full, zero-offset chunks with no patches, delta_decompress takes a fully fused fast path (try_fused_for_bitpackingdecompress_fused): each chunk is unpacked, FoR-decoded, and un-delta'd in a single pass before untransposing. Every other shape (signed reference, patches, sliced bit-packing) falls back to the existing generic path unchanged.

Note

Depends on spiraldb/fastlanes#140 (the delta_for_bitpacking kernel). This branch carries a temporary [patch.crates-io] pinning fastlanes to rev 267717cd72e8b6f0ed0e5321ae3fc785fa433058. It must be replaced by a published fastlanes version bump before merge — until then, Rust publish dry-run and Rust build (all-features) are expected red because crates.io fastlanes 0.5.0 has no delta_for_bitpacking feature (this is a standard stacked cross-repo PR: merge + release fastlanes first).

Feature flag

  • vortex-fastlanes: new unstable_encodings = ["fastlanes/delta_for_bitpacking"]. The fused path, its imports, the round-trip test, and the bench are all #[cfg(feature = "unstable_encodings")].
  • vortex-btrblocks's existing unstable_encodings feature propagates vortex-fastlanes/unstable_encodings.

With the feature off (default) the kernel and fast path are compiled out — no behavior or code-size change on the default build.

Tests

fused_for_bitpacking_roundtrip builds the stack from non-strictly-increasing u32/u64 columns, asserts the fused path is actually taken (not a silent fallback), and round-trips. cargo test -p vortex-fastlanes --lib delta:: (61 tests) passes; cargo clippy --all-targets --all-features, the default lib build, and nightly fmt --check are clean. The compat suite passes 35/35.

Performance — fused vs the real current Vortex decode

benches/delta_for_bitpack.rs A/Bs the real decode entry points on the same array: fused = delta_decompress (fast path) vs current = delta_decompress_generic (the pre-fusion path Vortex uses today). Cold each iteration, fastest time:

case current Vortex fused speedup
u32, 64 Ki 146 µs 32.0 µs 4.6×
u32, 1 Mi 3.34 ms 600 µs 5.6×
u64, 64 Ki 81.9 µs 44.3 µs 1.85×
u64, 1 Mi 6.88 ms 1.00 ms 6.9×

The win is eliminating the intermediate FoR-decoded PrimitiveArray materialization (+ its validity mask + a second allocation/pass), not the kernel itself: the kernel is ~0.16 ns/elem while current spends ~3.3 ns/elem, i.e. ~95% of the current path is array machinery.

Is the kernel itself optimal? (asm)

Yes — measured locally. The fused kernel is at parity with the shipped unfor_pack/undelta_pack (within ~3%), and wider SIMD regresses realistic widths (AVX2/AVX-512 ~10% slower than SSE2; asm is clean %ymm, zero shuffles — it's port-throughput/frequency-bound, not codegen). Details in spiraldb/fastlanes#140.

Code-size analysis

The kernel is monomorphized per (type × bit-width). Release libfastlanes rlib:

unstable_encodings new symbols new .text
off (default) 0 0 B
on 128 ~254 KiB

Fully opt-in via the feature.

🤖 Generated with Claude Code

claude added 2 commits June 2, 2026 18:07
Wire the new `fastlanes::Delta::unfor_undelta_pack` kernel into delta
decompression. When a DeltaArray's `deltas` child is a FoR array (unsigned
reference) wrapping a BitPacked array stored as full, zero-offset chunks with
no patches, `delta_decompress` now takes a fully fused fast path
(`try_fused_for_bitpacking` -> `decompress_fused`) that unpacks, applies the
frame-of-reference, and inverts the delta encoding in a single pass per chunk
before untransposing. All other shapes fall back to the existing generic path.

A round-trip test builds the stack from non-strictly-increasing (monotone
non-decreasing) u32/u64 columns and asserts the fused path is actually taken.

The `delta_for_bitpack` divan bench compares the fused decode against an
unfused baseline (materialize the FoR(bitpacked) deltas, then generic delta
decode). On non-decreasing columns the fused path is ~1.3-2.0x faster, with the
gap widening at larger sizes and for u64.

A local-dev `[patch.crates-io]` points fastlanes at the sibling checkout that
carries the kernel; it would be replaced by a published version bump.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Put the fused decode fast path (`try_fused_for_bitpacking` /
`decompress_fused`), its imports, the round-trip test, and the bench behind a
new `unstable_encodings` feature on vortex-fastlanes that enables
`fastlanes/unstable`. With the feature off (the default) the kernel is compiled
out entirely, so there is no `.text` cost; vortex-btrblocks' existing
`unstable_encodings` feature now propagates it.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs added the changelog/performance A performance improvement label Jun 2, 2026 — with Claude
claude added 2 commits June 2, 2026 18:33
The `[patch.crates-io]` previously pointed at a sibling `../fastlanes` checkout,
which does not exist in CI and broke workspace resolution for every job. Point
it at the pushed fastlanes branch (spiraldb/fastlanes#140) so the workspace
resolves and both default and all-features builds compile. To be replaced by a
published fastlanes version bump once that PR merges.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Split the combined `use` statements into one item per line and regroup, matching
the repo's nightly rustfmt config (imports_granularity = "Item",
group_imports = "StdExternalCrate"). No functional change.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@codspeed-hq

codspeed-hq Bot commented Jun 2, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

✅ 1275 untouched benchmarks
🆕 8 new benchmarks

Performance Changes

Mode Benchmark BASE HEAD Efficiency
🆕 Simulation current_u64[65536] N/A 654 µs N/A
🆕 Simulation fused_u32[65536] N/A 235.3 µs N/A
🆕 Simulation fused_u64[65536] N/A 378 µs N/A
🆕 Simulation fused_u64[1048576] N/A 5.7 ms N/A
🆕 Simulation fused_u32[1048576] N/A 3.5 ms N/A
🆕 Simulation current_u32[1048576] N/A 5.6 ms N/A
🆕 Simulation current_u64[1048576] N/A 13.5 ms N/A
🆕 Simulation current_u32[65536] N/A 379.2 µs N/A

Comparing claude/delta-bitpacking-fastlanes-V6mTZ (1565f71) with develop (81046d7)

Open in CodSpeed

claude added 3 commits June 2, 2026 20:05
Point `unstable_encodings` at `fastlanes/delta_for_bitpacking` and bump the
patched fastlanes git revision accordingly.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Expose `delta_decompress` / `delta_decompress_generic` under the `_test-harness`
feature and rewrite the bench so both arms call the real decode entry points on
the identical delta(for(bitpacking)) array: `fused` (the unfor_undelta_pack fast
path) vs `current` (the pre-fusion generic decode). The previous baseline reused
a cached intermediate and understated the gap; the cold-vs-cold comparison shows
~4.6x (u32 64Ki) to ~6.9x (u64 1Mi), dominated by avoiding the intermediate
FoR-decoded PrimitiveArray materialization rather than kernel speed.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Reference the exact fastlanes revision (spiraldb/fastlanes#140) instead of the
branch for reproducibility.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@github-actions

Copy link
Copy Markdown
Contributor

This PR has been marked as stale because it has been open for 14 days with no activity. Please comment or remove the stale label if you wish to keep it active, otherwise it will be closed in 7 days

@github-actions github-actions Bot added the stale This PR is stale and will be auto-closed soon label Jun 18, 2026
@github-actions

Copy link
Copy Markdown
Contributor

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions Bot closed this Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement stale This PR is stale and will be auto-closed soon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants