fastlanes: bit-packed compare-constant fast path + bitpack_constant kernel by joseph-isaacs · Pull Request #8013 · vortex-data/vortex

joseph-isaacs · 2026-05-18T17:04:24Z

Summary

Stacked on #8012. Speeds up the bitpack_compare bench from the parent PR with two complementary optimizations driven by the same observation — a bit-packed lane holds values in [0, 2^bit_width - 1], so a constant outside that range can be answered analytically without touching the packed buffer.

Compare-constant fast path (`compute/compare.rs`)

Register a CompareKernel for BitPacked that short-circuits when the RHS constant c is outside [0, 2^bit_width - 1]. For each operator the answer is a constant boolean modulo patches and validity:

Operator	Outside range result
`Eq` / `NotEq`	`false` / `true` everywhere
`Lt` / `Lte` / `Gt` / `Gte`	constant once `c` is on either side of the range

Detecting the range is an O(1) i128 check via the new BitPackedData::value_fits_bit_width helper. With no patches and no nulls the kernel returns a ConstantArray<bool> (also O(1)); otherwise it allocates a BitBuffer, fills it with the constant result, and overlays the per-position outcome at each patch index. In-range constants fall through to the canonical decompress + Arrow compare path; tests exercise both fall-throughs.

`bitpack_constant` analytical encoder (`array/bitpack_compress.rs`)

Add a constant-only pack kernel that builds the FastLanes bit pattern for a [constant; len] input without calling BitPacking::pack. For constant input every lane produces the same bit_width output words; we compute those words analytically — each output word's j-th bit is bit (k * T_bits + j) mod bit_width of c — then memset each word LANES times into a stack chunk template and memcpy the template into every full chunk. The standard packer is only invoked for the partial tail (zero-padded past len). bitpack_encode_constant wraps the buffer up as a BitPackedArray. A bitwise-equivalence rstest covers byte-identity with BitPacking::pack across lengths, widths, and constants.

Benches

bitpack_compare (added in bench: bit-packed compare-constant baseline #8012) on this branch now exercises the fast path; at bit_width ∈ {4, 16}, len ∈ {1024, 65536} it runs in ~1.4–1.5 µs vs 8–125 µs for the decompress + Arrow baseline.
New bitpack_constant bench compares the analytical kernel against the full bitpack_encode pipeline on uniform-constant input; at 64 K u32 elements the analytical kernel is roughly 23–62× faster.

Plan doc (`docs/inrange_compare_plan.md`)

Documents the follow-up plan to accelerate in-range ordering comparisons: compare the packed array against the packed constant via SWAR less-than per supported bit width (Routes A/B/C, including Knuth broadword with rotation tables for widths that straddle word boundaries), derive the four ordering operators from one Lt primitive, and benchmark against the canonical SIMD baseline before landing.

Test plan

cargo nextest run -p vortex-fastlanes --all-features → 265/265 pass locally
cargo check -p vortex-fastlanes --benches --all-features
cargo bench -p vortex-fastlanes --bench bitpack_compare shows the fast-path speedup vs the baseline from bench: bit-packed compare-constant baseline #8012
cargo bench -p vortex-fastlanes --bench bitpack_constant shows the analytical encoder speedup
./scripts/public-api.sh agrees with the committed lock file
cargo clippy --all-targets --all-features

Supersedes #8011 (split into bench + speedup).

🤖 Generated with Claude Code

Add `bitpack_compare` divan bench in vortex-fastlanes that pits a binary `Operator::Eq` / `Operator::Lt` against an out-of-range constant on a `BitPackedData` array against an explicit "decompress, then Arrow compare" baseline that materialises the unpacked `PrimitiveArray` first. The constant is chosen as `1 << BW`, i.e. just past the packable range, so a future kernel that recognises out-of-range constants can short-circuit it. Today both arms decompress; the benchmark establishes a baseline for that upcoming optimization to land against. Sized small (`len ∈ {1024, 65536}`, `bit_width ∈ {4, 16}`, Eq + Lt) so it finishes quickly. Run with `cargo bench -p vortex-fastlanes --bench bitpack_compare`. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

…ernel Speeds up the `bitpack_compare` bench from the parent commit with two independent optimizations driven by the same observation — a bit-packed lane holds values in `[0, 2^bit_width - 1]`, so a constant outside that range can be answered analytically without touching the packed buffer. **Compare-constant fast path (`compute/compare.rs`)** Register a `CompareKernel` for `BitPacked` that short-circuits when the RHS constant `c` is outside `[0, 2^bit_width - 1]`. For each operator the answer is a constant boolean modulo patches and validity: Eq/NotEq - false / true everywhere Lt/Lte/Gt/Gte - constant once `c` is on either side of the range Detecting the range is an `O(1)` `i128` check via the new `BitPackedData::value_fits_bit_width` helper. With no patches and no nulls the kernel returns a `ConstantArray<bool>` (also `O(1)`); otherwise it allocates a `BitBuffer`, fills it with the constant result, and overlays the per-position outcome at each patch index. In-range constants fall through to the canonical decompress + Arrow compare path; tests exercise both fall-throughs. **`bitpack_constant` analytical encoder (`array/bitpack_compress.rs`)** Add a constant-only pack kernel that builds the FastLanes bit pattern for a `[constant; len]` input without calling `BitPacking::pack`. For constant input every lane produces the same `bit_width` output words; we compute those words analytically - each output word's `j`-th bit is bit `(k * T_bits + j) mod bit_width` of `c` - then `memset` each word `LANES` times into a stack chunk template and `memcpy` the template into every full chunk. The standard packer is only invoked for the partial tail (zero-padded past `len`). `bitpack_encode_constant` wraps the buffer up as a `BitPackedArray`. A bitwise-equivalence rstest covers byte-identity with `BitPacking::pack` across lengths, widths, and constants. **Benches** * `bitpack_compare` (added in the parent commit) on this branch now exercises the fast path; at `bit_width ∈ {4, 16}`, `len ∈ {1024, 65536}` it runs in ~1.4-1.5 µs vs 8-125 µs for the decompress + Arrow baseline. * New `bitpack_constant` bench compares the analytical kernel against the full `bitpack_encode` pipeline on uniform-constant input; at 64 K u32 elements the analytical kernel is roughly 23-62x faster. **Plan doc (`docs/inrange_compare_plan.md`)** Document the follow-up plan to accelerate *in-range* ordering comparisons: compare the packed array against the packed constant via SWAR less-than per supported bit width (Routes A/B/C, including Knuth broadword with rotation tables for widths that straddle word boundaries), derive the four ordering operators from one `Lt` primitive, and benchmark against the canonical SIMD baseline before landing. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

codspeed-hq · 2026-05-18T17:12:57Z

Merging this PR will improve performance by 59.84%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 9 improved benchmarks
❌ 5 regressed benchmarks
✅ 1261 untouched benchmarks
🆕 16 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`baseline_lt[16, 65536]`	217.6 µs	303 µs	-28.19%
❌	Simulation	`baseline_eq[4, 65536]`	185 µs	243 µs	-23.86%
❌	Simulation	`baseline_lt[4, 65536]`	200.8 µs	258 µs	-22.17%
❌	Simulation	`baseline_eq[16, 65536]`	229.9 µs	287.8 µs	-20.13%
❌	Simulation	`chunked_varbinview_canonical_into[(1000, 10)]`	162.1 µs	198.2 µs	-18.2%
⚡	Simulation	`fast_eq_out_of_range[16, 65536]`	225.1 µs	36.2 µs	×6.2
⚡	Simulation	`fast_lt_out_of_range[16, 65536]`	207.8 µs	36.5 µs	×5.7
⚡	Simulation	`fast_lt_out_of_range[4, 65536]`	190.7 µs	35.8 µs	×5.3
⚡	Simulation	`fast_eq_out_of_range[4, 65536]`	174.6 µs	36.1 µs	×4.8
⚡	Simulation	`fast_lt_out_of_range[4, 1024]`	37.9 µs	26.5 µs	+42.93%
⚡	Simulation	`baseline_lt[4, 1024]`	78.9 µs	63.2 µs	+24.81%
⚡	Simulation	`fast_lt_out_of_range[16, 1024]`	31.8 µs	26.6 µs	+19.76%
⚡	Simulation	`fast_eq_out_of_range[16, 1024]`	32.1 µs	27.7 µs	+15.71%
⚡	Simulation	`fast_eq_out_of_range[4, 1024]`	31.4 µs	27.7 µs	+13.17%
🆕	Simulation	`full_encode[16, 65536]`	N/A	358 µs	N/A
🆕	Simulation	`in_range_eq[4, 1024]`	N/A	30.8 µs	N/A
🆕	Simulation	`in_range_eq[16, 1024]`	N/A	31 µs	N/A
🆕	Simulation	`in_range_eq_baseline[4, 1024]`	N/A	83.5 µs	N/A
🆕	Simulation	`full_encode[4, 1024]`	N/A	19.1 µs	N/A
🆕	Simulation	`in_range_eq[16, 65536]`	N/A	156.4 µs	N/A
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/bitpack-compare-speedup-KGPS3 (940b84d) with develop (7de53da)¹}

No successful run was found on develop (41af74d) during the generation of this report, so 7de53da was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

…pare-speedup-KGPS3 Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> # Conflicts: # encodings/fastlanes/Cargo.toml

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

github-actions · 2026-06-02T02:17:33Z

This PR has been marked as stale because it has been open for 14 days with no activity. Please comment or remove the stale label if you wish to keep it active, otherwise it will be closed in 7 days

…pare-speedup-KGPS3 Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> # Conflicts: # encodings/fastlanes/public-api.lock # encodings/fastlanes/src/bitpacking/compute/compare.rs # encodings/fastlanes/src/bitpacking/vtable/kernels.rs

The streaming in-range compare kernel landed on develop (#8015) already covers the in-range path, so the planning note is no longer needed. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

The in-range compare path previously unpacked each FastLanes block to a full primitive scratch buffer and then ran a per-element predicate. At scale that was actually slower than plain decompress + Arrow compare (~78us vs ~63us for 64K u32 elements at bit_width 4/16). Switch the in-range path to the FastLanes fused unpack-and-compare kernel (BitPackingCompare::unchecked_unpack_cmp), which unpacks each value in-register and compares it on the spot, folding the result straight into the output bit buffer without materialising the unpacked primitive or a per-element scratch. Patches are applied afterwards by overwriting the result bit at each patched index. Sliced initial/trailing partial chunks keep the plain unpack-then-compare fallback. In-range eq at 64K now runs ~62us (down from ~78us, ~20% faster) and beats the decompress baseline at every size. The streaming stream_predicate is retained for the between kernel. Adds in_range_eq / in_range_eq_baseline benches to bitpack_compare. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Profiling the in-range compare kernel (samply + addr2line on a tight 64K-element eq loop) showed the work splits into a SSE2 fused unpack-compare (~23%) and, dominating it, the scalar bool-to-bitmask fold in vortex_buffer::collect_bool_word / pack_bools_into_words (~60%). The decompress+Arrow baseline avoids the latter because Arrow emits the result bitmask with a SIMD movemask. Add two runtime-dispatched AVX2 fast paths (no global target-cpu change; the default x86-64 build only reaches SSE2): - Fused unpack-and-compare: re-expand the exported fastlanes "unpack" macro inside target_feature(enable = "avx2") per-width kernels (byte-identical work to fastlanes' unpack_cmp, 256-bit codegen). Per-width kernels are inline(never) so the width dispatcher keeps a small stack frame. - Bool-to-bitmask packing: a vpmovmskb packer for the hot full-chunk (64-bit-aligned) case, replacing 64 scalar shift-ORs per word. Both detect AVX2 via is_x86_feature_detected and fall back to the stock scalar/SSE2 paths on other hosts/targets. Local bitpack_compare bench (default build) in_range_eq at 64K u32: bit_width 4: 62us -> 13.3us bit_width 16: 62us -> 14.9us i.e. ~3.9x faster than the decompress baseline (~51us) and ~5.9x faster than the original streaming kernel (~78us). CodSpeed CI already builds benches with target-feature=+avx2, so the movemask packer also speeds the CI numbers while the unpack fast path matches the already-AVX2 stock kernel there. Adds paste + seq-macro deps (needed to expand the fastlanes macros) and unit tests for the bool packer. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

github-actions · 2026-06-19T02:23:01Z

This PR has been marked as stale because it has been open for 14 days with no activity. Please comment or remove the stale label if you wish to keep it active, otherwise it will be closed in 7 days

github-actions · 2026-06-29T02:17:17Z

This PR was closed because it has been inactive for 7 days since being marked as stale.

joseph-isaacs added 2 commits May 18, 2026 17:53

joseph-isaacs mentioned this pull request May 18, 2026

Fast-path comparison and constant encoding for bit-packed arrays #8011

Closed

Base automatically changed from claude/bitpack-compare-bench-KGPS3 to develop May 18, 2026 17:26

joseph-isaacs added 2 commits May 18, 2026 18:28

Merge remote-tracking branch 'origin/develop' into claude/bitpack-com…

64284d2

…pare-speedup-KGPS3 Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> # Conflicts: # encodings/fastlanes/Cargo.toml

u

3b1b8cf

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

github-actions Bot added the stale This PR is stale and will be auto-closed soon label Jun 2, 2026

joseph-isaacs added 2 commits June 2, 2026 09:53

fastlanes: drop obsolete in-range compare plan doc

0303ca6

The streaming in-range compare kernel landed on develop (#8015) already covers the in-range path, so the planning note is no longer needed. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

joseph-isaacs marked this pull request as draft June 2, 2026 10:13

joseph-isaacs removed the stale This PR is stale and will be auto-closed soon label Jun 2, 2026

github-actions Bot added the stale This PR is stale and will be auto-closed soon label Jun 19, 2026

github-actions Bot closed this Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fastlanes: bit-packed compare-constant fast path + bitpack_constant kernel#8013

fastlanes: bit-packed compare-constant fast path + bitpack_constant kernel#8013
joseph-isaacs wants to merge 8 commits into
developfrom
claude/bitpack-compare-speedup-KGPS3

joseph-isaacs commented May 18, 2026

Uh oh!

codspeed-hq Bot commented May 18, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

joseph-isaacs commented May 18, 2026

Summary

Compare-constant fast path (compute/compare.rs)

bitpack_constant analytical encoder (array/bitpack_compress.rs)

Benches

Plan doc (docs/inrange_compare_plan.md)

Test plan

Uh oh!

codspeed-hq Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by 59.84%

Performance Changes

Footnotes

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Compare-constant fast path (`compute/compare.rs`)

`bitpack_constant` analytical encoder (`array/bitpack_compress.rs`)

Plan doc (`docs/inrange_compare_plan.md`)

codspeed-hq Bot commented May 18, 2026 •

edited

Loading