Skip to content

fastlanes: bit-packed compare-constant fast path + bitpack_constant kernel#8013

Closed
joseph-isaacs wants to merge 8 commits into
developfrom
claude/bitpack-compare-speedup-KGPS3
Closed

fastlanes: bit-packed compare-constant fast path + bitpack_constant kernel#8013
joseph-isaacs wants to merge 8 commits into
developfrom
claude/bitpack-compare-speedup-KGPS3

Conversation

@joseph-isaacs

Copy link
Copy Markdown
Contributor

Summary

Stacked on #8012. Speeds up the bitpack_compare bench from the parent PR with two complementary optimizations driven by the same observation — a bit-packed lane holds values in [0, 2^bit_width - 1], so a constant outside that range can be answered analytically without touching the packed buffer.

Compare-constant fast path (compute/compare.rs)

Register a CompareKernel for BitPacked that short-circuits when the RHS constant c is outside [0, 2^bit_width - 1]. For each operator the answer is a constant boolean modulo patches and validity:

Operator Outside range result
Eq / NotEq false / true everywhere
Lt / Lte / Gt / Gte constant once c is on either side of the range

Detecting the range is an O(1) i128 check via the new BitPackedData::value_fits_bit_width helper. With no patches and no nulls the kernel returns a ConstantArray<bool> (also O(1)); otherwise it allocates a BitBuffer, fills it with the constant result, and overlays the per-position outcome at each patch index. In-range constants fall through to the canonical decompress + Arrow compare path; tests exercise both fall-throughs.

bitpack_constant analytical encoder (array/bitpack_compress.rs)

Add a constant-only pack kernel that builds the FastLanes bit pattern for a [constant; len] input without calling BitPacking::pack. For constant input every lane produces the same bit_width output words; we compute those words analytically — each output word's j-th bit is bit (k * T_bits + j) mod bit_width of c — then memset each word LANES times into a stack chunk template and memcpy the template into every full chunk. The standard packer is only invoked for the partial tail (zero-padded past len). bitpack_encode_constant wraps the buffer up as a BitPackedArray. A bitwise-equivalence rstest covers byte-identity with BitPacking::pack across lengths, widths, and constants.

Benches

  • bitpack_compare (added in bench: bit-packed compare-constant baseline #8012) on this branch now exercises the fast path; at bit_width ∈ {4, 16}, len ∈ {1024, 65536} it runs in ~1.4–1.5 µs vs 8–125 µs for the decompress + Arrow baseline.
  • New bitpack_constant bench compares the analytical kernel against the full bitpack_encode pipeline on uniform-constant input; at 64 K u32 elements the analytical kernel is roughly 23–62× faster.

Plan doc (docs/inrange_compare_plan.md)

Documents the follow-up plan to accelerate in-range ordering comparisons: compare the packed array against the packed constant via SWAR less-than per supported bit width (Routes A/B/C, including Knuth broadword with rotation tables for widths that straddle word boundaries), derive the four ordering operators from one Lt primitive, and benchmark against the canonical SIMD baseline before landing.

Test plan

  • cargo nextest run -p vortex-fastlanes --all-features → 265/265 pass locally
  • cargo check -p vortex-fastlanes --benches --all-features
  • cargo bench -p vortex-fastlanes --bench bitpack_compare shows the fast-path speedup vs the baseline from bench: bit-packed compare-constant baseline #8012
  • cargo bench -p vortex-fastlanes --bench bitpack_constant shows the analytical encoder speedup
  • ./scripts/public-api.sh agrees with the committed lock file
  • cargo clippy --all-targets --all-features

Supersedes #8011 (split into bench + speedup).

🤖 Generated with Claude Code

Add `bitpack_compare` divan bench in vortex-fastlanes that pits a binary
`Operator::Eq` / `Operator::Lt` against an out-of-range constant on a
`BitPackedData` array against an explicit "decompress, then Arrow compare"
baseline that materialises the unpacked `PrimitiveArray` first.

The constant is chosen as `1 << BW`, i.e. just past the packable range, so a
future kernel that recognises out-of-range constants can short-circuit it.
Today both arms decompress; the benchmark establishes a baseline for that
upcoming optimization to land against. Sized small (`len ∈ {1024, 65536}`,
`bit_width ∈ {4, 16}`, Eq + Lt) so it finishes quickly.

Run with `cargo bench -p vortex-fastlanes --bench bitpack_compare`.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…ernel

Speeds up the `bitpack_compare` bench from the parent commit with two
independent optimizations driven by the same observation — a bit-packed lane
holds values in `[0, 2^bit_width - 1]`, so a constant outside that range can
be answered analytically without touching the packed buffer.

**Compare-constant fast path (`compute/compare.rs`)**

Register a `CompareKernel` for `BitPacked` that short-circuits when the RHS
constant `c` is outside `[0, 2^bit_width - 1]`. For each operator the answer
is a constant boolean modulo patches and validity:

  Eq/NotEq           - false / true everywhere
  Lt/Lte/Gt/Gte      - constant once `c` is on either side of the range

Detecting the range is an `O(1)` `i128` check via the new
`BitPackedData::value_fits_bit_width` helper. With no patches and no nulls the
kernel returns a `ConstantArray<bool>` (also `O(1)`); otherwise it allocates a
`BitBuffer`, fills it with the constant result, and overlays the per-position
outcome at each patch index. In-range constants fall through to the canonical
decompress + Arrow compare path; tests exercise both fall-throughs.

**`bitpack_constant` analytical encoder (`array/bitpack_compress.rs`)**

Add a constant-only pack kernel that builds the FastLanes bit pattern for a
`[constant; len]` input without calling `BitPacking::pack`. For constant input
every lane produces the same `bit_width` output words; we compute those words
analytically - each output word's `j`-th bit is bit `(k * T_bits + j) mod
bit_width` of `c` - then `memset` each word `LANES` times into a stack chunk
template and `memcpy` the template into every full chunk. The standard packer
is only invoked for the partial tail (zero-padded past `len`).
`bitpack_encode_constant` wraps the buffer up as a `BitPackedArray`. A
bitwise-equivalence rstest covers byte-identity with `BitPacking::pack` across
lengths, widths, and constants.

**Benches**

* `bitpack_compare` (added in the parent commit) on this branch now exercises
  the fast path; at `bit_width ∈ {4, 16}`, `len ∈ {1024, 65536}` it runs in
  ~1.4-1.5 µs vs 8-125 µs for the decompress + Arrow baseline.
* New `bitpack_constant` bench compares the analytical kernel against the
  full `bitpack_encode` pipeline on uniform-constant input; at 64 K u32
  elements the analytical kernel is roughly 23-62x faster.

**Plan doc (`docs/inrange_compare_plan.md`)**

Document the follow-up plan to accelerate *in-range* ordering comparisons:
compare the packed array against the packed constant via SWAR less-than per
supported bit width (Routes A/B/C, including Knuth broadword with rotation
tables for widths that straddle word boundaries), derive the four ordering
operators from one `Lt` primitive, and benchmark against the canonical SIMD
baseline before landing.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@codspeed-hq

codspeed-hq Bot commented May 18, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by 59.84%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 9 improved benchmarks
❌ 5 regressed benchmarks
✅ 1261 untouched benchmarks
🆕 16 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation baseline_lt[16, 65536] 217.6 µs 303 µs -28.19%
Simulation baseline_eq[4, 65536] 185 µs 243 µs -23.86%
Simulation baseline_lt[4, 65536] 200.8 µs 258 µs -22.17%
Simulation baseline_eq[16, 65536] 229.9 µs 287.8 µs -20.13%
Simulation chunked_varbinview_canonical_into[(1000, 10)] 162.1 µs 198.2 µs -18.2%
Simulation fast_eq_out_of_range[16, 65536] 225.1 µs 36.2 µs ×6.2
Simulation fast_lt_out_of_range[16, 65536] 207.8 µs 36.5 µs ×5.7
Simulation fast_lt_out_of_range[4, 65536] 190.7 µs 35.8 µs ×5.3
Simulation fast_eq_out_of_range[4, 65536] 174.6 µs 36.1 µs ×4.8
Simulation fast_lt_out_of_range[4, 1024] 37.9 µs 26.5 µs +42.93%
Simulation baseline_lt[4, 1024] 78.9 µs 63.2 µs +24.81%
Simulation fast_lt_out_of_range[16, 1024] 31.8 µs 26.6 µs +19.76%
Simulation fast_eq_out_of_range[16, 1024] 32.1 µs 27.7 µs +15.71%
Simulation fast_eq_out_of_range[4, 1024] 31.4 µs 27.7 µs +13.17%
🆕 Simulation full_encode[16, 65536] N/A 358 µs N/A
🆕 Simulation in_range_eq[4, 1024] N/A 30.8 µs N/A
🆕 Simulation in_range_eq[16, 1024] N/A 31 µs N/A
🆕 Simulation in_range_eq_baseline[4, 1024] N/A 83.5 µs N/A
🆕 Simulation full_encode[4, 1024] N/A 19.1 µs N/A
🆕 Simulation in_range_eq[16, 65536] N/A 156.4 µs N/A
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/bitpack-compare-speedup-KGPS3 (940b84d) with develop (7de53da)1

Open in CodSpeed

Footnotes

  1. No successful run was found on develop (41af74d) during the generation of this report, so 7de53da was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

Base automatically changed from claude/bitpack-compare-bench-KGPS3 to develop May 18, 2026 17:26
…pare-speedup-KGPS3

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

# Conflicts:
#	encodings/fastlanes/Cargo.toml
u
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

This PR has been marked as stale because it has been open for 14 days with no activity. Please comment or remove the stale label if you wish to keep it active, otherwise it will be closed in 7 days

@github-actions github-actions Bot added the stale This PR is stale and will be auto-closed soon label Jun 2, 2026
…pare-speedup-KGPS3

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

# Conflicts:
#	encodings/fastlanes/public-api.lock
#	encodings/fastlanes/src/bitpacking/compute/compare.rs
#	encodings/fastlanes/src/bitpacking/vtable/kernels.rs
The streaming in-range compare kernel landed on develop (#8015) already
covers the in-range path, so the planning note is no longer needed.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs marked this pull request as draft June 2, 2026 10:13
The in-range compare path previously unpacked each FastLanes block to a
full primitive scratch buffer and then ran a per-element predicate. At
scale that was actually slower than plain decompress + Arrow compare
(~78us vs ~63us for 64K u32 elements at bit_width 4/16).

Switch the in-range path to the FastLanes fused unpack-and-compare kernel
(BitPackingCompare::unchecked_unpack_cmp), which unpacks each value
in-register and compares it on the spot, folding the result straight into
the output bit buffer without materialising the unpacked primitive or a
per-element scratch. Patches are applied afterwards by overwriting the
result bit at each patched index. Sliced initial/trailing partial chunks
keep the plain unpack-then-compare fallback.

In-range eq at 64K now runs ~62us (down from ~78us, ~20% faster) and beats
the decompress baseline at every size. The streaming stream_predicate is
retained for the between kernel.

Adds in_range_eq / in_range_eq_baseline benches to bitpack_compare.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs removed the stale This PR is stale and will be auto-closed soon label Jun 2, 2026
Profiling the in-range compare kernel (samply + addr2line on a tight
64K-element eq loop) showed the work splits into a SSE2 fused unpack-compare
(~23%) and, dominating it, the scalar bool-to-bitmask fold in
vortex_buffer::collect_bool_word / pack_bools_into_words (~60%). The
decompress+Arrow baseline avoids the latter because Arrow emits the result
bitmask with a SIMD movemask.

Add two runtime-dispatched AVX2 fast paths (no global target-cpu change; the
default x86-64 build only reaches SSE2):

- Fused unpack-and-compare: re-expand the exported fastlanes "unpack" macro
  inside target_feature(enable = "avx2") per-width kernels (byte-identical
  work to fastlanes' unpack_cmp, 256-bit codegen). Per-width kernels are
  inline(never) so the width dispatcher keeps a small stack frame.
- Bool-to-bitmask packing: a vpmovmskb packer for the hot full-chunk
  (64-bit-aligned) case, replacing 64 scalar shift-ORs per word.

Both detect AVX2 via is_x86_feature_detected and fall back to the stock
scalar/SSE2 paths on other hosts/targets.

Local bitpack_compare bench (default build) in_range_eq at 64K u32:
  bit_width 4:  62us -> 13.3us   bit_width 16: 62us -> 14.9us
i.e. ~3.9x faster than the decompress baseline (~51us) and ~5.9x faster than
the original streaming kernel (~78us). CodSpeed CI already builds benches with
target-feature=+avx2, so the movemask packer also speeds the CI numbers while
the unpack fast path matches the already-AVX2 stock kernel there.

Adds paste + seq-macro deps (needed to expand the fastlanes macros) and unit
tests for the bool packer.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@github-actions

Copy link
Copy Markdown
Contributor

This PR has been marked as stale because it has been open for 14 days with no activity. Please comment or remove the stale label if you wish to keep it active, otherwise it will be closed in 7 days

@github-actions github-actions Bot added the stale This PR is stale and will be auto-closed soon label Jun 19, 2026
@github-actions

Copy link
Copy Markdown
Contributor

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions Bot closed this Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale This PR is stale and will be auto-closed soon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant