[experimental] Unify VBMI untranspose onto one vpermi2b kernel for all widths by joseph-isaacs · Pull Request #146 · spiraldb/fastlanes

joseph-isaacs · 2026-06-03T16:26:54Z

⚠️ Disclaimer — experimental, NOT validated on real AVX-512VBMI hardware

This PR changes the u64 VBMI untranspose kernel. It could not be executed on real avx512vbmi hardware in the dev/CI environment available to me:

the host CPU lacks avx512vbmi (so the has_vbmi()-gated tests skip);
qemu-user here cannot execute AVX-512 in user mode (TCG faults on even basic AVX-512F — XSAVE/ZMM state unsupported in this build);
Intel SDE was not reachable (download blocked).

So correctness here rests on cross-compilation + structural equivalence, not execution — see "Validation" below. Do not merge until it has been run on a real avx512vbmi machine (the existing test_vbmi_untranspose_all_widths_match_baseline test will exercise it there). Stacked on #145; review/merge that first.

What changed

Stacked on top of #145. That PR intentionally kept the u64 VBMI path byte-identical to the original kernel (vpermb gather within a 64-byte half + a scatter the compiler vectorizes), behind a compile-time T::T == 64 branch, because VBMI couldn't be benchmarked.

This PR removes that special-case (and the _lt64 helper) so all widths — including u64 — flow through the single width-generic kernel:

vpermi2b gather (16 groups → group-major)  →  8x8 bit-transpose (zmm)  →  vpermi2b scatter

Only the per-width group_perm gather/scatter tables differ. For u64 each group's 8 bytes stay within one 64-byte half, so the second vpermi2b source is unused — the two-source form just keeps one code path.

Net diff: −40 lines in x86.rs, one kernel instead of two.

Validation (no real-hardware execution)

Cross-compiles cleanly, including RUSTFLAGS="-C target-feature=+avx512f,+avx512bw,+avx512vbmi" (forces VBMI codegen). Confirmed via --emit asm that u64 now emits vpermi2b gather + vpermi2b scatter (no vpermb, no scalar scatter).
Structural equivalence: the kernel is identical to the narrow-width VBMI path already in Add width-generic x86 BMI2/VBMI untranspose for u8/u16/u32 #145, and uses the same GATHER_64/SCATTER_64 tables that the NEON untranspose_bits_neon::<u64> all-widths test executes and validates on aarch64 CI. The scalar/BMI2 all-widths tests cover the same tables on x86.
cargo test --features std (155) + clippy clean on x86_64 (std + no_std) and aarch64.

Why it's worth considering

One uniform kernel (matches the NEON structure), no compile-time width branch, ~40 fewer lines. Permute-op count is unchanged vs the #145 u64 path (4 permutes either way); the gather becomes two-source vpermi2b instead of single-source vpermb, which is throughput-equivalent on Ice Lake / Zen 4 — but that should be confirmed on hardware before merging.

https://claude.ai/code/session_01ATBvsrFw3eAPgcZnrqpMPS

Generated by Claude Code

The BMI2 and VBMI `untranspose_bits` kernels previously only implemented the u64 (16-lane) mask transpose; u8/u16/u32 fell back to scalar. This generalizes both to all element widths, matching the NEON kernel, and routes every width through them in the dispatcher. - BMI2: `untranspose_bits_bmi2::<T>` gathers each of the 16 byte-groups at the width's stride and uses PDEP to perform the 8x8 bit transpose. For u64 this monomorphizes to byte-identical asm as before (verified), so u64 perf is unchanged; u8/u16/u32 are ~1.5-2x faster than scalar. - VBMI: narrow widths use a uniform vpermi2b gather / 8x8 transpose / vpermi2b scatter kernel (`untranspose_bits_vbmi_lt64`) spanning both 64-byte halves. The u64 path is preserved exactly (byte-identical asm). - The per-width gather/scatter permutation tables are hoisted from the NEON module into `bit_transpose::mod` and shared by NEON and VBMI. - Tests now cover all widths against the baseline for both kernels. https://claude.ai/code/session_01ATBvsrFw3eAPgcZnrqpMPS

Collapse the per-item `#[cfg(any(target_arch = "x86_64", target_arch = "aarch64"))]` attributes on the shared gather/scatter tables into a single `group_perm` submodule gated once. NEON and VBMI reference `group_perm::group_tables`; the index-building const fns and statics are now private to the module. https://claude.ai/code/session_01ATBvsrFw3eAPgcZnrqpMPS

Drop the `T::T == 64` special-case (and the `_lt64` helper) that PR #145 kept to leave the u64 VBMI asm byte-identical, so every width now flows through one `vpermi2b` gather / 8x8 transpose / `vpermi2b` scatter kernel selected only by the per-width gather/scatter tables. For u64 a group's 8 bytes stay within one 64-byte half, so the second permute source is unused, but the two-source form keeps a single code path. EXPERIMENTAL: this changes the u64 VBMI kernel and could not be executed on real AVX-512VBMI hardware in this environment (host lacks the feature; qemu-user here cannot run AVX-512; Intel SDE unavailable). Validated by cross-compilation and by structural equivalence to the width-generic NEON and narrow-width VBMI kernels, which share the same gather/scatter tables. https://claude.ai/code/session_01ATBvsrFw3eAPgcZnrqpMPS

codspeed-hq · 2026-06-03T16:28:50Z

Merging this PR will not alter performance

✅ 158 untouched benchmarks
⏩ 123 skipped benchmarks¹

_{Comparing claude/dreamy-goodall-vbmi-uniform (ec0597e) with claude/dreamy-goodall-3vpHT (6c10ea7)}

123 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

joseph-isaacs added 3 commits June 3, 2026 15:53

Base automatically changed from claude/dreamy-goodall-3vpHT to develop June 4, 2026 12:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[experimental] Unify VBMI untranspose onto one vpermi2b kernel for all widths#146

[experimental] Unify VBMI untranspose onto one vpermi2b kernel for all widths#146
joseph-isaacs wants to merge 3 commits into
developfrom
claude/dreamy-goodall-vbmi-uniform

joseph-isaacs commented Jun 3, 2026

Uh oh!

codspeed-hq Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

joseph-isaacs commented Jun 3, 2026

⚠️ Disclaimer — experimental, NOT validated on real AVX-512VBMI hardware

What changed

Validation (no real-hardware execution)

Why it's worth considering

Uh oh!

codspeed-hq Bot commented Jun 3, 2026

Merging this PR will not alter performance

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant