Skip to content

[experimental] Unify VBMI untranspose onto one vpermi2b kernel for all widths#146

Draft
joseph-isaacs wants to merge 3 commits into
developfrom
claude/dreamy-goodall-vbmi-uniform
Draft

[experimental] Unify VBMI untranspose onto one vpermi2b kernel for all widths#146
joseph-isaacs wants to merge 3 commits into
developfrom
claude/dreamy-goodall-vbmi-uniform

Conversation

@joseph-isaacs

Copy link
Copy Markdown
Member

⚠️ Disclaimer — experimental, NOT validated on real AVX-512VBMI hardware

This PR changes the u64 VBMI untranspose kernel. It could not be executed on real avx512vbmi hardware in the dev/CI environment available to me:

  • the host CPU lacks avx512vbmi (so the has_vbmi()-gated tests skip);
  • qemu-user here cannot execute AVX-512 in user mode (TCG faults on even basic AVX-512F — XSAVE/ZMM state unsupported in this build);
  • Intel SDE was not reachable (download blocked).

So correctness here rests on cross-compilation + structural equivalence, not execution — see "Validation" below. Do not merge until it has been run on a real avx512vbmi machine (the existing test_vbmi_untranspose_all_widths_match_baseline test will exercise it there). Stacked on #145; review/merge that first.

What changed

Stacked on top of #145. That PR intentionally kept the u64 VBMI path byte-identical to the original kernel (vpermb gather within a 64-byte half + a scatter the compiler vectorizes), behind a compile-time T::T == 64 branch, because VBMI couldn't be benchmarked.

This PR removes that special-case (and the _lt64 helper) so all widths — including u64 — flow through the single width-generic kernel:

vpermi2b gather (16 groups → group-major)  →  8x8 bit-transpose (zmm)  →  vpermi2b scatter

Only the per-width group_perm gather/scatter tables differ. For u64 each group's 8 bytes stay within one 64-byte half, so the second vpermi2b source is unused — the two-source form just keeps one code path.

Net diff: −40 lines in x86.rs, one kernel instead of two.

Validation (no real-hardware execution)

  • Cross-compiles cleanly, including RUSTFLAGS="-C target-feature=+avx512f,+avx512bw,+avx512vbmi" (forces VBMI codegen). Confirmed via --emit asm that u64 now emits vpermi2b gather + vpermi2b scatter (no vpermb, no scalar scatter).
  • Structural equivalence: the kernel is identical to the narrow-width VBMI path already in Add width-generic x86 BMI2/VBMI untranspose for u8/u16/u32 #145, and uses the same GATHER_64/SCATTER_64 tables that the NEON untranspose_bits_neon::<u64> all-widths test executes and validates on aarch64 CI. The scalar/BMI2 all-widths tests cover the same tables on x86.
  • cargo test --features std (155) + clippy clean on x86_64 (std + no_std) and aarch64.

Why it's worth considering

One uniform kernel (matches the NEON structure), no compile-time width branch, ~40 fewer lines. Permute-op count is unchanged vs the #145 u64 path (4 permutes either way); the gather becomes two-source vpermi2b instead of single-source vpermb, which is throughput-equivalent on Ice Lake / Zen 4 — but that should be confirmed on hardware before merging.

https://claude.ai/code/session_01ATBvsrFw3eAPgcZnrqpMPS


Generated by Claude Code

The BMI2 and VBMI `untranspose_bits` kernels previously only implemented
the u64 (16-lane) mask transpose; u8/u16/u32 fell back to scalar. This
generalizes both to all element widths, matching the NEON kernel, and
routes every width through them in the dispatcher.

- BMI2: `untranspose_bits_bmi2::<T>` gathers each of the 16 byte-groups at
  the width's stride and uses PDEP to perform the 8x8 bit transpose. For
  u64 this monomorphizes to byte-identical asm as before (verified), so
  u64 perf is unchanged; u8/u16/u32 are ~1.5-2x faster than scalar.
- VBMI: narrow widths use a uniform vpermi2b gather / 8x8 transpose /
  vpermi2b scatter kernel (`untranspose_bits_vbmi_lt64`) spanning both
  64-byte halves. The u64 path is preserved exactly (byte-identical asm).
- The per-width gather/scatter permutation tables are hoisted from the
  NEON module into `bit_transpose::mod` and shared by NEON and VBMI.
- Tests now cover all widths against the baseline for both kernels.

https://claude.ai/code/session_01ATBvsrFw3eAPgcZnrqpMPS
Collapse the per-item `#[cfg(any(target_arch = "x86_64", target_arch =
"aarch64"))]` attributes on the shared gather/scatter tables into a single
`group_perm` submodule gated once. NEON and VBMI reference
`group_perm::group_tables`; the index-building const fns and statics are now
private to the module.

https://claude.ai/code/session_01ATBvsrFw3eAPgcZnrqpMPS
Drop the `T::T == 64` special-case (and the `_lt64` helper) that PR #145
kept to leave the u64 VBMI asm byte-identical, so every width now flows
through one `vpermi2b` gather / 8x8 transpose / `vpermi2b` scatter kernel
selected only by the per-width gather/scatter tables. For u64 a group's
8 bytes stay within one 64-byte half, so the second permute source is
unused, but the two-source form keeps a single code path.

EXPERIMENTAL: this changes the u64 VBMI kernel and could not be executed
on real AVX-512VBMI hardware in this environment (host lacks the feature;
qemu-user here cannot run AVX-512; Intel SDE unavailable). Validated by
cross-compilation and by structural equivalence to the width-generic NEON
and narrow-width VBMI kernels, which share the same gather/scatter tables.

https://claude.ai/code/session_01ATBvsrFw3eAPgcZnrqpMPS
@codspeed-hq

codspeed-hq Bot commented Jun 3, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 158 untouched benchmarks
⏩ 123 skipped benchmarks1


Comparing claude/dreamy-goodall-vbmi-uniform (ec0597e) with claude/dreamy-goodall-3vpHT (6c10ea7)

Open in CodSpeed

Footnotes

  1. 123 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Base automatically changed from claude/dreamy-goodall-3vpHT to develop June 4, 2026 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant