[experimental] Unify VBMI untranspose onto one vpermi2b kernel for all widths#146
Draft
joseph-isaacs wants to merge 3 commits into
Draft
[experimental] Unify VBMI untranspose onto one vpermi2b kernel for all widths#146joseph-isaacs wants to merge 3 commits into
joseph-isaacs wants to merge 3 commits into
Conversation
The BMI2 and VBMI `untranspose_bits` kernels previously only implemented the u64 (16-lane) mask transpose; u8/u16/u32 fell back to scalar. This generalizes both to all element widths, matching the NEON kernel, and routes every width through them in the dispatcher. - BMI2: `untranspose_bits_bmi2::<T>` gathers each of the 16 byte-groups at the width's stride and uses PDEP to perform the 8x8 bit transpose. For u64 this monomorphizes to byte-identical asm as before (verified), so u64 perf is unchanged; u8/u16/u32 are ~1.5-2x faster than scalar. - VBMI: narrow widths use a uniform vpermi2b gather / 8x8 transpose / vpermi2b scatter kernel (`untranspose_bits_vbmi_lt64`) spanning both 64-byte halves. The u64 path is preserved exactly (byte-identical asm). - The per-width gather/scatter permutation tables are hoisted from the NEON module into `bit_transpose::mod` and shared by NEON and VBMI. - Tests now cover all widths against the baseline for both kernels. https://claude.ai/code/session_01ATBvsrFw3eAPgcZnrqpMPS
Collapse the per-item `#[cfg(any(target_arch = "x86_64", target_arch = "aarch64"))]` attributes on the shared gather/scatter tables into a single `group_perm` submodule gated once. NEON and VBMI reference `group_perm::group_tables`; the index-building const fns and statics are now private to the module. https://claude.ai/code/session_01ATBvsrFw3eAPgcZnrqpMPS
Drop the `T::T == 64` special-case (and the `_lt64` helper) that PR #145 kept to leave the u64 VBMI asm byte-identical, so every width now flows through one `vpermi2b` gather / 8x8 transpose / `vpermi2b` scatter kernel selected only by the per-width gather/scatter tables. For u64 a group's 8 bytes stay within one 64-byte half, so the second permute source is unused, but the two-source form keeps a single code path. EXPERIMENTAL: this changes the u64 VBMI kernel and could not be executed on real AVX-512VBMI hardware in this environment (host lacks the feature; qemu-user here cannot run AVX-512; Intel SDE unavailable). Validated by cross-compilation and by structural equivalence to the width-generic NEON and narrow-width VBMI kernels, which share the same gather/scatter tables. https://claude.ai/code/session_01ATBvsrFw3eAPgcZnrqpMPS
Merging this PR will not alter performance
Comparing Footnotes
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR changes the u64 VBMI untranspose kernel. It could not be executed on real
avx512vbmihardware in the dev/CI environment available to me:avx512vbmi(so thehas_vbmi()-gated tests skip);qemu-userhere cannot execute AVX-512 in user mode (TCG faults on even basic AVX-512F — XSAVE/ZMM state unsupported in this build);So correctness here rests on cross-compilation + structural equivalence, not execution — see "Validation" below. Do not merge until it has been run on a real
avx512vbmimachine (the existingtest_vbmi_untranspose_all_widths_match_baselinetest will exercise it there). Stacked on #145; review/merge that first.What changed
Stacked on top of #145. That PR intentionally kept the
u64VBMI path byte-identical to the original kernel (vpermbgather within a 64-byte half + a scatter the compiler vectorizes), behind a compile-timeT::T == 64branch, because VBMI couldn't be benchmarked.This PR removes that special-case (and the
_lt64helper) so all widths — includingu64— flow through the single width-generic kernel:Only the per-width
group_permgather/scatter tables differ. Foru64each group's 8 bytes stay within one 64-byte half, so the secondvpermi2bsource is unused — the two-source form just keeps one code path.Net diff: −40 lines in
x86.rs, one kernel instead of two.Validation (no real-hardware execution)
RUSTFLAGS="-C target-feature=+avx512f,+avx512bw,+avx512vbmi"(forces VBMI codegen). Confirmed via--emit asmthatu64now emitsvpermi2bgather +vpermi2bscatter (novpermb, no scalar scatter).GATHER_64/SCATTER_64tables that the NEONuntranspose_bits_neon::<u64>all-widths test executes and validates on aarch64 CI. The scalar/BMI2 all-widths tests cover the same tables on x86.cargo test --features std(155) + clippy clean on x86_64 (std + no_std) and aarch64.Why it's worth considering
One uniform kernel (matches the NEON structure), no compile-time width branch, ~40 fewer lines. Permute-op count is unchanged vs the #145 u64 path (4 permutes either way); the gather becomes two-source
vpermi2binstead of single-sourcevpermb, which is throughput-equivalent on Ice Lake / Zen 4 — but that should be confirmed on hardware before merging.https://claude.ai/code/session_01ATBvsrFw3eAPgcZnrqpMPS
Generated by Claude Code