Refactor bit-transpose benchmarks with macro and add walltime CI#144
Refactor bit-transpose benchmarks with macro and add walltime CI#144joseph-isaacs wants to merge 8 commits into
Conversation
Generate the bit-transpose benchmarks from a single `bench_feature!` macro that emits a transpose/untranspose pair per feature tier (scalar, BMI2, AVX-512 VBMI, NEON, and the dispatch entry point), each runtime-guarded so the suite "just works" locally and only runs what the host supports. Add a `bench-bit-transpose-walltime` CI job on a dedicated runs-on x86 instance that measures every tier in CodSpeed walltime mode. This is necessary because CodSpeed's hosted macro runners are ARM64 and the existing simulation job runs under Valgrind, neither of which can execute the x86 BMI2 / AVX-512 VBMI paths.
|
|
Merging this PR will improve performance by 38.93%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | for_pack_16_to_3_stack |
1.8 µs | 1.3 µs | +43.67% |
| ⚡ | Simulation | unpack_16_from_3_stack |
2 µs | 1.4 µs | +41.77% |
| ⚡ | Simulation | unchecked_unpack_16_from_3_stack |
2.1 µs | 1.5 µs | +37.65% |
| ⚡ | Simulation | pack_16_to_3_stack |
2.4 µs | 1.8 µs | +32.89% |
| 🆕 | WallTime | scalar_transpose |
N/A | 55.3 µs | N/A |
| 🆕 | WallTime | scalar_untranspose[u8] |
N/A | 45.3 µs | N/A |
| 🆕 | WallTime | dispatch_untranspose[u16] |
N/A | 51 µs | N/A |
| 🆕 | WallTime | dispatch_untranspose[u8] |
N/A | 45.8 µs | N/A |
| 🆕 | WallTime | dispatch_untranspose[u64] |
N/A | 5.6 µs | N/A |
| 🆕 | WallTime | dispatch_transpose |
N/A | 6 µs | N/A |
| 🆕 | WallTime | dispatch_untranspose[u32] |
N/A | 61.8 µs | N/A |
| 🆕 | WallTime | scalar_untranspose[u16] |
N/A | 51.3 µs | N/A |
| 🆕 | WallTime | scalar_untranspose[u64] |
N/A | 60.6 µs | N/A |
| 🆕 | WallTime | scalar_untranspose[u32] |
N/A | 62 µs | N/A |
| 🆕 | WallTime | vbmi_transpose |
N/A | 4.7 µs | N/A |
| 🆕 | WallTime | vbmi_untranspose |
N/A | 4.9 µs | N/A |
| 🆕 | WallTime | bmi2_untranspose |
N/A | 33.8 µs | N/A |
| 🆕 | WallTime | bmi2_transpose |
N/A | 33.7 µs | N/A |
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/blissful-mccarthy-6wV2r (fd689c3) with develop (938100e)
Footnotes
-
134 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
Replace the standalone walltime job with a measurement-mode matrix on the existing `bench-codspeed` job: `simulation` keeps running every suite under Valgrind, while `walltime` runs the bit-transpose Intel feature tiers on the runs-on x86 instance. Removes the duplicated job/steps.
Make the runner `family` a per-matrix variable so the walltime entry uses c8i (Intel Xeon 6 / Granite Rapids), the newest compute-optimized instance, while simulation stays on c6id since Valgrind instruction counting is machine-independent.
Restore individual `#[divan::bench]` functions per feature tier instead of generating them from a macro. The only shared complexity left in a macro is the runtime feature-gate plus unsafe wrapping (`gated_bench!`, single arm); every benchmark still calls the shared `bench_blocks` driver.
Replace the runtime feature guards with compile-time `#[cfg(target_feature)]` gates, mutually exclusive across baseline / bmi2 / avx512vbmi. Combined with a three-entry codspeed matrix (simulation + walltime-bmi2 + walltime-avx512), each built with its own `-C target-feature` flag on c8i, every benchmark compiles — and therefore runs — on exactly one runner. Also integrates develop (#141): generic-over-T untranspose benches.
Introduce a dev-only `bench-macros` proc-macro crate providing `#[bench(baseline|bmi2|avx512)]`, which expands to the mutually-exclusive `#[cfg(target_feature = …)]` gate for that Intel tier. It's excluded from the fastlanes workspace and never published (path dev-dependency only). Rework the codspeed matrix into four c8i runners: a simulation job for every suite except bit_transpose, and one walltime runner per tier (baseline / bmi2 / avx512) built with that tier's -C target-feature. Combined with the #[bench] gates, each bit_transpose benchmark compiles — and runs — on exactly one runner.
Summary
This PR refactors the bit-transpose benchmark suite to reduce code duplication and adds a dedicated CI job for measuring per-tier performance on x86 hardware with all CPU feature tiers available.
Key Changes
Introduced
bench_feature!macro: Consolidates the repetitive pattern of generating paired transpose/untranspose benchmarks for each CPU feature tier. The macro:<feature>_transposeand<feature>_untransposebenchmark functionsguard = <expr>to skip benchmarks when CPU features are unavailablepaste::paste!for identifier generationRefactored benchmark definitions: Replaced 16 individual benchmark functions with 5 macro invocations:
scalaranddispatch(always available)bmi2andvbmi(x86_64 only, with runtime guards)neon(aarch64 only, always available)Added walltime CI job: New
bench-bit-transpose-walltimeworkflow that:-C target-feature=+avx2to ensure baseline feature availabilityImplementation Details
The macro approach enables "just works" local benchmarking—
cargo benchexercises whatever the host CPU supports—while CI on feature-complete hardware measures all tiers. Runtime guards prevent benchmark failures on unsupported CPUs without requiring per-feature conditional compilation wiring.https://claude.ai/code/session_018byoMez1xPQTscpobqk1ST