Skip to content

Refactor bit-transpose benchmarks with macro and add walltime CI#144

Draft
joseph-isaacs wants to merge 8 commits into
developfrom
claude/blissful-mccarthy-6wV2r
Draft

Refactor bit-transpose benchmarks with macro and add walltime CI#144
joseph-isaacs wants to merge 8 commits into
developfrom
claude/blissful-mccarthy-6wV2r

Conversation

@joseph-isaacs

Copy link
Copy Markdown
Member

Summary

This PR refactors the bit-transpose benchmark suite to reduce code duplication and adds a dedicated CI job for measuring per-tier performance on x86 hardware with all CPU feature tiers available.

Key Changes

  • Introduced bench_feature! macro: Consolidates the repetitive pattern of generating paired transpose/untranspose benchmarks for each CPU feature tier. The macro:

    • Generates <feature>_transpose and <feature>_untranspose benchmark functions
    • Supports optional runtime guards via guard = <expr> to skip benchmarks when CPU features are unavailable
    • Uses paste::paste! for identifier generation
  • Refactored benchmark definitions: Replaced 16 individual benchmark functions with 5 macro invocations:

    • scalar and dispatch (always available)
    • bmi2 and vbmi (x86_64 only, with runtime guards)
    • neon (aarch64 only, always available)
  • Added walltime CI job: New bench-bit-transpose-walltime workflow that:

    • Runs on dedicated x86 hardware (Ice Lake c6id.8xlarge) with all CPU feature tiers
    • Complements the existing CodSpeed simulation job which cannot execute AVX-512 VBMI
    • Measures actual wall-clock performance for each tier in CI
    • Uses -C target-feature=+avx2 to ensure baseline feature availability

Implementation Details

The macro approach enables "just works" local benchmarking—cargo bench exercises whatever the host CPU supports—while CI on feature-complete hardware measures all tiers. Runtime guards prevent benchmark failures on unsupported CPUs without requiring per-feature conditional compilation wiring.

https://claude.ai/code/session_018byoMez1xPQTscpobqk1ST

Generate the bit-transpose benchmarks from a single `bench_feature!` macro
that emits a transpose/untranspose pair per feature tier (scalar, BMI2,
AVX-512 VBMI, NEON, and the dispatch entry point), each runtime-guarded so
the suite "just works" locally and only runs what the host supports.

Add a `bench-bit-transpose-walltime` CI job on a dedicated runs-on x86
instance that measures every tier in CodSpeed walltime mode. This is
necessary because CodSpeed's hosted macro runners are ARM64 and the
existing simulation job runs under Valgrind, neither of which can execute
the x86 BMI2 / AVX-512 VBMI paths.
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codspeed-hq

codspeed-hq Bot commented Jun 3, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by 38.93%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 4 improved benchmarks
✅ 139 untouched benchmarks
🆕 14 new benchmarks
⏩ 134 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation for_pack_16_to_3_stack 1.8 µs 1.3 µs +43.67%
Simulation unpack_16_from_3_stack 2 µs 1.4 µs +41.77%
Simulation unchecked_unpack_16_from_3_stack 2.1 µs 1.5 µs +37.65%
Simulation pack_16_to_3_stack 2.4 µs 1.8 µs +32.89%
🆕 WallTime scalar_transpose N/A 55.3 µs N/A
🆕 WallTime scalar_untranspose[u8] N/A 45.3 µs N/A
🆕 WallTime dispatch_untranspose[u16] N/A 51 µs N/A
🆕 WallTime dispatch_untranspose[u8] N/A 45.8 µs N/A
🆕 WallTime dispatch_untranspose[u64] N/A 5.6 µs N/A
🆕 WallTime dispatch_transpose N/A 6 µs N/A
🆕 WallTime dispatch_untranspose[u32] N/A 61.8 µs N/A
🆕 WallTime scalar_untranspose[u16] N/A 51.3 µs N/A
🆕 WallTime scalar_untranspose[u64] N/A 60.6 µs N/A
🆕 WallTime scalar_untranspose[u32] N/A 62 µs N/A
🆕 WallTime vbmi_transpose N/A 4.7 µs N/A
🆕 WallTime vbmi_untranspose N/A 4.9 µs N/A
🆕 WallTime bmi2_untranspose N/A 33.8 µs N/A
🆕 WallTime bmi2_transpose N/A 33.7 µs N/A

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/blissful-mccarthy-6wV2r (fd689c3) with develop (938100e)

Open in CodSpeed

Footnotes

  1. 134 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

claude added 3 commits June 3, 2026 13:48
Replace the standalone walltime job with a measurement-mode matrix on the
existing `bench-codspeed` job: `simulation` keeps running every suite under
Valgrind, while `walltime` runs the bit-transpose Intel feature tiers on the
runs-on x86 instance. Removes the duplicated job/steps.
Make the runner `family` a per-matrix variable so the walltime entry uses
c8i (Intel Xeon 6 / Granite Rapids), the newest compute-optimized instance,
while simulation stays on c6id since Valgrind instruction counting is
machine-independent.
Restore individual `#[divan::bench]` functions per feature tier instead of
generating them from a macro. The only shared complexity left in a macro is
the runtime feature-gate plus unsafe wrapping (`gated_bench!`, single arm);
every benchmark still calls the shared `bench_blocks` driver.
@joseph-isaacs joseph-isaacs marked this pull request as draft June 3, 2026 15:24
claude added 4 commits June 3, 2026 16:46
Replace the runtime feature guards with compile-time `#[cfg(target_feature)]`
gates, mutually exclusive across baseline / bmi2 / avx512vbmi. Combined with a
three-entry codspeed matrix (simulation + walltime-bmi2 + walltime-avx512),
each built with its own `-C target-feature` flag on c8i, every benchmark
compiles — and therefore runs — on exactly one runner.

Also integrates develop (#141): generic-over-T untranspose benches.
Introduce a dev-only `bench-macros` proc-macro crate providing
`#[bench(baseline|bmi2|avx512)]`, which expands to the mutually-exclusive
`#[cfg(target_feature = …)]` gate for that Intel tier. It's excluded from the
fastlanes workspace and never published (path dev-dependency only).

Rework the codspeed matrix into four c8i runners: a simulation job for every
suite except bit_transpose, and one walltime runner per tier (baseline / bmi2 /
avx512) built with that tier's -C target-feature. Combined with the #[bench]
gates, each bit_transpose benchmark compiles — and runs — on exactly one runner.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants