Refactor bit-transpose benchmarks with macro and add walltime CI by joseph-isaacs · Pull Request #144 · spiraldb/fastlanes

joseph-isaacs · 2026-06-03T13:43:07Z

Summary

This PR refactors the bit-transpose benchmark suite to reduce code duplication and adds a dedicated CI job for measuring per-tier performance on x86 hardware with all CPU feature tiers available.

Key Changes

Introduced bench_feature! macro: Consolidates the repetitive pattern of generating paired transpose/untranspose benchmarks for each CPU feature tier. The macro:
- Generates <feature>_transpose and <feature>_untranspose benchmark functions
- Supports optional runtime guards via guard = <expr> to skip benchmarks when CPU features are unavailable
- Uses paste::paste! for identifier generation
Refactored benchmark definitions: Replaced 16 individual benchmark functions with 5 macro invocations:
- scalar and dispatch (always available)
- bmi2 and vbmi (x86_64 only, with runtime guards)
- neon (aarch64 only, always available)
Added walltime CI job: New bench-bit-transpose-walltime workflow that:
- Runs on dedicated x86 hardware (Ice Lake c6id.8xlarge) with all CPU feature tiers
- Complements the existing CodSpeed simulation job which cannot execute AVX-512 VBMI
- Measures actual wall-clock performance for each tier in CI
- Uses -C target-feature=+avx2 to ensure baseline feature availability

Implementation Details

The macro approach enables "just works" local benchmarking—cargo bench exercises whatever the host CPU supports—while CI on feature-complete hardware measures all tiers. Runtime guards prevent benchmark failures on unsupported CPUs without requiring per-feature conditional compilation wiring.

https://claude.ai/code/session_018byoMez1xPQTscpobqk1ST

Generate the bit-transpose benchmarks from a single `bench_feature!` macro that emits a transpose/untranspose pair per feature tier (scalar, BMI2, AVX-512 VBMI, NEON, and the dispatch entry point), each runtime-guarded so the suite "just works" locally and only runs what the host supports. Add a `bench-bit-transpose-walltime` CI job on a dedicated runs-on x86 instance that measures every tier in CodSpeed walltime mode. This is necessary because CodSpeed's hosted macro runners are ARM64 and the existing simulation job runs under Valgrind, neither of which can execute the x86 BMI2 / AVX-512 VBMI paths.

CLAassistant · 2026-06-03T13:43:17Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

codspeed-hq · 2026-06-03T13:44:58Z

Merging this PR will improve performance by 38.93%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 4 improved benchmarks
✅ 139 untouched benchmarks
🆕 14 new benchmarks
⏩ 134 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	Simulation	`for_pack_16_to_3_stack`	1.8 µs	1.3 µs	+43.67%
⚡	Simulation	`unpack_16_from_3_stack`	2 µs	1.4 µs	+41.77%
⚡	Simulation	`unchecked_unpack_16_from_3_stack`	2.1 µs	1.5 µs	+37.65%
⚡	Simulation	`pack_16_to_3_stack`	2.4 µs	1.8 µs	+32.89%
🆕	WallTime	`scalar_transpose`	N/A	55.3 µs	N/A
🆕	WallTime	`scalar_untranspose[u8]`	N/A	45.3 µs	N/A
🆕	WallTime	`dispatch_untranspose[u16]`	N/A	51 µs	N/A
🆕	WallTime	`dispatch_untranspose[u8]`	N/A	45.8 µs	N/A
🆕	WallTime	`dispatch_untranspose[u64]`	N/A	5.6 µs	N/A
🆕	WallTime	`dispatch_transpose`	N/A	6 µs	N/A
🆕	WallTime	`dispatch_untranspose[u32]`	N/A	61.8 µs	N/A
🆕	WallTime	`scalar_untranspose[u16]`	N/A	51.3 µs	N/A
🆕	WallTime	`scalar_untranspose[u64]`	N/A	60.6 µs	N/A
🆕	WallTime	`scalar_untranspose[u32]`	N/A	62 µs	N/A
🆕	WallTime	`vbmi_transpose`	N/A	4.7 µs	N/A
🆕	WallTime	`vbmi_untranspose`	N/A	4.9 µs	N/A
🆕	WallTime	`bmi2_untranspose`	N/A	33.8 µs	N/A
🆕	WallTime	`bmi2_transpose`	N/A	33.7 µs	N/A

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/blissful-mccarthy-6wV2r (fd689c3) with develop (938100e)}

134 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

Replace the standalone walltime job with a measurement-mode matrix on the existing `bench-codspeed` job: `simulation` keeps running every suite under Valgrind, while `walltime` runs the bit-transpose Intel feature tiers on the runs-on x86 instance. Removes the duplicated job/steps.

Make the runner `family` a per-matrix variable so the walltime entry uses c8i (Intel Xeon 6 / Granite Rapids), the newest compute-optimized instance, while simulation stays on c6id since Valgrind instruction counting is machine-independent.

Restore individual `#[divan::bench]` functions per feature tier instead of generating them from a macro. The only shared complexity left in a macro is the runtime feature-gate plus unsafe wrapping (`gated_bench!`, single arm); every benchmark still calls the shared `bench_blocks` driver.

…carthy-6wV2r

Replace the runtime feature guards with compile-time `#[cfg(target_feature)]` gates, mutually exclusive across baseline / bmi2 / avx512vbmi. Combined with a three-entry codspeed matrix (simulation + walltime-bmi2 + walltime-avx512), each built with its own `-C target-feature` flag on c8i, every benchmark compiles — and therefore runs — on exactly one runner. Also integrates develop (#141): generic-over-T untranspose benches.

Introduce a dev-only `bench-macros` proc-macro crate providing `#[bench(baseline|bmi2|avx512)]`, which expands to the mutually-exclusive `#[cfg(target_feature = …)]` gate for that Intel tier. It's excluded from the fastlanes workspace and never published (path dev-dependency only). Rework the codspeed matrix into four c8i runners: a simulation job for every suite except bit_transpose, and one walltime runner per tier (baseline / bmi2 / avx512) built with that tier's -C target-feature. Combined with the #[bench] gates, each bit_transpose benchmark compiles — and runs — on exactly one runner.

claude added 3 commits June 3, 2026 13:48

joseph-isaacs marked this pull request as draft June 3, 2026 15:24

claude added 4 commits June 3, 2026 16:46

Run all codspeed benches on c8i

882df4b

Merge remote-tracking branch 'origin/develop' into claude/blissful-mc…

5603724

…carthy-6wV2r

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor bit-transpose benchmarks with macro and add walltime CI#144

Refactor bit-transpose benchmarks with macro and add walltime CI#144
joseph-isaacs wants to merge 8 commits into
developfrom
claude/blissful-mccarthy-6wV2r

joseph-isaacs commented Jun 3, 2026

Uh oh!

CLAassistant commented Jun 3, 2026

Uh oh!

codspeed-hq Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

joseph-isaacs commented Jun 3, 2026

Summary

Key Changes

Implementation Details

Uh oh!

CLAassistant commented Jun 3, 2026

Uh oh!

codspeed-hq Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by 38.93%

Performance Changes

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codspeed-hq Bot commented Jun 3, 2026 •

edited

Loading