Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pclmulqdq intrinsics don't inline well across target_feature changes anymore #139029

Open
iximeow opened this issue Mar 27, 2025 · 6 comments
Open
Labels
A-target-feature Area: Enabling/disabling target features like AVX, Neon, etc. C-bug Category: This is a bug. I-prioritize Issue: Indicates that prioritization has been requested for this issue. I-slow Issue: Problems and improvements with respect to performance of generated code. O-x86_64 Target: x86-64 processors (like x86_64-*) (also known as amd64 and x64) regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. S-has-bisection Status: a bisection has been found for this issue T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@iximeow
Copy link
Contributor

iximeow commented Mar 27, 2025

i'd noticed the pclmulqdq intrinsics in crc32fast were notable in a perf report of a benchmark last night. somewhat shockingly, there were functions whose body was pclmulqdq xmm0, xmm1, 17; ret and pclmulqdq xmm0, xmm1, 0; ret, complete with constraining callers' choice of xmm registers! after a bit of digging it seems to be a regression in nightly.

the specific regression i'd started at can be reproduced with cargo bench in https://github.com/srijs/rust-crc32fast .
cargo +1.85.1 bench produces

running 4 tests
test bench_kilobyte_baseline    ... bench:         130 ns/iter (+/- 2) = 7876 MB/s
test bench_kilobyte_specialized ... bench:          48 ns/iter (+/- 0) = 21333 MB/s
test bench_megabyte_baseline    ... bench:     137,017 ns/iter (+/- 247) = 7652 MB/s
test bench_megabyte_specialized ... bench:      48,153 ns/iter (+/- 51) = 21775 MB/s

whereas
cargo +nightly bench produces

running 4 tests
test bench_kilobyte_baseline    ... bench:         130 ns/iter (+/- 1) = 7876 MB/s
test bench_kilobyte_specialized ... bench:         156 ns/iter (+/- 0) = 6564 MB/s
test bench_megabyte_baseline    ... bench:     137,229 ns/iter (+/- 393) = 7641 MB/s
test bench_megabyte_specialized ... bench:     145,632 ns/iter (+/- 377) = 7200 MB/s

after looking at perf a bit i believe this is representative: https://rust.godbolt.org/z/8dxcE4vo1 . i'm including everything there in this issue as well.

Code

I tried this code:

use std::arch::x86_64 as arch;

#[target_feature(enable = "pclmulqdq", enable = "sse2", enable = "sse4.1")]
#[no_mangle]
pub unsafe fn reduce128_caller(a: arch::__m128i, b: arch::__m128i, keys: arch::__m128i) -> arch::__m128i {
    reduce128(a, b, keys)
}

unsafe fn reduce128(a: arch::__m128i, b: arch::__m128i, keys: arch::__m128i) -> arch::__m128i {
    let t1 = arch::_mm_clmulepi64_si128(a, keys, 0x00);
    let t2 = arch::_mm_clmulepi64_si128(a, keys, 0x11);
    arch::_mm_xor_si128(arch::_mm_xor_si128(b, t1), t2)
}

I expected to see this happen (with -C opt-level=3):

reduce128_caller:
        mov     rax, rdi
        movdqa  xmm0, xmmword ptr [rsi]
        movdqa  xmm1, xmmword ptr [rcx]
        movdqa  xmm2, xmm0
        pclmulqdq       xmm2, xmm1, 0
        pxor    xmm2, xmmword ptr [rdx]
        pclmulqdq       xmm0, xmm1, 17
        pxor    xmm2, xmm0
        movdqa  xmmword ptr [rdi], xmm2
        ret

Instead, this happened (also -C opt-level=3):

core::core_arch::x86::pclmulqdq::_mm_clmulepi64_si128::h9ea1fa421d47acc5:
        pclmulqdq       xmm0, xmm1, 17
        ret

core::core_arch::x86::pclmulqdq::_mm_clmulepi64_si128::heb2402630e2a6f04:
        pclmulqdq       xmm0, xmm1, 0
        ret

reduce128_caller:
        jmp     example::reduce128::h8b5076bc8edc1d53

example::reduce128::h8b5076bc8edc1d53:
        sub     rsp, 72
        movaps  xmmword ptr [rsp + 16], xmm2
        movaps  xmmword ptr [rsp + 48], xmm1
        movaps  xmmword ptr [rsp], xmm0
        movaps  xmm1, xmm2
        call    core::core_arch::x86::pclmulqdq::_mm_clmulepi64_si128::heb2402630e2a6f04
        movaps  xmmword ptr [rsp + 32], xmm0
        movaps  xmm0, xmmword ptr [rsp]
        movaps  xmm1, xmmword ptr [rsp + 16]
        call    core::core_arch::x86::pclmulqdq::_mm_clmulepi64_si128::h9ea1fa421d47acc5
        movaps  xmm1, xmmword ptr [rsp + 32]
        xorps   xmm1, xmmword ptr [rsp + 48]
        xorps   xmm0, xmm1
        add     rsp, 72
        ret

Version it worked on

1.85.1, 1.31.0, and a half dozen in between.

additionally, beta (rust version 1.86.0-beta.7 (7824ede 2025-03-22) seems good.

nightly with -C opt-level=3 -C target-feature=+pclmul still does great.

Version with regression

in the above godbolt link, i see --version in the rustc nightly tab provides rustc 1.87.0-nightly (a2e63569f 2025-03-26). this is consistent with how i first saw this locally:

rustc +nightly --version --verbose:

rustc +nightly --version --verbose
rustc 1.87.0-nightly (a2e63569f 2025-03-26)
binary: rustc
commit-hash: a2e63569fd6702ac5dd027a80a9fdaadce73adae
commit-date: 2025-03-26
host: x86_64-unknown-linux-gnu
release: 1.87.0-nightly
LLVM version: 20.1.1

Related improvement along the way

adding the same target_feature block on the inner function sees nightly produce somewhat better-than-baseline code: https://rust.godbolt.org/z/sGrYedeaP

use std::arch::x86_64 as arch;

#[target_feature(enable = "pclmulqdq", enable = "sse2", enable = "sse4.1")]
#[no_mangle]
pub unsafe fn reduce128_caller(a: arch::__m128i, b: arch::__m128i, keys: arch::__m128i) -> arch::__m128i {
    reduce128(a, b, keys)
}

#[target_feature(enable = "pclmulqdq", enable = "sse2", enable = "sse4.1")]
unsafe fn reduce128(a: arch::__m128i, b: arch::__m128i, keys: arch::__m128i) -> arch::__m128i {
    let t1 = arch::_mm_clmulepi64_si128(a, keys, 0x00);
    let t2 = arch::_mm_clmulepi64_si128(a, keys, 0x11);
    arch::_mm_xor_si128(arch::_mm_xor_si128(b, t1), t2)
}

with rustc +nightly -C opt-level 3 yields:

reduce128_caller:
        movdqa  xmm3, xmm0
        pclmulqdq       xmm3, xmm2, 0
        pclmulqdq       xmm0, xmm2, 17
        pxor    xmm3, xmm1
        pxor    xmm0, xmm3
        ret

whereas before the codegen was identical regardless of the target_feature attribute on the inner function. so at least in some cases there is a modest improvement?

@rustbot modify labels: +regression-from-stable-to-nightly -regression-untriaged

@iximeow iximeow added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Mar 27, 2025
@rustbot rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. and removed regression-untriaged Untriaged performance or correctness regression. labels Mar 27, 2025
@iximeow iximeow changed the title pclmulqdq intrinsics don't inline well in the face of target_feature anymore pclmulqdq intrinsics don't inline well across target_feature changes Mar 27, 2025
@iximeow iximeow changed the title pclmulqdq intrinsics don't inline well across target_feature changes pclmulqdq intrinsics don't inline well across target_feature changes anymore Mar 27, 2025
@Noratrieb
Copy link
Member

Thank you for the report! It would be useful to bisect the regression to the specific PR using cargo-bisect-rustc to make it easier to figure out what happened and ping the relevant people.
A script that dumps the assembly should be sufficient for the repro check for it.
@rustbot label E-needs-bisection

@rustbot rustbot added the E-needs-bisection Call for participation: This issue needs bisection: https://github.com/rust-lang/cargo-bisect-rustc label Mar 27, 2025
@moxian
Copy link
Contributor

moxian commented Mar 27, 2025

@rustbot label: -E-needs-bisection +S-has-bisection +A-target-feature +T-compiler +I-slow +O-x86_64

@rustbot rustbot added A-target-feature Area: Enabling/disabling target features like AVX, Neon, etc. S-has-bisection Status: a bisection has been found for this issue T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. I-slow Issue: Problems and improvements with respect to performance of generated code. O-x86_64 Target: x86-64 processors (like x86_64-*) (also known as amd64 and x64) and removed E-needs-bisection Call for participation: This issue needs bisection: https://github.com/rust-lang/cargo-bisect-rustc labels Mar 27, 2025
@iximeow
Copy link
Contributor Author

iximeow commented Mar 27, 2025

i also arrived at that change with cargo-bisect-rustc (intensely handy tool!!)

(notes on how i got there)
~/target_feature_regression$ cat code.rs
use std::arch::x86_64 as arch;

#[target_feature(enable = "pclmulqdq", enable = "sse2", enable = "sse4.1")]
#[no_mangle]
pub unsafe fn reduce128_caller(a: arch::__m128i, b: arch::__m128i, keys: arch::__m128i) -> arch::__m128i {
    reduce128(a, b, keys)
}

unsafe fn reduce128(a: arch::__m128i, b: arch::__m128i, keys: arch::__m128i) -> arch::__m128i {
    let t1 = arch::_mm_clmulepi64_si128(a, keys, 0x00);
    let t2 = arch::_mm_clmulepi64_si128(a, keys, 0x11);
    arch::_mm_xor_si128(arch::_mm_xor_si128(b, t1), t2)
}
~/target_feature_regression$ cat script.sh
#!/bin/sh

RESULTS=$(rustc -C opt-level=3 --emit asm=- --crate-type cdylib code.rs | grep _mm_clmulepi64_si)

if [ -z "$RESULTS" ]; then
  exit 0
else
  exit 1
fi

ends up with...

********************************************************************************
Regression in 17c1c329a5512d718b67ef6797538b154016cd34
********************************************************************************

Attempting to search unrolled perf builds
ERROR: couldn't find perf build comment
==================================================================================
= Please file this regression report on the rust-lang/rust GitHub repository     =
=        New issue: https://github.com/rust-lang/rust/issues/new                 =
=     Known issues: https://github.com/rust-lang/rust/issues                     =
= Copy and paste the text below into the issue report thread.  Thanks!           =
==================================================================================

searched nightlies: from nightly-2025-01-01 to nightly-2025-03-27
regressed nightly: nightly-2025-02-20
searched commit range: https://github.com/rust-lang/rust/compare/827a0d638dabc9a22c56f9c37a557568f86ac76c...f280acf4c743806abbbbcfe65050ac52ec4bdec0
regressed commit: https://github.com/rust-lang/rust/commit/17c1c329a5512d718b67ef6797538b154016cd34

@Noratrieb
Copy link
Member

I suspect that this is because of https://github.com/rust-lang/rust/pull/135408/files#diff-f5ccef4931d8d453a6d2cadcd1fda96f59b96dfdc73fd0543db257cb4a362020R851, but I'm not entirely sure why. @RalfJung do you have an idea here?

@Noratrieb Noratrieb removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Mar 28, 2025
@RalfJung
Copy link
Member

That PR changed the ABI so SSE vectors are passed by-val rather than by-ptr. I would have expected that to make optimizations easier, not harder. But sadly I know very little about the later parts of the backend, so I am clueless here.

Cc @nikic

@workingjubilee
Copy link
Member

wild speculation: we were relying on some opt via mem2reg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-target-feature Area: Enabling/disabling target features like AVX, Neon, etc. C-bug Category: This is a bug. I-prioritize Issue: Indicates that prioritization has been requested for this issue. I-slow Issue: Problems and improvements with respect to performance of generated code. O-x86_64 Target: x86-64 processors (like x86_64-*) (also known as amd64 and x64) regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. S-has-bisection Status: a bisection has been found for this issue T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants