Description
I recently looked into the sha2
crate performance, specifically for performing many consecutive SHA512 calculations on modern x64 processors which do not yet have the brand-new SHA512 instructions mentioned in #634.
As documented in RustCrypto/asm-hashes#83 and RustCrypto/asm-hashes#82, the now-deprecated asm
feature target of sha2
0.10.x
is slower than the native AVX2 enabled native Rust with intrinsics. Upon closer inspection, this makes sense since the chosen asm
code doesn't use AVX or other newer CPU technologies at all.
In comparison with other implementations such as libgcrypt
's which have specially optimized asm
code like sha512-avx2-bmi2-amd64.S
, those are roughly ~25% faster for SHA512 than the sha2
crate in quick benchmarks.
- Tested on AMD Zen3
Ryzen 5950X
under Linux RUSTFLAGS='-C target-cpu=native' cargo +nightly bench -p sha2
hastest sha512_10000 [...] =
894 MB/slibgcrypt
tests/bench-slope --repetitions 10000
shows 1084 MiB/s- The benchmark harnesses may not be fully comparable and have different units, this is just some quick testing to get the relevant ballpark numbers (!)
Another well-known project with this optimization level is the Linux kernel, see arch/x86/crypto/sha512-avx2-asm.S.
Based on observations made as part of RustCrypto/asm-hashes#83 , a potential explanation for this is that the current native optimized Rust code in sha2/src/sha512/x86_avx2.rs
uses AVX2
, but not BMI2
. For the assembler implementations, the BMI2
instruction RORX
made a significant performance difference. Also, the terminology is a bit fuzzy here. Since BMI2 seems to be present on all common processors that have AVX2, it's sometimes mentioned as belonging to AVX2, but is technically separate, see Wikipedia.
The bmi2
target feature was around for a while since rust-lang/rust#30462 . I'm not an expert on Rust intrinsics, but the RORX
instruction seems to be missing from the current core_arch/src/x86_64/bmi2.rs instructions implemented by core::arch::x86_64
?
If the instruction itself isn't available, that may be a major roadblock to using it in sha2
for SHA512. I'm not sure of the exact backstory here, but gnzlbg/bitintr#2 seems to hint at the lack of RORX
and other similar instruction availability since 2017, so it doesn't look like a regression.
To summarize, I suspect that once there is support for this particular BMI2 CPU instruction, it may be possible to squeeze additional SHA512 performance out of existing CPUs.
Notably, this does not rely on the more recent AVX512 instruction set or VSHA512
instruction set. It also probably won't be relevant for SHA1/SHA256 where faster mechanisms are commonly available and in use by sha2
on most modern CPUs.