Potential speed improvements for SHA512 via BMI2 instructions

I recently looked into the `sha2` crate performance, specifically for performing many consecutive SHA512 calculations on modern x64 processors which do not yet have the brand-new SHA512 instructions mentioned in #634.

As documented in https://github.com/RustCrypto/asm-hashes/issues/83 and https://github.com/RustCrypto/asm-hashes/issues/82, the now-deprecated `asm` feature target of `sha2` `0.10.x` is slower than the native AVX2 enabled native Rust with intrinsics. Upon closer inspection, this makes sense since the chosen `asm` code doesn't use AVX or other newer CPU technologies at all.

In comparison with other implementations such as `libgcrypt`'s which have specially optimized `asm` code like `sha512-avx2-bmi2-amd64.S`, those are roughly ~25% faster for SHA512 than the `sha2` crate in quick benchmarks.
* Tested on AMD Zen3 `Ryzen 5950X` under Linux
* `RUSTFLAGS='-C target-cpu=native'  cargo +nightly bench -p sha2` has `test sha512_10000 [...] = ` **894 MB/s**
* `libgcrypt` `tests/bench-slope --repetitions 10000` shows **1084 MiB/s**
*  The benchmark harnesses may not be fully comparable and have different units, this is just some quick testing to get the relevant ballpark numbers (!)

Another well-known project with this optimization level is the Linux kernel, see [arch/x86/crypto/sha512-avx2-asm.S](https://github.com/torvalds/linux/blob/b46c89c08f4146e7987fc355941a93b12e2c03ef/arch/x86/crypto/sha512-avx2-asm.S). 
 
Based on observations made as part of https://github.com/RustCrypto/asm-hashes/issues/83 , a potential explanation for this is that the current native optimized Rust code in `sha2/src/sha512/x86_avx2.rs` uses `AVX2`, but not `BMI2`. For the assembler implementations, the `BMI2` instruction `RORX`  made a significant performance difference. Also, the terminology is a bit fuzzy here. Since BMI2 seems to be present on all common processors that have AVX2, it's sometimes mentioned as belonging to AVX2, but is technically separate, see [Wikipedia](https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set#BMI2_(Bit_Manipulation_Instruction_Set_2)).


The `bmi2` target feature was around for a while since https://github.com/rust-lang/rust/issues/30462 . I'm not an expert on Rust intrinsics, but the `RORX` instruction seems to be missing from the current [core_arch/src/x86_64/bmi2.rs](https://github.com/rust-lang/stdarch/blob/cf10e913390d1f55c7a5a82e84e7f8bfd41d90f7/crates/core_arch/src/x86_64/bmi2.rs) instructions implemented by `core::arch::x86_64`? 
If the instruction itself isn't available, that may be a major roadblock to using it in `sha2` for SHA512. I'm not sure of the exact backstory here, but https://github.com/gnzlbg/bitintr/issues/2 seems to hint at the lack of `RORX` and other similar instruction availability since 2017, so it doesn't look like a regression.
 
To summarize, I suspect that once there is support for this particular BMI2 CPU instruction, it may be possible to squeeze additional SHA512 performance out of existing CPUs.
Notably, this does not rely on the more recent AVX512 instruction set or `VSHA512` instruction set. It also probably won't be relevant for SHA1/SHA256 where faster mechanisms are commonly available and in use by `sha2` on most modern CPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential speed improvements for SHA512 via BMI2 instructions #640

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Potential speed improvements for SHA512 via BMI2 instructions #640

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions