Open
Description
I tried this code:
pub fn func(a: u8, b: u64) -> u8 {
let c = mask8x8::from_bitmask(a).to_int().cast();
let d = c | u8x8::from_array(b.to_le_bytes());
d.horizontal_and()
}
I expected to see this happen: vectorized mask8x8::from_bitmask
, for example like this Rust code:
u8x8::splat(0).lanes_ne(u8x8::splat(a) & u8x8::from_array([1, 2, 4, 8, 16, 32, 64, 128]))
(on x86 with appropriate target-cpu, using PDEP may be the best approach.)
Instead, this happened on x86: Each bit is extracted individually by movl
, shrb
, andb
to its own general-purpose register and then inserted with vpinsrb
or pinsrw
(depending on target-cpu). After that, the bits are expanded to 0x00 or 0xff using vectorized code. Scalar bit extraction needs more instructions and more runtime than vectorized code. Also, it may pressure the register allocator in more complex functions.
Meta
rustc --version --verbose
:
rustc 1.61.0-nightly (f103b2969 2022-03-12)
binary: rustc
commit-hash: f103b2969b0088953873dc1ac92eb3387c753596
commit-date: 2022-03-12
host: x86_64-unknown-linux-gnu
release: 1.61.0-nightly
LLVM version: 14.0.0