Skip to content

mask8x8::from_bitmask falls back to scalar code #264

Open
@lukaslihotzki

Description

@lukaslihotzki

I tried this code:

pub fn func(a: u8, b: u64) -> u8 {
    let c = mask8x8::from_bitmask(a).to_int().cast();
    let d = c | u8x8::from_array(b.to_le_bytes());
    d.horizontal_and()
}

I expected to see this happen: vectorized mask8x8::from_bitmask, for example like this Rust code:

u8x8::splat(0).lanes_ne(u8x8::splat(a) & u8x8::from_array([1, 2, 4, 8, 16, 32, 64, 128]))

(on x86 with appropriate target-cpu, using PDEP may be the best approach.)

Instead, this happened on x86: Each bit is extracted individually by movl, shrb, andb to its own general-purpose register and then inserted with vpinsrb or pinsrw (depending on target-cpu). After that, the bits are expanded to 0x00 or 0xff using vectorized code. Scalar bit extraction needs more instructions and more runtime than vectorized code. Also, it may pressure the register allocator in more complex functions.

Meta

rustc --version --verbose:

rustc 1.61.0-nightly (f103b2969 2022-03-12)
binary: rustc
commit-hash: f103b2969b0088953873dc1ac92eb3387c753596
commit-date: 2022-03-12
host: x86_64-unknown-linux-gnu
release: 1.61.0-nightly
LLVM version: 14.0.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-bugCategory: BugI-scalarizeImpact: code that should be vectorized, isn't

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions