Skip to content

x86 avx2 vpor is first done on calculation-heavy operands #131588

Open
@ImpleLee

Description

@ImpleLee

See the code and the compilation result at https://godbolt.org/z/Kchh341vW . This code calculates vpor of several operands in the loop, where some operands are relatively cheap to calculate, while some are not. Compilation flags: -O3 -std=c++2b -march=skylake.

#include <experimental/simd>
#include <cstdint>
namespace stdx = std::experimental;

template <class T, std::size_t N>
using simd_of = stdx::simd<T, stdx::simd_abi::deduce_t<T, N>>;

using data_t = simd_of<std::uint64_t, 4>;

data_t f(data_t a, data_t b) {
    while (true) {
        data_t result = a;
        result |= (a << 1) & std::uint64_t(0x802008020080200);
        result |= a >> 1;
        result |= a >> 10;
        data_t temp = a << 50;
        result |= data_t([=](auto i) {
            if constexpr (i + 1 >= 4) return 0;
            else return temp[i + 1];
        });
        result &= b;
        if (all_of((result & ~a) == 0)) return a;
        a = result;
    }
}

The assembly of the loop is as follows.

.LBB0_1:
        vmovdqa %ymm4, %ymm3
        vpaddq  %ymm4, %ymm4, %ymm4
        vpand   %ymm1, %ymm4, %ymm4
        vpsrlq  $1, %ymm3, %ymm5
        vpsrlq  $10, %ymm3, %ymm6
        vpor    %ymm6, %ymm5, %ymm5
        vpsllq  $50, %ymm3, %ymm6
        vpermq  $249, %ymm6, %ymm6 # latency 3 on skylake
        vpblendd        $192, %ymm2, %ymm6, %ymm6
        vpor    %ymm6, %ymm5, %ymm5 # ymm6 is heavy to calculate, but or'ed first
        vpor    %ymm3, %ymm5, %ymm5 # ymm3 and ymm4 are cheap to calculate, but or'ed later
        vpor    %ymm4, %ymm5, %ymm4
        vpand   %ymm0, %ymm4, %ymm4
        vptest  %ymm4, %ymm3
        jae     .LBB0_1

The critical path of this loop is vpmov-> vpsll $50 -> vperm -> vpblend -> vpor -> vpor -> vpor -> vpand, but if ymm6 is vpor'ed later, the other two vpor's does not need to be on the critical path.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions