Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x86 avx2 vpor is first done on calculation-heavy operands #131588

Open
ImpleLee opened this issue Mar 17, 2025 · 3 comments
Open

x86 avx2 vpor is first done on calculation-heavy operands #131588

ImpleLee opened this issue Mar 17, 2025 · 3 comments

Comments

@ImpleLee
Copy link

See the code and the compilation result at https://godbolt.org/z/Kchh341vW . This code calculates vpor of several operands in the loop, where some operands are relatively cheap to calculate, while some are not. Compilation flags: -O3 -std=c++2b -march=skylake.

#include <experimental/simd>
#include <cstdint>
namespace stdx = std::experimental;

template <class T, std::size_t N>
using simd_of = stdx::simd<T, stdx::simd_abi::deduce_t<T, N>>;

using data_t = simd_of<std::uint64_t, 4>;

data_t f(data_t a, data_t b) {
    while (true) {
        data_t result = a;
        result |= (a << 1) & std::uint64_t(0x802008020080200);
        result |= a >> 1;
        result |= a >> 10;
        data_t temp = a << 50;
        result |= data_t([=](auto i) {
            if constexpr (i + 1 >= 4) return 0;
            else return temp[i + 1];
        });
        result &= b;
        if (all_of((result & ~a) == 0)) return a;
        a = result;
    }
}

The assembly of the loop is as follows.

.LBB0_1:
        vmovdqa %ymm4, %ymm3
        vpaddq  %ymm4, %ymm4, %ymm4
        vpand   %ymm1, %ymm4, %ymm4
        vpsrlq  $1, %ymm3, %ymm5
        vpsrlq  $10, %ymm3, %ymm6
        vpor    %ymm6, %ymm5, %ymm5
        vpsllq  $50, %ymm3, %ymm6
        vpermq  $249, %ymm6, %ymm6 # latency 3 on skylake
        vpblendd        $192, %ymm2, %ymm6, %ymm6
        vpor    %ymm6, %ymm5, %ymm5 # ymm6 is heavy to calculate, but or'ed first
        vpor    %ymm3, %ymm5, %ymm5 # ymm3 and ymm4 are cheap to calculate, but or'ed later
        vpor    %ymm4, %ymm5, %ymm4
        vpand   %ymm0, %ymm4, %ymm4
        vptest  %ymm4, %ymm3
        jae     .LBB0_1

The critical path of this loop is vpmov-> vpsll $50 -> vperm -> vpblend -> vpor -> vpor -> vpor -> vpand, but if ymm6 is vpor'ed later, the other two vpor's does not need to be on the critical path.

@llvmbot
Copy link
Member

llvmbot commented Mar 17, 2025

@llvm/issue-subscribers-backend-x86

Author: Imple Lee (ImpleLee)

See the code and the compilation result at https://godbolt.org/z/Kchh341vW . This code calculates vpor of several operands in the loop, where some operands are relatively cheap to calculate, while some are not. Compilation flags: `-O3 -std=c++2b -march=skylake`.
#include &lt;experimental/simd&gt;
#include &lt;cstdint&gt;
namespace stdx = std::experimental;

template &lt;class T, std::size_t N&gt;
using simd_of = stdx::simd&lt;T, stdx::simd_abi::deduce_t&lt;T, N&gt;&gt;;

using data_t = simd_of&lt;std::uint64_t, 4&gt;;

data_t f(data_t a, data_t b) {
    while (true) {
        data_t result = a;
        result |= (a &lt;&lt; 1) &amp; std::uint64_t(0x802008020080200);
        result |= a &gt;&gt; 1;
        result |= a &gt;&gt; 10;
        data_t temp = a &lt;&lt; 50;
        result |= data_t([=](auto i) {
            if constexpr (i + 1 &gt;= 4) return 0;
            else return temp[i + 1];
        });
        result &amp;= b;
        if (all_of((result &amp; ~a) == 0)) return a;
        a = result;
    }
}

The assembly of the loop is as follows.

.LBB0_1:
        vmovdqa %ymm4, %ymm3
        vpaddq  %ymm4, %ymm4, %ymm4
        vpand   %ymm1, %ymm4, %ymm4
        vpsrlq  $1, %ymm3, %ymm5
        vpsrlq  $10, %ymm3, %ymm6
        vpor    %ymm6, %ymm5, %ymm5
        vpsllq  $50, %ymm3, %ymm6
        vpermq  $249, %ymm6, %ymm6 # latency 3 on skylake
        vpblendd        $192, %ymm2, %ymm6, %ymm6
        vpor    %ymm6, %ymm5, %ymm5 # ymm6 is heavy to calculate, but or'ed first
        vpor    %ymm3, %ymm5, %ymm5 # ymm3 and ymm4 are cheap to calculate, but or'ed later
        vpor    %ymm4, %ymm5, %ymm4
        vpand   %ymm0, %ymm4, %ymm4
        vptest  %ymm4, %ymm3
        jae     .LBB0_1

The critical path of this loop is vpmov-&gt; vpsll $50 -&gt; vperm -&gt; vpblend -&gt; vpor -&gt; vpor -&gt; vpor -&gt; vpand, but if ymm6 is vpor'ed later, the other two vpor's does not need to be on the critical path.

@RKSimon
Copy link
Collaborator

RKSimon commented Mar 17, 2025

I'm not sure what facilities we have for cost driven reassociation like this - we could either try to do it in the middle end and possibly end up fighting canonicalization or in the backend and have to maybe post-ra scheduling (or a similar pass) to be able to recognize re-associatable machine instructions.

@ImpleLee
Copy link
Author

ImpleLee commented Mar 17, 2025

Maybe in the backend but before register allocation? (I am not familiar with llvm.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants