-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x86 avx2 vpor is first done on calculation-heavy operands #131588
Comments
@llvm/issue-subscribers-backend-x86 Author: Imple Lee (ImpleLee)
See the code and the compilation result at https://godbolt.org/z/Kchh341vW . This code calculates vpor of several operands in the loop, where some operands are relatively cheap to calculate, while some are not. Compilation flags: `-O3 -std=c++2b -march=skylake`.
#include <experimental/simd>
#include <cstdint>
namespace stdx = std::experimental;
template <class T, std::size_t N>
using simd_of = stdx::simd<T, stdx::simd_abi::deduce_t<T, N>>;
using data_t = simd_of<std::uint64_t, 4>;
data_t f(data_t a, data_t b) {
while (true) {
data_t result = a;
result |= (a << 1) & std::uint64_t(0x802008020080200);
result |= a >> 1;
result |= a >> 10;
data_t temp = a << 50;
result |= data_t([=](auto i) {
if constexpr (i + 1 >= 4) return 0;
else return temp[i + 1];
});
result &= b;
if (all_of((result & ~a) == 0)) return a;
a = result;
}
} The assembly of the loop is as follows. .LBB0_1:
vmovdqa %ymm4, %ymm3
vpaddq %ymm4, %ymm4, %ymm4
vpand %ymm1, %ymm4, %ymm4
vpsrlq $1, %ymm3, %ymm5
vpsrlq $10, %ymm3, %ymm6
vpor %ymm6, %ymm5, %ymm5
vpsllq $50, %ymm3, %ymm6
vpermq $249, %ymm6, %ymm6 # latency 3 on skylake
vpblendd $192, %ymm2, %ymm6, %ymm6
vpor %ymm6, %ymm5, %ymm5 # ymm6 is heavy to calculate, but or'ed first
vpor %ymm3, %ymm5, %ymm5 # ymm3 and ymm4 are cheap to calculate, but or'ed later
vpor %ymm4, %ymm5, %ymm4
vpand %ymm0, %ymm4, %ymm4
vptest %ymm4, %ymm3
jae .LBB0_1 The critical path of this loop is |
I'm not sure what facilities we have for cost driven reassociation like this - we could either try to do it in the middle end and possibly end up fighting canonicalization or in the backend and have to maybe post-ra scheduling (or a similar pass) to be able to recognize re-associatable machine instructions. |
Maybe in the backend but before register allocation? (I am not familiar with llvm.) |
See the code and the compilation result at https://godbolt.org/z/Kchh341vW . This code calculates vpor of several operands in the loop, where some operands are relatively cheap to calculate, while some are not. Compilation flags:
-O3 -std=c++2b -march=skylake
.The assembly of the loop is as follows.
The critical path of this loop is
vpmov-> vpsll $50 -> vperm -> vpblend -> vpor -> vpor -> vpor -> vpand
, but if ymm6 is vpor'ed later, the other two vpor's does not need to be on the critical path.The text was updated successfully, but these errors were encountered: