-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] (x or y) xor -1 -> x nor y #130264
base: main
Are you sure you want to change the base?
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-backend-amdgpu Author: Ana Mihajlovic (mihajlovicana) ChangesAdded pattern so s_nor is selected for ((i1 x or i1 y) xor -1) instead of s_or and s_xor Full diff: https://github.com/llvm/llvm-project/pull/130264.diff 3 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SOPInstructions.td b/llvm/lib/Target/AMDGPU/SOPInstructions.td
index 5e62ceac281b8..eb994d9e5687a 100644
--- a/llvm/lib/Target/AMDGPU/SOPInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SOPInstructions.td
@@ -1919,6 +1919,20 @@ def : ScalarNot2Pat<S_ORN2_B32, or, v2i16>;
def : ScalarNot2Pat<S_ORN2_B64, or, v4i16>;
def : ScalarNot2Pat<S_ORN2_B64, or, v2i32>;
+let WaveSizePredicate = isWave32 in {
+def : GCNPat<
+ (i1 (not (or_oneuse i1:$src0, i1:$src1))),
+ (S_NOR_B32 i1:$src0, i1:$src1)
+>;
+}
+
+let WaveSizePredicate = isWave64 in {
+def : GCNPat<
+ (i1 (not (or_oneuse i1:$src0, i1:$src1))),
+ (S_NOR_B64 i1:$src0, i1:$src1)
+>;
+}
+
//===----------------------------------------------------------------------===//
// Target-specific instruction encodings.
//===----------------------------------------------------------------------===//
diff --git a/llvm/test/CodeGen/AMDGPU/isel-nor-32.ll b/llvm/test/CodeGen/AMDGPU/isel-nor-32.ll
new file mode 100644
index 0000000000000..983a335568cc1
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/isel-nor-32.ll
@@ -0,0 +1,68 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
+; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 -mattr="+wavefrontsize32,-wavefrontsize64" -o - < %s | FileCheck %s
+; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 -mattr="+wavefrontsize32,-wavefrontsize64" -o - < %s | FileCheck %s
+
+define amdgpu_ps void @divergent_i1_phi_if_else(ptr addrspace(1) %out, i32 %tid, i32 %a, i32 %b, i32 %c, i32 %d) {
+; CHECK-LABEL: divergent_i1_phi_if_else:
+; CHECK: ; %bb.0: ; %entry
+; CHECK-NEXT: v_cmp_le_u32_e64 s0, v3, v4
+; CHECK-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT: s_mov_b32 s2, s0
+; CHECK-NEXT: s_and_saveexec_b32 s1, s0
+; CHECK-NEXT: ; %bb.1: ; %C
+; CHECK-NEXT: v_cmp_eq_u32_e32 vcc_lo, v3, v5
+; CHECK-NEXT: s_and_not1_b32 s2, s0, exec_lo
+; CHECK-NEXT: s_and_b32 s3, vcc_lo, exec_lo
+; CHECK-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT: s_or_b32 s2, s2, s3
+; CHECK-NEXT: ; %bb.2: ; %MergeCF
+; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s1
+; CHECK-NEXT: s_nor_b32 s1, s0, s2
+; CHECK-NEXT: ; implicit-def: $sgpr0
+; CHECK-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; CHECK-NEXT: s_and_saveexec_b32 s2, s1
+; CHECK-NEXT: s_xor_b32 s1, exec_lo, s2
+; CHECK-NEXT: ; %bb.3: ; %B
+; CHECK-NEXT: v_cmp_gt_u32_e64 s0, 2, v2
+; CHECK-NEXT: ; implicit-def: $vgpr2
+; CHECK-NEXT: ; %bb.4: ; %Flow
+; CHECK-NEXT: s_and_not1_saveexec_b32 s1, s1
+; CHECK-NEXT: ; %bb.5: ; %A
+; CHECK-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v2
+; CHECK-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; CHECK-NEXT: s_and_not1_b32 s0, s0, exec_lo
+; CHECK-NEXT: s_and_b32 s2, vcc_lo, exec_lo
+; CHECK-NEXT: s_or_b32 s0, s0, s2
+; CHECK-NEXT: ; %bb.6: ; %exit
+; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s1
+; CHECK-NEXT: v_cndmask_b32_e64 v2, 2, 1, s0
+; CHECK-NEXT: global_store_b32 v[0:1], v2, off
+; CHECK-NEXT: s_endpgm
+entry:
+ %x = icmp ule i32 %a, %b
+ br i1 %x, label %C, label %MergeCF
+
+C:
+ %y = icmp eq i32 %a, %c
+ br label %MergeCF
+
+MergeCF:
+ %z = phi i1 [ %x, %entry ], [ %y, %C ]
+ %w = icmp ule i32 %a, %b
+ %cmp = or i1 %w, %z
+ br i1 %cmp, label %A, label %B
+
+A:
+ %val_A = icmp uge i32 %tid, 1
+ br label %exit
+
+B:
+ %val_B = icmp ult i32 %tid, 2
+ br label %exit
+
+exit:
+ %phi = phi i1 [ %val_A, %A ], [ %val_B, %B ]
+ %sel = select i1 %phi, i32 1, i32 2
+ store i32 %sel, ptr addrspace(1) %out
+ ret void
+}
diff --git a/llvm/test/CodeGen/AMDGPU/isel-nor-64.ll b/llvm/test/CodeGen/AMDGPU/isel-nor-64.ll
new file mode 100644
index 0000000000000..88e0a05556b78
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/isel-nor-64.ll
@@ -0,0 +1,68 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
+; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 -mattr="-wavefrontsize32,+wavefrontsize64" -o - < %s | FileCheck %s
+; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 -mattr="-wavefrontsize32,+wavefrontsize64" -o - < %s | FileCheck %s
+
+define amdgpu_ps void @divergent_i1_phi_if_else(ptr addrspace(1) %out, i32 %tid, i32 %a, i32 %b, i32 %c, i32 %d) {
+; CHECK-LABEL: divergent_i1_phi_if_else:
+; CHECK: ; %bb.0: ; %entry
+; CHECK-NEXT: v_cmp_le_u32_e64 s[0:1], v3, v4
+; CHECK-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT: s_mov_b64 s[4:5], s[0:1]
+; CHECK-NEXT: s_and_saveexec_b64 s[2:3], s[0:1]
+; CHECK-NEXT: ; %bb.1: ; %C
+; CHECK-NEXT: v_cmp_eq_u32_e32 vcc, v3, v5
+; CHECK-NEXT: s_and_not1_b64 s[4:5], s[0:1], exec
+; CHECK-NEXT: s_and_b64 s[6:7], vcc, exec
+; CHECK-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT: s_or_b64 s[4:5], s[4:5], s[6:7]
+; CHECK-NEXT: ; %bb.2: ; %MergeCF
+; CHECK-NEXT: s_or_b64 exec, exec, s[2:3]
+; CHECK-NEXT: s_nor_b64 s[2:3], s[0:1], s[4:5]
+; CHECK-NEXT: ; implicit-def: $sgpr0_sgpr1
+; CHECK-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; CHECK-NEXT: s_and_saveexec_b64 s[4:5], s[2:3]
+; CHECK-NEXT: s_xor_b64 s[2:3], exec, s[4:5]
+; CHECK-NEXT: ; %bb.3: ; %B
+; CHECK-NEXT: v_cmp_gt_u32_e64 s[0:1], 2, v2
+; CHECK-NEXT: ; implicit-def: $vgpr2
+; CHECK-NEXT: ; %bb.4: ; %Flow
+; CHECK-NEXT: s_and_not1_saveexec_b64 s[2:3], s[2:3]
+; CHECK-NEXT: ; %bb.5: ; %A
+; CHECK-NEXT: v_cmp_ne_u32_e32 vcc, 0, v2
+; CHECK-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; CHECK-NEXT: s_and_not1_b64 s[0:1], s[0:1], exec
+; CHECK-NEXT: s_and_b64 s[4:5], vcc, exec
+; CHECK-NEXT: s_or_b64 s[0:1], s[0:1], s[4:5]
+; CHECK-NEXT: ; %bb.6: ; %exit
+; CHECK-NEXT: s_or_b64 exec, exec, s[2:3]
+; CHECK-NEXT: v_cndmask_b32_e64 v2, 2, 1, s[0:1]
+; CHECK-NEXT: global_store_b32 v[0:1], v2, off
+; CHECK-NEXT: s_endpgm
+entry:
+ %x = icmp ule i32 %a, %b
+ br i1 %x, label %C, label %MergeCF
+
+C:
+ %y = icmp eq i32 %a, %c
+ br label %MergeCF
+
+MergeCF:
+ %z = phi i1 [ %x, %entry ], [ %y, %C ]
+ %w = icmp ule i32 %a, %b
+ %cmp = or i1 %w, %z
+ br i1 %cmp, label %A, label %B
+
+A:
+ %val_A = icmp uge i32 %tid, 1
+ br label %exit
+
+B:
+ %val_B = icmp ult i32 %tid, 2
+ br label %exit
+
+exit:
+ %phi = phi i1 [ %val_A, %A ], [ %val_B, %B ]
+ %sel = select i1 %phi, i32 1, i32 2
+ store i32 %sel, ptr addrspace(1) %out
+ ret void
+}
|
; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 -mattr="+wavefrontsize32,-wavefrontsize64" -o - < %s | FileCheck %s | ||
; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 -mattr="+wavefrontsize32,-wavefrontsize64" -o - < %s | FileCheck %s | ||
|
||
define amdgpu_ps void @divergent_i1_phi_if_else(ptr addrspace(1) %out, i32 %tid, i32 %a, i32 %b, i32 %c, i32 %d) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need control flow in this test. Should cover all the SGPR and VGPR input cases, and need negative tests for the multi-use case
; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 -mattr="+wavefrontsize32,-wavefrontsize64" -o - < %s | FileCheck %s | ||
; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 -mattr="+wavefrontsize32,-wavefrontsize64" -o - < %s | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need the -wavefrontsize64 cases. Also this isn't testing the wave64 case. Plus add -global-isel run lines, I would hope this pattern works without issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works for global-isel
@@ -0,0 +1,68 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3 | |||
; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 -mattr="-wavefrontsize32,+wavefrontsize64" -o - < %s | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wave32 and wave64 can be tested in the same file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remodel the testing style after something like test/CodeGen/AMDGPU/GlobalISel/xnor.ll?
Notably:
wave32 and wave64 should be the same file. There is no IR difference between them.
The tested type combinations should be in the name. The wavesize is only significant to the output, not the input.
Explicit tests for the negative one use case, with a name to reflect that
Avoid memory instructions, they are not relevant to the pattern and add a lot of extra noise to the output
ret i32 %z | ||
} | ||
|
||
define amdgpu_ps i32 @negative_test_w32(i32 %x, i32 %y) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Negative why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
((x or y) xor -1 -> x nor y) not needed in this case
The problem I have here is that if I am using ballot I have to have separate tests. I tried using cmp instead of it but the problem is that SelectionDAG optimizes IR and I get this :
So I can't really test if this patch is working with this. |
; GFX12W32-NEXT: v_cndmask_b32_e64 v0, 0, 1, s0 | ||
; GFX12W32-NEXT: s_delay_alu instid0(VALU_DEP_1) | ||
; GFX12W32-NEXT: v_cmp_ne_u32_e64 s0, 0, v0 | ||
; GFX12W32-NEXT: s_wait_alu 0xf1ff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the check lines are redundant. Could consider to use --check-prefixes=CHECK,GFX...
and then hopefully the update script could tell them apart, though most of the time it doesn't work well. Worth a shot.
; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 -mattr="+wavefrontsize32,-wavefrontsize64" -o - < %s | FileCheck -check-prefix=GFX12W32 %s | ||
; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 -mattr="+wavefrontsize32,-wavefrontsize64" -o - < %s | FileCheck -check-prefixes=GFX11W32 %s | ||
|
||
define amdgpu_ps i32 @test_w32(i32 %x, i32 %y) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%x and %y should be inreg, inverse.ballot.i32 arguments are uniform i32/i64
Title should clarify that this is pattern for divergent i1/lane mask specifically, pattern for uniform i32/i64 already exists. Also constant should not be |
Ballot is not core to the pattern here, the ballot tests should be separate. But also the wave64 ballot should work on wave32 |
Looking at this with Ana offline, it seems that inverse.ballot.i64 doesn't work on wave32 yet. That should be fixed first. As for the rest, I think the use of ballot/inverse.ballot here is very reasonable. Our calling conventions generally pass i1 as 32-bit values, and so without ballot/inverse.ballot the boolean logic tends to get folded into the v_cndmask/v_cmp that is generated as part of the calling convention lowering, which makes it difficult to test this particular pattern. Ballot/inverse.ballot is the most natural way to express what we want to do here. It certainly makes for a cleaner test than the test with the branches (which I believe was a distillation of the real world code where we found this isel weakness). |
We're working on fixing that to pass as a lane mask for callable functions (this won't change for entries) |
That's good. How quickly will that be ready? |
#132770 <- since this was merged I am now able to put ballot tests for both wave32 and wave64 in the same file |
Added pattern so s_nor is selected for ((i1 x or i1 y) xor -1) instead of s_or and s_xor . This patch is for i1 divergent. The ballot in the test is added for the retrieval of lanemask. The control flow is needed because the combiner can't pass through phi instructions.