Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] (x or y) xor -1 -> x nor y #130264

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

mihajlovicana
Copy link
Contributor

@mihajlovicana mihajlovicana commented Mar 7, 2025

Added pattern so s_nor is selected for ((i1 x or i1 y) xor -1) instead of s_or and s_xor . This patch is for i1 divergent. The ballot in the test is added for the retrieval of lanemask. The control flow is needed because the combiner can't pass through phi instructions.

Copy link

github-actions bot commented Mar 7, 2025

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Mar 7, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Ana Mihajlovic (mihajlovicana)

Changes

Added pattern so s_nor is selected for ((i1 x or i1 y) xor -1) instead of s_or and s_xor


Full diff: https://github.com/llvm/llvm-project/pull/130264.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SOPInstructions.td (+14)
  • (added) llvm/test/CodeGen/AMDGPU/isel-nor-32.ll (+68)
  • (added) llvm/test/CodeGen/AMDGPU/isel-nor-64.ll (+68)
diff --git a/llvm/lib/Target/AMDGPU/SOPInstructions.td b/llvm/lib/Target/AMDGPU/SOPInstructions.td
index 5e62ceac281b8..eb994d9e5687a 100644
--- a/llvm/lib/Target/AMDGPU/SOPInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SOPInstructions.td
@@ -1919,6 +1919,20 @@ def : ScalarNot2Pat<S_ORN2_B32, or, v2i16>;
 def : ScalarNot2Pat<S_ORN2_B64, or, v4i16>;
 def : ScalarNot2Pat<S_ORN2_B64, or, v2i32>;
 
+let WaveSizePredicate = isWave32 in {
+def : GCNPat<
+  (i1 (not (or_oneuse i1:$src0, i1:$src1))),
+  (S_NOR_B32 i1:$src0, i1:$src1)
+>;
+}
+
+let WaveSizePredicate = isWave64 in {
+def : GCNPat<
+  (i1 (not (or_oneuse i1:$src0, i1:$src1))),
+  (S_NOR_B64 i1:$src0, i1:$src1)
+>;
+}
+
 //===----------------------------------------------------------------------===//
 // Target-specific instruction encodings.
 //===----------------------------------------------------------------------===//
diff --git a/llvm/test/CodeGen/AMDGPU/isel-nor-32.ll b/llvm/test/CodeGen/AMDGPU/isel-nor-32.ll
new file mode 100644
index 0000000000000..983a335568cc1
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/isel-nor-32.ll
@@ -0,0 +1,68 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
+; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 -mattr="+wavefrontsize32,-wavefrontsize64" -o - < %s | FileCheck %s
+; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 -mattr="+wavefrontsize32,-wavefrontsize64" -o - < %s | FileCheck %s
+
+define amdgpu_ps void @divergent_i1_phi_if_else(ptr addrspace(1) %out, i32  %tid, i32  %a, i32  %b, i32  %c, i32  %d) {
+; CHECK-LABEL: divergent_i1_phi_if_else:
+; CHECK:       ; %bb.0: ; %entry
+; CHECK-NEXT:    v_cmp_le_u32_e64 s0, v3, v4
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT:    s_mov_b32 s2, s0
+; CHECK-NEXT:    s_and_saveexec_b32 s1, s0
+; CHECK-NEXT:  ; %bb.1: ; %C
+; CHECK-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v3, v5
+; CHECK-NEXT:    s_and_not1_b32 s2, s0, exec_lo
+; CHECK-NEXT:    s_and_b32 s3, vcc_lo, exec_lo
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_or_b32 s2, s2, s3
+; CHECK-NEXT:  ; %bb.2: ; %MergeCF
+; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s1
+; CHECK-NEXT:    s_nor_b32 s1, s0, s2
+; CHECK-NEXT:    ; implicit-def: $sgpr0
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_saveexec_b32 s2, s1
+; CHECK-NEXT:    s_xor_b32 s1, exec_lo, s2
+; CHECK-NEXT:  ; %bb.3: ; %B
+; CHECK-NEXT:    v_cmp_gt_u32_e64 s0, 2, v2
+; CHECK-NEXT:    ; implicit-def: $vgpr2
+; CHECK-NEXT:  ; %bb.4: ; %Flow
+; CHECK-NEXT:    s_and_not1_saveexec_b32 s1, s1
+; CHECK-NEXT:  ; %bb.5: ; %A
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v2
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 s0, s0, exec_lo
+; CHECK-NEXT:    s_and_b32 s2, vcc_lo, exec_lo
+; CHECK-NEXT:    s_or_b32 s0, s0, s2
+; CHECK-NEXT:  ; %bb.6: ; %exit
+; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s1
+; CHECK-NEXT:    v_cndmask_b32_e64 v2, 2, 1, s0
+; CHECK-NEXT:    global_store_b32 v[0:1], v2, off
+; CHECK-NEXT:    s_endpgm
+entry:
+  %x = icmp ule i32 %a, %b
+  br i1 %x, label %C, label %MergeCF
+
+C:
+  %y = icmp eq i32 %a, %c
+  br label %MergeCF
+
+MergeCF:
+  %z = phi i1 [ %x, %entry ], [ %y, %C ]
+  %w = icmp ule i32 %a, %b
+  %cmp = or i1 %w, %z
+  br i1 %cmp, label %A, label %B
+
+A:
+  %val_A = icmp uge i32 %tid, 1
+  br label %exit
+
+B:
+  %val_B = icmp ult i32 %tid, 2
+  br label %exit
+
+exit:
+  %phi = phi i1 [ %val_A, %A ], [ %val_B, %B ]
+  %sel = select i1 %phi, i32 1, i32 2
+  store i32 %sel, ptr addrspace(1) %out
+  ret void
+}
diff --git a/llvm/test/CodeGen/AMDGPU/isel-nor-64.ll b/llvm/test/CodeGen/AMDGPU/isel-nor-64.ll
new file mode 100644
index 0000000000000..88e0a05556b78
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/isel-nor-64.ll
@@ -0,0 +1,68 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
+; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 -mattr="-wavefrontsize32,+wavefrontsize64" -o - < %s | FileCheck %s
+; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 -mattr="-wavefrontsize32,+wavefrontsize64" -o - < %s | FileCheck %s
+
+define amdgpu_ps void @divergent_i1_phi_if_else(ptr addrspace(1) %out, i32  %tid, i32  %a, i32  %b, i32  %c, i32  %d) {
+; CHECK-LABEL: divergent_i1_phi_if_else:
+; CHECK:       ; %bb.0: ; %entry
+; CHECK-NEXT:    v_cmp_le_u32_e64 s[0:1], v3, v4
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[0:1]
+; CHECK-NEXT:    s_and_saveexec_b64 s[2:3], s[0:1]
+; CHECK-NEXT:  ; %bb.1: ; %C
+; CHECK-NEXT:    v_cmp_eq_u32_e32 vcc, v3, v5
+; CHECK-NEXT:    s_and_not1_b64 s[4:5], s[0:1], exec
+; CHECK-NEXT:    s_and_b64 s[6:7], vcc, exec
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_or_b64 s[4:5], s[4:5], s[6:7]
+; CHECK-NEXT:  ; %bb.2: ; %MergeCF
+; CHECK-NEXT:    s_or_b64 exec, exec, s[2:3]
+; CHECK-NEXT:    s_nor_b64 s[2:3], s[0:1], s[4:5]
+; CHECK-NEXT:    ; implicit-def: $sgpr0_sgpr1
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_saveexec_b64 s[4:5], s[2:3]
+; CHECK-NEXT:    s_xor_b64 s[2:3], exec, s[4:5]
+; CHECK-NEXT:  ; %bb.3: ; %B
+; CHECK-NEXT:    v_cmp_gt_u32_e64 s[0:1], 2, v2
+; CHECK-NEXT:    ; implicit-def: $vgpr2
+; CHECK-NEXT:  ; %bb.4: ; %Flow
+; CHECK-NEXT:    s_and_not1_saveexec_b64 s[2:3], s[2:3]
+; CHECK-NEXT:  ; %bb.5: ; %A
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v2
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b64 s[0:1], s[0:1], exec
+; CHECK-NEXT:    s_and_b64 s[4:5], vcc, exec
+; CHECK-NEXT:    s_or_b64 s[0:1], s[0:1], s[4:5]
+; CHECK-NEXT:  ; %bb.6: ; %exit
+; CHECK-NEXT:    s_or_b64 exec, exec, s[2:3]
+; CHECK-NEXT:    v_cndmask_b32_e64 v2, 2, 1, s[0:1]
+; CHECK-NEXT:    global_store_b32 v[0:1], v2, off
+; CHECK-NEXT:    s_endpgm
+entry:
+  %x = icmp ule i32 %a, %b
+  br i1 %x, label %C, label %MergeCF
+
+C:
+  %y = icmp eq i32 %a, %c
+  br label %MergeCF
+
+MergeCF:
+  %z = phi i1 [ %x, %entry ], [ %y, %C ]
+  %w = icmp ule i32 %a, %b
+  %cmp = or i1 %w, %z
+  br i1 %cmp, label %A, label %B
+
+A:
+  %val_A = icmp uge i32 %tid, 1
+  br label %exit
+
+B:
+  %val_B = icmp ult i32 %tid, 2
+  br label %exit
+
+exit:
+  %phi = phi i1 [ %val_A, %A ], [ %val_B, %B ]
+  %sel = select i1 %phi, i32 1, i32 2
+  store i32 %sel, ptr addrspace(1) %out
+  ret void
+}

; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 -mattr="+wavefrontsize32,-wavefrontsize64" -o - < %s | FileCheck %s
; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 -mattr="+wavefrontsize32,-wavefrontsize64" -o - < %s | FileCheck %s

define amdgpu_ps void @divergent_i1_phi_if_else(ptr addrspace(1) %out, i32 %tid, i32 %a, i32 %b, i32 %c, i32 %d) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need control flow in this test. Should cover all the SGPR and VGPR input cases, and need negative tests for the multi-use case

Comment on lines 2 to 3
; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 -mattr="+wavefrontsize32,-wavefrontsize64" -o - < %s | FileCheck %s
; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 -mattr="+wavefrontsize32,-wavefrontsize64" -o - < %s | FileCheck %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the -wavefrontsize64 cases. Also this isn't testing the wave64 case. Plus add -global-isel run lines, I would hope this pattern works without issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works for global-isel

@@ -0,0 +1,68 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 -mattr="-wavefrontsize32,+wavefrontsize64" -o - < %s | FileCheck %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wave32 and wave64 can be tested in the same file

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remodel the testing style after something like test/CodeGen/AMDGPU/GlobalISel/xnor.ll?

Notably:
wave32 and wave64 should be the same file. There is no IR difference between them.

The tested type combinations should be in the name. The wavesize is only significant to the output, not the input.

Explicit tests for the negative one use case, with a name to reflect that

Avoid memory instructions, they are not relevant to the pattern and add a lot of extra noise to the output

ret i32 %z
}

define amdgpu_ps i32 @negative_test_w32(i32 %x, i32 %y) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negative why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

((x or y) xor -1 -> x nor y) not needed in this case

@mihajlovicana
Copy link
Contributor Author

mihajlovicana commented Mar 11, 2025

Can you remodel the testing style after something like test/CodeGen/AMDGPU/GlobalISel/xnor.ll?

Notably: wave32 and wave64 should be the same file. There is no IR difference between them.

The tested type combinations should be in the name. The wavesize is only significant to the output, not the input.

Explicit tests for the negative one use case, with a name to reflect that

Avoid memory instructions, they are not relevant to the pattern and add a lot of extra noise to the output

The problem I have here is that if I am using ballot I have to have separate tests. I tried using cmp instead of it but the problem is that SelectionDAG optimizes IR and I get this :

`define amdgpu_ps void @foo(i32 %x, i32 %y, ptr addrspace(1) %out, i32 %a, i32 %b) {
; W64-LABEL: foo:
; W64:       ; %bb.0:
; W64-NEXT:    v_cmp_gt_u32_e32 vcc, v0, v4
; W64-NEXT:    v_cmp_ne_u32_e64 s[0:1], v1, v5
; W64-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
; W64-NEXT:    s_and_b64 s[0:1], s[0:1], vcc
; W64-NEXT:    v_cndmask_b32_e64 v0, 0, 1, s[0:1]
; W64-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; W64-NEXT:    v_cmp_ne_u32_e64 s[0:1], 0, v0
; W64-NEXT:    s_wait_alu 0xf1ff
; W64-NEXT:    v_mov_b32_e32 v0, s0
; W64-NEXT:    s_delay_alu instid0(VALU_DEP_2)
; W64-NEXT:    v_mov_b32_e32 v1, s1
; W64-NEXT:    global_store_b64 v[2:3], v[0:1], off
; W64-NEXT:    s_endpgm
  %y.b = icmp ule i32 %x, %a
  %x.b = icmp eq i32 %y, %b
  %t = or i1 %x.b, %y.b
  %t.1 = xor i1 %t, -1
  %z = call i64 @llvm.amdgcn.ballot.i64(i1 %t.1)
  store i64 %z, ptr addrspace(1) %out
  ret void
}`

So I can't really test if this patch is working with this.
I added store in the last case to avoid illegal vgpr to sgpr copy.

; GFX12W32-NEXT: v_cndmask_b32_e64 v0, 0, 1, s0
; GFX12W32-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12W32-NEXT: v_cmp_ne_u32_e64 s0, 0, v0
; GFX12W32-NEXT: s_wait_alu 0xf1ff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the check lines are redundant. Could consider to use --check-prefixes=CHECK,GFX... and then hopefully the update script could tell them apart, though most of the time it doesn't work well. Worth a shot.

; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 -mattr="+wavefrontsize32,-wavefrontsize64" -o - < %s | FileCheck -check-prefix=GFX12W32 %s
; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 -mattr="+wavefrontsize32,-wavefrontsize64" -o - < %s | FileCheck -check-prefixes=GFX11W32 %s

define amdgpu_ps i32 @test_w32(i32 %x, i32 %y) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%x and %y should be inreg, inverse.ballot.i32 arguments are uniform i32/i64

@petar-avramovic
Copy link
Collaborator

Title should clarify that this is pattern for divergent i1/lane mask specifically, pattern for uniform i32/i64 already exists. Also constant should not be -1 but true instead since it its type is i1?

@arsenm
Copy link
Contributor

arsenm commented Mar 12, 2025

The problem I have here is that if I am using ballot

Ballot is not core to the pattern here, the ballot tests should be separate. But also the wave64 ballot should work on wave32

@nhaehnle
Copy link
Collaborator

nhaehnle commented Mar 17, 2025

The problem I have here is that if I am using ballot

Ballot is not core to the pattern here, the ballot tests should be separate. But also the wave64 ballot should work on wave32

Looking at this with Ana offline, it seems that inverse.ballot.i64 doesn't work on wave32 yet. That should be fixed first.

As for the rest, I think the use of ballot/inverse.ballot here is very reasonable. Our calling conventions generally pass i1 as 32-bit values, and so without ballot/inverse.ballot the boolean logic tends to get folded into the v_cndmask/v_cmp that is generated as part of the calling convention lowering, which makes it difficult to test this particular pattern. Ballot/inverse.ballot is the most natural way to express what we want to do here. It certainly makes for a cleaner test than the test with the branches (which I believe was a distillation of the real world code where we found this isel weakness).

@arsenm
Copy link
Contributor

arsenm commented Mar 18, 2025

As for the rest, I think the use of ballot/inverse.ballot here is very reasonable. Our calling conventions generally pass i1 as 32-bit values,

We're working on fixing that to pass as a lane mask for callable functions (this won't change for entries)

@nhaehnle
Copy link
Collaborator

As for the rest, I think the use of ballot/inverse.ballot here is very reasonable. Our calling conventions generally pass i1 as 32-bit values,

We're working on fixing that to pass as a lane mask for callable functions (this won't change for entries)

That's good. How quickly will that be ready?

@mihajlovicana
Copy link
Contributor Author

#132770 <- since this was merged I am now able to put ballot tests for both wave32 and wave64 in the same file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants