-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV] Don't use EVL/Mask for vid when lowering vp.reverse #123048
Conversation
vp.reverse intrinsics are emitted by the loop vectorizer when EVL tail folding is enabled, and currently end up generating code like this: .LBB0_1: # %loop # =>This Inner Loop Header: Depth=1 sub a3, a2, a1 slli a4, a1, 3 vsetvli a3, a3, e64, m1, ta, ma add a4, a0, a4 vle64.v v8, (a4) addi a5, a3, -1 vid.v v9 vrsub.vx v9, v9, a5 vrgather.vv v10, v8, v9 add a1, a1, a3 vse64.v v10, (a4) bltu a1, a2, .LBB0_1 The vid.v needed for the indices is calculated every loop, but because its AVL is set to the EVL computed by get.vector.length within the loop it isn't hoisted out. This changes the AVL used to be VLMAX so it can be made loop invariant: vsetvli a3, zero, e64, m1, ta, ma vid.v v8 .LBB0_1: # %loop # =>This Inner Loop Header: Depth=1 sub a3, a2, a1 slli a4, a1, 3 vsetvli a3, a3, e64, m1, ta, ma add a4, a0, a4 vle64.v v9, (a4) addi a5, a3, -1 vrsub.vx v10, v8, a5 vrgather.vv v11, v9, v10 add a1, a1, a3 vse64.v v11, (a4) bltu a1, a2, .LBB0_1 Now that we have RISCVVLOptimizer, It shouldn't increase the number of vsetvlis for straight-line code. This also removes the mask which isn't needed, in case it also prevents hoisting.
@llvm/pr-subscribers-backend-risc-v Author: Luke Lau (lukel97) Changesvp.reverse intrinsics are emitted by the loop vectorizer when EVL tail folding is enabled, and currently end up generating code like this:
The vid.v needed for the indices is calculated every loop, but because its AVL is set to the EVL computed by get.vector.length within the loop it isn't hoisted out. This changes the AVL used to be VLMAX so it can be made loop invariant:
Now that we have RISCVVLOptimizer, It shouldn't increase the number of vsetvlis for straight-line code. This also removes the mask which isn't needed, in case it also prevents hoisting. Patch is 33.90 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/123048.diff 8 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index b25cb128bce9fb..32a94f3dee22c4 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -12430,7 +12430,11 @@ RISCVTargetLowering::lowerVPReverseExperimental(SDValue Op,
GatherOpc = RISCVISD::VRGATHEREI16_VV_VL;
}
- SDValue VID = DAG.getNode(RISCVISD::VID_VL, DL, IndicesVT, Mask, EVL);
+ // Don't use EVL or Mask for vid so it can be hoisted out of loops.
+ auto [TrueMask, VLMAX] =
+ getDefaultScalableVLOps(IndicesVT, DL, DAG, Subtarget);
+ SDValue VID = DAG.getNode(RISCVISD::VID_VL, DL, IndicesVT, TrueMask, VLMAX);
+
SDValue VecLen =
DAG.getNode(ISD::SUB, DL, XLenVT, EVL, DAG.getConstant(1, DL, XLenVT));
SDValue VecLenSplat = DAG.getNode(RISCVISD::VMV_V_X_VL, DL, IndicesVT,
diff --git a/llvm/test/CodeGen/RISCV/rvv/vp-reverse-float-fixed-vectors.ll b/llvm/test/CodeGen/RISCV/rvv/vp-reverse-float-fixed-vectors.ll
index 136f6e7bc99909..887edafe9c88a8 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vp-reverse-float-fixed-vectors.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vp-reverse-float-fixed-vectors.ll
@@ -5,10 +5,10 @@
define <2 x double> @test_vp_reverse_v2f64_masked(<2 x double> %src, <2 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_v2f64_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vid.v v9, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT: vid.v v9
+; CHECK-NEXT: vrsub.vx v10, v9, a1, v0.t
; CHECK-NEXT: vrgather.vv v9, v8, v10, v0.t
; CHECK-NEXT: vmv.v.v v8, v9
; CHECK-NEXT: ret
@@ -34,10 +34,10 @@ define <2 x double> @test_vp_reverse_v2f64(<2 x double> %src, i32 zeroext %evl)
define <4 x float> @test_vp_reverse_v4f32_masked(<4 x float> %src, <4 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_v4f32_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e32, m1, ta, ma
-; CHECK-NEXT: vid.v v9, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT: vid.v v9
+; CHECK-NEXT: vrsub.vx v10, v9, a1, v0.t
; CHECK-NEXT: vrgather.vv v9, v8, v10, v0.t
; CHECK-NEXT: vmv.v.v v8, v9
; CHECK-NEXT: ret
diff --git a/llvm/test/CodeGen/RISCV/rvv/vp-reverse-float.ll b/llvm/test/CodeGen/RISCV/rvv/vp-reverse-float.ll
index b235990ab5dd09..194eb222be01f5 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vp-reverse-float.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vp-reverse-float.ll
@@ -4,10 +4,10 @@
define <vscale x 1 x double> @test_vp_reverse_nxv1f64_masked(<vscale x 1 x double> %src, <vscale x 1 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv1f64_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vid.v v9, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT: vid.v v9
+; CHECK-NEXT: vrsub.vx v10, v9, a1, v0.t
; CHECK-NEXT: vrgather.vv v9, v8, v10, v0.t
; CHECK-NEXT: vmv.v.v v8, v9
; CHECK-NEXT: ret
@@ -33,10 +33,10 @@ define <vscale x 1 x double> @test_vp_reverse_nxv1f64(<vscale x 1 x double> %src
define <vscale x 2 x float> @test_vp_reverse_nxv2f32_masked(<vscale x 2 x float> %src, <vscale x 2 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv2f32_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e32, m1, ta, ma
-; CHECK-NEXT: vid.v v9, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT: vid.v v9
+; CHECK-NEXT: vrsub.vx v10, v9, a1, v0.t
; CHECK-NEXT: vrgather.vv v9, v8, v10, v0.t
; CHECK-NEXT: vmv.v.v v8, v9
; CHECK-NEXT: ret
@@ -62,10 +62,10 @@ define <vscale x 2 x float> @test_vp_reverse_nxv2f32(<vscale x 2 x float> %src,
define <vscale x 2 x double> @test_vp_reverse_nxv2f64_masked(<vscale x 2 x double> %src, <vscale x 2 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv2f64_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e64, m2, ta, ma
-; CHECK-NEXT: vid.v v10, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v12, v10, a0, v0.t
+; CHECK-NEXT: vid.v v10
+; CHECK-NEXT: vrsub.vx v12, v10, a1, v0.t
; CHECK-NEXT: vrgather.vv v10, v8, v12, v0.t
; CHECK-NEXT: vmv.v.v v8, v10
; CHECK-NEXT: ret
@@ -91,10 +91,10 @@ define <vscale x 2 x double> @test_vp_reverse_nxv2f64(<vscale x 2 x double> %src
define <vscale x 4 x float> @test_vp_reverse_nxv4f32_masked(<vscale x 4 x float> %src, <vscale x 4 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv4f32_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e32, m2, ta, ma
-; CHECK-NEXT: vid.v v10, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v12, v10, a0, v0.t
+; CHECK-NEXT: vid.v v10
+; CHECK-NEXT: vrsub.vx v12, v10, a1, v0.t
; CHECK-NEXT: vrgather.vv v10, v8, v12, v0.t
; CHECK-NEXT: vmv.v.v v8, v10
; CHECK-NEXT: ret
@@ -120,10 +120,10 @@ define <vscale x 4 x float> @test_vp_reverse_nxv4f32(<vscale x 4 x float> %src,
define <vscale x 4 x double> @test_vp_reverse_nxv4f64_masked(<vscale x 4 x double> %src, <vscale x 4 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv4f64_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e64, m4, ta, ma
-; CHECK-NEXT: vid.v v12, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v16, v12, a0, v0.t
+; CHECK-NEXT: vid.v v12
+; CHECK-NEXT: vrsub.vx v16, v12, a1, v0.t
; CHECK-NEXT: vrgather.vv v12, v8, v16, v0.t
; CHECK-NEXT: vmv.v.v v8, v12
; CHECK-NEXT: ret
@@ -149,10 +149,10 @@ define <vscale x 4 x double> @test_vp_reverse_nxv4f64(<vscale x 4 x double> %src
define <vscale x 8 x float> @test_vp_reverse_nxv8f32_masked(<vscale x 8 x float> %src, <vscale x 8 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv8f32_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e32, m4, ta, ma
-; CHECK-NEXT: vid.v v12, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v16, v12, a0, v0.t
+; CHECK-NEXT: vid.v v12
+; CHECK-NEXT: vrsub.vx v16, v12, a1, v0.t
; CHECK-NEXT: vrgather.vv v12, v8, v16, v0.t
; CHECK-NEXT: vmv.v.v v8, v12
; CHECK-NEXT: ret
@@ -178,10 +178,10 @@ define <vscale x 8 x float> @test_vp_reverse_nxv8f32(<vscale x 8 x float> %src,
define <vscale x 8 x double> @test_vp_reverse_nxv8f64_masked(<vscale x 8 x double> %src, <vscale x 8 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv8f64_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e64, m8, ta, ma
-; CHECK-NEXT: vid.v v16, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v24, v16, a0, v0.t
+; CHECK-NEXT: vid.v v16
+; CHECK-NEXT: vrsub.vx v24, v16, a1, v0.t
; CHECK-NEXT: vrgather.vv v16, v8, v24, v0.t
; CHECK-NEXT: vmv.v.v v8, v16
; CHECK-NEXT: ret
@@ -207,10 +207,10 @@ define <vscale x 8 x double> @test_vp_reverse_nxv8f64(<vscale x 8 x double> %src
define <vscale x 16 x float> @test_vp_reverse_nxv16f32_masked(<vscale x 16 x float> %src, <vscale x 16 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv16f32_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e32, m8, ta, ma
-; CHECK-NEXT: vid.v v16, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v24, v16, a0, v0.t
+; CHECK-NEXT: vid.v v16
+; CHECK-NEXT: vrsub.vx v24, v16, a1, v0.t
; CHECK-NEXT: vrgather.vv v16, v8, v24, v0.t
; CHECK-NEXT: vmv.v.v v8, v16
; CHECK-NEXT: ret
diff --git a/llvm/test/CodeGen/RISCV/rvv/vp-reverse-int-fixed-vectors.ll b/llvm/test/CodeGen/RISCV/rvv/vp-reverse-int-fixed-vectors.ll
index 27f16f0285e12f..33fa3539ade937 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vp-reverse-int-fixed-vectors.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vp-reverse-int-fixed-vectors.ll
@@ -5,10 +5,10 @@
define <2 x i64> @test_vp_reverse_v2i64_masked(<2 x i64> %src, <2 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_v2i64_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vid.v v9, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT: vid.v v9
+; CHECK-NEXT: vrsub.vx v10, v9, a1, v0.t
; CHECK-NEXT: vrgather.vv v9, v8, v10, v0.t
; CHECK-NEXT: vmv.v.v v8, v9
; CHECK-NEXT: ret
@@ -34,10 +34,10 @@ define <2 x i64> @test_vp_reverse_v2i64(<2 x i64> %src, i32 zeroext %evl) {
define <4 x i32> @test_vp_reverse_v4i32_masked(<4 x i32> %src, <4 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_v4i32_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e32, m1, ta, ma
-; CHECK-NEXT: vid.v v9, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT: vid.v v9
+; CHECK-NEXT: vrsub.vx v10, v9, a1, v0.t
; CHECK-NEXT: vrgather.vv v9, v8, v10, v0.t
; CHECK-NEXT: vmv.v.v v8, v9
; CHECK-NEXT: ret
@@ -63,10 +63,10 @@ define <4 x i32> @test_vp_reverse_v4i32(<4 x i32> %src, i32 zeroext %evl) {
define <8 x i16> @test_vp_reverse_v8i16_masked(<8 x i16> %src, <8 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_v8i16_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e16, m1, ta, ma
-; CHECK-NEXT: vid.v v9, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT: vid.v v9
+; CHECK-NEXT: vrsub.vx v10, v9, a1, v0.t
; CHECK-NEXT: vrgather.vv v9, v8, v10, v0.t
; CHECK-NEXT: vmv.v.v v8, v9
; CHECK-NEXT: ret
@@ -92,10 +92,10 @@ define <8 x i16> @test_vp_reverse_v8i16(<8 x i16> %src, i32 zeroext %evl) {
define <16 x i8> @test_vp_reverse_v16i8_masked(<16 x i8> %src, <16 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_v16i8_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e16, m2, ta, ma
-; CHECK-NEXT: vid.v v10, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v10, v10, a0, v0.t
+; CHECK-NEXT: vid.v v10
+; CHECK-NEXT: vrsub.vx v10, v10, a1, v0.t
; CHECK-NEXT: vsetvli zero, zero, e8, m1, ta, ma
; CHECK-NEXT: vrgatherei16.vv v9, v8, v10, v0.t
; CHECK-NEXT: vmv.v.v v8, v9
diff --git a/llvm/test/CodeGen/RISCV/rvv/vp-reverse-int.ll b/llvm/test/CodeGen/RISCV/rvv/vp-reverse-int.ll
index 507f5154cf1aca..ab37e5f27bcefe 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vp-reverse-int.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vp-reverse-int.ll
@@ -4,10 +4,10 @@
define <vscale x 1 x i64> @test_vp_reverse_nxv1i64_masked(<vscale x 1 x i64> %src, <vscale x 1 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv1i64_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vid.v v9, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT: vid.v v9
+; CHECK-NEXT: vrsub.vx v10, v9, a1, v0.t
; CHECK-NEXT: vrgather.vv v9, v8, v10, v0.t
; CHECK-NEXT: vmv.v.v v8, v9
; CHECK-NEXT: ret
@@ -33,10 +33,10 @@ define <vscale x 1 x i64> @test_vp_reverse_nxv1i64(<vscale x 1 x i64> %src, i32
define <vscale x 2 x i32> @test_vp_reverse_nxv2i32_masked(<vscale x 2 x i32> %src, <vscale x 2 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv2i32_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e32, m1, ta, ma
-; CHECK-NEXT: vid.v v9, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT: vid.v v9
+; CHECK-NEXT: vrsub.vx v10, v9, a1, v0.t
; CHECK-NEXT: vrgather.vv v9, v8, v10, v0.t
; CHECK-NEXT: vmv.v.v v8, v9
; CHECK-NEXT: ret
@@ -62,10 +62,10 @@ define <vscale x 2 x i32> @test_vp_reverse_nxv2i32(<vscale x 2 x i32> %src, i32
define <vscale x 4 x i16> @test_vp_reverse_nxv4i16_masked(<vscale x 4 x i16> %src, <vscale x 4 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv4i16_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e16, m1, ta, ma
-; CHECK-NEXT: vid.v v9, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT: vid.v v9
+; CHECK-NEXT: vrsub.vx v10, v9, a1, v0.t
; CHECK-NEXT: vrgather.vv v9, v8, v10, v0.t
; CHECK-NEXT: vmv.v.v v8, v9
; CHECK-NEXT: ret
@@ -91,10 +91,10 @@ define <vscale x 4 x i16> @test_vp_reverse_nxv4i16(<vscale x 4 x i16> %src, i32
define <vscale x 8 x i8> @test_vp_reverse_nxv8i8_masked(<vscale x 8 x i8> %src, <vscale x 8 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv8i8_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e16, m2, ta, ma
-; CHECK-NEXT: vid.v v10, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v10, v10, a0, v0.t
+; CHECK-NEXT: vid.v v10
+; CHECK-NEXT: vrsub.vx v10, v10, a1, v0.t
; CHECK-NEXT: vsetvli zero, zero, e8, m1, ta, ma
; CHECK-NEXT: vrgatherei16.vv v9, v8, v10, v0.t
; CHECK-NEXT: vmv.v.v v8, v9
@@ -122,10 +122,10 @@ define <vscale x 8 x i8> @test_vp_reverse_nxv8i8(<vscale x 8 x i8> %src, i32 zer
define <vscale x 2 x i64> @test_vp_reverse_nxv2i64_masked(<vscale x 2 x i64> %src, <vscale x 2 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv2i64_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e64, m2, ta, ma
-; CHECK-NEXT: vid.v v10, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v12, v10, a0, v0.t
+; CHECK-NEXT: vid.v v10
+; CHECK-NEXT: vrsub.vx v12, v10, a1, v0.t
; CHECK-NEXT: vrgather.vv v10, v8, v12, v0.t
; CHECK-NEXT: vmv.v.v v8, v10
; CHECK-NEXT: ret
@@ -151,10 +151,10 @@ define <vscale x 2 x i64> @test_vp_reverse_nxv2i64(<vscale x 2 x i64> %src, i32
define <vscale x 4 x i32> @test_vp_reverse_nxv4i32_masked(<vscale x 4 x i32> %src, <vscale x 4 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv4i32_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e32, m2, ta, ma
-; CHECK-NEXT: vid.v v10, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v12, v10, a0, v0.t
+; CHECK-NEXT: vid.v v10
+; CHECK-NEXT: vrsub.vx v12, v10, a1, v0.t
; CHECK-NEXT: vrgather.vv v10, v8, v12, v0.t
; CHECK-NEXT: vmv.v.v v8, v10
; CHECK-NEXT: ret
@@ -180,10 +180,10 @@ define <vscale x 4 x i32> @test_vp_reverse_nxv4i32(<vscale x 4 x i32> %src, i32
define <vscale x 8 x i16> @test_vp_reverse_nxv8i16_masked(<vscale x 8 x i16> %src, <vscale x 8 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv8i16_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e16, m2, ta, ma
-; CHECK-NEXT: vid.v v10, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v12, v10, a0, v0.t
+; CHECK-NEXT: vid.v v10
+; CHECK-NEXT: vrsub.vx v12, v10, a1, v0.t
; CHECK-NEXT: vrgather.vv v10, v8, v12, v0.t
; CHECK-NEXT: vmv.v.v v8, v10
; CHECK-NEXT: ret
@@ -209,10 +209,10 @@ define <vscale x 8 x i16> @test_vp_reverse_nxv8i16(<vscale x 8 x i16> %src, i32
define <vscale x 16 x i8> @test_vp_reverse_nxv16i8_masked(<vscale x 16 x i8> %src, <vscale x 16 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv16i8_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e16, m4, ta, ma
-; CHECK-NEXT: vid.v v12, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v12, v12, a0, v0.t
+; CHECK-NEXT: vid.v v12
+; CHECK-NEXT: vrsub.vx v12, v12, a1, v0.t
; CHECK-NEXT: vsetvli zero, zero, e8, m2, ta, ma
; CHECK-NEXT: vrgatherei16.vv v10, v8, v12, v0.t
; CHECK-NEXT: vmv.v.v v8, v10
@@ -240,10 +240,10 @@ define <vscale x 16 x i8> @test_vp_reverse_nxv16i8(<vscale x 16 x i8> %src, i32
define <vscale x 4 x i64> @test_vp_reverse_nxv4i64_masked(<vscale x 4 x i64> %src, <vscale x 4 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv4i64_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e64, m4, ta, ma
-; CHECK-NEXT: vid.v v12, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v16, v12, a0, v0.t
+; CHECK-NEXT: vid.v v12
+; CHECK-NEXT: vrsub.vx v16, v12, a1, v0.t
; CHECK-NEXT: vrgather.vv v12, v8, v16, v0.t
; CHECK-NEXT: vmv.v.v v8, v12
; CHECK-NEXT: ret
@@ -269,10 +269,10 @@ define <vscale x 4 x i64> @test_vp_reverse_nxv4i64(<vscale x 4 x i64> %src, i32
define <vscale x 8 x i32> @test_vp_reverse_nxv8i32_masked(<vscale x 8 x i32> %src, <vscale x 8 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv8i32_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e32, m4, ta, ma
-; CHECK-NEXT: vid.v v12, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v16, v12, a0, v0.t
+; CHECK-NEXT: vid.v v12
+; CHECK-NEXT: vrsub.vx v16, v12, a1, v0.t
; CHECK-NEXT: vrgather.vv v12, v8, v16, v0.t
; CHECK-NEXT: vmv.v.v v8, v12
; CHECK-NEXT: ret
@@ -298,10 +298,10 @@ define <vscale x 8 x i32> @test_vp_reverse_nxv8i32(<vscale x 8 x i32> %src, i32
define <vscale x 16 x i16> @test_vp_reverse_nxv16i16_masked(<vscale x 16 x i16> %src, <vscale x 16 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv16i16_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e16, m4, ta, ma
-; CHECK-NEXT: vid.v v12, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v16, v12, a0, v0.t
+; CHECK-NEXT: vid.v v12
+; CHECK-NEXT: vrsub.vx v16, v12, a1, v0.t
; CHECK-NEXT: vrgather.vv v12, v8, v16, v0.t
; CHECK-NEXT: vmv.v.v v8, v12
; CHECK-NEXT: ret
@@ -327,10 +327,10 @@ define <vscale x 16 x i16> @test_vp_reverse_nxv16i16(<vscale x 16 x i16> %src, i
define <vscale x 32 x i8> @test_vp_reverse_nxv32i8_masked(<vscale x 32 x i8> %src, <vscale x 32 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv32i8_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e16, m8, ta, ma
-; CHECK-NEXT: vid.v v16, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v16, v16, a0, v0.t
+; CHECK-NEXT: vid.v v16
+; CHECK-NEXT: vrsub.vx v16, v16, a1, v0.t
; CHECK-NEXT: vsetvli zero, zero, e8, m4, ta, ma
; CHECK-NEXT: vrgatherei16.vv v12, v8, v16, v0.t
; CHECK-NEXT: vmv.v.v v8, v12
@@ -358,10 +358,10 @@ define <vscale x 32 x i8> @test_vp_reverse_nxv32i8(<vscale x 32 x i8> %src, i32
define <vscale x 8 x i64> @test_vp_reverse_nxv8i64_masked(<vscale x 8 x i64> %src, <vscale x 8 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv8i64_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e64, m8, ta, ma
-; CHECK-NEXT: vid.v v16, v0.t
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vrsub.vx v24, v16, a0, v0.t
+; CHECK-NEXT: vid.v v16
+; CHECK-NEXT: vrsub.vx v24, v16, a1, v0.t
; CHECK-NEXT: vrgather.vv v16, v8, v24, v0.t
; CHECK-NEXT: vmv.v.v v8, v16
; CHECK-NEXT: ret
@@ -387,10 +387,10 @@ define <vscale x 8 x i64> @test_vp_reverse_nxv8i64(<vscale x 8 x i64> %src, i32
define <vscale x 16 x i32> @test_vp_reverse_nxv16i32_masked(<vscale x 16 x i32> %src, <vscale x 16 x i1> %mask, i32 zeroext %evl) {
; CHECK-LABEL: test_vp_reverse_nxv16i32_masked:
; CHECK: # %bb.0:
+; CHECK-NEXT: addi a1, a0, -1
; CHECK-NEXT: vsetvli zero, a0, e32, m8, ta, ma
-; CHECK-NEXT: vid.v v16, v0....
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I handled the same issue before but my solution was generating strided load/store with negative stride so that we don't need to reverse the mask and load/store.
; CHECK-NEXT: vrsub.vx v10, v10, a0, v0.t | ||
; CHECK-NEXT: vsetvli zero, zero, e8, mf8, ta, ma | ||
; CHECK-NEXT: vmv.v.i v11, 0 | ||
; CHECK-NEXT: vmv1r.v v0, v9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it looks like the scheduler is now shuffling things about because the vid has one less use. And I think it's especially bad for vxi1 vectors because we need to constantly copy V0. Can we put this down to noise? Alternatively we could rewrite these tests so the mask vector comes first in v0.
As a side note, we don't even strictly need to apply the mask for this operation, or for most non-trapping VP ops either. Then we would get rid of a lot of these mask copies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scheduler is supposed to be register pressure aware, does this hint we have a problem with the way we model costs for the v0 register?
Yeah, hopefully there will be less of them if we can either teach the loop vectorizer to emit strided intrinsics or land #122244, but I guess we might still need it for general reverse gathers? E.g. for (int i = n; i >= 0; i--)
x[y[i]] = 0; |
This sorta feels like solving the problem in the wrong place. If the vectorizer is unprofitable emitting the vp.reverse, why not just change the vectorizer to emit a vp.reverse at vlmax or better an unpredicated vector.reverse? |
The vectorizer can't emit a vlmax reverse. The reverse needs to put the element at EVL-1 into element 0 and element 0 into EVL-1. So the vrsub needs to do EVL-vid. It can't do VLMAX-vid. |
I have patches in my downstream for DAG combines for vp.reverse(vp.load(ADDR, MASK)) -> vp.strided.load(ADDR, -1, MASK) and vp.store(vp.reverse(VAL), ADDR, MASK) -> vp.strided.store(VAL, NEW_ADDR, -1, MASK) We also have InstCombine patches to aggressively eliminate redundant vp.reverses. We've seen the vectorizer do things like load, reverse, add, reverse, store. |
We should probably also teach the loop vectorizer to emit a strided load/store for reversed VPWidenLoadEVLRecipe directly. That way we can remove the reverse shuffle from its cost |
Closing since #123608 avoids the need for this |
vp.reverse intrinsics are emitted by the loop vectorizer when EVL tail folding is enabled, and currently end up generating code like this:
The vid.v needed for the indices is calculated every loop, but because its AVL is set to the EVL computed by get.vector.length within the loop it isn't hoisted out.
This changes the AVL used to be VLMAX so it can be made loop invariant:
Now that we have RISCVVLOptimizer, It shouldn't increase the number of vsetvlis for straight-line code.
This also removes the mask which isn't needed, in case it also prevents hoisting.