Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV] Don't use EVL/Mask for vid when lowering vp.reverse #123048

Closed
wants to merge 2 commits into from

Conversation

lukel97
Copy link
Contributor

@lukel97 lukel97 commented Jan 15, 2025

vp.reverse intrinsics are emitted by the loop vectorizer when EVL tail folding is enabled, and currently end up generating code like this:

.LBB0_1: # %loop
  # =>This Inner Loop Header: Depth=1
  sub a3, a2, a1
  slli a4, a1, 3
  vsetvli a3, a3, e64, m1, ta, ma
  add a4, a0, a4
  vle64.v v8, (a4)
  addi a5, a3, -1
  vid.v v9
  vrsub.vx v9, v9, a5
  vrgather.vv v10, v8, v9
  add a1, a1, a3
  vse64.v v10, (a4)
  bltu a1, a2, .LBB0_1

The vid.v needed for the indices is calculated every loop, but because its AVL is set to the EVL computed by get.vector.length within the loop it isn't hoisted out.

This changes the AVL used to be VLMAX so it can be made loop invariant:

  vsetvli a3, zero, e64, m1, ta, ma
  vid.v v8
.LBB0_1: # %loop
  # =>This Inner Loop Header: Depth=1
  sub a3, a2, a1
  slli a4, a1, 3
  vsetvli a3, a3, e64, m1, ta, ma
  add a4, a0, a4
  vle64.v v9, (a4)
  addi a5, a3, -1
  vrsub.vx v10, v8, a5
  vrgather.vv v11, v9, v10
  add a1, a1, a3
  vse64.v v11, (a4)
  bltu a1, a2, .LBB0_1

Now that we have RISCVVLOptimizer, It shouldn't increase the number of vsetvlis for straight-line code.

This also removes the mask which isn't needed, in case it also prevents hoisting.

vp.reverse intrinsics are emitted by the loop vectorizer when EVL tail folding is enabled, and currently end up generating code like this:

    .LBB0_1: # %loop
      # =>This Inner Loop Header: Depth=1
      sub a3, a2, a1
      slli a4, a1, 3
      vsetvli a3, a3, e64, m1, ta, ma
      add a4, a0, a4
      vle64.v v8, (a4)
      addi a5, a3, -1
      vid.v v9
      vrsub.vx v9, v9, a5
      vrgather.vv v10, v8, v9
      add a1, a1, a3
      vse64.v v10, (a4)
      bltu a1, a2, .LBB0_1

The vid.v needed for the indices is calculated every loop, but because its AVL is set to the EVL computed by get.vector.length within the loop it isn't hoisted out.

This changes the AVL used to be VLMAX so it can be made loop invariant:

      vsetvli a3, zero, e64, m1, ta, ma
      vid.v v8
    .LBB0_1: # %loop
      # =>This Inner Loop Header: Depth=1
      sub a3, a2, a1
      slli a4, a1, 3
      vsetvli a3, a3, e64, m1, ta, ma
      add a4, a0, a4
      vle64.v v9, (a4)
      addi a5, a3, -1
      vrsub.vx v10, v8, a5
      vrgather.vv v11, v9, v10
      add a1, a1, a3
      vse64.v v11, (a4)
      bltu a1, a2, .LBB0_1

Now that we have RISCVVLOptimizer, It shouldn't increase the number of vsetvlis for straight-line code.

This also removes the mask which isn't needed, in case it also prevents hoisting.
@llvmbot
Copy link
Member

llvmbot commented Jan 15, 2025

@llvm/pr-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

Changes

vp.reverse intrinsics are emitted by the loop vectorizer when EVL tail folding is enabled, and currently end up generating code like this:

.LBB0_1: # %loop
  # =>This Inner Loop Header: Depth=1
  sub a3, a2, a1
  slli a4, a1, 3
  vsetvli a3, a3, e64, m1, ta, ma
  add a4, a0, a4
  vle64.v v8, (a4)
  addi a5, a3, -1
  vid.v v9
  vrsub.vx v9, v9, a5
  vrgather.vv v10, v8, v9
  add a1, a1, a3
  vse64.v v10, (a4)
  bltu a1, a2, .LBB0_1

The vid.v needed for the indices is calculated every loop, but because its AVL is set to the EVL computed by get.vector.length within the loop it isn't hoisted out.

This changes the AVL used to be VLMAX so it can be made loop invariant:

  vsetvli a3, zero, e64, m1, ta, ma
  vid.v v8
.LBB0_1: # %loop
  # =>This Inner Loop Header: Depth=1
  sub a3, a2, a1
  slli a4, a1, 3
  vsetvli a3, a3, e64, m1, ta, ma
  add a4, a0, a4
  vle64.v v9, (a4)
  addi a5, a3, -1
  vrsub.vx v10, v8, a5
  vrgather.vv v11, v9, v10
  add a1, a1, a3
  vse64.v v11, (a4)
  bltu a1, a2, .LBB0_1

Now that we have RISCVVLOptimizer, It shouldn't increase the number of vsetvlis for straight-line code.

This also removes the mask which isn't needed, in case it also prevents hoisting.


Patch is 33.90 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/123048.diff

8 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+5-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vp-reverse-float-fixed-vectors.ll (+6-6)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vp-reverse-float.ll (+24-24)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vp-reverse-int-fixed-vectors.ll (+12-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vp-reverse-int.ll (+45-45)
  • (added) llvm/test/CodeGen/RISCV/rvv/vp-reverse-loop.ll (+50)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vp-reverse-mask-fixed-vectors.ll (+32-24)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vp-reverse-mask.ll (+48-36)
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index b25cb128bce9fb..32a94f3dee22c4 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -12430,7 +12430,11 @@ RISCVTargetLowering::lowerVPReverseExperimental(SDValue Op,
     GatherOpc = RISCVISD::VRGATHEREI16_VV_VL;
   }
 
-  SDValue VID = DAG.getNode(RISCVISD::VID_VL, DL, IndicesVT, Mask, EVL);
+  // Don't use EVL or Mask for vid so it can be hoisted out of loops.
+  auto [TrueMask, VLMAX] =
+      getDefaultScalableVLOps(IndicesVT, DL, DAG, Subtarget);
+  SDValue VID = DAG.getNode(RISCVISD::VID_VL, DL, IndicesVT, TrueMask, VLMAX);
+
   SDValue VecLen =
       DAG.getNode(ISD::SUB, DL, XLenVT, EVL, DAG.getConstant(1, DL, XLenVT));
   SDValue VecLenSplat = DAG.getNode(RISCVISD::VMV_V_X_VL, DL, IndicesVT,
diff --git a/llvm/test/CodeGen/RISCV/rvv/vp-reverse-float-fixed-vectors.ll b/llvm/test/CodeGen/RISCV/rvv/vp-reverse-float-fixed-vectors.ll
index 136f6e7bc99909..887edafe9c88a8 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vp-reverse-float-fixed-vectors.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vp-reverse-float-fixed-vectors.ll
@@ -5,10 +5,10 @@
 define <2 x double> @test_vp_reverse_v2f64_masked(<2 x double> %src, <2 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_v2f64_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vid.v v9, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT:    vid.v v9
+; CHECK-NEXT:    vrsub.vx v10, v9, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v9, v8, v10, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v9
 ; CHECK-NEXT:    ret
@@ -34,10 +34,10 @@ define <2 x double> @test_vp_reverse_v2f64(<2 x double> %src, i32 zeroext %evl)
 define <4 x float> @test_vp_reverse_v4f32_masked(<4 x float> %src, <4 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_v4f32_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e32, m1, ta, ma
-; CHECK-NEXT:    vid.v v9, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT:    vid.v v9
+; CHECK-NEXT:    vrsub.vx v10, v9, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v9, v8, v10, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v9
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/RISCV/rvv/vp-reverse-float.ll b/llvm/test/CodeGen/RISCV/rvv/vp-reverse-float.ll
index b235990ab5dd09..194eb222be01f5 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vp-reverse-float.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vp-reverse-float.ll
@@ -4,10 +4,10 @@
 define <vscale x 1 x double> @test_vp_reverse_nxv1f64_masked(<vscale x 1 x double> %src, <vscale x 1 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv1f64_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vid.v v9, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT:    vid.v v9
+; CHECK-NEXT:    vrsub.vx v10, v9, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v9, v8, v10, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v9
 ; CHECK-NEXT:    ret
@@ -33,10 +33,10 @@ define <vscale x 1 x double> @test_vp_reverse_nxv1f64(<vscale x 1 x double> %src
 define <vscale x 2 x float> @test_vp_reverse_nxv2f32_masked(<vscale x 2 x float> %src, <vscale x 2 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv2f32_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e32, m1, ta, ma
-; CHECK-NEXT:    vid.v v9, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT:    vid.v v9
+; CHECK-NEXT:    vrsub.vx v10, v9, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v9, v8, v10, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v9
 ; CHECK-NEXT:    ret
@@ -62,10 +62,10 @@ define <vscale x 2 x float> @test_vp_reverse_nxv2f32(<vscale x 2 x float> %src,
 define <vscale x 2 x double> @test_vp_reverse_nxv2f64_masked(<vscale x 2 x double> %src, <vscale x 2 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv2f64_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m2, ta, ma
-; CHECK-NEXT:    vid.v v10, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v12, v10, a0, v0.t
+; CHECK-NEXT:    vid.v v10
+; CHECK-NEXT:    vrsub.vx v12, v10, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v10, v8, v12, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v10
 ; CHECK-NEXT:    ret
@@ -91,10 +91,10 @@ define <vscale x 2 x double> @test_vp_reverse_nxv2f64(<vscale x 2 x double> %src
 define <vscale x 4 x float> @test_vp_reverse_nxv4f32_masked(<vscale x 4 x float> %src, <vscale x 4 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv4f32_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e32, m2, ta, ma
-; CHECK-NEXT:    vid.v v10, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v12, v10, a0, v0.t
+; CHECK-NEXT:    vid.v v10
+; CHECK-NEXT:    vrsub.vx v12, v10, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v10, v8, v12, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v10
 ; CHECK-NEXT:    ret
@@ -120,10 +120,10 @@ define <vscale x 4 x float> @test_vp_reverse_nxv4f32(<vscale x 4 x float> %src,
 define <vscale x 4 x double> @test_vp_reverse_nxv4f64_masked(<vscale x 4 x double> %src, <vscale x 4 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv4f64_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m4, ta, ma
-; CHECK-NEXT:    vid.v v12, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v16, v12, a0, v0.t
+; CHECK-NEXT:    vid.v v12
+; CHECK-NEXT:    vrsub.vx v16, v12, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v12, v8, v16, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v12
 ; CHECK-NEXT:    ret
@@ -149,10 +149,10 @@ define <vscale x 4 x double> @test_vp_reverse_nxv4f64(<vscale x 4 x double> %src
 define <vscale x 8 x float> @test_vp_reverse_nxv8f32_masked(<vscale x 8 x float> %src, <vscale x 8 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv8f32_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e32, m4, ta, ma
-; CHECK-NEXT:    vid.v v12, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v16, v12, a0, v0.t
+; CHECK-NEXT:    vid.v v12
+; CHECK-NEXT:    vrsub.vx v16, v12, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v12, v8, v16, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v12
 ; CHECK-NEXT:    ret
@@ -178,10 +178,10 @@ define <vscale x 8 x float> @test_vp_reverse_nxv8f32(<vscale x 8 x float> %src,
 define <vscale x 8 x double> @test_vp_reverse_nxv8f64_masked(<vscale x 8 x double> %src, <vscale x 8 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv8f64_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m8, ta, ma
-; CHECK-NEXT:    vid.v v16, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v24, v16, a0, v0.t
+; CHECK-NEXT:    vid.v v16
+; CHECK-NEXT:    vrsub.vx v24, v16, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v16, v8, v24, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v16
 ; CHECK-NEXT:    ret
@@ -207,10 +207,10 @@ define <vscale x 8 x double> @test_vp_reverse_nxv8f64(<vscale x 8 x double> %src
 define <vscale x 16 x float> @test_vp_reverse_nxv16f32_masked(<vscale x 16 x float> %src, <vscale x 16 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv16f32_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e32, m8, ta, ma
-; CHECK-NEXT:    vid.v v16, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v24, v16, a0, v0.t
+; CHECK-NEXT:    vid.v v16
+; CHECK-NEXT:    vrsub.vx v24, v16, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v16, v8, v24, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v16
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/RISCV/rvv/vp-reverse-int-fixed-vectors.ll b/llvm/test/CodeGen/RISCV/rvv/vp-reverse-int-fixed-vectors.ll
index 27f16f0285e12f..33fa3539ade937 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vp-reverse-int-fixed-vectors.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vp-reverse-int-fixed-vectors.ll
@@ -5,10 +5,10 @@
 define <2 x i64> @test_vp_reverse_v2i64_masked(<2 x i64> %src, <2 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_v2i64_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vid.v v9, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT:    vid.v v9
+; CHECK-NEXT:    vrsub.vx v10, v9, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v9, v8, v10, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v9
 ; CHECK-NEXT:    ret
@@ -34,10 +34,10 @@ define <2 x i64> @test_vp_reverse_v2i64(<2 x i64> %src, i32 zeroext %evl) {
 define <4 x i32> @test_vp_reverse_v4i32_masked(<4 x i32> %src, <4 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_v4i32_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e32, m1, ta, ma
-; CHECK-NEXT:    vid.v v9, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT:    vid.v v9
+; CHECK-NEXT:    vrsub.vx v10, v9, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v9, v8, v10, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v9
 ; CHECK-NEXT:    ret
@@ -63,10 +63,10 @@ define <4 x i32> @test_vp_reverse_v4i32(<4 x i32> %src, i32 zeroext %evl) {
 define <8 x i16> @test_vp_reverse_v8i16_masked(<8 x i16> %src, <8 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_v8i16_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e16, m1, ta, ma
-; CHECK-NEXT:    vid.v v9, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT:    vid.v v9
+; CHECK-NEXT:    vrsub.vx v10, v9, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v9, v8, v10, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v9
 ; CHECK-NEXT:    ret
@@ -92,10 +92,10 @@ define <8 x i16> @test_vp_reverse_v8i16(<8 x i16> %src, i32 zeroext %evl) {
 define <16 x i8> @test_vp_reverse_v16i8_masked(<16 x i8> %src, <16 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_v16i8_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e16, m2, ta, ma
-; CHECK-NEXT:    vid.v v10, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v10, v10, a0, v0.t
+; CHECK-NEXT:    vid.v v10
+; CHECK-NEXT:    vrsub.vx v10, v10, a1, v0.t
 ; CHECK-NEXT:    vsetvli zero, zero, e8, m1, ta, ma
 ; CHECK-NEXT:    vrgatherei16.vv v9, v8, v10, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v9
diff --git a/llvm/test/CodeGen/RISCV/rvv/vp-reverse-int.ll b/llvm/test/CodeGen/RISCV/rvv/vp-reverse-int.ll
index 507f5154cf1aca..ab37e5f27bcefe 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vp-reverse-int.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vp-reverse-int.ll
@@ -4,10 +4,10 @@
 define <vscale x 1 x i64> @test_vp_reverse_nxv1i64_masked(<vscale x 1 x i64> %src, <vscale x 1 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv1i64_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vid.v v9, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT:    vid.v v9
+; CHECK-NEXT:    vrsub.vx v10, v9, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v9, v8, v10, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v9
 ; CHECK-NEXT:    ret
@@ -33,10 +33,10 @@ define <vscale x 1 x i64> @test_vp_reverse_nxv1i64(<vscale x 1 x i64> %src, i32
 define <vscale x 2 x i32> @test_vp_reverse_nxv2i32_masked(<vscale x 2 x i32> %src, <vscale x 2 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv2i32_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e32, m1, ta, ma
-; CHECK-NEXT:    vid.v v9, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT:    vid.v v9
+; CHECK-NEXT:    vrsub.vx v10, v9, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v9, v8, v10, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v9
 ; CHECK-NEXT:    ret
@@ -62,10 +62,10 @@ define <vscale x 2 x i32> @test_vp_reverse_nxv2i32(<vscale x 2 x i32> %src, i32
 define <vscale x 4 x i16> @test_vp_reverse_nxv4i16_masked(<vscale x 4 x i16> %src, <vscale x 4 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv4i16_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e16, m1, ta, ma
-; CHECK-NEXT:    vid.v v9, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v10, v9, a0, v0.t
+; CHECK-NEXT:    vid.v v9
+; CHECK-NEXT:    vrsub.vx v10, v9, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v9, v8, v10, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v9
 ; CHECK-NEXT:    ret
@@ -91,10 +91,10 @@ define <vscale x 4 x i16> @test_vp_reverse_nxv4i16(<vscale x 4 x i16> %src, i32
 define <vscale x 8 x i8> @test_vp_reverse_nxv8i8_masked(<vscale x 8 x i8> %src, <vscale x 8 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv8i8_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e16, m2, ta, ma
-; CHECK-NEXT:    vid.v v10, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v10, v10, a0, v0.t
+; CHECK-NEXT:    vid.v v10
+; CHECK-NEXT:    vrsub.vx v10, v10, a1, v0.t
 ; CHECK-NEXT:    vsetvli zero, zero, e8, m1, ta, ma
 ; CHECK-NEXT:    vrgatherei16.vv v9, v8, v10, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v9
@@ -122,10 +122,10 @@ define <vscale x 8 x i8> @test_vp_reverse_nxv8i8(<vscale x 8 x i8> %src, i32 zer
 define <vscale x 2 x i64> @test_vp_reverse_nxv2i64_masked(<vscale x 2 x i64> %src, <vscale x 2 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv2i64_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m2, ta, ma
-; CHECK-NEXT:    vid.v v10, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v12, v10, a0, v0.t
+; CHECK-NEXT:    vid.v v10
+; CHECK-NEXT:    vrsub.vx v12, v10, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v10, v8, v12, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v10
 ; CHECK-NEXT:    ret
@@ -151,10 +151,10 @@ define <vscale x 2 x i64> @test_vp_reverse_nxv2i64(<vscale x 2 x i64> %src, i32
 define <vscale x 4 x i32> @test_vp_reverse_nxv4i32_masked(<vscale x 4 x i32> %src, <vscale x 4 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv4i32_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e32, m2, ta, ma
-; CHECK-NEXT:    vid.v v10, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v12, v10, a0, v0.t
+; CHECK-NEXT:    vid.v v10
+; CHECK-NEXT:    vrsub.vx v12, v10, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v10, v8, v12, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v10
 ; CHECK-NEXT:    ret
@@ -180,10 +180,10 @@ define <vscale x 4 x i32> @test_vp_reverse_nxv4i32(<vscale x 4 x i32> %src, i32
 define <vscale x 8 x i16> @test_vp_reverse_nxv8i16_masked(<vscale x 8 x i16> %src, <vscale x 8 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv8i16_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e16, m2, ta, ma
-; CHECK-NEXT:    vid.v v10, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v12, v10, a0, v0.t
+; CHECK-NEXT:    vid.v v10
+; CHECK-NEXT:    vrsub.vx v12, v10, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v10, v8, v12, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v10
 ; CHECK-NEXT:    ret
@@ -209,10 +209,10 @@ define <vscale x 8 x i16> @test_vp_reverse_nxv8i16(<vscale x 8 x i16> %src, i32
 define <vscale x 16 x i8> @test_vp_reverse_nxv16i8_masked(<vscale x 16 x i8> %src, <vscale x 16 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv16i8_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e16, m4, ta, ma
-; CHECK-NEXT:    vid.v v12, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v12, v12, a0, v0.t
+; CHECK-NEXT:    vid.v v12
+; CHECK-NEXT:    vrsub.vx v12, v12, a1, v0.t
 ; CHECK-NEXT:    vsetvli zero, zero, e8, m2, ta, ma
 ; CHECK-NEXT:    vrgatherei16.vv v10, v8, v12, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v10
@@ -240,10 +240,10 @@ define <vscale x 16 x i8> @test_vp_reverse_nxv16i8(<vscale x 16 x i8> %src, i32
 define <vscale x 4 x i64> @test_vp_reverse_nxv4i64_masked(<vscale x 4 x i64> %src, <vscale x 4 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv4i64_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m4, ta, ma
-; CHECK-NEXT:    vid.v v12, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v16, v12, a0, v0.t
+; CHECK-NEXT:    vid.v v12
+; CHECK-NEXT:    vrsub.vx v16, v12, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v12, v8, v16, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v12
 ; CHECK-NEXT:    ret
@@ -269,10 +269,10 @@ define <vscale x 4 x i64> @test_vp_reverse_nxv4i64(<vscale x 4 x i64> %src, i32
 define <vscale x 8 x i32> @test_vp_reverse_nxv8i32_masked(<vscale x 8 x i32> %src, <vscale x 8 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv8i32_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e32, m4, ta, ma
-; CHECK-NEXT:    vid.v v12, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v16, v12, a0, v0.t
+; CHECK-NEXT:    vid.v v12
+; CHECK-NEXT:    vrsub.vx v16, v12, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v12, v8, v16, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v12
 ; CHECK-NEXT:    ret
@@ -298,10 +298,10 @@ define <vscale x 8 x i32> @test_vp_reverse_nxv8i32(<vscale x 8 x i32> %src, i32
 define <vscale x 16 x i16> @test_vp_reverse_nxv16i16_masked(<vscale x 16 x i16> %src, <vscale x 16 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv16i16_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e16, m4, ta, ma
-; CHECK-NEXT:    vid.v v12, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v16, v12, a0, v0.t
+; CHECK-NEXT:    vid.v v12
+; CHECK-NEXT:    vrsub.vx v16, v12, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v12, v8, v16, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v12
 ; CHECK-NEXT:    ret
@@ -327,10 +327,10 @@ define <vscale x 16 x i16> @test_vp_reverse_nxv16i16(<vscale x 16 x i16> %src, i
 define <vscale x 32 x i8> @test_vp_reverse_nxv32i8_masked(<vscale x 32 x i8> %src, <vscale x 32 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv32i8_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e16, m8, ta, ma
-; CHECK-NEXT:    vid.v v16, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v16, v16, a0, v0.t
+; CHECK-NEXT:    vid.v v16
+; CHECK-NEXT:    vrsub.vx v16, v16, a1, v0.t
 ; CHECK-NEXT:    vsetvli zero, zero, e8, m4, ta, ma
 ; CHECK-NEXT:    vrgatherei16.vv v12, v8, v16, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v12
@@ -358,10 +358,10 @@ define <vscale x 32 x i8> @test_vp_reverse_nxv32i8(<vscale x 32 x i8> %src, i32
 define <vscale x 8 x i64> @test_vp_reverse_nxv8i64_masked(<vscale x 8 x i64> %src, <vscale x 8 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv8i64_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m8, ta, ma
-; CHECK-NEXT:    vid.v v16, v0.t
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vrsub.vx v24, v16, a0, v0.t
+; CHECK-NEXT:    vid.v v16
+; CHECK-NEXT:    vrsub.vx v24, v16, a1, v0.t
 ; CHECK-NEXT:    vrgather.vv v16, v8, v24, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v16
 ; CHECK-NEXT:    ret
@@ -387,10 +387,10 @@ define <vscale x 8 x i64> @test_vp_reverse_nxv8i64(<vscale x 8 x i64> %src, i32
 define <vscale x 16 x i32> @test_vp_reverse_nxv16i32_masked(<vscale x 16 x i32> %src, <vscale x 16 x i1> %mask, i32 zeroext %evl) {
 ; CHECK-LABEL: test_vp_reverse_nxv16i32_masked:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi a1, a0, -1
 ; CHECK-NEXT:    vsetvli zero, a0, e32, m8, ta, ma
-; CHECK-NEXT:    vid.v v16, v0....
[truncated]

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I handled the same issue before but my solution was generating strided load/store with negative stride so that we don't need to reverse the mask and load/store.

; CHECK-NEXT: vrsub.vx v10, v10, a0, v0.t
; CHECK-NEXT: vsetvli zero, zero, e8, mf8, ta, ma
; CHECK-NEXT: vmv.v.i v11, 0
; CHECK-NEXT: vmv1r.v v0, v9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regressions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it looks like the scheduler is now shuffling things about because the vid has one less use. And I think it's especially bad for vxi1 vectors because we need to constantly copy V0. Can we put this down to noise? Alternatively we could rewrite these tests so the mask vector comes first in v0.

As a side note, we don't even strictly need to apply the mask for this operation, or for most non-trapping VP ops either. Then we would get rid of a lot of these mask copies.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler is supposed to be register pressure aware, does this hint we have a problem with the way we model costs for the v0 register?

@lukel97
Copy link
Contributor Author

lukel97 commented Jan 15, 2025

I handled the same issue before but my solution was generating strided load/store with negative stride so that we don't need to reverse the mask and load/store.

Yeah, hopefully there will be less of them if we can either teach the loop vectorizer to emit strided intrinsics or land #122244, but I guess we might still need it for general reverse gathers? E.g.

for (int i = n; i >= 0; i--)
  x[y[i]] = 0;

@preames
Copy link
Collaborator

preames commented Jan 15, 2025

This sorta feels like solving the problem in the wrong place. If the vectorizer is unprofitable emitting the vp.reverse, why not just change the vectorizer to emit a vp.reverse at vlmax or better an unpredicated vector.reverse?

@topperc
Copy link
Collaborator

topperc commented Jan 15, 2025

This sorta feels like solving the problem in the wrong place. If the vectorizer is unprofitable emitting the vp.reverse, why not just change the vectorizer to emit a vp.reverse at vlmax or better an unpredicated vector.reverse?

The vectorizer can't emit a vlmax reverse. The reverse needs to put the element at EVL-1 into element 0 and element 0 into EVL-1. So the vrsub needs to do EVL-vid. It can't do VLMAX-vid.

@topperc
Copy link
Collaborator

topperc commented Jan 15, 2025

I have patches in my downstream for DAG combines for vp.reverse(vp.load(ADDR, MASK)) -> vp.strided.load(ADDR, -1, MASK) and vp.store(vp.reverse(VAL), ADDR, MASK) -> vp.strided.store(VAL, NEW_ADDR, -1, MASK)

We also have InstCombine patches to aggressively eliminate redundant vp.reverses. We've seen the vectorizer do things like load, reverse, add, reverse, store.

@lukel97
Copy link
Contributor Author

lukel97 commented Jan 16, 2025

I have patches in my downstream for DAG combines for vp.reverse(vp.load(ADDR, MASK)) -> vp.strided.load(ADDR, -1, MASK) and vp.store(vp.reverse(VAL), ADDR, MASK) -> vp.strided.store(VAL, NEW_ADDR, -1, MASK)

We should probably also teach the loop vectorizer to emit a strided load/store for reversed VPWidenLoadEVLRecipe directly. That way we can remove the reverse shuffle from its cost

@lukel97
Copy link
Contributor Author

lukel97 commented Jan 20, 2025

Closing since #123608 avoids the need for this

@lukel97 lukel97 closed this Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants