Skip to content

Conversation

@jacobhinkle
Copy link
Collaborator

@jacobhinkle jacobhinkle commented Jun 6, 2023

This PR normalizes the inputs to slice in order to mimic the semantics of numpy/PyTorch slicing. For an axis with extent ext, if we receive a slice of (start, stop, step) we normalize it to (norm_start, norm_stop, step) where

norm_start = max(0, start < 0 ? start + ext : start);
norm_stop = max(norm_start, min(ext, stop < 0 ? stop + ext : stop));

Specific changes in this PR:

  • Form the above expressions in the slice op.
  • Add shmoo tests that test various scenarios with constant and input size slices.

The simple Fusion in the input range test prints like this:

Inputs:
  T0_g[ iS0{9} ], float
  i3, nvfuser_index_t
  i4, nvfuser_index_t
Outputs:
  T1_g[ ?S2{( ( ( -( fmax(0, ( where(( i3 < 0 ), ( i3 + 9 ), i3) )) ) ) + 9 ) + ( ( fmax(( fmax(0, ( where(( i3 < 0 ), ( i3 + 9 ), i3) )) ), ( fmin(9, ( where(( i4 < 0 ), ( i4 + 9 ), i4) )) )) ) - 9 ) )}rf ], float

%kernel_math {
b7 = i3 < 0;
i5 = i3 + 9;
i9 = where(b7, i5, i3);
i11 = fmax(0, i9);
b15 = i4 < 0;
i13 = i4 + 9;
i17 = where(b15, i13, i4);
i19 = fmin(9, i17);
i21 = fmax(i11, i19);
T1_g[ ?S2{( ( ( -( fmax(0, ( where(( i3 < 0 ), ( i3 + 9 ), i3) )) ) ) + 9 ) + ( ( fmax(( fmax(0, ( where(( i3 < 0 ), ( i3 + 9 ), i3) )) ), ( fmin(9, ( where(( i4 < 0 ), ( i4 + 9 ), i4) )) )) ) - 9 ) )}rf ]
   = slice( T0_g[ iS0{9} ], { {i11, i21, 1} } )
}

T0_g[ iS0{9} ]
 root domain : (iS0{9})
 contiguity: f
 leaf domain : (iS0{9})
T1_g[ ?S2{( ( ( -( fmax(0, ( where(( i3 < 0 ), ( i3 + 9 ), i3) )) ) ) + 9 ) + ( ( fmax(( fmax(0, ( where(( i3 < 0 ), ( i3 + 9 ), i3) )) ), ( fmin(9, ( where(( i4 < 0 ), ( i4 + 9 ), i4) )) )) ) - 9 ) )}rf ]
 root domain : (iS1{9}rf)
  Resize: iS1{9}rf by ( -( fmax(0, ( where(( i3 < 0 ), ( i3 + 9 ), i3) )) ) ) and ( ( fmax(( fmax(0, ( where(( i3 < 0 ), ( i3 + 9 ), i3) )) ), ( fmin(9, ( where(( i4 < 0 ), ( i4 + 9 ), i4) )) )) ) - 9 ) -> ?S2{( ( ( -( fmax(0, ( where(( i3 < 0 ), ( i3 + 9 ), i3) )) ) ) + 9 ) + ( ( fmax(( fmax(0, ( where(( i3 < 0 ), ( i3 + 9 ), i3) )) ), ( fmin(9, ( where(( i4 < 0 ), ( i4 + 9 ), i4) )) )) ) - 9 ) )}rf
 rfactor domain : (?S2{( ( ( -( fmax(0, ( where(( i3 < 0 ), ( i3 + 9 ), i3) )) ) ) + 9 ) + ( ( fmax(( fmax(0, ( where(( i3 < 0 ), ( i3 + 9 ), i3) )) ), ( fmin(9, ( where(( i4 < 0 ), ( i4 + 9 ), i4) )) )) ) - 9 ) )}rf)
 contiguity: t
 leaf domain : (?S2{( ( ( -( fmax(0, ( where(( i3 < 0 ), ( i3 + 9 ), i3) )) ) ) + 9 ) + ( ( fmax(( fmax(0, ( where(( i3 < 0 ), ( i3 + 9 ), i3) )) ), ( fmin(9, ( where(( i4 < 0 ), ( i4 + 9 ), i4) )) )) ) - 9 ) )}rf)

resulting in the following CUDA kernel:

__global__ void kernel1(Tensor<float, 1, 1> T0, nvfuser_index_t i0, nvfuser_index_t i1, Tensor<float, 1, 1> T1) {
  nvfuser_index_t i2;
  i2 = i0 + 9;
  bool b3;
  b3 = i0 < 0;
  nvfuser_index_t i4;
  i4 = b3 ? i2 : i0;
  nvfuser_index_t i5;
  i5 = max(0, i4);
  nvfuser_index_t i6;
  i6 = i1 + 9;
  bool b7;
  b7 = i1 < 0;
  nvfuser_index_t i8;
  i8 = b7 ? i6 : i1;
  nvfuser_index_t i9;
  i9 = min(9, i8);
  nvfuser_index_t i10;
  i10 = max(i5, i9);
  nvfuser_index_t i11;
  i11 = (-i5) + i10;
  nvfuser_index_t i12;
  i12 = i5 * T0.alloc_stride[0];
  #pragma unroll 1
  for(nvfuser_index_t i13 = 0; i13 < i11; ++i13) {
    T1[i13]
       = T0[(i12 + (T0.alloc_stride[0] * i13))];
  }
}

This PR does NOT simplify these expressions for non-constant inputs. This can be done at concretization, which will be left for a follow-up PR.

Stacked on #892 and #895.

Fixes #439. Fixes #52.

@jacobhinkle
Copy link
Collaborator Author

!build

@jacobhinkle jacobhinkle marked this pull request as ready for review June 6, 2023 23:35
@jacobhinkle jacobhinkle requested a review from naoyam June 6, 2023 23:35
This currently fails at lowering due to infinite recursion in
nvfuser::prove::lessEqual when trying to simplify index expressions for
index hoisting.
@jacobhinkle
Copy link
Collaborator Author

Closing in favor of #511.

@zasdfgbnm zasdfgbnm deleted the slice_clip branch June 22, 2023 15:50
@jacobhinkle jacobhinkle restored the slice_clip branch September 11, 2023 14:18
@jacobhinkle jacobhinkle reopened this Sep 11, 2023
@jacobhinkle jacobhinkle marked this pull request as draft September 11, 2023 16:29

const int64_t slice_offset = 4;
const std::vector<int64_t> shape({1024 * 1024});
const std::vector<int64_t> shape({1024L * 1024L});
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silencing clang-tidy

Comment on lines -898 to +899
std::cout << "Fusion IR after pre-segmenter optimization passes:"
<< std::endl;
debug() << "Fusion IR after pre-segmenter optimization passes:"
<< std::endl;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR. Just found wrong ostream in this debug dump.

@jacobhinkle jacobhinkle marked this pull request as ready for review September 26, 2023 13:09
@jacobhinkle
Copy link
Collaborator Author

!build

@jacobhinkle
Copy link
Collaborator Author

!build


// Test slice with a variety of constant ranges
TEST_F(NVFuserTest, FusionResizeSliceConstantShmoo_CUDA) {
for (auto [start, stop] : std::vector<std::pair<int64_t, int64_t>>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the same set of slices as FusionResizeSliceInputShmoo_CUDA?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's good now. The reason I didn't originally is just that it slows down the test a lot since we need to recompile for each slice.

fe.compileFusion(&fusion);

auto t0 = at::randn(shape, options);
for (auto [start, stop] : std::vector<std::pair<int64_t, int64_t>>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pull this set of slices out of the test and reuse it for all the three tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slice should clip to end of tensor slice corner case: Index range at runtime that is beyond a tensor does not return a zero-element tensor

4 participants