Open
Description
In cases where the shape of a linalg operation is smaller or equal to the minimal tile size (which is 32) the operation is untouched and left as it is. That's the problem as our GPU pipeline expects a for-loop (that will later describe a launch grid) after the IterativeTilingAndFusion
pass. If there's no loop the pipeline breaks.
For the stability reasons, I would expect that such operations would be wrapped into a single-iteration for-loop just to make pipeline working even on those corner cases:
func.func @linalg_matmul(%arg0: tensor<32x32xf16>, %arg1: tensor<32x32xf16>,
%arg2: tensor<32x32xf16>) -> tensor<32x32xf16> {
%0 = linalg.matmul ins(%arg0, %arg1 : tensor<32x32xf16>, tensor<32x32xf16>)
outs(%arg2 : tensor<32x32xf16>) -> tensor<32x32xf16>
return %0 : tensor<32x32xf16>
}
// Expected output (a tiling for loop consisting of one iteration):
func.func @linalg_matmul(%arg0: tensor<32x32xf16>, %arg1: tensor<32x32xf16>, %arg2: tensor<32x32xf16>) -> tensor<32x32xf16> {
%0 = scf.forall (%arg3, %arg4) = (0, 0) to (32, 32) step (32, 32) shared_outs(%arg5 = %arg2) -> (tensor<32x32xf16>) {
%extracted_slice = tensor.extract_slice %arg0[%arg3, 0] [32, 32] [1, 1] : tensor<32x32xf16> to tensor<32x32xf16>
%extracted_slice_0 = tensor.extract_slice %arg1[0, %arg4] [32, 32] [1, 1] : tensor<32x32xf16> to tensor<32x32xf16>
%extracted_slice_1 = tensor.extract_slice %arg5[%arg3, %arg4] [32, 32] [1, 1] : tensor<32x32xf16> to tensor<32x32xf16>
%1 = linalg.matmul ins(%extracted_slice, %extracted_slice_0 : tensor<32x32xf16>, tensor<32x32xf16>) outs(%extracted_slice_1 : tensor<32x32xf16>) -> tensor<32x32xf16>
scf.forall.in_parallel {
tensor.parallel_insert_slice %1 into %arg5[%arg3, %arg4] [32, 32] [1, 1] : tensor<32x32xf16> into tensor<32x32xf16>
}
}
return %0 : tensor<32x32xf16>
}
P.S. this is not critical, as in real-life scenarios we would likely not meet ops with such small shapes
Metadata
Metadata
Assignees
Labels
No labels