Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[codegen][gpu] Use HWFC filter layout for SDXL convolution Ops #19701

Closed
jerryyin opened this issue Jan 14, 2025 · 9 comments
Closed

[codegen][gpu] Use HWFC filter layout for SDXL convolution Ops #19701

jerryyin opened this issue Jan 14, 2025 · 9 comments
Assignees
Labels
codegen/rocm ROCm code generation compiler backend (HIP/HSA)

Comments

@jerryyin
Copy link
Member

Context

For NHWC convolution, we want to make the filter hwfc layout, making the convolution equivalent to linalg.conv_2d_nhwc_hwfc op. As of right now, there is no op for this, so it will need to be a linalg.generic op representation. The reason we want this layout is so that we can load nice contiguous vectors along the inner reduction dimension (C dimension).

Implementation

The conversion that needs to be supported includes nhwc_hwcf -> nhwc_hwfc. The other nhwc convolution op (nhwc_fhwc) doesn't needs to be support for now since sdxl-scripts repo <alibaba_fp16> branch dumped result includes only conv_2d_nhwc_hwcf ops. (Taking configured_compiled_unet_run_forward$async_dispatch_151.mlir as an example).

The implementation should mimic an existing pass ConvertConvFilterToChannelsLast.cpp, but simpler since only a single input needs to be considered now.

Output

  • Manually write a linalg.conv_2d_nhwc_hwfc op as linalg.genric form
  • Pass along the output to @nirvedhmeshram as his input for further img2col pipeline processing

Thanks to @Max191 for assisting with scope of the task.

@jerryyin jerryyin self-assigned this Jan 14, 2025
@jerryyin jerryyin added the codegen/rocm ROCm code generation compiler backend (HIP/HSA) label Jan 14, 2025
@jerryyin jerryyin moved this to In Progress in CPU Codegen Tracker Jan 14, 2025
@nirvedhmeshram
Copy link
Contributor

nirvedhmeshram commented Jan 14, 2025

Just to update on possible output of this task (and input for the im2col pipeline), I used this conv op

func.func @conv(%arg0: tensor<2x130x130x16xf16>, %arg1: tensor<3x3x16x320xf16>, 
%arg2: tensor<2x128x128x320xf32>)
    -> tensor<2x128x128x320xf32> {
  %conv0 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>}
             ins(%arg0, %arg1 : tensor<2x130x130x16xf16>, tensor<3x3x16x320xf16>)
             outs(%arg2 : tensor<2x128x128x320xf32>) -> tensor<2x128x128x320xf32>
  return %conv0 : tensor<2x128x128x320xf32>
}

and used

iree-opt  --pass-pipeline="builtin.module(func.func(iree-codegen-gpu-generalize-named-ops))" test_conv.mlir

(with minor changes to the pass)
and got

#map = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d1 + d4, d2 + d5, d6)>
// (N,H,W,F,kh,kw,C) -> (N, H + kh, W + kw, C)
#map1 = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d4, d5, d6, d3)>
// (N,H,W,F,kh,kw,C) -> (kh, kW, C, F)
#map2 = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d1, d2, d3)>
// (N,H,W,F,kh,kw,C) -> (N, H, W, F)
module {
  func.func @conv(%arg0: tensor<2x130x130x16xf16>, %arg1: tensor<3x3x16x320xf16>, 
%arg2: tensor<2x128x128x320xf32>) -> tensor<2x128x128x320xf32> {
    %0 = linalg.generic {indexing_maps = [#map, #map1, #map2], 
iterator_types = ["parallel", "parallel", "parallel", "parallel", "reduction", "reduction", "reduction"]} 
ins(%arg0, %arg1 : tensor<2x130x130x16xf16>, tensor<3x3x16x320xf16>) 
outs(%arg2 : tensor<2x128x128x320xf32>) {
    ^bb0(%in: f16, %in_0: f16, %out: f32):
      %1 = arith.extf %in : f16 to f32
      %2 = arith.extf %in_0 : f16 to f32
      %3 = arith.mulf %1, %2 : f32
      %4 = arith.addf %out, %3 : f32
      linalg.yield %4 : f32
    } -> tensor<2x128x128x320xf32>
    return %0 : tensor<2x128x128x320xf32>
  }
}

Then manually modified it for the desired layout to

#map = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d1 + d4, d2 + d5, d6)>
#map1 = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d4, d5, d3, d6)>
#map2 = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d1, d2, d3)>
module {
  func.func @conv(%arg0: tensor<2x130x130x16xf16>, %arg1: tensor<3x3x320x16xf16>, 
%arg2: tensor<2x128x128x320xf32>) -> tensor<2x128x128x320xf32> {
    %0 = linalg.generic {indexing_maps = [#map, #map1, #map2], 
iterator_types = ["parallel", "parallel", "parallel", "parallel", "reduction", "reduction", "reduction"]} 
ins(%arg0, %arg1 : tensor<2x130x130x16xf16>, tensor<3x3x320x16xf16>) 
outs(%arg2 : tensor<2x128x128x320xf32>) {
    ^bb0(%in: f16, %in_0: f16, %out: f32):
      %1 = arith.extf %in : f16 to f32
      %2 = arith.extf %in_0 : f16 to f32
      %3 = arith.mulf %1, %2 : f32
      %4 = arith.addf %out, %3 : f32
      linalg.yield %4 : f32
    } -> tensor<2x128x128x320xf32>
    return %0 : tensor<2x128x128x320xf32>
  }
}

IR1 goes to IGEMM tileandfuse, while IR2 and IR3 currently go to vectordistribute which also targets intrinsic for them
they all give ~0.106ms time in benchmarking. But I am assuming we plan to do something better for IR3 in IGEMM tileandfuse

@jerryyin
Copy link
Member Author

Thanks for leaving details regarding the hwfc filter layout, Nirvedh! I'm planning to do the same and see your comment.

Changes in IR3 looks reasonable, basically swapping filter last two dimensions, as well as the affine map (d4, d5, d6, d3) -> (d4, d5, d3, d6).

@MaheshRavishankar
Copy link
Contributor

MaheshRavishankar commented Jan 15, 2025

Do we want to do hwfc or hwcf` ? cc @qedawkins

@qedawkins
Copy link
Contributor

hwfc or fhwc so that the reduced dimensions are inner most (a la matmul_transpose_b). hwcf is what we do today.

@jerryyin
Copy link
Member Author

jerryyin commented Jan 15, 2025

Talked with @Max191 on this who indicate we should support both hwfc and fhwc and experiment.

For hwfc we don't have a named op so I'm populating linalg.generic as @nirvedhmeshram has indicated above.
For fhwc is more convenient, I can potentially re-use the pre-existing named ops for conversion, and not having to populate linalg.generic.

The progress as of now is to get one flavor implemented and test with sdxl, make sure the transpose can be const-folded. Then implement and check another layout.

@MaheshRavishankar
Copy link
Contributor

Could we count this as done?

@jerryyin
Copy link
Member Author

In terms of the implementation the pass is implemented. But, I'd use this to also track

With the two PRs above, the pass will be exposed as we expected end to end.

@Max191
Copy link
Contributor

Max191 commented Feb 25, 2025

We may also still want to investigate the HWFC layout a bit more. We landed on the FHWC layout because we found no notable performance difference (so we went with the layout that had connected reduction dims), but our measurements were crude, and I think we need to investigate further. We can start with an iree-kernel-benchmark run, but even that may not tell the full story because there could be other differences in codegen related to having separated HW and C dims in the filter.

I think once we land the patches that Jerry listed above, we can close this issue, but we should open another issue tracking the performance comparison for the 2 layout options.

@jerryyin
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
codegen/rocm ROCm code generation compiler backend (HIP/HSA)
Projects
None yet
Development

No branches or pull requests

5 participants