[codegen][gpu] Use `HWFC` filter layout for SDXL convolution Ops #19701

jerryyin · 2025-01-14T22:17:35Z

Context

For NHWC convolution, we want to make the filter hwfc layout, making the convolution equivalent to linalg.conv_2d_nhwc_hwfc op. As of right now, there is no op for this, so it will need to be a linalg.generic op representation. The reason we want this layout is so that we can load nice contiguous vectors along the inner reduction dimension (C dimension).

Implementation

The conversion that needs to be supported includes nhwc_hwcf -> nhwc_hwfc. The other nhwc convolution op (nhwc_fhwc) doesn't needs to be support for now since sdxl-scripts repo <alibaba_fp16> branch dumped result includes only conv_2d_nhwc_hwcf ops. (Taking configured_compiled_unet_run_forward$async_dispatch_151.mlir as an example).

The implementation should mimic an existing pass ConvertConvFilterToChannelsLast.cpp, but simpler since only a single input needs to be considered now.

Output

Manually write a linalg.conv_2d_nhwc_hwfc op as linalg.genric form
Pass along the output to @nirvedhmeshram as his input for further img2col pipeline processing

Thanks to @Max191 for assisting with scope of the task.

The text was updated successfully, but these errors were encountered:

nirvedhmeshram · 2025-01-14T22:23:57Z

Just to update on possible output of this task (and input for the im2col pipeline), I used this conv op

func.func @conv(%arg0: tensor<2x130x130x16xf16>, %arg1: tensor<3x3x16x320xf16>, 
%arg2: tensor<2x128x128x320xf32>)
    -> tensor<2x128x128x320xf32> {
  %conv0 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>}
             ins(%arg0, %arg1 : tensor<2x130x130x16xf16>, tensor<3x3x16x320xf16>)
             outs(%arg2 : tensor<2x128x128x320xf32>) -> tensor<2x128x128x320xf32>
  return %conv0 : tensor<2x128x128x320xf32>
}

and used

iree-opt  --pass-pipeline="builtin.module(func.func(iree-codegen-gpu-generalize-named-ops))" test_conv.mlir

(with minor changes to the pass)
and got

#map = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d1 + d4, d2 + d5, d6)>
// (N,H,W,F,kh,kw,C) -> (N, H + kh, W + kw, C)
#map1 = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d4, d5, d6, d3)>
// (N,H,W,F,kh,kw,C) -> (kh, kW, C, F)
#map2 = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d1, d2, d3)>
// (N,H,W,F,kh,kw,C) -> (N, H, W, F)
module {
  func.func @conv(%arg0: tensor<2x130x130x16xf16>, %arg1: tensor<3x3x16x320xf16>, 
%arg2: tensor<2x128x128x320xf32>) -> tensor<2x128x128x320xf32> {
    %0 = linalg.generic {indexing_maps = [#map, #map1, #map2], 
iterator_types = ["parallel", "parallel", "parallel", "parallel", "reduction", "reduction", "reduction"]} 
ins(%arg0, %arg1 : tensor<2x130x130x16xf16>, tensor<3x3x16x320xf16>) 
outs(%arg2 : tensor<2x128x128x320xf32>) {
    ^bb0(%in: f16, %in_0: f16, %out: f32):
      %1 = arith.extf %in : f16 to f32
      %2 = arith.extf %in_0 : f16 to f32
      %3 = arith.mulf %1, %2 : f32
      %4 = arith.addf %out, %3 : f32
      linalg.yield %4 : f32
    } -> tensor<2x128x128x320xf32>
    return %0 : tensor<2x128x128x320xf32>
  }
}

Then manually modified it for the desired layout to

#map = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d1 + d4, d2 + d5, d6)>
#map1 = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d4, d5, d3, d6)>
#map2 = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d1, d2, d3)>
module {
  func.func @conv(%arg0: tensor<2x130x130x16xf16>, %arg1: tensor<3x3x320x16xf16>, 
%arg2: tensor<2x128x128x320xf32>) -> tensor<2x128x128x320xf32> {
    %0 = linalg.generic {indexing_maps = [#map, #map1, #map2], 
iterator_types = ["parallel", "parallel", "parallel", "parallel", "reduction", "reduction", "reduction"]} 
ins(%arg0, %arg1 : tensor<2x130x130x16xf16>, tensor<3x3x320x16xf16>) 
outs(%arg2 : tensor<2x128x128x320xf32>) {
    ^bb0(%in: f16, %in_0: f16, %out: f32):
      %1 = arith.extf %in : f16 to f32
      %2 = arith.extf %in_0 : f16 to f32
      %3 = arith.mulf %1, %2 : f32
      %4 = arith.addf %out, %3 : f32
      linalg.yield %4 : f32
    } -> tensor<2x128x128x320xf32>
    return %0 : tensor<2x128x128x320xf32>
  }
}

IR1 goes to IGEMM tileandfuse, while IR2 and IR3 currently go to vectordistribute which also targets intrinsic for them
they all give ~0.106ms time in benchmarking. But I am assuming we plan to do something better for IR3 in IGEMM tileandfuse

jerryyin · 2025-01-15T14:31:43Z

Thanks for leaving details regarding the hwfc filter layout, Nirvedh! I'm planning to do the same and see your comment.

Changes in IR3 looks reasonable, basically swapping filter last two dimensions, as well as the affine map (d4, d5, d6, d3) -> (d4, d5, d3, d6).

MaheshRavishankar · 2025-01-15T21:48:22Z

Do we want to do hwfc or hwcf` ? cc @qedawkins

qedawkins · 2025-01-15T21:55:31Z

hwfc or fhwc so that the reduced dimensions are inner most (a la matmul_transpose_b). hwcf is what we do today.

jerryyin · 2025-01-15T22:03:35Z

Talked with @Max191 on this who indicate we should support both hwfc and fhwc and experiment.

For hwfc we don't have a named op so I'm populating linalg.generic as @nirvedhmeshram has indicated above.
For fhwc is more convenient, I can potentially re-use the pre-existing named ops for conversion, and not having to populate linalg.generic.

The progress as of now is to get one flavor implemented and test with sdxl, make sure the transpose can be const-folded. Then implement and check another layout.

MaheshRavishankar · 2025-02-24T17:05:50Z

Could we count this as done?

jerryyin · 2025-02-24T17:31:56Z

In terms of the implementation the pass is implemented. But, I'd use this to also track

[codegen][gpu] Adding support to generic op and flexible layout to pad_to_intrinsics on convolution #20073
[codegen][gpu] Adding conv filter layout fhwc to preprocessing pipeline #19974

With the two PRs above, the pass will be exposed as we expected end to end.

Max191 · 2025-02-25T14:12:54Z

We may also still want to investigate the HWFC layout a bit more. We landed on the FHWC layout because we found no notable performance difference (so we went with the layout that had connected reduction dims), but our measurements were crude, and I think we need to investigate further. We can start with an iree-kernel-benchmark run, but even that may not tell the full story because there could be other differences in codegen related to having separated HW and C dims in the filter.

I think once we land the patches that Jerry listed above, we can close this issue, but we should open another issue tracking the performance comparison for the 2 layout options.

jerryyin · 2025-02-26T17:38:25Z

@Max191 Agreed. Please refer to #20105 for the new follow-up ticket.

Since all work tracked by this ticket is done, I'm closing it as finished. Relevant work of this ticket include:

[Preprocessing] Adding conv filter to channel last pass in preprocessing #19739
[NFC][GPU] Remove named op IGEMM lowering #19885
Add pattern to convert generic conv ops to IGEMM #19798
[GPU][Codegen] Setting padding size of external reduction dimensions #19834
[codegen][gpu] Adding support to generic op and flexible layout to pad_to_intrinsics on convolution #20073
[codegen][gpu] Adding conv filter layout fhwc to preprocessing pipeline #19974
Adding conv igemm layout support nod-ai/shark-ai#898 and Adding conv igemm layout support - step 1: conv op matcher nod-ai/shark-ai#943
- It was decided that we'd not pursue those tuning PRs further and next step convolution tuning to be implemented via bindings.

jerryyin self-assigned this Jan 14, 2025

jerryyin added the codegen/rocm ROCm code generation compiler backend (HIP/HSA) label Jan 14, 2025

jerryyin added this to CPU Codegen Tracker Jan 14, 2025

jerryyin moved this to In Progress in CPU Codegen Tracker Jan 14, 2025

jerryyin removed this from CPU Codegen Tracker Jan 15, 2025

jerryyin mentioned this issue Feb 26, 2025

[codegen][gpu] Performance investigation of default convolution filter layout #20105

Open

jerryyin closed this as completed Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codegen][gpu] Use `HWFC` filter layout for SDXL convolution Ops #19701

[codegen][gpu] Use `HWFC` filter layout for SDXL convolution Ops #19701

jerryyin commented Jan 14, 2025

nirvedhmeshram commented Jan 14, 2025 •

edited

Loading

jerryyin commented Jan 15, 2025

MaheshRavishankar commented Jan 15, 2025 •

edited

Loading

qedawkins commented Jan 15, 2025

jerryyin commented Jan 15, 2025 •

edited

Loading

MaheshRavishankar commented Feb 24, 2025

jerryyin commented Feb 24, 2025

Max191 commented Feb 25, 2025

jerryyin commented Feb 26, 2025

[codegen][gpu] Use HWFC filter layout for SDXL convolution Ops #19701

[codegen][gpu] Use HWFC filter layout for SDXL convolution Ops #19701

Comments

jerryyin commented Jan 14, 2025

Context

Implementation

Output

nirvedhmeshram commented Jan 14, 2025 • edited Loading

jerryyin commented Jan 15, 2025

MaheshRavishankar commented Jan 15, 2025 • edited Loading

qedawkins commented Jan 15, 2025

jerryyin commented Jan 15, 2025 • edited Loading

MaheshRavishankar commented Feb 24, 2025

jerryyin commented Feb 24, 2025

Max191 commented Feb 25, 2025

jerryyin commented Feb 26, 2025

[codegen][gpu] Use `HWFC` filter layout for SDXL convolution Ops #19701

[codegen][gpu] Use `HWFC` filter layout for SDXL convolution Ops #19701

nirvedhmeshram commented Jan 14, 2025 •

edited

Loading

MaheshRavishankar commented Jan 15, 2025 •

edited

Loading

jerryyin commented Jan 15, 2025 •

edited

Loading