[CK_TILE] Add universal gemm mem skip A/B LDS pipelines #2056

jakpiase · 2025-04-05T15:12:55Z

Proposed changes

[CK_TILE] Add universal gemm mem skip A/B LDS pipelines for tall and skinny gemms.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

…line_mem_skip_lds

aosewski

Good start! But I think we should work more on reusing existing code. Additionally I think we should have skipping A/B LDS functionality controlled (turned on/off) from policy. It would be better if we would have single pipeline (mem in this case) which could be configured by policy to either use LDS for any of inputs, both or none.

include/ck_tile/ops/gemm/block/block_universal_gemm_as_bs_cr.hpp

aosewski · 2025-04-11T09:17:36Z

include/ck_tile/ops/gemm/pipeline/gemm_pipeline_ag_bg_cr_base.hpp

+    template <typename ADramBlockWindowTmp, typename ALdsLoadTileDistr>
+    CK_TILE_DEVICE constexpr auto
+    GetADramWindowSkipLds(const ADramBlockWindowTmp& a_dram_block_window_tmp,


Can you please get rid of this tmp suffix throughout this file ? :)

aosewski · 2025-04-11T09:24:31Z

include/ck_tile/ops/gemm/pipeline/gemm_universal_pipeline_ag_bg_cr_skip_b_lds_policy.hpp

+namespace ck_tile {
+
+template <typename Derived>
+struct UniversalGemmSkipBLdsBasePolicy


Can't you actually reuse here this class:

composable_kernel/include/ck_tile/ops/gemm/pipeline/gemm_universal_pipeline_ag_bg_cr_policy.hpp

Line 13 in 5f885d2

struct UniversalGemmBasePolicy

You don't have to use all it's functionality.
Maybe you could even refactor it out to separate file?

aosewski · 2025-04-11T09:41:33Z

include/ck_tile/ops/gemm/pipeline/gemm_pipeline_ag_bg_cr_mem_skip_a_lds.hpp

+            auto b_lds_block = Base::GetBLdsTensorView(p_smem);
+
+            // Tile distribution for load from lds
+            constexpr auto a_lds_load_tile_distr = decltype(make_static_tile_distribution(


The variable name here is misleading, since you skip A LDS.

aosewski · 2025-04-11T09:44:56Z

include/ck_tile/ops/gemm/pipeline/gemm_pipeline_ag_bg_cr_mem_skip_a_lds.hpp

+
+            // LDS write 0
+            // TODO add a colmajor support
+            static_assert(is_a_col_major == false, "AColMajor not supported yet!");


What's the problem with col-major? You can reuse available logic of reading DRAM->VGPR and then transpose tile if it's in col-major.

aosewski · 2025-04-11T09:50:05Z

include/ck_tile/ops/gemm/block/block_universal_gemm_ar_bs_cr.hpp

+        using ALdsTile = decltype(make_static_distributed_tensor<ComputeDataType>(ALdsTileDistr));
+        using BLdsTile = decltype(make_static_distributed_tensor<ComputeDataType>(BLdsTileDistr));
+
+        ALdsTile a_warp_tile_;


In this case you actually even don't need this.

aosewski · 2025-04-11T09:50:27Z

include/ck_tile/ops/gemm/block/block_universal_gemm_ar_bs_cr.hpp

+                          "The ADataType and BDataType as defined in "
+                          "traits should be the same as correspoinding block window data type!");
+
+            a_warp_tile_.get_thread_buffer() = a_block_tensor.get_thread_buffer();


You can just use a_block_tensor.

aosewski · 2025-04-11T10:19:11Z

And don't forget to update CHANGELOG.md

…line_mem_skip_lds

spolifroni-amd · 2025-05-05T18:25:40Z

CHANGELOG.md

@@ -16,6 +16,7 @@ Documentation for Composable Kernel available at [https://rocm.docs.amd.com/proj
 * Added GEMM pipeline for microscaling (MX) data types
 * Added support for FP16 2:4 structured sparsity to universal GEMM.
 * Added support for Split K for grouped convolution backward data.
+* Added support for skipping LDS to universal GEMM


Is this supposed to be "to"? Because this sounds like the LDS is being skipped to go straight to universal GEMM, which doesn't sound quite right.

Is it maybe supposed to be "for" or "in" or "with"?

As in support's been added for skipping LDS when using universal GEMM?

Could be "for" or "in".

aosewski

This is a step into a good direction, however I feel like there's still a lot we can improve. I wonder about just single block gemm version named like: BlockUniversalGemmAxBxCr. From pipeline problem you could get information about skipping A/B LDS, and leverage it in local prefetching of A/B. Other than that core parts of the code would remain the same.

aosewski · 2025-05-07T09:25:01Z

CHANGELOG.md

@@ -16,6 +16,7 @@ Documentation for Composable Kernel available at [https://rocm.docs.amd.com/proj
 * Added GEMM pipeline for microscaling (MX) data types
 * Added support for FP16 2:4 structured sparsity to universal GEMM.
 * Added support for Split K for grouped convolution backward data.
+* Added support for skipping LDS to universal GEMM


Could be "for" or "in".

aosewski · 2025-05-07T09:29:41Z

include/ck_tile/ops/gemm/pipeline/tile_gemm_traits.hpp

    static constexpr bool TransposeC            = TransposeC_;
+    static constexpr bool SkipALds              = SkipALds_;
+    static constexpr bool SkipBLds              = SkipBLds_;
    static constexpr bool UseStructuredSparsity = UseStructuredSparsity_;


Could you please add doc to all those members ?

aosewski · 2025-05-07T09:29:54Z

include/ck_tile/ops/gemm/pipeline/tile_gemm_traits.hpp

    static constexpr bool TransposeC            = false;
+    static constexpr bool SkipALds              = false;
+    static constexpr bool SkipBLds              = false;
    static constexpr bool UseStructuredSparsity = false;


Could you please add doc to all those members ?

aosewski · 2025-05-07T09:31:56Z

test/ck_tile/gemm/test_gemm_pipeline_util.hpp

+        constexpr bool SkipALds = false;
+        constexpr bool SkipBLds = false;


This should rather be parameterized in tests or you should create a separate test suite for that.

aosewski · 2025-05-07T09:47:18Z

include/ck_tile/ops/gemm/pipeline/gemm_pipeline_ag_bg_cr_base.hpp

+                return make_tile_window(a_dram_block_window.get_bottom_tensor_view(),
+                                        make_tuple(YPerTile{}, XPerTile{}),
+                                        a_dram_block_window.get_window_origin(),
+                                        ALdsLoadTileDistr{});


This might not necessarily be the optimal solution. I imagine that regardless of using LDS we should read global memory in with an optimal vectorized access pattern. Then if it's needed you would just adapt the data layout in registers.

aosewski · 2025-05-07T10:13:33Z

include/ck_tile/ops/gemm/block/block_universal_gemm_ar_br_cr.hpp

+    }
+
+    // C = A * B
+    template <typename ARegBlockTensor, typename BSmemBlockWindow>


Suggested change

template <typename ARegBlockTensor, typename BSmemBlockWindow>

template <typename ARegBlockTensor, typename BRegBlockWindow>

aosewski · 2025-05-07T10:13:45Z

include/ck_tile/ops/gemm/block/block_universal_gemm_ar_br_cr.hpp

+    // C = A * B
+    template <typename ARegBlockTensor, typename BSmemBlockWindow>
+    CK_TILE_DEVICE auto operator()(const ARegBlockTensor& a_block_tensor,
+                                   const BSmemBlockWindow& b_block_window)


Suggested change

const BSmemBlockWindow& b_block_window)

const BRegBlockWindow& b_block_window)

aosewski · 2025-05-07T10:14:28Z

include/ck_tile/ops/gemm/block/block_universal_gemm_ar_br_cr.hpp

+                          "traits should be the same as correspoinding block window data type!");
+
+            // hot loop:
+            static_for<0, GemmTraits::KIterPerWarp, 1>{}([&](auto kIter) {


Suggested change

static_for<0, GemmTraits::KIterPerWarp, 1>{}([&](auto kIter) {

static_for<0, KIterPerWarp, 1>{}([&](auto kIter) {

aosewski · 2025-05-07T10:15:27Z

include/ck_tile/ops/gemm/block/block_universal_gemm_ar_br_cr.hpp

+    };
+
+    template <typename GemmTraits>
+    struct BlockGemmImpl<GemmPipelineScheduler::Intrawave, GemmTraits>


Looks like this one is entirely same as the Default one now, so you can just derive from it.

aosewski · 2025-05-07T10:28:10Z

include/ck_tile/ops/gemm/block/block_universal_gemm_ar_bs_cr.hpp

+
+            if constexpr(std::is_same_v<BDataType, pk_int4_t>)
+            {
+                load_interleaved_pk_type(b_warp_tile_, b_block_window);


It looks like we have support for B preshuffle with int4 packed data type only when we load B to LDS ... i think we should be able to support this in all situations.

jakpiase added 2 commits April 5, 2025 15:11

add skip a/b lds mem pipeline to universal gemm

4a2f735

Merge remote-tracking branch 'origin/develop' into jakpiase/gemm_pipe…

b3a1e16

…line_mem_skip_lds

jakpiase requested review from illsilin, carlushuang, qianfengz, aosewski, poyenc, geyyer, bartekxk, andriy-ca, afagaj, asleepzzz and tenpercent as code owners April 5, 2025 15:12

jakpiase added 3 commits April 7, 2025 14:09

Merge branch 'develop' into jakpiase/gemm_pipeline_mem_skip_lds

d26270d

Merge branch 'develop' into jakpiase/gemm_pipeline_mem_skip_lds

ff3aa68

Merge branch 'develop' into jakpiase/gemm_pipeline_mem_skip_lds

b4dc448

aosewski requested changes Apr 11, 2025

View reviewed changes

jakpiase added 3 commits May 5, 2025 14:11

merge skiplds into mem pipeline

acf61f9

resolve conflicts

608fd3a

fixes after merge

4baaa1b

jakpiase requested review from ThomasNing and coderfeli as code owners May 5, 2025 17:59

jakpiase added 3 commits May 5, 2025 18:05

more fixes

e0e2955

update changelog

a22186f

Merge remote-tracking branch 'origin/develop' into jakpiase/gemm_pipe…

2b5300a

…line_mem_skip_lds

jakpiase requested a review from a team as a code owner May 5, 2025 18:10

spolifroni-amd requested changes May 5, 2025

View reviewed changes

aosewski requested changes May 7, 2025

View reviewed changes

		constexpr bool SkipALds = false;
		constexpr bool SkipBLds = false;

	template <typename ARegBlockTensor, typename BSmemBlockWindow>
	template <typename ARegBlockTensor, typename BRegBlockWindow>

	const BSmemBlockWindow& b_block_window)
	const BRegBlockWindow& b_block_window)

	static_for<0, GemmTraits::KIterPerWarp, 1>{}([&](auto kIter) {
	static_for<0, KIterPerWarp, 1>{}([&](auto kIter) {

[CK_TILE] Add universal gemm mem skip A/B LDS pipelines #2056

Are you sure you want to change the base?

[CK_TILE] Add universal gemm mem skip A/B LDS pipelines #2056

Conversation

jakpiase commented Apr 5, 2025

Proposed changes

Checklist

Uh oh!

aosewski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aosewski commented Apr 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aosewski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!