[WIP][Kernels] Contiguous Group GeMM #1036

lessw2020 · 2025-03-31T22:35:21Z

This PR is adds a 'contiguous' group gemm with dynamic input and dynamic expert support. This is similar in spirit to the DeepSeek GEMM contiguous where inputs must be aligned to a given dimensionality of group_size_m and padded if not meeting that alignment.

1 - Forward and Backward are all working. See cg_forward.py and cg_backward.py:

Forward:

Performance Results:
  Dimensions: 32x1024x4096 -> 7168
  Triton: 5.17 ms (372.40 TFLOPS)
  PyTorch: 15.85 ms (121.38 TFLOPS)
  Speedup: 3.07x

Paper table format:
8       256     128     4096    7168    372 TFLOPS      5 ms    3.1x

Overall test result: All tests passed!

Verifying backward pass correctness...
Outputs match: True
Input gradients match: True
Weight gradients match: True

All gradients match! Running performance benchmark...
PyTorch backward time: 1.24 ms
Triton backward time: 0.25 ms
Speedup: 4.99x

2 - demo.py and full_moe_e2e put all the pieces together for an implementation.
3 - cg_reference.py has the Pytorch reference equivalence.

Usage:
It is important to note that input tokens must be one expert per block, where block is defined as group_size_m. No mixing of experts within a block is allowed.

contiguous grouped GEMM forward pass for MoE.
    All tokens mapped to the same expert must be in contiguous blocks of size group_size_m.

    Args:
        inputs: Input tensor of shape [M_total, K]
        expert_weights: Expert weight tensor of shape [num_experts, N, K]
        expert_indices: Indices tensor of shape [M_total] mapping each token to its expert
        group_size_m: Size of contiguous token blocks for each expert (default: 128)

    Returns:
        Output tensor of shape [M_total, N]

torchtitan/experiments/kernels/contiguous_group_gemm/debug.py

… prep kernels

…file for both backward and forward to use.

lessw2020 added 11 commits March 29, 2025 19:34

first draft, not working

47b8799

forward pass working, matches PyTorch ref

a4c1e7c

naming cleanup

e394886

naming cleanup

812f06d

remove tma version for now...

9ed651e

start backwards..but forwards having numerics issues as well.

578cab4

ensure group_index is calculated properly

d7d2679

small test passing, med fails

a63865a

all sizes passing - use debug.py

24cd08b

update pytorch reference. add alignedbench but not working

cfa5773

add grid stride kernel but not passing on large sizes

3bc93b6

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 31, 2025

add demo.py to test token input prep

8440a90

AdnanHoque reviewed Apr 1, 2025

View reviewed changes

torchtitan/experiments/kernels/contiguous_group_gemm/debug.py Show resolved Hide resolved

lessw2020 added 14 commits March 31, 2025 22:24

fixed demo input prep - now works with cg_forward

1646119

backward pass now working - dx, dw

e680f0f

full e2e MoE demo, triton restore not yet accurate

e5311e9

remove non working backwards kernels

fc0c8ef

get router gradients working for non triton prep path

4a5675b

add first unit test suite

e720e91

update early config prune to return min if none found, add additional…

9f715ac

… prep kernels

consolidate standard config, early config prune, cuda utils into sep …

323eebc

…file for both backward and forward to use.

first pass at sorting kernel, fails unit testing

6c4bdcd

sort update

07c7c8f

add cpp extension

872638d

remove cpu gpu sync, add unit test

1ff3603

3 unit test failures of 9

21ac60a

update to persistent forward with L2 caching optimization

7207639

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][Kernels] Contiguous Group GeMM #1036

[WIP][Kernels] Contiguous Group GeMM #1036

lessw2020 commented Mar 31, 2025 •

edited

Loading

[WIP][Kernels] Contiguous Group GeMM #1036

Are you sure you want to change the base?

[WIP][Kernels] Contiguous Group GeMM #1036

Conversation

lessw2020 commented Mar 31, 2025 • edited Loading

lessw2020 commented Mar 31, 2025 •

edited

Loading