Skip to content

[WIP][Kernels] Contiguous Group GeMM #1036

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

lessw2020
Copy link
Contributor

@lessw2020 lessw2020 commented Mar 31, 2025

This PR is adds a 'contiguous' group gemm with dynamic input and dynamic expert support. This is similar in spirit to the DeepSeek GEMM contiguous where inputs must be aligned to a given dimensionality of group_size_m and padded if not meeting that alignment.

1 - Forward and Backward are all working. See cg_forward.py and cg_backward.py:

Forward:

Performance Results:
  Dimensions: 32x1024x4096 -> 7168
  Triton: 5.17 ms (372.40 TFLOPS)
  PyTorch: 15.85 ms (121.38 TFLOPS)
  Speedup: 3.07x

Paper table format:
8       256     128     4096    7168    372 TFLOPS      5 ms    3.1x

Overall test result: All tests passed!

Verifying backward pass correctness...
Outputs match: True
Input gradients match: True
Weight gradients match: True

All gradients match! Running performance benchmark...
PyTorch backward time: 1.24 ms
Triton backward time: 0.25 ms
Speedup: 4.99x

2 - demo.py and full_moe_e2e put all the pieces together for an implementation.
3 - cg_reference.py has the Pytorch reference equivalence.

Usage:
It is important to note that input tokens must be one expert per block, where block is defined as group_size_m. No mixing of experts within a block is allowed.

contiguous grouped GEMM forward pass for MoE.
    All tokens mapped to the same expert must be in contiguous blocks of size group_size_m.

    Args:
        inputs: Input tensor of shape [M_total, K]
        expert_weights: Expert weight tensor of shape [num_experts, N, K]
        expert_indices: Indices tensor of shape [M_total] mapping each token to its expert
        group_size_m: Size of contiguous token blocks for each expert (default: 128)

    Returns:
        Output tensor of shape [M_total, N]

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants