Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can we integrate the DeepGEMM Fp8 GEMM implementation in TE's block-wise scaling? #1509

Open
BolongLin opened this issue Feb 26, 2025 · 2 comments

Comments

@BolongLin
Copy link

Hi, how can we set other Cute/Cutlass operators on TE? Like GEMM from DeepGEMM, which is a library designed for FP8 GEMMs with fine-grained scaling, as proposed in DeepSeek-V3.

@yaox12
Copy link
Collaborator

yaox12 commented Feb 28, 2025

We will not add DeepGEMM into TE because it lacks the GEMM for wgrad (1x128 by 1x128) in back propagation. And its JIT mechanism brings non-negligible overheads in training.

We're landing DeepSeek-v3-like FP8 recipe (1x128 for activations and 128x128 for weights) in TE and we will use the block-wise GEMM from cuBLAS (to be released in CUDA 12.9), which has a comparable performance as DeepGEMM and both 1D2D (1x128 by 128x128) and 1D1D (1x128 by 1x128) support, and gets rid of the JIT overheads.

@zhujian19891203
Copy link

Hi, how can we set other Cute/Cutlass operators on TE? Like GEMM from DeepGEMM, which is a library designed for FP8 GEMMs with fine-grained scaling, as proposed in DeepSeek-V3.

Image

Link: deepseek-ai/DeepGEMM#10 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants