How can we integrate the DeepGEMM Fp8 GEMM implementation in TE's block-wise scaling？ #1509

BolongLin · 2025-02-26T03:56:02Z

Hi, how can we set other Cute/Cutlass operators on TE? Like GEMM from DeepGEMM, which is a library designed for FP8 GEMMs with fine-grained scaling, as proposed in DeepSeek-V3.

yaox12 · 2025-02-28T01:52:58Z

We will not add DeepGEMM into TE because it lacks the GEMM for wgrad (1x128 by 1x128) in back propagation. And its JIT mechanism brings non-negligible overheads in training.

We're landing DeepSeek-v3-like FP8 recipe (1x128 for activations and 128x128 for weights) in TE and we will use the block-wise GEMM from cuBLAS (to be released in CUDA 12.9), which has a comparable performance as DeepGEMM and both 1D2D (1x128 by 128x128) and 1D1D (1x128 by 1x128) support, and gets rid of the JIT overheads.

zhujian19891203 · 2025-03-08T04:05:50Z

Hi, how can we set other Cute/Cutlass operators on TE? Like GEMM from DeepGEMM, which is a library designed for FP8 GEMMs with fine-grained scaling, as proposed in DeepSeek-V3.

Link: deepseek-ai/DeepGEMM#10 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can we integrate the DeepGEMM Fp8 GEMM implementation in TE's block-wise scaling？ #1509

How can we integrate the DeepGEMM Fp8 GEMM implementation in TE's block-wise scaling？ #1509

BolongLin commented Feb 26, 2025

yaox12 commented Feb 28, 2025

zhujian19891203 commented Mar 8, 2025

How can we integrate the DeepGEMM Fp8 GEMM implementation in TE's block-wise scaling？ #1509

How can we integrate the DeepGEMM Fp8 GEMM implementation in TE's block-wise scaling？ #1509

Comments

BolongLin commented Feb 26, 2025

yaox12 commented Feb 28, 2025

zhujian19891203 commented Mar 8, 2025