Summary
cutlass_fused_moe produces incorrect output when FP8 weights are used with hidden_size >= 512. The BF16-only path works correctly for all dimensions. The existing test_moe_fp8 only tests hidden_size=128 which masks this issue.
Test Matrix
All tests use E=2, top_k=2, I=128, batch_size=1:
| Input |
Weight |
H=128 |
H=256 |
H=512 |
H=2048 |
| BF16 |
BF16 |
✅ |
✅ |
✅ |
✅ (0.37% diff) |
| FP8 |
FP8 |
✅ |
✅ |
❌ |
❌ (11% mismatch, max rel 234x) |
| BF16 |
FP8 |
— |
— |
❌ |
❌ crash / all-zero |
Reproduction
Using FlashInfer v0.6.8rc1 on H200 (SM90):
from tests.moe.test_trtllm_cutlass_fused_moe import test_moe_fp8
import torch
# Existing test — PASS
test_moe_fp8(1, 128, 2, 2, 128, torch.float16, torch.float8_e4m3fn)
# Same test with realistic hidden_size — FAIL
test_moe_fp8(1, 512, 2, 2, 128, torch.float16, torch.float8_e4m3fn)
# → Mismatched elements: 11.1%, greatest abs diff: 1.45, greatest rel diff: 234x
test_moe_fp8(1, 2048, 2, 2, 128, torch.float16, torch.float8_e4m3fn)
# → FAIL
The failure is independent of num_experts (tested E=2 through E=128, all fail for H>=512).
Impact
All production FP8 MoE models have hidden_size >= 2048:
- Qwen3-30B-A3B: H=2048
- MiniMax M2.5: H=3072
- Qwen3-235B: H=4096
- DeepSeek V3: H=7168
Notes
Environment
- FlashInfer: v0.6.8rc1 (
e843df97)
- GPU: H200 (SM90)
- CUDA: 12.8
Summary
cutlass_fused_moeproduces incorrect output when FP8 weights are used withhidden_size >= 512. The BF16-only path works correctly for all dimensions. The existingtest_moe_fp8only testshidden_size=128which masks this issue.Test Matrix
All tests use
E=2, top_k=2, I=128, batch_size=1:Reproduction
Using FlashInfer v0.6.8rc1 on H200 (SM90):
The failure is independent of
num_experts(tested E=2 through E=128, all fail for H>=512).Impact
All production FP8 MoE models have
hidden_size >= 2048:Notes
cutlass_backend(cutlass_fused_moe), nottrtllm_backend(trtllm_fp8_block_scale_moe)trtllm_fp8_block_scale_moeaccuracy and was fixed by Refactor the routing part #2803 — different code pathEnvironment
e843df97)