Skip to content

cutlass_fused_moe FP8 incorrect for hidden_size >= 512 (per-tensor dequant path) #3068

@kzjeef

Description

@kzjeef

Summary

cutlass_fused_moe produces incorrect output when FP8 weights are used with hidden_size >= 512. The BF16-only path works correctly for all dimensions. The existing test_moe_fp8 only tests hidden_size=128 which masks this issue.

Test Matrix

All tests use E=2, top_k=2, I=128, batch_size=1:

Input Weight H=128 H=256 H=512 H=2048
BF16 BF16 ✅ (0.37% diff)
FP8 FP8 ❌ (11% mismatch, max rel 234x)
BF16 FP8 ❌ crash / all-zero

Reproduction

Using FlashInfer v0.6.8rc1 on H200 (SM90):

from tests.moe.test_trtllm_cutlass_fused_moe import test_moe_fp8
import torch

# Existing test — PASS
test_moe_fp8(1, 128, 2, 2, 128, torch.float16, torch.float8_e4m3fn)

# Same test with realistic hidden_size — FAIL
test_moe_fp8(1, 512, 2, 2, 128, torch.float16, torch.float8_e4m3fn)
# → Mismatched elements: 11.1%, greatest abs diff: 1.45, greatest rel diff: 234x

test_moe_fp8(1, 2048, 2, 2, 128, torch.float16, torch.float8_e4m3fn)
# → FAIL

The failure is independent of num_experts (tested E=2 through E=128, all fail for H>=512).

Impact

All production FP8 MoE models have hidden_size >= 2048:

  • Qwen3-30B-A3B: H=2048
  • MiniMax M2.5: H=3072
  • Qwen3-235B: H=4096
  • DeepSeek V3: H=7168

Notes

Environment

  • FlashInfer: v0.6.8rc1 (e843df97)
  • GPU: H200 (SM90)
  • CUDA: 12.8

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions