cutlass_fused_moe FP8 incorrect for hidden_size >= 512 (per-tensor dequant path)

## Summary

`cutlass_fused_moe` produces incorrect output when FP8 weights are used with `hidden_size >= 512`. The BF16-only path works correctly for all dimensions. The existing `test_moe_fp8` only tests `hidden_size=128` which masks this issue.

## Test Matrix

All tests use `E=2, top_k=2, I=128, batch_size=1`:

| Input | Weight | H=128 | H=256 | H=512 | H=2048 |
|-------|--------|-------|-------|-------|--------|
| BF16 | BF16 | ✅ | ✅ | ✅ | ✅ (0.37% diff) |
| FP8 | FP8 | ✅ | ✅ | ❌ | ❌ (11% mismatch, max rel 234x) |
| BF16 | FP8 | — | — | ❌ | ❌ crash / all-zero |

## Reproduction

Using FlashInfer v0.6.8rc1 on H200 (SM90):

```python
from tests.moe.test_trtllm_cutlass_fused_moe import test_moe_fp8
import torch

# Existing test — PASS
test_moe_fp8(1, 128, 2, 2, 128, torch.float16, torch.float8_e4m3fn)

# Same test with realistic hidden_size — FAIL
test_moe_fp8(1, 512, 2, 2, 128, torch.float16, torch.float8_e4m3fn)
# → Mismatched elements: 11.1%, greatest abs diff: 1.45, greatest rel diff: 234x

test_moe_fp8(1, 2048, 2, 2, 128, torch.float16, torch.float8_e4m3fn)
# → FAIL
```

The failure is independent of `num_experts` (tested E=2 through E=128, all fail for H>=512).

## Impact

All production FP8 MoE models have `hidden_size >= 2048`:
- Qwen3-30B-A3B: H=2048
- MiniMax M2.5: H=3072
- Qwen3-235B: H=4096
- DeepSeek V3: H=7168

## Notes

- This is in `cutlass_backend` (`cutlass_fused_moe`), not `trtllm_backend` (`trtllm_fp8_block_scale_moe`)
- Related: #2356 was about `trtllm_fp8_block_scale_moe` accuracy and was fixed by #2803 — different code path
- BF16×BF16 path is correct for all dimensions, confirming routing/finalize logic is fine — the bug is specific to the FP8 GEMM or dequant path
- We work around this with auto-fallback to per-expert blockscale GEMM loop

## Environment

- FlashInfer: v0.6.8rc1 (`e843df97`)
- GPU: H200 (SM90)
- CUDA: 12.8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cutlass_fused_moe FP8 incorrect for hidden_size >= 512 (per-tensor dequant path) #3068

Summary

Test Matrix

Reproduction

Impact

Notes

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Input	Weight	H=128	H=256	H=512	H=2048
BF16	BF16	✅	✅	✅	✅ (0.37% diff)
FP8	FP8	✅	✅	❌	❌ (11% mismatch, max rel 234x)
BF16	FP8	—	—	❌	❌ crash / all-zero

cutlass_fused_moe FP8 incorrect for hidden_size >= 512 (per-tensor dequant path) #3068

Description

Summary

Test Matrix

Reproduction

Impact

Notes

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions