Summary
SGLang v0.5.12-cu130 regresses DeepGemm on NVIDIA B300 (Blackwell, sm_120). Every GLM-5-FP8 inference run on this image crashes during CUDA graph capture with CUDA_ERROR_ILLEGAL_ADDRESS originating in DeepGemm's TMA descriptor initialization for the shared-experts FP8 GEMM. This blocks the bump in #1421 and would likely block similar B300 sglang PRs.
Upstream report
Surfaced by
Failing GitHub Action runs
Reproduction
Use the glm5-fp8-b300-sglang recipe but swap the image to v0.5.12-cu130. Server flags pulled from benchmarks/single_node/glm5_fp8_b300.sh:
docker run --gpus all --shm-size=32g --rm \
-v $HF_HUB_CACHE:/root/.cache/huggingface \
lmsysorg/sglang:v0.5.12-cu130 \
bash -c "
pip install --no-deps 'transformers==5.2.0' 'huggingface-hub==1.4.1' && \
export SGL_ENABLE_JIT_DEEPGEMM=1 && \
python3 -m sglang.launch_server \
--model-path=zai-org/GLM-5-FP8 \
--host=0.0.0.0 --port=8888 --trust-remote-code \
--tensor-parallel-size=8 \
--data-parallel-size 1 --expert-parallel-size 1 \
--tool-call-parser glm47 --reasoning-parser glm45 \
--kv-cache-dtype fp8_e4m3 --quantization fp8 \
--attention-backend nsa \
--nsa-decode-backend trtllm --nsa-prefill-backend trtllm \
--moe-runner-backend flashinfer_trtllm \
--cuda-graph-max-bs 128 --max-running-requests 128 \
--mem-fraction-static 0.85 \
--chunked-prefill-size 32768 --max-prefill-tokens 32768 \
--enable-flashinfer-allreduce-fusion --disable-radix-cache \
--stream-interval 30 \
--model-loader-extra-config '{\"enable_multithread_load\": true}'
"
Diagnosis
- Crash site (all TP ranks, simultaneous):
deep_gemm.fp8_fp4_gemm_nt → _C.fp8_fp4_gemm_nt() → /deepgemm/csrc/apis/../jit_kernels/impls/runtime_utils.hpp:143 → CUDA_ERROR_ILLEGAL_ADDRESS (error 700)
- Call path:
cuda_graph_runner.capture() → deepseek_v2.forward() → MoE layer forward_normal_dual_stream → _forward_shared_experts → shared_experts.gate_up_proj → fp8_kernel.deep_gemm_fp8_fp8_bf16_nt → deep_gemm_wrapper.gemm_nt_f8f8bf16 → deep_gemm.fp8_gemm_nt → native fp8_fp4_gemm_nt → CUDA_ERROR_ILLEGAL_ADDRESS in TMA descriptor runtime utils
- Working baseline:
lmsysorg/sglang:v0.5.11-cu130 runs cleanly on the same recipe / hardware.
The bundled DeepGemm in v0.5.12-cu130 has a regression in its TMA-descriptor init path on Blackwell. runtime_utils.hpp:143 is the TMA descriptor validation/creation site.
Workarounds (pick one to unblock #1421)
- Pin sglang to
lmsysorg/sglang:v0.5.11-cu130 until upstream DeepGemm/Blackwell TMA fix lands.
- Bypass DeepGemm: pass
--fp8-gemm-runner-backend cutlass to the SGLang server launch — uses CUTLASS FP8 path instead of DeepGemm.
- Disable CUDA graphs:
--disable-cuda-graph — large perf hit, only useful as a smoke test.
Suggested actions
🤖 Filed by Claude Code
Summary
SGLang
v0.5.12-cu130regresses DeepGemm on NVIDIA B300 (Blackwell, sm_120). Every GLM-5-FP8 inference run on this image crashes during CUDA graph capture withCUDA_ERROR_ILLEGAL_ADDRESSoriginating in DeepGemm's TMA descriptor initialization for the shared-experts FP8 GEMM. This blocks the bump in #1421 and would likely block similar B300 sglang PRs.Upstream report
Surfaced by
Update glm5-fp8-b300-sglang and -mtp SGLang image to v0.5.12-cu130).Failing GitHub Action runs
Reproduction
Use the
glm5-fp8-b300-sglangrecipe but swap the image to v0.5.12-cu130. Server flags pulled frombenchmarks/single_node/glm5_fp8_b300.sh:Diagnosis
deep_gemm.fp8_fp4_gemm_nt→_C.fp8_fp4_gemm_nt()→/deepgemm/csrc/apis/../jit_kernels/impls/runtime_utils.hpp:143→CUDA_ERROR_ILLEGAL_ADDRESS (error 700)cuda_graph_runner.capture()→deepseek_v2.forward()→ MoE layerforward_normal_dual_stream→_forward_shared_experts→shared_experts.gate_up_proj→fp8_kernel.deep_gemm_fp8_fp8_bf16_nt→deep_gemm_wrapper.gemm_nt_f8f8bf16→deep_gemm.fp8_gemm_nt→ nativefp8_fp4_gemm_nt→ CUDA_ERROR_ILLEGAL_ADDRESS in TMA descriptor runtime utilslmsysorg/sglang:v0.5.11-cu130runs cleanly on the same recipe / hardware.The bundled DeepGemm in v0.5.12-cu130 has a regression in its TMA-descriptor init path on Blackwell.
runtime_utils.hpp:143is the TMA descriptor validation/creation site.Workarounds (pick one to unblock #1421)
lmsysorg/sglang:v0.5.11-cu130until upstream DeepGemm/Blackwell TMA fix lands.--fp8-gemm-runner-backend cutlassto the SGLang server launch — uses CUTLASS FP8 path instead of DeepGemm.--disable-cuda-graph— large perf hit, only useful as a smoke test.Suggested actions
--fp8-gemm-runner-backend cutlasstobenchmarks/single_node/glm5_fp8_b300.shso the PR can land.🤖 Filed by Claude Code