Skip to content

[Klaud Cold] sglang v0.5.12 DeepGemm regression on B300: CUDA_ERROR_ILLEGAL_ADDRESS in fp8_fp4_gemm_nt TMA descriptor init #1463

@functionstackx

Description

@functionstackx

Summary

SGLang v0.5.12-cu130 regresses DeepGemm on NVIDIA B300 (Blackwell, sm_120). Every GLM-5-FP8 inference run on this image crashes during CUDA graph capture with CUDA_ERROR_ILLEGAL_ADDRESS originating in DeepGemm's TMA descriptor initialization for the shared-experts FP8 GEMM. This blocks the bump in #1421 and would likely block similar B300 sglang PRs.

Upstream report

Surfaced by

Failing GitHub Action runs

Reproduction

Use the glm5-fp8-b300-sglang recipe but swap the image to v0.5.12-cu130. Server flags pulled from benchmarks/single_node/glm5_fp8_b300.sh:

docker run --gpus all --shm-size=32g --rm \
  -v $HF_HUB_CACHE:/root/.cache/huggingface \
  lmsysorg/sglang:v0.5.12-cu130 \
  bash -c "
    pip install --no-deps 'transformers==5.2.0' 'huggingface-hub==1.4.1' && \
    export SGL_ENABLE_JIT_DEEPGEMM=1 && \
    python3 -m sglang.launch_server \
      --model-path=zai-org/GLM-5-FP8 \
      --host=0.0.0.0 --port=8888 --trust-remote-code \
      --tensor-parallel-size=8 \
      --data-parallel-size 1 --expert-parallel-size 1 \
      --tool-call-parser glm47 --reasoning-parser glm45 \
      --kv-cache-dtype fp8_e4m3 --quantization fp8 \
      --attention-backend nsa \
      --nsa-decode-backend trtllm --nsa-prefill-backend trtllm \
      --moe-runner-backend flashinfer_trtllm \
      --cuda-graph-max-bs 128 --max-running-requests 128 \
      --mem-fraction-static 0.85 \
      --chunked-prefill-size 32768 --max-prefill-tokens 32768 \
      --enable-flashinfer-allreduce-fusion --disable-radix-cache \
      --stream-interval 30 \
      --model-loader-extra-config '{\"enable_multithread_load\": true}'
  "

Diagnosis

  • Crash site (all TP ranks, simultaneous): deep_gemm.fp8_fp4_gemm_nt_C.fp8_fp4_gemm_nt()/deepgemm/csrc/apis/../jit_kernels/impls/runtime_utils.hpp:143CUDA_ERROR_ILLEGAL_ADDRESS (error 700)
  • Call path: cuda_graph_runner.capture()deepseek_v2.forward() → MoE layer forward_normal_dual_stream_forward_shared_expertsshared_experts.gate_up_projfp8_kernel.deep_gemm_fp8_fp8_bf16_ntdeep_gemm_wrapper.gemm_nt_f8f8bf16deep_gemm.fp8_gemm_nt → native fp8_fp4_gemm_nt → CUDA_ERROR_ILLEGAL_ADDRESS in TMA descriptor runtime utils
  • Working baseline: lmsysorg/sglang:v0.5.11-cu130 runs cleanly on the same recipe / hardware.

The bundled DeepGemm in v0.5.12-cu130 has a regression in its TMA-descriptor init path on Blackwell. runtime_utils.hpp:143 is the TMA descriptor validation/creation site.

Workarounds (pick one to unblock #1421)

  1. Pin sglang to lmsysorg/sglang:v0.5.11-cu130 until upstream DeepGemm/Blackwell TMA fix lands.
  2. Bypass DeepGemm: pass --fp8-gemm-runner-backend cutlass to the SGLang server launch — uses CUTLASS FP8 path instead of DeepGemm.
  3. Disable CUDA graphs: --disable-cuda-graph — large perf hit, only useful as a smoke test.

Suggested actions

🤖 Filed by Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions