Skip to content

[Bug] Hy3-preview cuda-graph crash on AMD MI300X/MI355X due to AITER custom all-reduce stream invalidation #23580

@andyluo7

Description

@andyluo7

Bug: Hy3-preview cuda-graph crash on AMD MI300X/MI355X due to AITER custom all-reduce stream capture invalidation

Summary

After applying the file overlays from PR #23533 (tencent/Hy3-preview model support) to a fresh rocm/sgl-dev:v0.5.10.post1-rocm720-mi30x-20260423 container and launching the standard SGLang server with the parameters recommended by the model card, first decode CUDA-graph replay fails with HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016, killing all 8 TP scheduler subprocesses.

The same model + same PR works correctly in eager mode (--disable-cuda-graph), which confirms the model code itself is correct on AMD. The crash is specific to CUDA-graph replay.

Root cause

When --debug-cuda-graph is enabled, the underlying error becomes visible:

Exception: Capture cuda graph failed: HIP error: operation failed due to a previous error during capture
Search for `hipErrorStreamCaptureInvalidated' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__HIPRT__TYPES.html for more information.

hipErrorStreamCaptureInvalidated means a kernel was launched on a stream that was not part of the active HIP graph capture, invalidating the capture. The replayed graph then dispatches kernels with broken state, causing the HSA exception during decode.

I bisected the source: this happens because AiterCustomAllreduce (the default custom all-reduce implementation on AMD when SGLANG_USE_AITER_AR=true, which is the default) launches some operation on an internal stream during the captured forward pass.

Reproducer

  1. Container: rocm/sgl-dev:v0.5.10.post1-rocm720-mi30x-20260423 (or the mi35x variant)
  2. Apply file overlays from PR support Hy3 preview #23533 (hunyuan_v3.py, hunyuan_v3_nextn.py, model_config.py, topk.py, fp8.py, serving_chat.py, function_call_parser.py, hunyuan_detector.py, parser/reasoning_parser.py, server_args.py, utils/common.py, quantization/fp8_utils.py, jit_kernel/grouped_topk.py)
  3. Upgrade transformers: pip install -U "transformers>=5.6.0"
  4. Launch:
python3 -m sglang.launch_server \
  --model tencent/Hy3-preview --tp 8 \
  --tool-call-parser hunyuan --reasoning-parser hunyuan \
  --served-model-name hy3-preview --port 30000 \
  --mem-fraction-static 0.85
  1. Send any inference request. First decode batch crashes.

Workaround

Set SGLANG_USE_AITER_AR=0 in the environment. This makes dispatch_custom_allreduce() (in sglang/srt/distributed/device_communicators/custom_all_reduce.py) select SGLang's own CustomAllreduce instead of AiterCustomAllreduce. AITER's other fast paths (attention, MoE, RMSNorm, fused_qk_norm) remain enabled, so performance is preserved (or slightly improved).

SGLANG_USE_AITER_AR=0 python3 -m sglang.launch_server ...

With this single env var: CUDA-graph works end-to-end on both MI300X and MI355X.

Validation

Hardware Workload Without fix With SGLANG_USE_AITER_AR=0
MI300X TP=8 Single long (512 tok) ❌ crash ✅ 34.9 tok/s
MI300X TP=8 c=8, 32 reqs ❌ crash ✅ 250.7 tok/s
MI355X TP=8 Single long (512 tok) ❌ crash ✅ 39.6 tok/s
MI355X TP=8 c=8, 32 reqs ❌ crash ✅ 295.7 tok/s

Suggested fix

Either of:

  1. Auto-detect on HIP: in dispatch_custom_allreduce(), when _is_hip and SGLang's monolithic CUDA-graph capture is enabled (i.e., not --enforce-piecewise-cuda-graph), prefer SGLang's CustomAllreduce over AiterCustomAllreduce. Document SGLANG_USE_AITER_AR=1 as opt-in for users who know their workload doesn't trigger the issue.

  2. Fix AiterCustomAllreduce upstream: identify and fix the rogue stream launch inside AITER's HIP graph capture path (likely inside one of the dispatchFusedAllReduceRMSNorm* or related kernels in csrc/include/custom_all_reduce.cuh). This is a more permanent fix but requires AITER changes.

I'd be happy to put up a PR for option 1.

References

  • PR support Hy3 preview #23533 (Tencent/Hy3-preview support) — which this issue blocks for AMD users
  • AITER source: aiter/dist/device_communicators/custom_all_reduce.py and csrc/include/custom_all_reduce.cuh (get_buffer_RD etc.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions