You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bug: Hy3-preview cuda-graph crash on AMD MI300X/MI355X due to AITER custom all-reduce stream capture invalidation
Summary
After applying the file overlays from PR #23533 (tencent/Hy3-preview model support) to a fresh rocm/sgl-dev:v0.5.10.post1-rocm720-mi30x-20260423 container and launching the standard SGLang server with the parameters recommended by the model card, first decode CUDA-graph replay fails with HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016, killing all 8 TP scheduler subprocesses.
The same model + same PR works correctly in eager mode (--disable-cuda-graph), which confirms the model code itself is correct on AMD. The crash is specific to CUDA-graph replay.
Root cause
When --debug-cuda-graph is enabled, the underlying error becomes visible:
Exception: Capture cuda graph failed: HIP error: operation failed due to a previous error during capture
Search for `hipErrorStreamCaptureInvalidated' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__HIPRT__TYPES.html for more information.
hipErrorStreamCaptureInvalidated means a kernel was launched on a stream that was not part of the active HIP graph capture, invalidating the capture. The replayed graph then dispatches kernels with broken state, causing the HSA exception during decode.
I bisected the source: this happens because AiterCustomAllreduce (the default custom all-reduce implementation on AMD when SGLANG_USE_AITER_AR=true, which is the default) launches some operation on an internal stream during the captured forward pass.
Reproducer
Container: rocm/sgl-dev:v0.5.10.post1-rocm720-mi30x-20260423 (or the mi35x variant)
Send any inference request. First decode batch crashes.
Workaround
Set SGLANG_USE_AITER_AR=0 in the environment. This makes dispatch_custom_allreduce() (in sglang/srt/distributed/device_communicators/custom_all_reduce.py) select SGLang's own CustomAllreduce instead of AiterCustomAllreduce. AITER's other fast paths (attention, MoE, RMSNorm, fused_qk_norm) remain enabled, so performance is preserved (or slightly improved).
With this single env var: CUDA-graph works end-to-end on both MI300X and MI355X.
Validation
Hardware
Workload
Without fix
With SGLANG_USE_AITER_AR=0
MI300X TP=8
Single long (512 tok)
❌ crash
✅ 34.9 tok/s
MI300X TP=8
c=8, 32 reqs
❌ crash
✅ 250.7 tok/s
MI355X TP=8
Single long (512 tok)
❌ crash
✅ 39.6 tok/s
MI355X TP=8
c=8, 32 reqs
❌ crash
✅ 295.7 tok/s
Suggested fix
Either of:
Auto-detect on HIP: in dispatch_custom_allreduce(), when _is_hip and SGLang's monolithic CUDA-graph capture is enabled (i.e., not --enforce-piecewise-cuda-graph), prefer SGLang's CustomAllreduce over AiterCustomAllreduce. Document SGLANG_USE_AITER_AR=1 as opt-in for users who know their workload doesn't trigger the issue.
Fix AiterCustomAllreduce upstream: identify and fix the rogue stream launch inside AITER's HIP graph capture path (likely inside one of the dispatchFusedAllReduceRMSNorm* or related kernels in csrc/include/custom_all_reduce.cuh). This is a more permanent fix but requires AITER changes.
Bug: Hy3-preview cuda-graph crash on AMD MI300X/MI355X due to AITER custom all-reduce stream capture invalidation
Summary
After applying the file overlays from PR #23533 (
tencent/Hy3-previewmodel support) to a freshrocm/sgl-dev:v0.5.10.post1-rocm720-mi30x-20260423container and launching the standard SGLang server with the parameters recommended by the model card, first decode CUDA-graph replay fails withHSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016, killing all 8 TP scheduler subprocesses.The same model + same PR works correctly in eager mode (
--disable-cuda-graph), which confirms the model code itself is correct on AMD. The crash is specific to CUDA-graph replay.Root cause
When
--debug-cuda-graphis enabled, the underlying error becomes visible:hipErrorStreamCaptureInvalidatedmeans a kernel was launched on a stream that was not part of the active HIP graph capture, invalidating the capture. The replayed graph then dispatches kernels with broken state, causing the HSA exception during decode.I bisected the source: this happens because
AiterCustomAllreduce(the default custom all-reduce implementation on AMD whenSGLANG_USE_AITER_AR=true, which is the default) launches some operation on an internal stream during the captured forward pass.Reproducer
rocm/sgl-dev:v0.5.10.post1-rocm720-mi30x-20260423(or the mi35x variant)hunyuan_v3.py,hunyuan_v3_nextn.py,model_config.py,topk.py,fp8.py,serving_chat.py,function_call_parser.py,hunyuan_detector.py,parser/reasoning_parser.py,server_args.py,utils/common.py,quantization/fp8_utils.py,jit_kernel/grouped_topk.py)pip install -U "transformers>=5.6.0"Workaround
Set
SGLANG_USE_AITER_AR=0in the environment. This makesdispatch_custom_allreduce()(insglang/srt/distributed/device_communicators/custom_all_reduce.py) select SGLang's ownCustomAllreduceinstead ofAiterCustomAllreduce. AITER's other fast paths (attention, MoE, RMSNorm, fused_qk_norm) remain enabled, so performance is preserved (or slightly improved).With this single env var: CUDA-graph works end-to-end on both MI300X and MI355X.
Validation
SGLANG_USE_AITER_AR=0Suggested fix
Either of:
Auto-detect on HIP: in
dispatch_custom_allreduce(), when_is_hipand SGLang's monolithic CUDA-graph capture is enabled (i.e., not--enforce-piecewise-cuda-graph), prefer SGLang'sCustomAllreduceoverAiterCustomAllreduce. DocumentSGLANG_USE_AITER_AR=1as opt-in for users who know their workload doesn't trigger the issue.Fix
AiterCustomAllreduceupstream: identify and fix the rogue stream launch inside AITER's HIP graph capture path (likely inside one of thedispatchFusedAllReduceRMSNorm*or related kernels incsrc/include/custom_all_reduce.cuh). This is a more permanent fix but requires AITER changes.I'd be happy to put up a PR for option 1.
References
aiter/dist/device_communicators/custom_all_reduce.pyandcsrc/include/custom_all_reduce.cuh(get_buffer_RDetc.)