[Bug] Hy3-preview cuda-graph crash on AMD MI300X/MI355X due to AITER custom all-reduce stream invalidation

## Bug: Hy3-preview cuda-graph crash on AMD MI300X/MI355X due to AITER custom all-reduce stream capture invalidation

### Summary

After applying the file overlays from PR #23533 (`tencent/Hy3-preview` model support) to a fresh `rocm/sgl-dev:v0.5.10.post1-rocm720-mi30x-20260423` container and launching the standard SGLang server with the parameters recommended by the model card, **first decode CUDA-graph replay fails with `HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016`**, killing all 8 TP scheduler subprocesses.

The same model + same PR works correctly in **eager mode** (`--disable-cuda-graph`), which confirms the model code itself is correct on AMD. The crash is specific to CUDA-graph replay.

### Root cause

When `--debug-cuda-graph` is enabled, the underlying error becomes visible:

```
Exception: Capture cuda graph failed: HIP error: operation failed due to a previous error during capture
Search for `hipErrorStreamCaptureInvalidated' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__HIPRT__TYPES.html for more information.
```

`hipErrorStreamCaptureInvalidated` means **a kernel was launched on a stream that was not part of the active HIP graph capture**, invalidating the capture. The replayed graph then dispatches kernels with broken state, causing the HSA exception during decode.

I bisected the source: this happens because **`AiterCustomAllreduce`** (the default custom all-reduce implementation on AMD when `SGLANG_USE_AITER_AR=true`, which is the default) launches some operation on an internal stream during the captured forward pass.

### Reproducer

1. Container: `rocm/sgl-dev:v0.5.10.post1-rocm720-mi30x-20260423` (or the mi35x variant)
2. Apply file overlays from PR #23533 (`hunyuan_v3.py`, `hunyuan_v3_nextn.py`, `model_config.py`, `topk.py`, `fp8.py`, `serving_chat.py`, `function_call_parser.py`, `hunyuan_detector.py`, `parser/reasoning_parser.py`, `server_args.py`, `utils/common.py`, `quantization/fp8_utils.py`, `jit_kernel/grouped_topk.py`)
3. Upgrade transformers: `pip install -U "transformers>=5.6.0"`
4. Launch:

```bash
python3 -m sglang.launch_server \
  --model tencent/Hy3-preview --tp 8 \
  --tool-call-parser hunyuan --reasoning-parser hunyuan \
  --served-model-name hy3-preview --port 30000 \
  --mem-fraction-static 0.85
```

5. Send any inference request. First decode batch crashes.

### Workaround

Set `SGLANG_USE_AITER_AR=0` in the environment. This makes `dispatch_custom_allreduce()` (in `sglang/srt/distributed/device_communicators/custom_all_reduce.py`) select SGLang's own `CustomAllreduce` instead of `AiterCustomAllreduce`. AITER's other fast paths (attention, MoE, RMSNorm, fused_qk_norm) remain enabled, so performance is preserved (or slightly improved).

```bash
SGLANG_USE_AITER_AR=0 python3 -m sglang.launch_server ...
```

With this single env var: **CUDA-graph works end-to-end on both MI300X and MI355X**.

### Validation

| Hardware | Workload | Without fix | With `SGLANG_USE_AITER_AR=0` |
|---|---|---|---|
| MI300X TP=8 | Single long (512 tok) | ❌ crash | ✅ 34.9 tok/s |
| MI300X TP=8 | c=8, 32 reqs | ❌ crash | ✅ 250.7 tok/s |
| MI355X TP=8 | Single long (512 tok) | ❌ crash | ✅ 39.6 tok/s |
| MI355X TP=8 | c=8, 32 reqs | ❌ crash | ✅ 295.7 tok/s |

### Suggested fix

Either of:

1. **Auto-detect on HIP**: in `dispatch_custom_allreduce()`, when `_is_hip` and SGLang's monolithic CUDA-graph capture is enabled (i.e., not `--enforce-piecewise-cuda-graph`), prefer SGLang's `CustomAllreduce` over `AiterCustomAllreduce`. Document `SGLANG_USE_AITER_AR=1` as opt-in for users who know their workload doesn't trigger the issue.

2. **Fix `AiterCustomAllreduce` upstream**: identify and fix the rogue stream launch inside AITER's HIP graph capture path (likely inside one of the `dispatchFusedAllReduceRMSNorm*` or related kernels in `csrc/include/custom_all_reduce.cuh`). This is a more permanent fix but requires AITER changes.

I'd be happy to put up a PR for option 1.

### References

- PR #23533 (Tencent/Hy3-preview support) — which this issue blocks for AMD users
- AITER source: `aiter/dist/device_communicators/custom_all_reduce.py` and `csrc/include/custom_all_reduce.cuh` (`get_buffer_RD` etc.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Hy3-preview cuda-graph crash on AMD MI300X/MI355X due to AITER custom all-reduce stream invalidation #23580

Bug: Hy3-preview cuda-graph crash on AMD MI300X/MI355X due to AITER custom all-reduce stream capture invalidation

Summary

Root cause

Reproducer

Workaround

Validation

Suggested fix

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hardware	Workload	Without fix	With `SGLANG_USE_AITER_AR=0`
MI300X TP=8	Single long (512 tok)	❌ crash	✅ 34.9 tok/s
MI300X TP=8	c=8, 32 reqs	❌ crash	✅ 250.7 tok/s
MI355X TP=8	Single long (512 tok)	❌ crash	✅ 39.6 tok/s
MI355X TP=8	c=8, 32 reqs	❌ crash	✅ 295.7 tok/s

[Bug] Hy3-preview cuda-graph crash on AMD MI300X/MI355X due to AITER custom all-reduce stream invalidation #23580

Description

Bug: Hy3-preview cuda-graph crash on AMD MI300X/MI355X due to AITER custom all-reduce stream capture invalidation

Summary

Root cause

Reproducer

Workaround

Validation

Suggested fix

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions