[Bug] PD disaggregation decode worker crashes (CUDA IndexKernel “index out of bounds”) with Qwen/Qwen3.6-27B using NIXL backend (H100, sglang 0.5.10.post1)

### Checklist

- [x] I searched related issues but found no solution.
- [x] The bug persists in the latest version.
- [x] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- [x] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [x] Please use English. Otherwise, it will be closed.

### Describe the bug

When running PD disaggregated inference (router + separate prefill/decode workers) with Qwen/Qwen3.6-27B, the system initially works: workers register, /chat/completions succeeds, and tokens are generated. After some time, the decode server crashes with a CUDA device-side assert:

Decode/prefill/router logs are attached.

[decode-qwen.log](https://github.com/user-attachments/files/27014726/decode-qwen.log)
[prefill-qwen.log](https://github.com/user-attachments/files/27014728/prefill-qwen.log)
[router-qwen.log](https://github.com/user-attachments/files/27014727/router-qwen.log)

### Reproduction

1. Start router on a CPU machine:
```
python -m sglang_router.launch_router \
  --host 0.0.0.0 \
  --port 8000 \
  --pd-disaggregation \
  --prefill-policy cache_aware
```
2. Start decode worker:
```
python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-27B \
  --disaggregation-mode decode \
  --disaggregation-transfer-backend nixl \
  --host 0.0.0.0 \
  --port 8000 \
  --log-level debug
```
3. Start prefill worker:
```
python -m sglang.launch_server \
            --model-path Qwen/Qwen3.6-27B \
            --disaggregation-mode prefill \
            --disaggregation-transfer-backend nixl \
            --host 0.0.0.0 \
            --port 8000 \
            --disaggregation-bootstrap-port 8998 \
            --log-level debug
```

4. Register workers to router via REST API:

**Decode**
```
curl -X POST http://localhost:8000/workers \
  -H "Content-Type: application/json" \
  -d '{
    "url": "http://<decode-machine>:8000",
    "worker_type": "decode"
  }'
```
**Prefill**
```
curl -X POST http://localhost:8000/workers \
  -H "Content-Type: application/json" \
  -d '{
    "url": "http://<prefill-machine>:8000",
    "worker_type": "prefill",
    "bootstrap_port": 8998
  }'
```
Send a chat completion request to the router (works initially).

After running for a while, even without sending any chat completion requests, if the system is left idle for some time, the decode worker eventually crashes with the error below:

```
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:111: operator(): block: [0,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
[2026-04-23 10:44:55] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3616, in run_scheduler_process
    scheduler.run_event_loop()
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1300, in run_event_loop
    dispatch_event_loop(self)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3513, in dispatch_event_loop
    scheduler.event_loop_normal_disagg_decode()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 1147, in event_loop_normal_disagg_decode
    self.process_decode_queue()
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 1316, in process_decode_queue
    req_conns, _ = self.disagg_decode_prealloc_queue.pop_preallocated()
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 725, in pop_preallocated
    kv_indices = kv_indices_full.cpu().numpy()
torch.AcceleratorError: CUDA error: device-side assert triggered
...
[2026-04-23 10:44:55] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
Killed
```


### Environment

Python: 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0]
CUDA available: True
GPU 0: NVIDIA H100 80GB HBM3
GPU 0 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 580.126.09
PyTorch: 2.9.1+cu129
sglang: 0.5.10.post1
sglang-kernel: 0.4.1
flashinfer_python: 0.6.7.post3
flashinfer_cubin: 0.6.7.post3
flashinfer_jit_cache: 0.6.7.post3+cu129
triton: 3.5.1
transformers: 5.3.0
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.5
fastapi: 0.135.3
huggingface_hub: 1.9.2
interegular: 0.3.3
modelscope: 1.35.3
orjson: 3.11.8
outlines: 0.1.11
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
python-multipart: 0.0.24
pyzmq: 27.1.0
uvicorn: 0.44.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.32
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.92.0
litellm: Module Not Found
torchcodec: 0.9.1
NVIDIA Topology: 
  GPU0  CPU Affinity  NUMA Affinity GPU NUMA ID
GPU0   X  0-15  0   N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Hypervisor vendor:: KVM
ulimit soft: 1024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] PD disaggregation decode worker crashes (CUDA IndexKernel “index out of bounds”) with Qwen/Qwen3.6-27B using NIXL backend (H100, sglang 0.5.10.post1) #23574

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] PD disaggregation decode worker crashes (CUDA IndexKernel “index out of bounds”) with Qwen/Qwen3.6-27B using NIXL backend (H100, sglang 0.5.10.post1) #23574

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions