Skip to content

[Bug] PD disaggregation decode worker crashes (CUDA IndexKernel “index out of bounds”) with Qwen/Qwen3.6-27B using NIXL backend (H100, sglang 0.5.10.post1) #23574

@Bihan

Description

@Bihan

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

When running PD disaggregated inference (router + separate prefill/decode workers) with Qwen/Qwen3.6-27B, the system initially works: workers register, /chat/completions succeeds, and tokens are generated. After some time, the decode server crashes with a CUDA device-side assert:

Decode/prefill/router logs are attached.

decode-qwen.log
prefill-qwen.log
router-qwen.log

Reproduction

  1. Start router on a CPU machine:
python -m sglang_router.launch_router \
  --host 0.0.0.0 \
  --port 8000 \
  --pd-disaggregation \
  --prefill-policy cache_aware
  1. Start decode worker:
python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-27B \
  --disaggregation-mode decode \
  --disaggregation-transfer-backend nixl \
  --host 0.0.0.0 \
  --port 8000 \
  --log-level debug
  1. Start prefill worker:
python -m sglang.launch_server \
            --model-path Qwen/Qwen3.6-27B \
            --disaggregation-mode prefill \
            --disaggregation-transfer-backend nixl \
            --host 0.0.0.0 \
            --port 8000 \
            --disaggregation-bootstrap-port 8998 \
            --log-level debug
  1. Register workers to router via REST API:

Decode

curl -X POST http://localhost:8000/workers \
  -H "Content-Type: application/json" \
  -d '{
    "url": "http://<decode-machine>:8000",
    "worker_type": "decode"
  }'

Prefill

curl -X POST http://localhost:8000/workers \
  -H "Content-Type: application/json" \
  -d '{
    "url": "http://<prefill-machine>:8000",
    "worker_type": "prefill",
    "bootstrap_port": 8998
  }'

Send a chat completion request to the router (works initially).

After running for a while, even without sending any chat completion requests, if the system is left idle for some time, the decode worker eventually crashes with the error below:

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:111: operator(): block: [0,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
[2026-04-23 10:44:55] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3616, in run_scheduler_process
    scheduler.run_event_loop()
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1300, in run_event_loop
    dispatch_event_loop(self)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3513, in dispatch_event_loop
    scheduler.event_loop_normal_disagg_decode()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 1147, in event_loop_normal_disagg_decode
    self.process_decode_queue()
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 1316, in process_decode_queue
    req_conns, _ = self.disagg_decode_prealloc_queue.pop_preallocated()
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 725, in pop_preallocated
    kv_indices = kv_indices_full.cpu().numpy()
torch.AcceleratorError: CUDA error: device-side assert triggered
...
[2026-04-23 10:44:55] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
Killed

Environment

Python: 3.12.3 (main, Mar 3 2026, 12:15:18) [GCC 13.3.0]
CUDA available: True
GPU 0: NVIDIA H100 80GB HBM3
GPU 0 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 580.126.09
PyTorch: 2.9.1+cu129
sglang: 0.5.10.post1
sglang-kernel: 0.4.1
flashinfer_python: 0.6.7.post3
flashinfer_cubin: 0.6.7.post3
flashinfer_jit_cache: 0.6.7.post3+cu129
triton: 3.5.1
transformers: 5.3.0
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.5
fastapi: 0.135.3
huggingface_hub: 1.9.2
interegular: 0.3.3
modelscope: 1.35.3
orjson: 3.11.8
outlines: 0.1.11
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
python-multipart: 0.0.24
pyzmq: 27.1.0
uvicorn: 0.44.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.32
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.92.0
litellm: Module Not Found
torchcodec: 0.9.1
NVIDIA Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-15 0 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

Hypervisor vendor:: KVM
ulimit soft: 1024

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions