Checklist
Describe the bug
When running PD disaggregated inference (router + separate prefill/decode workers) with Qwen/Qwen3.6-27B, the system initially works: workers register, /chat/completions succeeds, and tokens are generated. After some time, the decode server crashes with a CUDA device-side assert:
Decode/prefill/router logs are attached.
decode-qwen.log
prefill-qwen.log
router-qwen.log
Reproduction
- Start router on a CPU machine:
python -m sglang_router.launch_router \
--host 0.0.0.0 \
--port 8000 \
--pd-disaggregation \
--prefill-policy cache_aware
- Start decode worker:
python -m sglang.launch_server \
--model-path Qwen/Qwen3.6-27B \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000 \
--log-level debug
- Start prefill worker:
python -m sglang.launch_server \
--model-path Qwen/Qwen3.6-27B \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000 \
--disaggregation-bootstrap-port 8998 \
--log-level debug
- Register workers to router via REST API:
Decode
curl -X POST http://localhost:8000/workers \
-H "Content-Type: application/json" \
-d '{
"url": "http://<decode-machine>:8000",
"worker_type": "decode"
}'
Prefill
curl -X POST http://localhost:8000/workers \
-H "Content-Type: application/json" \
-d '{
"url": "http://<prefill-machine>:8000",
"worker_type": "prefill",
"bootstrap_port": 8998
}'
Send a chat completion request to the router (works initially).
After running for a while, even without sending any chat completion requests, if the system is left idle for some time, the decode worker eventually crashes with the error below:
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:111: operator(): block: [0,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
[2026-04-23 10:44:55] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3616, in run_scheduler_process
scheduler.run_event_loop()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1300, in run_event_loop
dispatch_event_loop(self)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3513, in dispatch_event_loop
scheduler.event_loop_normal_disagg_decode()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 1147, in event_loop_normal_disagg_decode
self.process_decode_queue()
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 1316, in process_decode_queue
req_conns, _ = self.disagg_decode_prealloc_queue.pop_preallocated()
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 725, in pop_preallocated
kv_indices = kv_indices_full.cpu().numpy()
torch.AcceleratorError: CUDA error: device-side assert triggered
...
[2026-04-23 10:44:55] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
Killed
Environment
Python: 3.12.3 (main, Mar 3 2026, 12:15:18) [GCC 13.3.0]
CUDA available: True
GPU 0: NVIDIA H100 80GB HBM3
GPU 0 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 580.126.09
PyTorch: 2.9.1+cu129
sglang: 0.5.10.post1
sglang-kernel: 0.4.1
flashinfer_python: 0.6.7.post3
flashinfer_cubin: 0.6.7.post3
flashinfer_jit_cache: 0.6.7.post3+cu129
triton: 3.5.1
transformers: 5.3.0
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.5
fastapi: 0.135.3
huggingface_hub: 1.9.2
interegular: 0.3.3
modelscope: 1.35.3
orjson: 3.11.8
outlines: 0.1.11
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
python-multipart: 0.0.24
pyzmq: 27.1.0
uvicorn: 0.44.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.32
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.92.0
litellm: Module Not Found
torchcodec: 0.9.1
NVIDIA Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-15 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Hypervisor vendor:: KVM
ulimit soft: 1024
Checklist
Describe the bug
When running PD disaggregated inference (router + separate prefill/decode workers) with Qwen/Qwen3.6-27B, the system initially works: workers register, /chat/completions succeeds, and tokens are generated. After some time, the decode server crashes with a CUDA device-side assert:
Decode/prefill/router logs are attached.
decode-qwen.log
prefill-qwen.log
router-qwen.log
Reproduction
Decode
Prefill
Send a chat completion request to the router (works initially).
After running for a while, even without sending any chat completion requests, if the system is left idle for some time, the decode worker eventually crashes with the error below:
Environment
Python: 3.12.3 (main, Mar 3 2026, 12:15:18) [GCC 13.3.0]
CUDA available: True
GPU 0: NVIDIA H100 80GB HBM3
GPU 0 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 580.126.09
PyTorch: 2.9.1+cu129
sglang: 0.5.10.post1
sglang-kernel: 0.4.1
flashinfer_python: 0.6.7.post3
flashinfer_cubin: 0.6.7.post3
flashinfer_jit_cache: 0.6.7.post3+cu129
triton: 3.5.1
transformers: 5.3.0
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.5
fastapi: 0.135.3
huggingface_hub: 1.9.2
interegular: 0.3.3
modelscope: 1.35.3
orjson: 3.11.8
outlines: 0.1.11
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
python-multipart: 0.0.24
pyzmq: 27.1.0
uvicorn: 0.44.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.32
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.92.0
litellm: Module Not Found
torchcodec: 0.9.1
NVIDIA Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-15 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Hypervisor vendor:: KVM
ulimit soft: 1024