Unable to capture GPU stream via start_profile on P node in separated PD deployment (1P2D), while D node works fine #19763

ysay-d · 2026-03-03T08:56:25Z

ysay-d
Mar 3, 2026

I am encountering an issue with the profiling functionality (start_profile) in a separated Prefill-Decode (PD) deployment architecture. Specifically, I am unable to capture GPU stream traces on the Prefill (P) node, whereas the Decode (D) nodes work as expected.

SGLang Version 0.5.9
Launch Command (P Node):
CUDA_VISIBLE_DEVICES=4 sglang serve --model-path /mnt/Qwen3-32B/ --port 36666 --disaggregation-mode prefill --disaggregation-ib-device mlx5_bond_4
Launch Command (D Node):
CUDA_VISIBLE_DEVICES=7 sglang serve --model-path /mnt/Qwen3-32B/ --port 36667 --disaggregation-mode decode --disaggregation-ib-device mlx5_bond_4

ysay-d · 2026-03-03T09:33:30Z

ysay-d
Mar 3, 2026
Author

I resolved this issue by extending the profiling duration, which suggests the root cause was that the CUPTI buffer did not fill up within the shorter time window, preventing the collected GPU stream data from being flushed and returned; increasing the capture time allows the buffer to reach its threshold and successfully commit the trace on the P-node.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to capture GPU stream via start_profile on P node in separated PD deployment (1P2D), while D node works fine #19763

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Unable to capture GPU stream via start_profile on P node in separated PD deployment (1P2D), while D node works fine #19763

Uh oh!

Uh oh!

ysay-d Mar 3, 2026

Replies: 1 comment

Uh oh!

ysay-d Mar 3, 2026 Author

ysay-d
Mar 3, 2026

ysay-d
Mar 3, 2026
Author