Replies: 1 comment
-
|
I resolved this issue by extending the profiling duration, which suggests the root cause was that the CUPTI buffer did not fill up within the shorter time window, preventing the collected GPU stream data from being flushed and returned; increasing the capture time allows the buffer to reach its threshold and successfully commit the trace on the P-node. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am encountering an issue with the profiling functionality (start_profile) in a separated Prefill-Decode (PD) deployment architecture. Specifically, I am unable to capture GPU stream traces on the Prefill (P) node, whereas the Decode (D) nodes work as expected.
CUDA_VISIBLE_DEVICES=4 sglang serve --model-path /mnt/Qwen3-32B/ --port 36666 --disaggregation-mode prefill --disaggregation-ib-device mlx5_bond_4CUDA_VISIBLE_DEVICES=7 sglang serve --model-path /mnt/Qwen3-32B/ --port 36667 --disaggregation-mode decode --disaggregation-ib-device mlx5_bond_4Beta Was this translation helpful? Give feedback.
All reactions