[Do Not Merge] P2pNcclConnector PD disaggregation#327
Open
AAbouzeid wants to merge 7 commits intoovg-project:mainfrom
Open
[Do Not Merge] P2pNcclConnector PD disaggregation#327AAbouzeid wants to merge 7 commits intoovg-project:mainfrom
AAbouzeid wants to merge 7 commits intoovg-project:mainfrom
Conversation
Collaborator
|
Thanks @AAbouzeid! Since it’s not an issue on our side, I think we can leave it for now and add support later once vLLM has better support for it. |
Collaborator
|
@AAbouzeid feel free to try the Nixl connector in vLLM which is better maintained :) |
Collaborator
Yeah we've already supported in #313 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
P2P NCCL Debug PR Description
Summary
This PR adds debug-only instrumentation and experiment scripts for investigating a
P2pNcclConnectorhang in PD disaggregation.No production fix is included. This is intentionally observational: the goal is to
identify where the hang occurs and whether it is caused by kvcached or upstream vLLM
P2P NCCL behavior.
What Changed
P2pNcclConnector.build_connector_metaP2pNcclConnector.save_kv_layerP2pNcclConnector.start_load_kvP2pNcclEngine.send_tensorP2pNcclEngine.recv_tensorexperiments/10_p2p_nccl_debug.shexperiments/11_vllm_p2p_nccl_direct.shKey Finding
The hang is reproducible with kvcached disabled and also with direct upstream vLLM
P2P NCCL. It is not kvcached-specific.
The strongest signal is that the hang disappears when vLLM request ID randomization
is disabled:
Experiment Matrix
logs_p2p_debug/20260509_123315logs_p2p_debug/20260509_124428logs_vllm_p2p_direct/20260509_125509logs_vllm_p2p_direct/20260509_130208Root Cause Hypothesis
P2pNcclConnectorkeys transferred tensors as:However, prefill and decode run as separate vLLM engines. With request ID
randomization enabled, each engine derives a different internal
request.request_idfrom the same externalX-Request-Id.Observed example from the instrumented run:
Producer sends:
Consumer waits for:
The embedded external transfer prefix matches:
but the randomized internal suffix differs, so decode waits forever for a tensor key
that producer never sends.
Interpretation
The stall is not at:
The instrumented kvcached run showed producer successfully sending layer tensors and
decode waiting on a missing tensor key. The direct upstream vLLM runs confirm the
same behavior without kvcached in the path.
Scope
This PR does not fix the issue.
It intentionally avoids:
.contiguous()changesNext Step
The likely fix is for
P2pNcclConnectorto key P2P tensor transfers by a stabletransfer request ID, derived from the external embedded request ID, rather than the
per-engine randomized internal vLLM request ID.
Logs attached below
p2p_nccl_request_id_evidence_20260509.tar.gz