Skip to content

[Do Not Merge] P2pNcclConnector PD disaggregation#327

Open
AAbouzeid wants to merge 7 commits intoovg-project:mainfrom
AAbouzeid:pd-disagg-P2pNcclConnector-1
Open

[Do Not Merge] P2pNcclConnector PD disaggregation#327
AAbouzeid wants to merge 7 commits intoovg-project:mainfrom
AAbouzeid:pd-disagg-P2pNcclConnector-1

Conversation

@AAbouzeid
Copy link
Copy Markdown

@AAbouzeid AAbouzeid commented May 9, 2026

P2P NCCL Debug PR Description

Summary

This PR adds debug-only instrumentation and experiment scripts for investigating a
P2pNcclConnector hang in PD disaggregation.

No production fix is included. This is intentionally observational: the goal is to
identify where the hang occurs and whether it is caused by kvcached or upstream vLLM
P2P NCCL behavior.

What Changed

  • Added P2P NCCL debug logging around:
    • P2pNcclConnector.build_connector_meta
    • P2pNcclConnector.save_kv_layer
    • P2pNcclConnector.start_load_kv
    • P2pNcclEngine.send_tensor
    • P2pNcclEngine.recv_tensor
    • ZMQ/NCCL send/recv paths
  • Added a kvcached P2P debug harness:
    • experiments/10_p2p_nccl_debug.sh
  • Added a direct upstream vLLM P2P NCCL validation harness:
    • experiments/11_vllm_p2p_nccl_direct.sh

Key Finding

The hang is reproducible with kvcached disabled and also with direct upstream vLLM
P2P NCCL. It is not kvcached-specific.

The strongest signal is that the hang disappears when vLLM request ID randomization
is disabled:

VLLM_DISABLE_REQUEST_ID_RANDOMIZATION=1

Experiment Matrix

Run Setup Request ID Randomization Result Log Dir
1 kvcached harness, kvcached disabled on hangs logs_p2p_debug/20260509_123315
2 kvcached harness, kvcached disabled off passes logs_p2p_debug/20260509_124428
3 direct upstream vLLM P2P NCCL on hangs logs_vllm_p2p_direct/20260509_125509
4 direct upstream vLLM P2P NCCL off passes logs_vllm_p2p_direct/20260509_130208

Root Cause Hypothesis

P2pNcclConnector keys transferred tensors as:

request_id#layer_name

However, prefill and decode run as separate vLLM engines. With request ID
randomization enabled, each engine derives a different internal
request.request_id from the same external X-Request-Id.

Observed example from the instrumented run:

Producer sends:

...-b5ce4b84#model.layers.0.self_attn.attn

Consumer waits for:

...-a08a4814#model.layers.0.self_attn.attn

The embedded external transfer prefix matches:

___prefill_addr_<addr>___decode_addr_<addr>_<uuid>

but the randomized internal suffix differs, so decode waits forever for a tensor key
that producer never sends.

Interpretation

The stall is not at:

  • proxy registration
  • ZMQ connection setup
  • ZMQ ack
  • NCCL init
  • NCCL send

The instrumented kvcached run showed producer successfully sending layer tensors and
decode waiting on a missing tensor key. The direct upstream vLLM runs confirm the
same behavior without kvcached in the path.

Scope

This PR does not fix the issue.

It intentionally avoids:

  • tensor layout changes
  • .contiguous() changes
  • block count rewrites
  • timeouts
  • exceptions
  • production behavior changes

Next Step

The likely fix is for P2pNcclConnector to key P2P tensor transfers by a stable
transfer request ID, derived from the external embedded request ID, rather than the
per-engine randomized internal vLLM request ID.

Logs attached below

p2p_nccl_request_id_evidence_20260509.tar.gz

@AAbouzeid AAbouzeid changed the title debugging [Do Not Merge] Fix kvcached + P2pNcclConnector PD disaggregation May 9, 2026
@AAbouzeid AAbouzeid changed the title [Do Not Merge] Fix kvcached + P2pNcclConnector PD disaggregation [Do Not Merge] P2pNcclConnector PD disaggregation May 9, 2026
@cui36
Copy link
Copy Markdown
Collaborator

cui36 commented May 9, 2026

Thanks @AAbouzeid! Since it’s not an issue on our side, I think we can leave it for now and add support later once vLLM has better support for it.

@ivanium
Copy link
Copy Markdown
Collaborator

ivanium commented May 9, 2026

@AAbouzeid feel free to try the Nixl connector in vLLM which is better maintained :)

@cui36
Copy link
Copy Markdown
Collaborator

cui36 commented May 9, 2026

@AAbouzeid feel free to try the Nixl connector in vLLM which is better maintained :)

Yeah we've already supported in #313

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants