[Do Not Merge] P2pNcclConnector PD disaggregation by AAbouzeid · Pull Request #327 · ovg-project/kvcached

AAbouzeid · 2026-05-09T11:51:06Z

P2P NCCL Debug PR Description

Summary

This PR adds debug-only instrumentation and experiment scripts for investigating a
P2pNcclConnector hang in PD disaggregation.

No production fix is included. This is intentionally observational: the goal is to
identify where the hang occurs and whether it is caused by kvcached or upstream vLLM
P2P NCCL behavior.

What Changed

Added P2P NCCL debug logging around:
- P2pNcclConnector.build_connector_meta
- P2pNcclConnector.save_kv_layer
- P2pNcclConnector.start_load_kv
- P2pNcclEngine.send_tensor
- P2pNcclEngine.recv_tensor
- ZMQ/NCCL send/recv paths
Added a kvcached P2P debug harness:
- experiments/10_p2p_nccl_debug.sh
Added a direct upstream vLLM P2P NCCL validation harness:
- experiments/11_vllm_p2p_nccl_direct.sh

Key Finding

The hang is reproducible with kvcached disabled and also with direct upstream vLLM
P2P NCCL. It is not kvcached-specific.

The strongest signal is that the hang disappears when vLLM request ID randomization
is disabled:

VLLM_DISABLE_REQUEST_ID_RANDOMIZATION=1

Experiment Matrix

Run	Setup	Request ID Randomization	Result	Log Dir
1	kvcached harness, kvcached disabled	on	hangs	`logs_p2p_debug/20260509_123315`
2	kvcached harness, kvcached disabled	off	passes	`logs_p2p_debug/20260509_124428`
3	direct upstream vLLM P2P NCCL	on	hangs	`logs_vllm_p2p_direct/20260509_125509`
4	direct upstream vLLM P2P NCCL	off	passes	`logs_vllm_p2p_direct/20260509_130208`

Root Cause Hypothesis

P2pNcclConnector keys transferred tensors as:

request_id#layer_name

However, prefill and decode run as separate vLLM engines. With request ID
randomization enabled, each engine derives a different internal
request.request_id from the same external X-Request-Id.

Observed example from the instrumented run:

Producer sends:

...-b5ce4b84#model.layers.0.self_attn.attn

Consumer waits for:

...-a08a4814#model.layers.0.self_attn.attn

The embedded external transfer prefix matches:

___prefill_addr_<addr>___decode_addr_<addr>_<uuid>

but the randomized internal suffix differs, so decode waits forever for a tensor key
that producer never sends.

Interpretation

The stall is not at:

proxy registration
ZMQ connection setup
ZMQ ack
NCCL init
NCCL send

The instrumented kvcached run showed producer successfully sending layer tensors and
decode waiting on a missing tensor key. The direct upstream vLLM runs confirm the
same behavior without kvcached in the path.

Scope

This PR does not fix the issue.

It intentionally avoids:

tensor layout changes
.contiguous() changes
block count rewrites
timeouts
exceptions
production behavior changes

Next Step

The likely fix is for P2pNcclConnector to key P2P tensor transfers by a stable
transfer request ID, derived from the external embedded request ID, rather than the
per-engine randomized internal vLLM request ID.

Logs attached below

p2p_nccl_request_id_evidence_20260509.tar.gz

cui36 · 2026-05-09T19:00:11Z

Thanks @AAbouzeid! Since it’s not an issue on our side, I think we can leave it for now and add support later once vLLM has better support for it.

ivanium · 2026-05-09T19:07:52Z

@AAbouzeid feel free to try the Nixl connector in vLLM which is better maintained :)

cui36 · 2026-05-09T19:37:57Z

@AAbouzeid feel free to try the Nixl connector in vLLM which is better maintained :)

Yeah we've already supported in #313

debugging

0cdbfcb

AAbouzeid changed the title ~~debugging~~ [Do Not Merge] Fix kvcached + P2pNcclConnector PD disaggregation May 9, 2026

AAbouzeid added 6 commits May 9, 2026 15:04

Force kvcached autopatch in P2P debug harness

12a23ad

Use normal install for P2P debug harness autopatch

9a6f09f

Add kvcached-disabled P2P baseline mode

fdaa7a8

Add direct vLLM P2P NCCL experiment

68f1a1e

Avoid unused deps in direct vLLM P2P script

931443d

Handle moved upstream vLLM P2P example

3498078

AAbouzeid changed the title ~~[Do Not Merge] Fix kvcached + P2pNcclConnector PD disaggregation~~ [Do Not Merge] P2pNcclConnector PD disaggregation May 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Do Not Merge] P2pNcclConnector PD disaggregation#327

[Do Not Merge] P2pNcclConnector PD disaggregation#327
AAbouzeid wants to merge 7 commits intoovg-project:mainfrom
AAbouzeid:pd-disagg-P2pNcclConnector-1

AAbouzeid commented May 9, 2026 •

edited

Loading

Uh oh!

cui36 commented May 9, 2026

Uh oh!

ivanium commented May 9, 2026

Uh oh!

cui36 commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AAbouzeid commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

P2P NCCL Debug PR Description

Summary

What Changed

Key Finding

Experiment Matrix

Root Cause Hypothesis

Interpretation

Scope

Next Step

Logs attached below

Uh oh!

cui36 commented May 9, 2026

Uh oh!

ivanium commented May 9, 2026

Uh oh!

cui36 commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AAbouzeid commented May 9, 2026 •

edited

Loading