kvcached incompatible with PD disaggregation (NixlConnector)

## Description

kvcached crashes during initialization when used with vLLM's PD (prefill-decode) disaggregation via NixlConnector. Two bugs were identified. A third concern (UCX + VMM memory compatibility for the actual transfer) appears to be addressed in recent UCX versions but remains untested.

PD disaggregation works correctly without kvcached using the same setup.

Related: #302

## Environment

| Component | Version / Detail |
|---|---|
| Machine | 2x A100 80GB SXM (NV12 NVLink) |
| CUDA | 12.8 |
| Python | 3.12.3 |
| vLLM | 0.19.0 |
| kvcached | 0.1.5 |
| KV connector | NixlConnector (UCX transport, `cuda_ipc,cuda_copy,tcp`) |
| Model | Qwen/Qwen2.5-1.5B-Instruct |

## Steps to reproduce

Start prefill and decode instances with kvcached + NixlConnector:

**Prefill (GPU 0):**
```bash
ENABLE_KVCACHED=true KVCACHED_AUTOPATCH=1 \
UCX_TLS=cuda_ipc,cuda_copy,tcp \
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen2.5-1.5B-Instruct \
  --host 0.0.0.0 --port 8100 --max-model-len 1024 --gpu-memory-utilization 0.8 \
  --kv-transfer-config \
  '{"kv_connector":"NixlConnector","kv_role":"kv_producer","kv_buffer_device":"cuda"}'
```

**Decode (GPU 1):**
```bash
ENABLE_KVCACHED=true KVCACHED_AUTOPATCH=1 \
UCX_TLS=cuda_ipc,cuda_copy,tcp \
VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen2.5-1.5B-Instruct \
  --host 0.0.0.0 --port 8200 --max-model-len 1024 --gpu-memory-utilization 0.8 \
  --kv-transfer-config \
  '{"kv_connector":"NixlConnector","kv_role":"kv_consumer","kv_buffer_device":"cuda"}'
```

**Proxy (port 9100):**
```bash
python3 disagg_prefill_proxy_server.py \
  --prefill-url http://localhost:8100 \
  --decode-url http://localhost:8200 \
  --port 9100
```
(from `vllm/benchmarks/disagg_benchmarks/disagg_prefill_proxy_server.py`)

**Test request:**
```bash
curl http://localhost:9100/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","prompt":"The capital of France is","max_tokens":30}'
```

**Without kvcached:** all four components work, responses are correct.

**With kvcached:** both prefill and decode crash during KV cache initialization before serving any requests.

## Bug 1: Block count assertion mismatch

### Error

```
AssertionError in nixl_connector.py:1667, register_kv_caches():
  "All kv cache tensors must have the same number of blocks"

Crash path:
  gpu_worker.py:536            → initialize_from_config()
  nixl_connector.py:452        → register_kv_caches()
  nixl_connector.py:1667       → assert cache.shape[0] == num_blocks  ← FAILS
```

### Root cause

NixlConnector sets `self.num_blocks = kv_cache_config.num_blocks` (line 1098) and
asserts every KV cache tensor's `shape[0]` matches this value.

kvcached's `alloc_kv_cache()` (interfaces.py:159) computes `num_blocks_per_layer`
from total GPU memory and returns tensors with that larger shape (interfaces.py:171):

```python
gpu_mem_bytes_per_layer_k_or_v = gpu_mem_bytes // num_layers // num_k_or_v
num_blocks_per_layer = gpu_mem_bytes_per_layer_k_or_v // block_mem_bytes
```

| Value | Source | Amount |
|---|---|---|
| `cache.shape[0]` | kvcached (virtual address space) | **187,136** |
| `self.num_blocks` | vLLM profiler | **512** |

Both sides are correct by design:
- kvcached over-provisions virtual blocks for elastic resize - physical pages are
  mapped/unmapped on demand via CUDA VMM (`cuMemAddressReserve` + `cuMemMap`).
- NixlConnector expects the exact physical block count for RDMA descriptor
  pre-registration.

### Why relaxing the assertion should be safe

NixlConnector pre-creates per-block transfer descriptors (block ID = array index).
During a transfer, only blocks assigned by the scheduler are copied - and the
scheduler only assigns blocks that kvcached has physically mapped. Descriptors for
unmapped blocks (backed by the shared zero page) exist but are never referenced in
transfers. Registering 187K descriptors instead of 512 wastes some memory (~252MB
for descriptor metadata) but is functionally safe.

## Bug 2: KV cache layout incompatibility

This bug is hidden behind Bug 1 - it only appears after patching the assertion.

### Error

```
RuntimeError: set_stride is not allowed on a Tensor created from .data or .detach().

Crash path:
  gpu_worker.py:381            → determine_available_memory()
  gpu_model_runner.py:5815     → _init_minimal_kv_cache_for_profiling()
  gpu_model_runner.py:6718     → initialize_kv_cache_tensors()
                               → set_stride CRASH
```

### Root cause

NixlConnector forces HND (Head-Num-Dim) KV cache layout via
`get_required_kvcache_layout()` returning `"HND"` for better transfer performance.
vLLM then tries to permute the KV cache tensors from NHD to HND, which calls
`set_stride()`.

kvcached creates tensors via `torch::from_blob()` over CUDA virtual memory
(`cuMemAddressReserve`). PyTorch does not allow `set_stride()` on `from_blob`
tensors (non-owning data pointer). kvcached explicitly only supports NHD layout
(interfaces.py:111-112).

## UCX + VMM memory compatibility

Even after fixing both bugs above, the actual GPU-to-GPU KV cache transfer has not
been tested with kvcached's VMM-allocated memory.

UCX's `cuda_ipc` transport historically used `cuIpcGetMemHandle()` which does not
support `cuMemCreate`-allocated memory:
- https://github.com/openucx/ucx/issues/7110
- https://github.com/openucx/ucx/issues/9895

However, recent UCX versions have added VMM support for both transports:
- **cuda_copy:** VMM detection added June 2024 (https://github.com/openucx/ucx/pull/9867)
- **cuda_ipc:** VMM handle type (`UCT_CUDA_IPC_KEY_HANDLE_TYPE_VMM`) using
  `cuMemExportToShareableHandle` with fabric handles instead of legacy
  `cuIpcGetMemHandle` (in UCX master, `src/uct/cuda/cuda_ipc/cuda_ipc_md.c`)

Whether the UCX version bundled with NIXL includes these changes is unknown. We have
not been able to get past the initialization crashes to attempt an actual transfer.

## Other connectors tested

| Connector | Block assertion? | Compatible? | Tested? |
|---|---|---|---|
| NixlConnector | YES (`shape[0] == num_blocks`) | NO - crashes | YES |
| P2pNcclConnector | NO (`num_block = kv_cache.shape[0]`) | ? | Baseline only (NCCL flaky) |
| LMCacheConnectorV1 | NO (stores tensors as-is) | ? | Baseline setup failed (complex multi-component config) |

## What we tried

We implemented a `NixlConnectorPatch` in kvcached's monkey-patch integration layer:

1. **Bug 2 fix (working):** Override `get_required_kvcache_layout()` to return `None`
   (use NHD default) when kvcached is active. Verified - `set_stride` crash no longer
   occurs.

2. **Bug 1 fix (in progress):** Patch `register_kv_caches()` to adjust `self.num_blocks`
   to match the tensor's `shape[0]`. The challenge is that internal method
   `_sync_block_size_with_kernel()` runs inside `register_kv_caches()` and can overwrite
   the patched value. Still debugging the patch mechanism.

Branch: https://github.com/AAbouzeid/kvcached/tree/fix/pd-disagg-nixl-connector

## Suggested fix directions

1. **Recommended - relax assertion in NixlConnector (vLLM side):** Change line 1667
   from `assert cache.shape[0] == num_blocks` to `assert cache.shape[0] >= num_blocks`.
   Use `cache.shape[0]` for descriptor creation instead of `kv_cache_config.num_blocks`.
   This is the smallest change and is safe because transfers only touch
   scheduler-assigned (physically backed) blocks.

2. **Alternative - fix tensor shape in kvcached (interfaces.py:171):** Return tensor
   views with `shape[0]` matching vLLM's `requested_num_blocks` instead of the full
   virtual capacity. The underlying virtual reservation remains large. Caveat: elastic
   resize creates block IDs beyond `requested_num_blocks`, which would be out of bounds
   on the smaller tensor.

3. **Layout - add HND support to kvcached or make it optional in NixlConnector:**
   Either support `set_stride()` on kvcached's VMM tensors, or make NixlConnector's
   HND layout preference overridable when the KV cache allocator doesn't support it.

Happy to work on a fix and test it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvcached incompatible with PD disaggregation (NixlConnector) #311

Description

Environment

Steps to reproduce

Bug 1: Block count assertion mismatch

Error

Root cause

Why relaxing the assertion should be safe

Bug 2: KV cache layout incompatibility

Error

Root cause

UCX + VMM memory compatibility

Other connectors tested

What we tried

Suggested fix directions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Version / Detail
Machine	2x A100 80GB SXM (NV12 NVLink)
CUDA	12.8
Python	3.12.3
vLLM	0.19.0
kvcached	0.1.5
KV connector	NixlConnector (UCX transport, `cuda_ipc,cuda_copy,tcp`)
Model	Qwen/Qwen2.5-1.5B-Instruct

Value	Source	Amount
`cache.shape[0]`	kvcached (virtual address space)	187,136
`self.num_blocks`	vLLM profiler	512

Connector	Block assertion?	Compatible?	Tested?
NixlConnector	YES (`shape[0] == num_blocks`)	NO - crashes	YES
P2pNcclConnector	NO (`num_block = kv_cache.shape[0]`)	?	Baseline only (NCCL flaky)
LMCacheConnectorV1	NO (stores tensors as-is)	?	Baseline setup failed (complex multi-component config)

kvcached incompatible with PD disaggregation (NixlConnector) #311

Description

Description

Environment

Steps to reproduce

Bug 1: Block count assertion mismatch

Error

Root cause

Why relaxing the assertion should be safe

Bug 2: KV cache layout incompatibility

Error

Root cause

UCX + VMM memory compatibility

Other connectors tested

What we tried

Suggested fix directions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions