kvcached Roadmap
Active workstreams, roughly in priority order. Each item links to the relevant issues/PRs; pick one up or file a new issue if something you care about isn't listed.
Production stability
Serving stack compatibility
Performance
Packaging / maintenance
Hardware backend expansion
All tracks exploratory / PoC stage.
Docs / examples
kvcached Roadmap
Active workstreams, roughly in priority order. Each item links to the relevant issues/PRs; pick one up or file a new issue if something you care about isn't listed.
Production stability
Central kvcached use case; current behavior is a production stability blocker.
Serving stack compatibility
PD (prefill / decode) disaggregation support — #302, #311
Without this, kvcached is locked out of PD-based serving stacks. vLLM exposes PD via several
KVConnectorimplementations, each with its own assumptions about KV layout and block count; we have to validate kvcached against each pattern.set_stridecrash +num_blocksassertion). Pending review + merge; UCX-over-VMM transfer path still needs an end-to-end correctness run.examples/)KV cache CPU offloading in vLLM — PR #269
Major capacity multiplier for memory-bound deployments.
--kv-offloading-sizeerror (#267)Cross-version vLLM compatibility — ongoing
vLLM ships every few weeks; each break delays user upgrades. Most recent fix: PR #305 (block-pool integration).
Maintain a patched vLLM fork in the org repo
Carry all kvcached patches in a single org-owned vLLM branch, kept rebased against upstream. Goal: one-line install that gives users vLLM + every kvcached integration patch in sync, instead of asking them to apply patches by hand against whatever vLLM tag they happen to be on.
Performance
Layout overhead study: contiguous vs non-contiguous — see
layout-comparison.mdRecent e2e numbers on attention-only show non-contig ≈ vanilla vLLM while contig (today's default) carries the 30–50% overhead also seen in #299 / PR #319. Goal: confirm + generalise so we can flip the default.
bench serve, 3 seedscontig=false, drop HYBRID_LINEARValueError, deprecate the env varDetailed performance characterisation across configs
Per-config sweep so users can predict overhead before deploying, and so we have a stable reference for regression tracking. Axes: model size, num_layers, page_size, num_kv_buffers, batch / request rate, layout flag, single-instance vs multi-instance. Output: a reproducible harness + published table / dashboard.
Packaging / maintenance
Removes the per-torch-version rebuild; long-term maintenance + redistribution win.
Hardware backend expansion
All tracks exploratory / PoC stage.
amd-support-init,amd-benchmarkDocs / examples
Project website — landing page + docs (scope TBD)
Currently only a README; a proper site would help adoption and give a stable link for talks and papers.
MuxServe++ reproduction / model support — #231