Skip to content

Roadmap/Feature requested/TODOs---Start Here for New Contributors #273

@cui36

Description

@cui36

kvcached Roadmap

Active workstreams, roughly in priority order. Each item links to the relevant issues/PRs; pick one up or file a new issue if something you care about isn't listed.

Production stability

  • Fix KV pool exhaustion crash under multi-instance load#262
    Central kvcached use case; current behavior is a production stability blocker.

Serving stack compatibility

  • PD (prefill / decode) disaggregation support#302, #311
    Without this, kvcached is locked out of PD-based serving stacks. vLLM exposes PD via several KVConnector implementations, each with its own assumptions about KV layout and block count; we have to validate kvcached against each pattern.

    • NixlConnector (UCX / RDMA): PR #313 patches the two known incompatibilities (HND set_stride crash + num_blocks assertion). Pending review + merge; UCX-over-VMM transfer path still needs an end-to-end correctness run.
    • P2pNcclConnector (NCCL P2P): untested with kvcached. No block-count assertion in the connector itself, but NCCL-on-VMM compatibility unverified. Need a smoke test + fix list.
    • LMCacheConnectorV1: untested; LMCache pulls full tensors so layout/over-provision behavior may differ. Need a smoke test.
    • MooncakeConnector / SharedStorageConnector / MultiConnector: out of scope for now; revisit once the three above are green.
    • Add an example demonstrating PD + kvcached (per-connector minimal repro under examples/)
    • Clarify the support matrix on #302 once each connector lands
  • KV cache CPU offloading in vLLMPR #269
    Major capacity multiplier for memory-bound deployments.

    • Review + perf sanity check on PR #269
    • Fix vllm-0.16.0 + --kv-offloading-size error (#267)
  • Cross-version vLLM compatibility — ongoing
    vLLM ships every few weeks; each break delays user upgrades. Most recent fix: PR #305 (block-pool integration).

  • Maintain a patched vLLM fork in the org repo
    Carry all kvcached patches in a single org-owned vLLM branch, kept rebased against upstream. Goal: one-line install that gives users vLLM + every kvcached integration patch in sync, instead of asking them to apply patches by hand against whatever vLLM tag they happen to be on.

    • Decide branch / tag strategy (per-vLLM-minor branch vs rolling)
    • Set up CI that re-runs kvcached integration tests on each upstream sync
    • Document the install path in the README

Performance

  • Layout overhead study: contiguous vs non-contiguous — see layout-comparison.md
    Recent e2e numbers on attention-only show non-contig ≈ vanilla vLLM while contig (today's default) carries the 30–50% overhead also seen in #299 / PR #319. Goal: confirm + generalise so we can flip the default.

    • (1) Pin attention-only: clean A/B/C (vanilla / contig=true / contig=false), Qwen3-0.6B, bench serve, 3 seeds
    • (2) Generalise to hybrid: same harness on smallest Jamba that fits, contig=false vs vanilla
    • (3) Decide: if both hold → flip default to contig=false, drop HYBRID_LINEAR ValueError, deprecate the env var
  • Detailed performance characterisation across configs
    Per-config sweep so users can predict overhead before deploying, and so we have a stable reference for regression tracking. Axes: model size, num_layers, page_size, num_kv_buffers, batch / request rate, layout flag, single-instance vs multi-instance. Output: a reproducible harness + published table / dashboard.

Packaging / maintenance

  • Port C++ extension to libtorch stable ABI#306, PR #308
    Removes the per-torch-version rebuild; long-term maintenance + redistribution win.

Hardware backend expansion

All tracks exploratory / PoC stage.

  • AMD supportPR #248; branches amd-support-init, amd-benchmark
  • MLX support (Apple Silicon) — draft PR #298
  • Arm64 support#225

Docs / examples

  • Project website — landing page + docs (scope TBD)
    Currently only a README; a proper site would help adoption and give a stable link for talks and papers.

  • MuxServe++ reproduction / model support#231

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions