Clean-room research code for KV-cache compression. The repository contains:
- a real TurboAngle reference codec
- a clean-room TurboQuant reference with a second-rotation residual correction stage
- an adaptive best-of-both hybrid policy
- an experimental fused hybrid codec
- a prototype compressed-KV OpenAI-compatible runtime
This repository is structured to stand on its own as a publishable research project.
kvcompress studies how to shrink LLM KV-cache memory while keeping the quality and runtime tradeoffs measurable. It includes reference codecs, benchmark scripts, GPU-oriented kernel work, and a runtime prototype instead of only paper notes.
| Path | Purpose |
|---|---|
kvcompress/python/ |
Reference codecs, metrics, policies, and hybrid logic |
kvcompress/packing/ |
Bit-packing helpers |
kvcompress/ops/ |
Triton kernel entry points for compressed-KV primitives |
configs/ |
Layerwise MixedKV presets |
bench/ |
Executable benchmark and ablation scripts |
docs/ |
Specs, benchmark docs, checked-in results, and runtime notes |
server/ |
Prototype compressed-KV runtime |
tests/ |
Unit and runtime smoke tests |
- Docs index:
docs/README.md - Algorithm notes:
docs/spec/IMPLEMENTATION_SPEC.md - Stack status:
docs/spec/STACK_NOTES.md - Benchmark protocol and reproducibility:
docs/benchmarks/BENCHMARK_PROTOCOL.md - Triton runbook:
docs/benchmarks/TRITON_KERNEL_RUNBOOK.md - Runtime note:
docs/runtime/OPENAI_COMPAT_COMPRESSED_KV.md
From the repository root:
pip install -e ".[dev]"
python -m pytest tests -v
python -m kvcompress.examples.roundtrip_demo
python bench/compare_codecs.py --batch 256 --head-dim 128
python bench/sweep_turboangle.py --preset early_boost_mistral_like
python bench/sweep_turboquant.py --use-residual-qjlFor CUDA/Triton validation on supported NVIDIA hardware:
pip install -e ".[dev,triton]"
python -m pytest tests/test_triton_ops.py -v
python bench/benchmark_triton_ops.py --batch 4096 --head-dim 128 --n-angles 256The project is organized around four connected layers:
- codecs in
kvcompress/python/ - GPU-oriented ops in
kvcompress/ops/ - runtime integration in
server/ - measurement and reporting through
bench/anddocs/results/
The overview image above was generated with paperbanana and is meant to make that split easy to scan.
TurboAngle: implemented as a measurable reference with orthonormal FWHT, pairwise polar quantization, multiple norm modes, and layerwise MixedKV policies.TurboQuant-style: kept as a baseline for ablations.TurboQuant: implemented as a clean-room reference with random rotation, scalar/codebook quantization, norm handling, and a second-rotation residual correction stage.Hybrid policy: implemented and benchmarked with a synthetic calibration proxy; under the current matched-bitrate synthetic setup it selects TurboQuant everywhere, which should not be mistaken for a full paper-calibrated adaptive policy.Fused hybrid: implemented and benchmarked as an experimental ablation; it has not yet beaten the adaptive policy on the main regime.Runtime: implemented as an honest compressed-KV cache service plus local Gemma-family model-path integration and optional OpenAI-compatible proxying, with decoded shadow buffers to avoid repeated full-prefix Python decode work.CUDA/Triton: real GPU-only kernels now exist for FWHT/sign and TurboAngle pair encode/decode, with codec integration behindTurboAngleParams(use_triton_ops=True).
/v1/cache/sessions,/append,/attention, and/tokens/{pos}are the real compressed-KV APIs in this package./v1/chat/completionsworks through either a configured upstream model server or the local Gemma-family runner when enabled.- The package no longer fabricates chat completions or fake tool calls when no upstream exists.
- Baseline codec comparison:
docs/results/CURRENT_RESULTS.md - TurboAngle MixedKV sweep:
docs/results/TURBOANGLE_MIXEDKV_RESULTS.md - TurboQuant residual results:
docs/results/TURBOQUANT_RESULTS.md - Adaptive hybrid policy results:
docs/results/HYBRID_POLICY_RESULTS.md - Fused hybrid ablation:
docs/results/FUSED_HYBRID_RESULTS.md - Runtime prototype status:
docs/results/RUNTIME_PROTOTYPE_RESULTS.md - Gemma-family model-path benchmark:
docs/results/GEMMA2_MODEL_PATH_RESULTS.md - Triton kernel validation guide:
docs/benchmarks/TRITON_KERNEL_RUNBOOK.md
- Tune and validate the new Triton kernels on RTX hardware; this machine cannot execute them.
- Broaden the Triton path beyond the current FWHT/sign and pairwise TurboAngle kernels.
- Add real model-quality measurements such as perplexity, long-context retrieval, and throughput.
- Run publishable RTX-tier experiments, not only the current local CPU/GTX-class compatibility checks.
Apache-2.0 (see LICENSE).
