kvcompress

Clean-room research code for KV-cache compression. The repository contains:

a real TurboAngle reference codec
a clean-room TurboQuant reference with a second-rotation residual correction stage
an adaptive best-of-both hybrid policy
an experimental fused hybrid codec
a prototype compressed-KV OpenAI-compatible runtime

This repository is structured to stand on its own as a publishable research project.

kvcompress studies how to shrink LLM KV-cache memory while keeping the quality and runtime tradeoffs measurable. It includes reference codecs, benchmark scripts, GPU-oriented kernel work, and a runtime prototype instead of only paper notes.

Layout

Path	Purpose
`kvcompress/python/`	Reference codecs, metrics, policies, and hybrid logic
`kvcompress/packing/`	Bit-packing helpers
`kvcompress/ops/`	Triton kernel entry points for compressed-KV primitives
`configs/`	Layerwise MixedKV presets
`bench/`	Executable benchmark and ablation scripts
`docs/`	Specs, benchmark docs, checked-in results, and runtime notes
`server/`	Prototype compressed-KV runtime
`tests/`	Unit and runtime smoke tests

Docs

Docs index: docs/README.md
Algorithm notes: docs/spec/IMPLEMENTATION_SPEC.md
Stack status: docs/spec/STACK_NOTES.md
Benchmark protocol and reproducibility: docs/benchmarks/BENCHMARK_PROTOCOL.md
Triton runbook: docs/benchmarks/TRITON_KERNEL_RUNBOOK.md
Runtime note: docs/runtime/OPENAI_COMPAT_COMPRESSED_KV.md

Quick Start

From the repository root:

pip install -e ".[dev]"
python -m pytest tests -v
python -m kvcompress.examples.roundtrip_demo
python bench/compare_codecs.py --batch 256 --head-dim 128
python bench/sweep_turboangle.py --preset early_boost_mistral_like
python bench/sweep_turboquant.py --use-residual-qjl

For CUDA/Triton validation on supported NVIDIA hardware:

pip install -e ".[dev,triton]"
python -m pytest tests/test_triton_ops.py -v
python bench/benchmark_triton_ops.py --batch 4096 --head-dim 128 --n-angles 256

Architecture

The project is organized around four connected layers:

codecs in kvcompress/python/
GPU-oriented ops in kvcompress/ops/
runtime integration in server/
measurement and reporting through bench/ and docs/results/

The overview image above was generated with paperbanana and is meant to make that split easy to scan.

Current Status

TurboAngle: implemented as a measurable reference with orthonormal FWHT, pairwise polar quantization, multiple norm modes, and layerwise MixedKV policies.
TurboQuant-style: kept as a baseline for ablations.
TurboQuant: implemented as a clean-room reference with random rotation, scalar/codebook quantization, norm handling, and a second-rotation residual correction stage.
Hybrid policy: implemented and benchmarked with a synthetic calibration proxy; under the current matched-bitrate synthetic setup it selects TurboQuant everywhere, which should not be mistaken for a full paper-calibrated adaptive policy.
Fused hybrid: implemented and benchmarked as an experimental ablation; it has not yet beaten the adaptive policy on the main regime.
Runtime: implemented as an honest compressed-KV cache service plus local Gemma-family model-path integration and optional OpenAI-compatible proxying, with decoded shadow buffers to avoid repeated full-prefix Python decode work.
CUDA/Triton: real GPU-only kernels now exist for FWHT/sign and TurboAngle pair encode/decode, with codec integration behind TurboAngleParams(use_triton_ops=True).

Runtime Behavior

/v1/cache/sessions, /append, /attention, and /tokens/{pos} are the real compressed-KV APIs in this package.
/v1/chat/completions works through either a configured upstream model server or the local Gemma-family runner when enabled.
The package no longer fabricates chat completions or fake tool calls when no upstream exists.

Results

Baseline codec comparison: docs/results/CURRENT_RESULTS.md
TurboAngle MixedKV sweep: docs/results/TURBOANGLE_MIXEDKV_RESULTS.md
TurboQuant residual results: docs/results/TURBOQUANT_RESULTS.md
Adaptive hybrid policy results: docs/results/HYBRID_POLICY_RESULTS.md
Fused hybrid ablation: docs/results/FUSED_HYBRID_RESULTS.md
Runtime prototype status: docs/results/RUNTIME_PROTOTYPE_RESULTS.md
Gemma-family model-path benchmark: docs/results/GEMMA2_MODEL_PATH_RESULTS.md
Triton kernel validation guide: docs/benchmarks/TRITON_KERNEL_RUNBOOK.md

What Still Needs Improvement

Tune and validate the new Triton kernels on RTX hardware; this machine cannot execute them.
Broaden the Triton path beyond the current FWHT/sign and pairwise TurboAngle kernels.
Add real model-quality measurements such as perplexity, long-context retrieval, and throughput.
Run publishable RTX-tier experiments, not only the current local CPU/GTX-class compatibility checks.

License

Apache-2.0 (see LICENSE).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
bench		bench
configs		configs
docs		docs
kvcompress		kvcompress
server		server
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-bench.txt		requirements-bench.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kvcompress

Layout

Docs

Quick Start

Architecture

Current Status

Runtime Behavior

Results

What Still Needs Improvement

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kvcompress

Layout

Docs

Quick Start

Architecture

Current Status

Runtime Behavior

Results

What Still Needs Improvement

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages