Skip to content

llmsresearch/kvcompress

Repository files navigation

kvcompress

CI License: Apache-2.0 Python 3.10+

Clean-room research code for KV-cache compression. The repository contains:

  • a real TurboAngle reference codec
  • a clean-room TurboQuant reference with a second-rotation residual correction stage
  • an adaptive best-of-both hybrid policy
  • an experimental fused hybrid codec
  • a prototype compressed-KV OpenAI-compatible runtime

This repository is structured to stand on its own as a publishable research project.

kvcompress overview

kvcompress studies how to shrink LLM KV-cache memory while keeping the quality and runtime tradeoffs measurable. It includes reference codecs, benchmark scripts, GPU-oriented kernel work, and a runtime prototype instead of only paper notes.

Layout

Path Purpose
kvcompress/python/ Reference codecs, metrics, policies, and hybrid logic
kvcompress/packing/ Bit-packing helpers
kvcompress/ops/ Triton kernel entry points for compressed-KV primitives
configs/ Layerwise MixedKV presets
bench/ Executable benchmark and ablation scripts
docs/ Specs, benchmark docs, checked-in results, and runtime notes
server/ Prototype compressed-KV runtime
tests/ Unit and runtime smoke tests

Docs

Quick Start

From the repository root:

pip install -e ".[dev]"
python -m pytest tests -v
python -m kvcompress.examples.roundtrip_demo
python bench/compare_codecs.py --batch 256 --head-dim 128
python bench/sweep_turboangle.py --preset early_boost_mistral_like
python bench/sweep_turboquant.py --use-residual-qjl

For CUDA/Triton validation on supported NVIDIA hardware:

pip install -e ".[dev,triton]"
python -m pytest tests/test_triton_ops.py -v
python bench/benchmark_triton_ops.py --batch 4096 --head-dim 128 --n-angles 256

Architecture

The project is organized around four connected layers:

  • codecs in kvcompress/python/
  • GPU-oriented ops in kvcompress/ops/
  • runtime integration in server/
  • measurement and reporting through bench/ and docs/results/

The overview image above was generated with paperbanana and is meant to make that split easy to scan.

Current Status

  • TurboAngle: implemented as a measurable reference with orthonormal FWHT, pairwise polar quantization, multiple norm modes, and layerwise MixedKV policies.
  • TurboQuant-style: kept as a baseline for ablations.
  • TurboQuant: implemented as a clean-room reference with random rotation, scalar/codebook quantization, norm handling, and a second-rotation residual correction stage.
  • Hybrid policy: implemented and benchmarked with a synthetic calibration proxy; under the current matched-bitrate synthetic setup it selects TurboQuant everywhere, which should not be mistaken for a full paper-calibrated adaptive policy.
  • Fused hybrid: implemented and benchmarked as an experimental ablation; it has not yet beaten the adaptive policy on the main regime.
  • Runtime: implemented as an honest compressed-KV cache service plus local Gemma-family model-path integration and optional OpenAI-compatible proxying, with decoded shadow buffers to avoid repeated full-prefix Python decode work.
  • CUDA/Triton: real GPU-only kernels now exist for FWHT/sign and TurboAngle pair encode/decode, with codec integration behind TurboAngleParams(use_triton_ops=True).

Runtime Behavior

  • /v1/cache/sessions, /append, /attention, and /tokens/{pos} are the real compressed-KV APIs in this package.
  • /v1/chat/completions works through either a configured upstream model server or the local Gemma-family runner when enabled.
  • The package no longer fabricates chat completions or fake tool calls when no upstream exists.

Results

What Still Needs Improvement

  • Tune and validate the new Triton kernels on RTX hardware; this machine cannot execute them.
  • Broaden the Triton path beyond the current FWHT/sign and pairwise TurboAngle kernels.
  • Add real model-quality measurements such as perplexity, long-context retrieval, and throughput.
  • Run publishable RTX-tier experiments, not only the current local CPU/GTX-class compatibility checks.

License

Apache-2.0 (see LICENSE).

About

KV-cache compression for LLMs: reference implementations of TurboAngle and TurboQuant codecs with Triton GPU kernels

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages