Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 35 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,34 +6,40 @@

# turboquant-vllm

TurboQuant KV cache compression as a drop-in vLLM plugin. **3.76x KV cache compression with asymmetric K/V support, validated across 8 models.**
Reference implementation for TurboQuant KV cache compression in HuggingFace `DynamicCache`, with verification tooling for model compatibility and an optional vLLM plugin bridge. **3.76x KV cache compression with asymmetric K/V support, validated across 8 models.**

> Implements Google's [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) — the first KV cache quantization method with provably near-optimal distortion rates.

## Install
> Native vLLM TurboQuant is converging upstream in [vllm-project/vllm#38479](https://github.com/vllm-project/vllm/pull/38479). This repo is the HuggingFace/reference path: research workflows, architecture validation, and incubation of upstreamable ideas.

```bash
pip install turboquant-vllm[vllm]
```
## When to Use This Repo

Use `turboquant-vllm` when you want to:

- compress KV cache in HuggingFace `transformers` via `DynamicCache`
- validate whether TurboQuant will work on a model before deeper integration
- experiment on multimodal, heterogeneous, sliding-window, or shared-KV architectures
- prototype ideas that may later move upstream into native vLLM

Or with [uv](https://docs.astral.sh/uv/):
Use upstream native vLLM TurboQuant when you want production-oriented vLLM serving on a supported path.

## Install

```bash
uv add turboquant-vllm --extra vllm
pip install turboquant-vllm
```

## Quick Start (vLLM)

The TQ4 attention backend registers automatically via vLLM's plugin system:
Optional vLLM plugin extras:

```bash
vllm serve meta-llama/Llama-3.1-8B-Instruct --attention-backend CUSTOM
pip install turboquant-vllm[vllm]
```

No code changes required. The plugin compresses KV cache pages to 68 bytes/token/head (vs 256 bytes FP16). For asymmetric K/V compression:
Or with [uv](https://docs.astral.sh/uv/) (choose one command based on your workflow):

```bash
TQ4_K_BITS=4 TQ4_V_BITS=3 vllm serve meta-llama/Llama-3.1-8B-Instruct --attention-backend CUSTOM
uv add turboquant-vllm # HuggingFace/reference workflow
uv add turboquant-vllm --extra vllm # Include optional vLLM plugin extras
```

## Quick Start (HuggingFace)
Expand All @@ -49,6 +55,22 @@ compressed = CompressedDynamicCache(cache, head_dim=128, k_bits=4, v_bits=3)
# Compression happens transparently on every cache.update()
```

## Optional vLLM Plugin Bridge

If you specifically need the out-of-tree vLLM plugin path from this repo:

```bash
vllm serve meta-llama/Llama-3.1-8B-Instruct --attention-backend CUSTOM
```

For asymmetric K/V compression:

```bash
TQ4_K_BITS=4 TQ4_V_BITS=3 vllm serve meta-llama/Llama-3.1-8B-Instruct --attention-backend CUSTOM
```

This path is still supported, but it is no longer the primary project direction. For native vLLM TurboQuant, prefer the upstream in-tree path as it matures.

## Compression Quality

Per-layer minimum cosine similarity on real model activations (128-token prefill, RTX 4090):
Expand Down
32 changes: 23 additions & 9 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,29 @@
# Project Documentation Index

> **Status (2026-04-04):** Reference implementation for HuggingFace transformers DynamicCache. For native vLLM TurboQuant, see [vllm-project/vllm#38479](https://github.com/vllm-project/vllm/pull/38479). This project complements that PR — HF transformers workflows here, production vLLM serving there.
> **Status (2026-04-04):** Reference implementation for HuggingFace transformers DynamicCache. For native vLLM TurboQuant, see [vllm-project/vllm#38479](https://github.com/vllm-project/vllm/pull/38479). This project complements that PR — HF transformers workflows, verification, and architecture research live here; production-oriented native vLLM serving belongs upstream.

## Project Overview

- **Type:** Python library (HuggingFace transformers DynamicCache patch)
- **Type:** Python library for HuggingFace DynamicCache compression, verification, and architecture research
- **Primary Language:** Python 3.12+
- **Architecture:** Layered library with strict DAG dependency flow

## Choose the Right Path

- **Use `turboquant-vllm`** when you want HuggingFace cache compression, model validation, multimodal experiments, or architecture/policy research
- **Use upstream native vLLM TurboQuant** when you want the in-tree serving path in vLLM
- **Use the plugin path here** only when you specifically need the out-of-tree bridge (`--attention-backend CUSTOM`)

Comment on lines +11 to +16
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is described as repositioning the docs landing/index, but the rendered MkDocs site uses docs/site as docs_dir and its landing page is docs/site/index.md (see mkdocs.yml). Updates here in docs/index.md won’t affect the site landing page, so the public docs may still present the old plugin-first messaging unless the MkDocs index is updated too or this file is explicitly linked from the site.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I updated docs/site/index.md, which is the MkDocs landing page configured by mkdocs.yml, so the public docs now lead with the HuggingFace/reference positioning instead of the old plugin-first messaging.

## Quick Reference

- **Package:** `turboquant-vllm` (pip-installable, src-layout)
- **Tech Stack:** PyTorch + Triton + HuggingFace transformers + scipy
- **Build System:** uv (uv_build backend)
- **Entry Point:** `src/turboquant_vllm/__init__.py` (8 public exports)
- **CLI:** `python -m turboquant_vllm.benchmark`
- **vLLM Plugin:** Auto-registered via `vllm.general_plugins` entry point
- **vLLM Usage:** `vllm serve <model> --attention-backend CUSTOM`
- **Verification CLI:** `python -m turboquant_vllm.verify`
- **Benchmark CLI:** `python -m turboquant_vllm.benchmark`
- **Optional vLLM Plugin:** Auto-registered via `vllm.general_plugins` entry point
- **Optional vLLM Usage:** `vllm serve <model> --attention-backend CUSTOM`
- **Architecture Pattern:** Layered library (lloyd_max -> quantizer -> compressors -> kv_cache)

## Generated Documentation
Expand Down Expand Up @@ -52,11 +59,11 @@
## Getting Started

```bash
# Install from PyPI
pip install turboquant-vllm[vllm]
# Install from PyPI for HuggingFace/reference workflows
pip install turboquant-vllm

# Use with vLLM (no code changes)
vllm serve allenai/Molmo2-4B --attention-backend CUSTOM
# Verify whether a model is a good TQ candidate
python -m turboquant_vllm.verify --model allenai/Molmo2-4B --bits 4

# Or use with HuggingFace directly
from turboquant_vllm import CompressedDynamicCache
Expand All @@ -65,3 +72,10 @@ from transformers import DynamicCache
cache = DynamicCache()
compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)
```

Optional plugin bridge for vLLM:

```bash
pip install turboquant-vllm[vllm]
vllm serve allenai/Molmo2-4B --attention-backend CUSTOM
```
41 changes: 31 additions & 10 deletions docs/site/index.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,41 @@
# turboquant-vllm

TurboQuant KV cache compression as a drop-in vLLM plugin. **3.76x compression, near-identical output quality, one CLI flag to enable.**
Reference implementation for TurboQuant KV cache compression in HuggingFace `DynamicCache`, with verification tooling for model compatibility and an optional vLLM plugin bridge. **3.76x compression, near-identical output quality, and a clear split between reference workflows here vs native vLLM upstream.**

> First open-source [TurboQuant](https://arxiv.org/abs/2504.19874) implementation (ICLR 2026) — paper to working vLLM plugin in 72 hours.
> Implements Google's [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026), the first KV cache quantization method with provably near-optimal distortion rates.

> Native vLLM TurboQuant is converging upstream in [vllm-project/vllm#38479](https://github.com/vllm-project/vllm/pull/38479). This repo is the HuggingFace/reference path: research workflows, verification, and architecture validation.

## Choose the Right Path

- Use `turboquant-vllm` for HuggingFace cache compression, model validation, multimodal experiments, and architecture research.
- Use upstream native vLLM TurboQuant when you want the in-tree serving path in vLLM.
- Use the plugin path here only when you specifically need the out-of-tree bridge (`--attention-backend CUSTOM`).

## Install

=== "pip"

```bash
pip install turboquant-vllm[vllm]
pip install turboquant-vllm
```

=== "uv"

```bash
uv add turboquant-vllm --extra vllm
uv add turboquant-vllm
```

## Quick Start

### vLLM (zero code changes)
Optional vLLM plugin extras:

```bash
vllm serve allenai/Molmo2-8B --attention-backend CUSTOM
pip install turboquant-vllm[vllm]
uv add turboquant-vllm --extra vllm
```

The TQ4 attention backend registers automatically via vLLM's plugin system. KV cache pages are compressed to 68 bytes/token/head (vs 256 bytes FP16).
## Quick Start

### HuggingFace
### HuggingFace (primary workflow)

```python
from transformers import DynamicCache
Expand All @@ -41,6 +48,20 @@ compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)
# Compression happens transparently on every cache.update()
```

### Verify a model before deeper integration

```bash
python -m turboquant_vllm.verify --model allenai/Molmo2-4B --bits 4
```

### Optional vLLM plugin bridge

```bash
vllm serve allenai/Molmo2-8B --attention-backend CUSTOM
```

The TQ4 attention backend registers automatically via vLLM's plugin system. KV cache pages are compressed to 68 bytes/token/head (vs 256 bytes FP16).

## Benchmark Results

Molmo2-4B (bfloat16, 36 layers) on RTX 4090 — 11K visual tokens from 2fps video + 256 generation tokens:
Expand Down
4 changes: 3 additions & 1 deletion docs/site/usage/container.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Container Deployment

turboquant-vllm ships a `Containerfile` that bakes the plugin into the official vLLM image. Build once, deploy anywhere — no runtime pip installs.
This page covers containerizing the optional vLLM plugin bridge from `turboquant-vllm`. Use it when you specifically need the repo's out-of-tree CUSTOM backend in a containerized environment.

For the primary project workflow, prefer the HuggingFace/reference path. For native vLLM TurboQuant serving, prefer the upstream in-tree path as it matures.

## Build the Image

Expand Down
2 changes: 1 addition & 1 deletion docs/site/usage/huggingface.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# HuggingFace Integration

Use turboquant-vllm directly with HuggingFace's `DynamicCache` for research, benchmarking, or non-vLLM inference.
This is the primary workflow for `turboquant-vllm`: use HuggingFace's `DynamicCache` path for research, benchmarking, model validation, and architecture experimentation.

## Install

Expand Down
6 changes: 4 additions & 2 deletions docs/site/usage/vllm.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# vLLM Plugin

turboquant-vllm registers as a custom attention backend via vLLM's plugin system. No code changes are needed — just install and pass a CLI flag.
This page documents the optional out-of-tree vLLM plugin bridge in `turboquant-vllm`. It is useful when you specifically want the repo's CUSTOM-backend path, but it is not the primary long-term direction of the project.

For native in-tree vLLM TurboQuant, prefer the upstream path as it matures. Use `turboquant-vllm` primarily for HuggingFace workflows, verification, and architecture research.

## Install

Expand Down Expand Up @@ -38,7 +40,7 @@ On each attention step, the backend:

## Configuration

The plugin uses sensible defaults. No additional configuration is needed beyond `--attention-backend CUSTOM`.
The plugin uses sensible defaults. No additional configuration is needed beyond `--attention-backend CUSTOM`, but treat this as a bridge path rather than the default recommendation for new users.

| vLLM Flag | Recommended | Notes |
|-----------|-------------|-------|
Expand Down
Loading
Loading