Alberto-Codes · Alberto-Codes · Apr 10, 2026 · Apr 10, 2026 · Apr 10, 2026 · Copilot
diff --git a/README.md b/README.md
@@ -6,34 +6,40 @@
 
 # turboquant-vllm
 
-TurboQuant KV cache compression as a drop-in vLLM plugin. **3.76x KV cache compression with asymmetric K/V support, validated across 8 models.**
+Reference implementation for TurboQuant KV cache compression in HuggingFace `DynamicCache`, with verification tooling for model compatibility and an optional vLLM plugin bridge. **3.76x KV cache compression with asymmetric K/V support, validated across 8 models.**
 
 > Implements Google's [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) — the first KV cache quantization method with provably near-optimal distortion rates.
 
-## Install
+> Native vLLM TurboQuant is converging upstream in [vllm-project/vllm#38479](https://github.com/vllm-project/vllm/pull/38479). This repo is the HuggingFace/reference path: research workflows, architecture validation, and incubation of upstreamable ideas.
 
-```bash
-pip install turboquant-vllm[vllm]
-```
+## When to Use This Repo
+
+Use `turboquant-vllm` when you want to:
+
+- compress KV cache in HuggingFace `transformers` via `DynamicCache`
+- validate whether TurboQuant will work on a model before deeper integration
+- experiment on multimodal, heterogeneous, sliding-window, or shared-KV architectures
+- prototype ideas that may later move upstream into native vLLM
 
-Or with [uv](https://docs.astral.sh/uv/):
+Use upstream native vLLM TurboQuant when you want production-oriented vLLM serving on a supported path.
+
+## Install
 
 ```bash
-uv add turboquant-vllm --extra vllm
+pip install turboquant-vllm
 ```
 
-## Quick Start (vLLM)
-
-The TQ4 attention backend registers automatically via vLLM's plugin system:
+Optional vLLM plugin extras:
 
 ```bash
-vllm serve meta-llama/Llama-3.1-8B-Instruct --attention-backend CUSTOM
+pip install turboquant-vllm[vllm]
 ```
 
-No code changes required. The plugin compresses KV cache pages to 68 bytes/token/head (vs 256 bytes FP16). For asymmetric K/V compression:
+Or with [uv](https://docs.astral.sh/uv/) (choose one command based on your workflow):
 
 ```bash
-TQ4_K_BITS=4 TQ4_V_BITS=3 vllm serve meta-llama/Llama-3.1-8B-Instruct --attention-backend CUSTOM
+uv add turboquant-vllm                 # HuggingFace/reference workflow
+uv add turboquant-vllm --extra vllm    # Include optional vLLM plugin extras
 ```
 
 ## Quick Start (HuggingFace)
@@ -49,6 +55,22 @@ compressed = CompressedDynamicCache(cache, head_dim=128, k_bits=4, v_bits=3)
 # Compression happens transparently on every cache.update()
 ```
 
+## Optional vLLM Plugin Bridge
+
+If you specifically need the out-of-tree vLLM plugin path from this repo:
+
+```bash
+vllm serve meta-llama/Llama-3.1-8B-Instruct --attention-backend CUSTOM
+```
+
+For asymmetric K/V compression:
+
+```bash
+TQ4_K_BITS=4 TQ4_V_BITS=3 vllm serve meta-llama/Llama-3.1-8B-Instruct --attention-backend CUSTOM
+```
+
+This path is still supported, but it is no longer the primary project direction. For native vLLM TurboQuant, prefer the upstream in-tree path as it matures.
+
 ## Compression Quality
 
 Per-layer minimum cosine similarity on real model activations (128-token prefill, RTX 4090):

diff --git a/docs/index.md b/docs/index.md
@@ -1,22 +1,29 @@
 # Project Documentation Index
 
-> **Status (2026-04-04):** Reference implementation for HuggingFace transformers DynamicCache. For native vLLM TurboQuant, see [vllm-project/vllm#38479](https://github.com/vllm-project/vllm/pull/38479). This project complements that PR — HF transformers workflows here, production vLLM serving there.
+> **Status (2026-04-04):** Reference implementation for HuggingFace transformers DynamicCache. For native vLLM TurboQuant, see [vllm-project/vllm#38479](https://github.com/vllm-project/vllm/pull/38479). This project complements that PR — HF transformers workflows, verification, and architecture research live here; production-oriented native vLLM serving belongs upstream.
 
 ## Project Overview
 
-- **Type:** Python library (HuggingFace transformers DynamicCache patch)
+- **Type:** Python library for HuggingFace DynamicCache compression, verification, and architecture research
 - **Primary Language:** Python 3.12+
 - **Architecture:** Layered library with strict DAG dependency flow
 
+## Choose the Right Path
+
+- **Use `turboquant-vllm`** when you want HuggingFace cache compression, model validation, multimodal experiments, or architecture/policy research
+- **Use upstream native vLLM TurboQuant** when you want the in-tree serving path in vLLM
+- **Use the plugin path here** only when you specifically need the out-of-tree bridge (`--attention-backend CUSTOM`)
+
 ## Quick Reference
 
 - **Package:** `turboquant-vllm` (pip-installable, src-layout)
 - **Tech Stack:** PyTorch + Triton + HuggingFace transformers + scipy
 - **Build System:** uv (uv_build backend)
 - **Entry Point:** `src/turboquant_vllm/__init__.py` (8 public exports)
-- **CLI:** `python -m turboquant_vllm.benchmark`
-- **vLLM Plugin:** Auto-registered via `vllm.general_plugins` entry point
-- **vLLM Usage:** `vllm serve <model> --attention-backend CUSTOM`
+- **Verification CLI:** `python -m turboquant_vllm.verify`
+- **Benchmark CLI:** `python -m turboquant_vllm.benchmark`
+- **Optional vLLM Plugin:** Auto-registered via `vllm.general_plugins` entry point
+- **Optional vLLM Usage:** `vllm serve <model> --attention-backend CUSTOM`
 - **Architecture Pattern:** Layered library (lloyd_max -> quantizer -> compressors -> kv_cache)
 
 ## Generated Documentation
@@ -52,11 +59,11 @@
 ## Getting Started
 
 ```bash
-# Install from PyPI
-pip install turboquant-vllm[vllm]
+# Install from PyPI for HuggingFace/reference workflows
+pip install turboquant-vllm
 
-# Use with vLLM (no code changes)
-vllm serve allenai/Molmo2-4B --attention-backend CUSTOM
+# Verify whether a model is a good TQ candidate
+python -m turboquant_vllm.verify --model allenai/Molmo2-4B --bits 4
 
 # Or use with HuggingFace directly
 from turboquant_vllm import CompressedDynamicCache
@@ -65,3 +72,10 @@ from transformers import DynamicCache
 cache = DynamicCache()
 compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)
 ```
+
+Optional plugin bridge for vLLM:
+
+```bash
+pip install turboquant-vllm[vllm]
+vllm serve allenai/Molmo2-4B --attention-backend CUSTOM
+```
diff --git a/docs/site/index.md b/docs/site/index.md
@@ -1,34 +1,41 @@
 # turboquant-vllm
 
-TurboQuant KV cache compression as a drop-in vLLM plugin. **3.76x compression, near-identical output quality, one CLI flag to enable.**
+Reference implementation for TurboQuant KV cache compression in HuggingFace `DynamicCache`, with verification tooling for model compatibility and an optional vLLM plugin bridge. **3.76x compression, near-identical output quality, and a clear split between reference workflows here vs native vLLM upstream.**
 
-> First open-source [TurboQuant](https://arxiv.org/abs/2504.19874) implementation (ICLR 2026) — paper to working vLLM plugin in 72 hours.
+> Implements Google's [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026), the first KV cache quantization method with provably near-optimal distortion rates.
+
+> Native vLLM TurboQuant is converging upstream in [vllm-project/vllm#38479](https://github.com/vllm-project/vllm/pull/38479). This repo is the HuggingFace/reference path: research workflows, verification, and architecture validation.
+
+## Choose the Right Path
+
+- Use `turboquant-vllm` for HuggingFace cache compression, model validation, multimodal experiments, and architecture research.
+- Use upstream native vLLM TurboQuant when you want the in-tree serving path in vLLM.
+- Use the plugin path here only when you specifically need the out-of-tree bridge (`--attention-backend CUSTOM`).
 
 ## Install
 
 === "pip"
 
     ```bash
-    pip install turboquant-vllm[vllm]
+    pip install turboquant-vllm
     ```
 
 === "uv"
 
     ```bash
-    uv add turboquant-vllm --extra vllm
+    uv add turboquant-vllm
     ```
 
-## Quick Start
-
-### vLLM (zero code changes)
+Optional vLLM plugin extras:
 
 ```bash
-vllm serve allenai/Molmo2-8B --attention-backend CUSTOM
+pip install turboquant-vllm[vllm]
+uv add turboquant-vllm --extra vllm
 ```
 
-The TQ4 attention backend registers automatically via vLLM's plugin system. KV cache pages are compressed to 68 bytes/token/head (vs 256 bytes FP16).
+## Quick Start
 
-### HuggingFace
+### HuggingFace (primary workflow)
 
 ```python
 from transformers import DynamicCache
@@ -41,6 +48,20 @@ compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)
 # Compression happens transparently on every cache.update()
 ```
 
+### Verify a model before deeper integration
+
+```bash
+python -m turboquant_vllm.verify --model allenai/Molmo2-4B --bits 4
+```
+
+### Optional vLLM plugin bridge
+
+```bash
+vllm serve allenai/Molmo2-8B --attention-backend CUSTOM
+```
+
+The TQ4 attention backend registers automatically via vLLM's plugin system. KV cache pages are compressed to 68 bytes/token/head (vs 256 bytes FP16).
+
 ## Benchmark Results
 
 Molmo2-4B (bfloat16, 36 layers) on RTX 4090 — 11K visual tokens from 2fps video + 256 generation tokens:

diff --git a/docs/site/usage/container.md b/docs/site/usage/container.md
@@ -1,6 +1,8 @@
 # Container Deployment
 
-turboquant-vllm ships a `Containerfile` that bakes the plugin into the official vLLM image. Build once, deploy anywhere — no runtime pip installs.
+This page covers containerizing the optional vLLM plugin bridge from `turboquant-vllm`. Use it when you specifically need the repo's out-of-tree CUSTOM backend in a containerized environment.
+
+For the primary project workflow, prefer the HuggingFace/reference path. For native vLLM TurboQuant serving, prefer the upstream in-tree path as it matures.
 
 ## Build the Image
 

diff --git a/docs/site/usage/huggingface.md b/docs/site/usage/huggingface.md
@@ -1,6 +1,6 @@
 # HuggingFace Integration
 
-Use turboquant-vllm directly with HuggingFace's `DynamicCache` for research, benchmarking, or non-vLLM inference.
+This is the primary workflow for `turboquant-vllm`: use HuggingFace's `DynamicCache` path for research, benchmarking, model validation, and architecture experimentation.
 
 ## Install
 

diff --git a/docs/site/usage/vllm.md b/docs/site/usage/vllm.md
@@ -1,6 +1,8 @@
 # vLLM Plugin
 
-turboquant-vllm registers as a custom attention backend via vLLM's plugin system. No code changes are needed — just install and pass a CLI flag.
+This page documents the optional out-of-tree vLLM plugin bridge in `turboquant-vllm`. It is useful when you specifically want the repo's CUSTOM-backend path, but it is not the primary long-term direction of the project.
+
+For native in-tree vLLM TurboQuant, prefer the upstream path as it matures. Use `turboquant-vllm` primarily for HuggingFace workflows, verification, and architecture research.
 
 ## Install
 
@@ -38,7 +40,7 @@ On each attention step, the backend:
 
 ## Configuration
 
-The plugin uses sensible defaults. No additional configuration is needed beyond `--attention-backend CUSTOM`.
+The plugin uses sensible defaults. No additional configuration is needed beyond `--attention-backend CUSTOM`, but treat this as a bridge path rather than the default recommendation for new users.
 
 | vLLM Flag | Recommended | Notes |
 |-----------|-------------|-------|