-
Notifications
You must be signed in to change notification settings - Fork 5
docs(readme): reposition project around HF reference workflows #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,22 +1,29 @@ | ||
| # Project Documentation Index | ||
|
|
||
| > **Status (2026-04-04):** Reference implementation for HuggingFace transformers DynamicCache. For native vLLM TurboQuant, see [vllm-project/vllm#38479](https://github.com/vllm-project/vllm/pull/38479). This project complements that PR — HF transformers workflows here, production vLLM serving there. | ||
| > **Status (2026-04-04):** Reference implementation for HuggingFace transformers DynamicCache. For native vLLM TurboQuant, see [vllm-project/vllm#38479](https://github.com/vllm-project/vllm/pull/38479). This project complements that PR — HF transformers workflows, verification, and architecture research live here; production-oriented native vLLM serving belongs upstream. | ||
|
|
||
| ## Project Overview | ||
|
|
||
| - **Type:** Python library (HuggingFace transformers DynamicCache patch) | ||
| - **Type:** Python library for HuggingFace DynamicCache compression, verification, and architecture research | ||
| - **Primary Language:** Python 3.12+ | ||
| - **Architecture:** Layered library with strict DAG dependency flow | ||
|
|
||
| ## Choose the Right Path | ||
|
|
||
| - **Use `turboquant-vllm`** when you want HuggingFace cache compression, model validation, multimodal experiments, or architecture/policy research | ||
| - **Use upstream native vLLM TurboQuant** when you want the in-tree serving path in vLLM | ||
| - **Use the plugin path here** only when you specifically need the out-of-tree bridge (`--attention-backend CUSTOM`) | ||
|
|
||
|
Comment on lines
+11
to
+16
|
||
| ## Quick Reference | ||
|
|
||
| - **Package:** `turboquant-vllm` (pip-installable, src-layout) | ||
| - **Tech Stack:** PyTorch + Triton + HuggingFace transformers + scipy | ||
| - **Build System:** uv (uv_build backend) | ||
| - **Entry Point:** `src/turboquant_vllm/__init__.py` (8 public exports) | ||
| - **CLI:** `python -m turboquant_vllm.benchmark` | ||
| - **vLLM Plugin:** Auto-registered via `vllm.general_plugins` entry point | ||
| - **vLLM Usage:** `vllm serve <model> --attention-backend CUSTOM` | ||
| - **Verification CLI:** `python -m turboquant_vllm.verify` | ||
| - **Benchmark CLI:** `python -m turboquant_vllm.benchmark` | ||
| - **Optional vLLM Plugin:** Auto-registered via `vllm.general_plugins` entry point | ||
| - **Optional vLLM Usage:** `vllm serve <model> --attention-backend CUSTOM` | ||
| - **Architecture Pattern:** Layered library (lloyd_max -> quantizer -> compressors -> kv_cache) | ||
|
|
||
| ## Generated Documentation | ||
|
|
@@ -52,11 +59,11 @@ | |
| ## Getting Started | ||
|
|
||
| ```bash | ||
| # Install from PyPI | ||
| pip install turboquant-vllm[vllm] | ||
| # Install from PyPI for HuggingFace/reference workflows | ||
| pip install turboquant-vllm | ||
|
|
||
| # Use with vLLM (no code changes) | ||
| vllm serve allenai/Molmo2-4B --attention-backend CUSTOM | ||
| # Verify whether a model is a good TQ candidate | ||
| python -m turboquant_vllm.verify --model allenai/Molmo2-4B --bits 4 | ||
|
|
||
| # Or use with HuggingFace directly | ||
| from turboquant_vllm import CompressedDynamicCache | ||
|
|
@@ -65,3 +72,10 @@ from transformers import DynamicCache | |
| cache = DynamicCache() | ||
| compressed = CompressedDynamicCache(cache, head_dim=128, bits=4) | ||
| ``` | ||
|
|
||
| Optional plugin bridge for vLLM: | ||
|
|
||
| ```bash | ||
| pip install turboquant-vllm[vllm] | ||
| vllm serve allenai/Molmo2-4B --attention-backend CUSTOM | ||
| ``` | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The uv install snippet currently shows two
uv addcommands back-to-back (base +--extra vllm), which reads like both should be run. This is redundant/confusing; it should be presented as alternative commands (either base install or install with thevllmextra) so users don’t add the dependency twice or wonder which one is correct.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. README now labels the two uv commands as alternatives and explains which workflow each command is for, so users do not read them as sequential steps.