Skip to content
#

turboquant

Here are 95 public repositories matching this topic...

Self-hosted auto clustering AI agent OS for low cost consumer hardware like the computer you have, an Orange or Raspberry Pi or a Mac etc. Desktop shell, app store, agent deployment, distributed compute cluster. Memory by taOSmd.

  • Updated Apr 29, 2026
  • Python

Production-grade runtime patches for vLLM (45+ patches) — Qwen3.6-35B-A3B-FP8 hybrid GDN+MoE on NVIDIA Ampere (SM 80-86). 127 tok/s MTP free-form, 99 tok/s suffix tool-call (max 175). TurboQuant k8v4 KV cache, 256K context verified to 252K. P67 multi-query kernel + Suffix Decoding + adaptive ngram K. Zero source modifications.

  • Updated Apr 27, 2026
  • Python

Unified KV cache compression for LLM inference — TurboQuant, IsoQuant, PlanarQuant, TriAttention. 10 methods, GPU-validated, multi-GPU planner. Compress KV cache 5-80x to run bigger models, longer context, more agents on your GPU.

  • Updated Apr 26, 2026
  • Python

Near-optimal vector quantization from Google's ICLR 2026 paper — 95% recall, 5x compression, zero preprocessing, pure Python FAISS replacement

  • Updated Mar 28, 2026
  • Python

Native Windows build of vLLM 0.19.1 — no WSL, no Docker. Pre-built wheels + 34-file Windows patch + Multi-TurboQuant KV cache compression (6 methods, 2x cache capacity). PyTorch 2.10 + CUDA 12.6 + Triton + Flash-Attention 2.

  • Updated Apr 26, 2026
  • Python

Improve this page

Add a description, image, and links to the turboquant topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the turboquant topic, visit your repo's landing page and select "manage topics."

Learn more