turboquant

Here are 95 public repositories matching this topic...

RyanCodrai / turbovec

A vector index built on TurboQuant, written in Rust with Python bindings

python rust neon embeddings simd nearest-neighbor quant ann quantization avx512 embedding faiss rag vector-search turboquant

Updated Apr 21, 2026
Rust

quantumaikr / quant.cpp

Star

LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.

embeddable transformer pure-c quantization delta-compression kv-cache llm llm-inference gguf turboquant

Updated Apr 26, 2026
C

Based on the implementation of Google's TurboQuant (ICLR 2026) — Quansloth brings elite KV cache compression to local LLM inference. Quansloth is a fully private, air-gapped AI server that runs massive context models natively on consumer hardware with ease

cuda turboquant quansloth vram-wall

Updated Apr 12, 2026
Python

arozanov / turboquant-mlx

Star

TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.

metal quantization mlx kv-cache apple-silicon llm turboquant

Updated Apr 23, 2026
Python

jaylfc / tinyagentos

Sponsor

Star

Self-hosted auto clustering AI agent OS for low cost consumer hardware like the computer you have, an Orange or Raspberry Pi or a Mac etc. Desktop shell, app store, agent deployment, distributed compute cluster. Memory by taOSmd.

raspberry-pi distributed-computing self-hosted orange-pi ai-agents ai-platform agent-framework apple-silicon llm vllm local-llm llm-inference kv-cache-quantization rockchip-npu turboquant

Updated Apr 29, 2026
Python

Alberto-Codes / turboquant-vllm

Star

TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs

compression transformer triton quantization inference-optimization kv-cache llm vllm consumer-gpu turboquant

Updated Apr 10, 2026
Python

back2matching / turboquant

Star

First open-source TurboQuant KV cache compression for LLM inference. Drop-in for HuggingFace. pip install turboquant.

machine-learning compression gpu transformers inference pytorch quantization vram huggingface kv-cache llm turboquant

Updated Apr 21, 2026
Python

Sandermage / genesis-vllm-patches

Star

Production-grade runtime patches for vLLM (45+ patches) — Qwen3.6-35B-A3B-FP8 hybrid GDN+MoE on NVIDIA Ampere (SM 80-86). 127 tok/s MTP free-form, 99 tok/s suffix tool-call (max 175). TurboQuant k8v4 KV cache, 256K context verified to 252K. P67 multi-query kernel + Suffix Decoding + adaptive ngram K. Zero source modifications.

cuda nvidia moe gdn ampere structured-output long-context fp8 vllm llm-inference qwen speculative-decoding tool-calling block-verify turboquant suffix-decoding adaptive-speculation ampere-sm86

Updated Apr 27, 2026
Python

aivrar / multi-turboquant

Star

Unified KV cache compression for LLM inference — TurboQuant, IsoQuant, PlanarQuant, TriAttention. 10 methods, GPU-validated, multi-GPU planner. Compress KV cache 5-80x to run bigger models, longer context, more agents on your GPU.

Updated Apr 26, 2026
Python

mindtro / semafold

Star

Vector compression with TurboQuant codecs for embeddings, retrieval, and KV-cache. 10x compression, pure NumPy core — optional GPU acceleration via PyTorch (CUDA/MPS) or MLX (Metal).

retrieval quantization vector-database kv-cache llm-inference embedding-compression turboquant vector-compression qjl semafold

Updated Apr 1, 2026
Python

Firmamento-Technologies / TurboQuant

Star

Near-optimal vector quantization from Google's ICLR 2026 paper — 95% recall, 5x compression, zero preprocessing, pure Python FAISS replacement

Updated Mar 28, 2026
Python

Lucien2468 / Ollama-TurboQuant-Integration

Star

TurboQuant: Native 3-Bit Quantization for Ollama - Achieve 25-28% better compression than Q4_0 while maintaining high-speed CPU inference. Experimentally integrated into Ollama with custom GGML kernels for LLM efficiency.

llama quantization ggml ollama turboquant

Updated Apr 4, 2026
Go

danilodevhub / turboquant-js

Star

TypeScript implementation of Google's TurboQuant algorithm for near-optimal vector quantization. Zero dependencies, works in Node.js and browsers.

machine-learning typescript browser compression embeddings nearest-neighbor quantization vector-quantization vector-search kv-cache llm transformers-js turboquant

Updated Apr 25, 2026
TypeScript

carlosfundora / llama.cpp-1-bit-turbo

Star

HIP/ROCm fork optimized for AMD RDNA2 (gfx1030) with PrismML Q1_0_G128 1-bit quant support, RotorQuant, TurboQuant, EAGLE3 and P-EAGLE speculative decoding, and full Wave32 kernel optimizations.

hip quantization bonsai rocm amd-gpu llama-cpp gguf rdna2 turboquant prismml gfx1030

Updated Apr 28, 2026
C++

artalis-io / bitnet.c

Star

Minimal, zero-dependency LLM inference in pure C11. CPU-first with NEON/AVX2 SIMD. Flash MoE (pread + LRU expert cache). TurboQuant 3-bit KV compression (8.9x less memory per session). 20+ GGUF quant formats. Compiles to WASM.

c neon wasm inference simd moe avx2 quantization kv-cache cpu-inference llm gguf turboquant

Updated Mar 28, 2026
C

atomicmilkshake / llama-cpp-turboquant

Star

llama.cpp fork with TurboQuant quantization (turbo2/3/4) and TriAttention GPU-accelerated KV cache pruning. 75 tok/s on Qwen3-8B / RTX 3080.

windows cuda inference quantization kv-cache llm llama-cpp ggml turboquant triattention

Updated Apr 9, 2026
C++

manjunathshiva / turboquant-mlx

Star

Extreme weight + KV cache compression for LLMs on Apple Silicon (MLX implementation of Google's TurboQuant)

quantization mlx kv-cache apple-silicon llm turboquant

Updated Apr 27, 2026
Python

zlaabsi / turboquant-wasm

Star

TurboQuant vector quantization for browser and edge runtimes

browser wasm quantization semantic-search webgpu rag vector-search edge-ai edge-runtime turboquant

Updated Apr 9, 2026
JavaScript

aivrar / vllm-windows-build

Star

Native Windows build of vLLM 0.19.1 — no WSL, no Docker. Pre-built wheels + 34-file Windows patch + Multi-TurboQuant KV cache compression (6 methods, 2x cache capacity). PyTorch 2.10 + CUDA 12.6 + Triton + Flash-Attention 2.

Updated Apr 26, 2026
Python

Sggin1 / DGX-SPARK

Star

DGX Spark research and tests - containers, benchmarks, and investigation notes for running models on GB10 (SM 12.1)

aarch64 blackwell kv-cache vllm nvfp4 dgx-spark mamba-ssm sm121 turboquant

Updated Apr 12, 2026
Python

Improve this page

Add a description, image, and links to the turboquant topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the turboquant topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

turboquant

Here are 95 public repositories matching this topic...

RyanCodrai / turbovec

quantumaikr / quant.cpp

PacifAIst / Quansloth

arozanov / turboquant-mlx

jaylfc / tinyagentos

Alberto-Codes / turboquant-vllm

back2matching / turboquant

Sandermage / genesis-vllm-patches

aivrar / multi-turboquant

mindtro / semafold

Firmamento-Technologies / TurboQuant

Lucien2468 / Ollama-TurboQuant-Integration

danilodevhub / turboquant-js

carlosfundora / llama.cpp-1-bit-turbo

artalis-io / bitnet.c

atomicmilkshake / llama-cpp-turboquant

manjunathshiva / turboquant-mlx

zlaabsi / turboquant-wasm

aivrar / vllm-windows-build

Sggin1 / DGX-SPARK

Improve this page

Add this topic to your repo