feat: INT8 KV cache quantization (~48% memory reduction) by dzhengAP · Pull Request #184 · GeeeekExplorer/nano-vllm

dzhengAP · 2026-03-09T18:25:23Z

Motivation

On an 8GB GPU, the KV cache is the primary memory bottleneck for serving
multiple concurrent sequences. This PR adds INT8 quantization of the KV cache,
reducing its memory footprint by ~48% and allowing ~2× more sequences to be
batched simultaneously.

Design

Quantization happens at store time inside a Triton kernel — no extra
copy or Python overhead on the hot path
Dequantization happens at attention time, just before flash_attn_with_kvcache
Per-(token, head) symmetric INT8: scale = max(|x|) / 127, stored as FP32
Prefill path is unaffected (Q/K/V stay in BF16/FP16 in HBM)
kv_quant=False by default — fully backwards-compatible

Memory savings

FP16: 14000.0MB → INT8: 7218.8MB (48.4% savings)

(measured at 500 blocks × 256 block_size × 28 layers × 8 kv_heads × 128 head_dim)

Usage

llm = LLM("/path/to/model", kv_quant=True)  # everything else unchanged

Files changed

nanovllm/layers/kv_quant.py — new: Triton store + dequant kernels + memory estimator
nanovllm/layers/attention.py — wire in INT8 path alongside existing FP16 path
nanovllm/engine/model_runner.py — allocate INT8 cache + FP32 scale tensors
nanovllm/config.py — add kv_quant: bool = False

- Add kv_quant.py with two Triton kernels: - store_kvcache_int8_kernel: quantize K/V to INT8 at store time - dequant_kvcache_kernel: dequantize to FP16 at decode time - Per-(token, head) symmetric INT8 with FP32 scale - Config.kv_quant=False by default (fully backwards-compatible) - ~48% KV cache memory reduction vs BF16 baseline - Allows ~2x more concurrent sequences on same GPU budget - estimate_memory_savings() utility for pre-run estimation Usage: LLM(model_path, kv_quant=True)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: INT8 KV cache quantization (~48% memory reduction)#184

feat: INT8 KV cache quantization (~48% memory reduction)#184
dzhengAP wants to merge 1 commit into
GeeeekExplorer:mainfrom
dzhengAP:feature/int8-kv-cache

dzhengAP commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dzhengAP commented Mar 9, 2026

Motivation

Design

Memory savings

Usage

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant