Skip to content

feat: INT8 KV cache quantization (~48% memory reduction)#184

Open
dzhengAP wants to merge 1 commit into
GeeeekExplorer:mainfrom
dzhengAP:feature/int8-kv-cache
Open

feat: INT8 KV cache quantization (~48% memory reduction)#184
dzhengAP wants to merge 1 commit into
GeeeekExplorer:mainfrom
dzhengAP:feature/int8-kv-cache

Conversation

@dzhengAP

@dzhengAP dzhengAP commented Mar 9, 2026

Copy link
Copy Markdown

Motivation

On an 8GB GPU, the KV cache is the primary memory bottleneck for serving
multiple concurrent sequences. This PR adds INT8 quantization of the KV cache,
reducing its memory footprint by ~48% and allowing ~2× more sequences to be
batched simultaneously.

Design

  • Quantization happens at store time inside a Triton kernel — no extra
    copy or Python overhead on the hot path
  • Dequantization happens at attention time, just before flash_attn_with_kvcache
  • Per-(token, head) symmetric INT8: scale = max(|x|) / 127, stored as FP32
  • Prefill path is unaffected (Q/K/V stay in BF16/FP16 in HBM)
  • kv_quant=False by default — fully backwards-compatible

Memory savings

FP16: 14000.0MB → INT8: 7218.8MB (48.4% savings)

(measured at 500 blocks × 256 block_size × 28 layers × 8 kv_heads × 128 head_dim)

Usage

llm = LLM("/path/to/model", kv_quant=True)  # everything else unchanged

Files changed

  • nanovllm/layers/kv_quant.py — new: Triton store + dequant kernels + memory estimator
  • nanovllm/layers/attention.py — wire in INT8 path alongside existing FP16 path
  • nanovllm/engine/model_runner.py — allocate INT8 cache + FP32 scale tensors
  • nanovllm/config.py — add kv_quant: bool = False

- Add kv_quant.py with two Triton kernels:
  - store_kvcache_int8_kernel: quantize K/V to INT8 at store time
  - dequant_kvcache_kernel: dequantize to FP16 at decode time
- Per-(token, head) symmetric INT8 with FP32 scale
- Config.kv_quant=False by default (fully backwards-compatible)
- ~48% KV cache memory reduction vs BF16 baseline
- Allows ~2x more concurrent sequences on same GPU budget
- estimate_memory_savings() utility for pre-run estimation

Usage: LLM(model_path, kv_quant=True)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant