KV cache Quantization. #17

longcuxit · 2026-01-21T17:10:32Z

longcuxit
Jan 21, 2026

I'm running a 6-bit 30b model with the --continuous-batching flag, but RAM usage is increasing rapidly. Does anyone know why?

How do I enable KV Cache quantization, or when does the server have this flag?

Configuration: M3 Max 128gb, Kilo code, GLM-4.7-Flash 6bit

waybarrios · 2026-02-11T01:52:00Z

waybarrios
Feb 11, 2026
Maintainer

Hi! Thanks for reporting this. Both issues are related and I'll address them:

1. RAM usage increasing rapidly with `--continuous-batching`

This was a known issue — the prefix cache was storing too many KV cache entries in full precision (fp16), and for a 30B model each entry can be hundreds of MB. This has been addressed in recent updates:

Memory-aware cache (now the default): Automatically detects available RAM and evicts entries based on memory pressure rather than entry count.

You can tune it with these flags:

# Limit cache to specific MB (e.g., 8GB out of your 128GB)
--cache-memory-mb 8192

# Or set as a fraction of available RAM (default is 20%)
--cache-memory-percent 0.15

# Disable prefix cache entirely if you don't need it
--disable-prefix-cache

Try running with --cache-memory-mb 8192 or --disable-prefix-cache to see if that resolves the RAM growth.

2. KV Cache Quantization

KV cache quantization is not currently exposed as a server flag. However, mlx-lm does have a QuantizedKVCache class (8-bit quantization with configurable group size) that could significantly reduce KV cache memory — roughly halving it compared to fp16.

I've opened this as a feature request to track: adding a --kv-cache-quantization flag that would use mlx_lm.models.cache.QuantizedKVCache instead of the default KVCache. This would be especially beneficial for large models like yours (30B 6-bit on 128GB).

Recommended immediate workaround

vllm-mlx serve <model> --continuous-batching --cache-memory-mb 8192

This caps the prefix cache at 8GB and should prevent the unbounded RAM growth. Let me know if you still see issues after trying this.

0 replies

janhilgard · 2026-02-11T15:24:00Z

janhilgard
Feb 11, 2026
Collaborator

I've implemented this feature in PR #67 — it adds --kv-cache-bits {4,8} and --kv-cache-group-size flags for KV cache quantization in the prefix cache.

How it works:

KV cache entries are quantized before storing in the prefix cache and dequantized on fetch
BatchGenerator always operates on full-precision KV states, so inference quality is unaffected
4-bit: ~75% memory savings, 8-bit: ~50% savings
Uses mlx-lm's built-in QuantizedKVCache / mx.dequantize()

Usage:

# 8-bit (recommended — good savings with minimal overhead)
vllm-mlx serve model --kv-cache-bits 8

# 4-bit (aggressive savings)
vllm-mlx serve model --kv-cache-bits 4

This is complementary to --cache-memory-mb — you can use both together to set a memory budget and fit more entries within that budget via quantization.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV cache Quantization. #17

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

KV cache Quantization. #17

Uh oh!

longcuxit Jan 21, 2026

Replies: 2 comments

Uh oh!

waybarrios Feb 11, 2026 Maintainer

1. RAM usage increasing rapidly with --continuous-batching

2. KV Cache Quantization

Recommended immediate workaround

Uh oh!

janhilgard Feb 11, 2026 Collaborator

longcuxit
Jan 21, 2026

waybarrios
Feb 11, 2026
Maintainer

1. RAM usage increasing rapidly with `--continuous-batching`

janhilgard
Feb 11, 2026
Collaborator