Replies: 2 comments
-
|
Hi! Thanks for reporting this. Both issues are related and I'll address them: 1. RAM usage increasing rapidly with
|
Beta Was this translation helpful? Give feedback.
-
|
I've implemented this feature in PR #67 — it adds How it works:
Usage: # 8-bit (recommended — good savings with minimal overhead)
vllm-mlx serve model --kv-cache-bits 8
# 4-bit (aggressive savings)
vllm-mlx serve model --kv-cache-bits 4This is complementary to |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm running a 6-bit 30b model with the --continuous-batching flag, but RAM usage is increasing rapidly. Does anyone know why?
How do I enable
KV Cache quantization, or when does the server have this flag?Configuration: M3 Max 128gb, Kilo code, GLM-4.7-Flash 6bit
Beta Was this translation helpful? Give feedback.
All reactions