Exploring Q-Filters for KV Cache Compression (32x Compression with FlashAttention) #14378

MillionthOdin16 · 2025-03-06T18:33:04Z

MillionthOdin16
Mar 6, 2025

The paper "Q-Filters: Leveraging Query-Key Geometry for Efficient Key-Value Cache Compression" introduces a novel training-free method to compress the KV cache in transformer models. It leverages the geometric properties of Query (Q) and Key (K) vectors to filter out less critical KV pairs, achieving 32× compression with 99% accuracy on long-context tasks (e.g., needle-in-a-haystack) and 65% lower perplexity drops compared to Streaming-LLM.

Key Takeaways for vLLM:

Training-Free & FlashAttention-Compatible:
- Q-Filters avoids fine-tuning and works with FlashAttention (no attention weight storage required), aligning with vLLM’s focus on speed and memory efficiency.
- A one-time calibration step (3k samples, <3 mins on 2xA100) computes context-agnostic "principal directions" for Q/K vectors.
Performance Gains:
- Outperforms FP8 KV cache compression (2×) with 32× compression and minimal accuracy loss.
- Reduces memory bottlenecks for long-context tasks (e.g., Ruler dataset scores: 56.1 at 8× vs. FP8’s ~43–50).
Synergy with vLLM Features:
- Could complement PagedAttention by prioritizing critical KV blocks.
- Combines with FP8 quantization for additive memory savings (e.g., 32× Q-Filters + FP8 ≈ 64× effective compression).

Discussion Points:

Feasibility: How easily could Q-Filters integrate with vLLM’s existing KV cache management (e.g., PagedAttention, CUDA graph support)?
Trade-offs: The calibration step adds minimal overhead, but would this impact deployment workflows? The paper demonstrates that the Q-Filters can be precomputed and fetched at run time.
Use Cases: Would Q-Filters benefit vLLM’s long-context benchmarks (e.g., Ruler) more than current FP8/4-bit quantization?

Thoughts?
Would love to hear from the community:

Is this a promising feature for vLLM?
Any challenges (e.g., hardware compatibility) we should anticipate?

Let’s discuss! 🚀

edit: adding author's summary- https://twitter.com/nthngdy/status/1897301390470603245

NathanGodey · 2025-03-11T09:24:10Z

NathanGodey
Mar 11, 2025

Hi @MillionthOdin16 ! I'm the main author of the paper. Thanks for your interest in our method.
Currently, VLLM does not offer a straightforward way to integrate KV cache compression methods unfortunately. The easiest workaround would be to implement a QfiltersLlama architecture where the filter is applied, but it's more a monkeypatch than an actual solution imo. We will hopefully give it a crack when one of the authors has some spare time :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploring Q-Filters for KV Cache Compression (32x Compression with FlashAttention) #14378

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Exploring Q-Filters for KV Cache Compression (32x Compression with FlashAttention) #14378

MillionthOdin16 Mar 6, 2025

Replies: 1 comment

NathanGodey Mar 11, 2025

MillionthOdin16
Mar 6, 2025

NathanGodey
Mar 11, 2025