Exploring Q-Filters for KV Cache Compression (32x Compression with FlashAttention) #14378
MillionthOdin16
started this conversation in
Ideas
Replies: 1 comment
-
Hi @MillionthOdin16 ! I'm the main author of the paper. Thanks for your interest in our method. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The paper "Q-Filters: Leveraging Query-Key Geometry for Efficient Key-Value Cache Compression" introduces a novel training-free method to compress the KV cache in transformer models. It leverages the geometric properties of Query (Q) and Key (K) vectors to filter out less critical KV pairs, achieving 32× compression with 99% accuracy on long-context tasks (e.g., needle-in-a-haystack) and 65% lower perplexity drops compared to Streaming-LLM.
Key Takeaways for vLLM:
Training-Free & FlashAttention-Compatible:
Performance Gains:
Synergy with vLLM Features:
Discussion Points:
Thoughts?
Would love to hear from the community:
Let’s discuss! 🚀
edit: adding author's summary- https://twitter.com/nthngdy/status/1897301390470603245
Beta Was this translation helpful? Give feedback.
All reactions