Q-Filters: Leveraging Query-Key Geometry for Efficient Key-Value Cache Compression

Abstract: Autoregressive language models rely on a Key-Value (KV) Cache, which avoids re-computing past hidden states during generation, making it faster. As model sizes and context lengths grow, the KV cache becomes a significant memory bottleneck, which calls for compression methods that limit its size during generation. In this paper, we discover surprising properties of Query (Q) and Key (K) vectors that allow us to efficiently approximate attention scores without computing the attention maps. We propose Q-Filters, a training-free KV cache compression method that filters out less crucial Key-Value pairs based on a single context-agnostic projection. Contrarily to many alternatives, Q-Filters is compatible with FlashAttention, as it does not require direct access to attention weights. Experimental results in long-context settings demonstrate that Q-Filters is competitive with attention-based compression methods such as SnapKV in retrieval tasks while consistently outperforming efficient compression schemes such as Streaming-LLM in generation setups. Notably, Q-Filters achieves a 99% accuracy in the needle-in-a-haystack task with a x32 compression level while reducing the generation perplexity drop by up to 65% in text generation compared to Streaming-LLM.

Setup

Install required libraries in a virtual environment:

python -m virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

Configure HuggingFace's environment:

export HF_DATASETS_CACHE=<path_to_hf_cache>
export HF_HOME=<path_to_hf_cache>
export HF_TOKEN=<hf_token>

Generate with Q-Filters

Here is an example of how to use Q-Filters in a generation setup:

from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
from src.hf_cache import QFiltersCache
from datasets import load_dataset

model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map="auto",
    low_cpu_mem_usage=True,
    torch_dtype="bfloat16"
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
streamer = TextStreamer(tokenizer)

question = """What is the probability of two integers selected at random having a greatest common divisor of 1."""
input_text = f"<|User|>{question}<|Assistant|><think>\n"

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

past_key_values = QFiltersCache(
    window_length=64,
    max_length=128, 
    model_name=model_name
)

out = model.generate(
    **inputs,
    do_sample=True, 
    temperature=0.5, 
    max_new_tokens=4096, 
    past_key_values=past_key_values, 
    streamer=streamer
)

Compute Q-Filters for a new model

Verify that the target model does not already have pre-computed Q-Filters.
Use the make_filters.py script to generate the filters. For instance:

python make_filters.py \
--model_name deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--model_cls Qwen2ForCausalLM \
--max_seq_len 2048 \
--num_sequences 10 \
--num_svd_samples 3000 \
--dataset_name PatrickHaller/fineweb-1B \
--save_mode disk \
# --save_mode hub \
# --save_mode hub+disk \
# --hf_user_id nthngdy \
--save_dir ../filters

For Q-Filters saved on disk, you can upload them later using this command:

huggingface-cli upload path_to_hf_repo path_to_local_qfilters .

Citation

@misc{godey2025qfiltersleveragingqkgeometry,
      title={Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression}, 
      author={Nathan Godey and Alessio Devoto and Yu Zhao and Simone Scardapane and Pasquale Minervini and Éric de la Clergerie and Benoît Sagot},
      year={2025},
      eprint={2503.02812},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.02812}, 
}

Name	Name	Last commit message	Last commit date
Latest commit NathanGodey Update make_filters.py Mar 7, 2025 cda5c38 · Mar 7, 2025 History 13 Commits
modeling	modeling	fix	Mar 3, 2025
src	src	fix	Mar 3, 2025
.gitignore	.gitignore	fix + gitignore	Mar 3, 2025
README.md	README.md	Update README.md	Mar 5, 2025
example.py	example.py	fix	Mar 3, 2025
make_filters.py	make_filters.py	Update make_filters.py	Mar 7, 2025
qfilters_demo.gif	qfilters_demo.gif	Add files via upload	Mar 3, 2025
requirements.txt	requirements.txt	fix	Mar 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Q-Filters: Leveraging Query-Key Geometry for Efficient Key-Value Cache Compression

Setup

Generate with Q-Filters

Compute Q-Filters for a new model

Citation

About

Releases

Packages

Languages

NathanGodey/qfilters

Folders and files

Latest commit

History

Repository files navigation

Q-Filters: Leveraging Query-Key Geometry for Efficient Key-Value Cache Compression

Setup

Generate with Q-Filters

Compute Q-Filters for a new model

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages