UNCOMP: Can Information Compression Uncover Sparsity? β A Compressor Design from an Uncertainty-Aware Perspective
| Date | Update |
|---|---|
| 2025-12-28 | Added Qwen model support (uncomp/qwen_model.py) |
| 2025-12-28 | Added document-level machine translation benchmark (doclevel-MT-benchmark/) |
| 2025-12-28 | Added WMT evaluation scripts (scripts/scripts_wmt/) |
This repository contains the official implementation of UNCOMP, an uncertainty-aware KV cache compression framework for long-context LLMs. It leverages truncated matrix entropy to reveal sparsity, cutting cache size to 4.74%, boosting throughput by 6Γ, with minimal performance loss.
UNCOMP: Can Information Compression Uncover Sparsity? β A
Compressor Design from an Uncertainty-Aware Perspective
Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Min Yang, Lingpeng Kong, Ngai Wong
π Paper Link
- Uncertainty-Aware Compression: Uses truncated matrix entropy to detect low-information regions
- Two-Stage Framework: Jointly compresses hidden states and KV cache, accelerating both prefill and decoding.
- Near-Lossless Accuracy: Outperforms or matches full KV cache in benchmarks like Needle-in-a-Haystack, even at 9.38% compression ratio.
- Extreme Compression: Reduces KV cache size to 4.74% of the original. Maintains high accuracy even at aggressive ratios (e.g., pruning multiple heads).
- Efficiency: Achieves 6Γ throughput improvement and 6% prefill speedup.
- Training-free: No costly fine-tuning required.
- Python 3.9+
- PyTorch 2.6.0
- CUDA compatible GPU(s)
Setup
- Clone the repository:
git clone https://github.com/your-username/UNComp.git
cd UNComp- Install dependencies:
pip install -r requirements.txtKey Dependencies:
torch==2.6.0- PyTorch for deep learningtransformers==4.39.2- Hugging Face transformers libraryaccelerate==1.0.1- For multi-GPU training and inferencedatasets==3.3.1- For dataset loading and processingnumpy,pandas- Data manipulationsentencepiece==0.2.0- For tokenization
bash ./scripts/scripts_longBench/eval.sh \
--max_capacity_prompts 512 \
--attn_implementation eager \
--source_path ./results/ \
--model_path meta-llama/Llama-2-7b-chat-hf \
--eval_batch_size 1 \
--method uncomp \
--name ./output \
--gpu_id 0 \
--fp16 1 \
--seed 43 \
--logger_pattern info \
--port 1236- LLaMA family models
- Mistral models
- Qwen models (NEW)
- LongBench: narrativeqa, qasper, multifieldqa_en, hotpotqa, 2wikimqa, musique, gov_report, qmsum, multi_news, trec, triviaqa, samsum, passage_count, passage_retrieval_en, lcc, repobench-p
- InfiniteBench: En.Sum, En.QA, En.MC, En.Dia, Zh.QA, Code.Debug, Code.Run, Math.Calc, Math.Find, Retrieve.PassKey, Retrieve.Number, Retrieve.KV
- Needle in a Haystack Task: A simple 'needle in a haystack' analysis to test in-context retrieval ability of long context LLMs.
- Standard Benchmarks: GSM8K
- Doc-level MT Benchmark: English-German translation benchmark with document-level context (see
doclevel-MT-benchmark/)
python uncomp/stage_division.py# two groups
bash ./scripts/scripts_longBench/eval.sh --max_capacity_prompts 512 --attn_implementation eager --source_path ./results/ --model_path meta-llama/Llama-2-7b-chat-hf --eval_batch_size 1 --method head_type_search_2 --name ./output --gpu_id 0 --fp16 1 --seed 43 --logger_pattern info --port 1236Evaluate on LongBench datasets using the provided scripts:
# Generation
### Multi-GPU evaluation
bash ./scripts/scripts_longBench/eval.sh --max_capacity_prompts 512 --attn_implementation eager --source_path ./results/ --model_path meta-llama/Llama-2-7b-chat-hf --eval_batch_size 1 --method uncomp --name ./output --gpu_id multi_0 --fp16 1 --seed 43 --logger_pattern info --port 1236
### Single GPU evaluation
bash ./scripts/scripts_longBench/eval.sh --max_capacity_prompts 512 --attn_implementation eager --source_path ./results/ --model_path meta-llama/Llama-2-7b-chat-hf --eval_batch_size 1 --method uncomp --name ./output --gpu_id 0 --fp16 1 --seed 43 --logger_pattern info --port 1236
# Evaluation
bash ./scripts/scripts_longBench/metrics.sh --results_dir ./results/results_long_bench/llama-2-7b-chat-hf_512/ --switch True --new_method uncompEvaluate on LongBench datasets using the provided scripts:
# Generation
### Multi-GPU evaluation
bash ./scripts/scripts_InfiniteBench/eval.sh --max_capacity_prompts 512 --attn_implementation eager --source_path ./results/ --model_path meta-llama/Llama-2-7b-chat-hf --eval_batch_size 1 --method uncomp --name ./output --gpu_id multi_0 --fp16 1 --seed 43 --logger_pattern info --port 1236
### Single GPU evaluation
bash ./scripts/scripts_InfiniteBench/eval.sh --max_capacity_prompts 512 --attn_implementation eager --source_path ./results/ --model_path meta-llama/Llama-2-7b-chat-hf --eval_batch_size 1 --method uncomp --name ./output --gpu_id 0 --fp16 1 --seed 43 --logger_pattern info --port 1236
# Evaluation
bash ./scripts/scripts_InfiniteBench/metrics.sh --results_dir ./results/results_Inifite_bench/llama-2-7b-chat-hf_512/ --switch True --new_method uncompEvaluate on document-level English-German translation using the WMT benchmark:
# Run WMT translation evaluation
bash ./scripts/scripts_wmt/eval.sh \
--max_capacity_prompts 512 \
--attn_implementation eager \
--source_path ./results/ \
--model_path meta-llama/Llama-2-7b-chat-hf \
--eval_batch_size 1 \
--method uncomp \
--name ./output \
--gpu_id 0 \
--fp16 1 \
--seed 43 \
--logger_pattern info \
--port 1236The document-level MT benchmark dataset is located in doclevel-MT-benchmark/ with the following structure:
dev/: Development settest/: Test setunshuffle.py: Utility script for data processing
Hyperparameter selection:
- method:
- head_type_search_2: The heads are divided into two groups.
- head_type_search_4: The heads are divided into four groups.
- head_type_search_8: The heads are divided into eight groups.
- head_type_search_32: The heads are divided into thirty-two groups.
- uncomp: The heads are divided into two groups.
- uncomp_stage: The heads are divided into two groups and layer are divided into some groups.
- uncomp_groupn: The heads are divided into n groups.
- other methods: snapkv/pyramidkv/fullkv/streamingllm/h2o/chai.
- max_capacity_prompts:
- 512/128: The average length retained per head.
- [β] Code Organization: Currently organizing and cleaning up the codebase for better usability
- [β] Qwen Support: Adding full support for Qwen model family
- [β] Baselines: Adding full support for Evaluation of Baselines
- [β] SGLang Integration: Adding support for SGLang inference engine for improved performance
- [β] Documentation: Expanding documentation with more detailed examples
- [β] Quantization Support: Adding support for model quantization (INT8/INT4) to reduce memory usage and accelerate inference
- [β] Benchmarks: Adding more comprehensive benchmark results
- [β ] Multi-GPU Inference Support
- [β ] Batch Inference Support
- [β ] AMD GPU Support
UNComp/
βββ eval_*.py # Evaluation
βββ run_*.py # Run Codes
βββ metrics.py # Evaluation metrics
βββ uncomp/
βββ utils/ # Some tools
βββ cache_revise.py # Adaptive code for grouping the head part
βββ download.py # download the datasets
βββ llama_model.py # Code of llama model
βββ mistral_model.py # Code of mistral model
βββ monkeypatch.py # Replace certain sections of transformers
βββ stage_division.py # Layer grouping
βββ uncomp_utils.py # Core code implementation
βββ scripts/ # Bash scripts and configs
βββ search/ # The head grouping
βββ data/ # Datasets
βββ results/ # The generated text results
βββ requirements.txt # Dependencies
UNComp achieves significant improvements in long-context tasks:
- Preduces the KV cache size to 4.74% of the original.
- Achieves a 6% prefill speedup.
- Improves throughput by 6.4Γ.
We welcome contributions! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
If you find this work useful, please cite our paper:
@article{xiong2025parallelcomp,
title={UNCOMP: Can Information Compression Uncover Sparsity? β A Compressor Design from an Uncertainty-Aware Perspective},
author={Xiong, Jing and Shen, Jianghan and Ye, Fanghua, and Tao, Chaofan and Wan, Zhongwei and Lu, Jianqiao and Zheng, Chuanyang, and Guo, Zhijiang and Yang, Min and Kong, Lingpeng and Wong Ngai},
journal={arXiv preprint arXiv:2410.03090},
year={2025}
}For questions and support, please open an issue in this repository or contact the authors.
Note: This implementation will be fully released soon. Stay tuned for updates!