1Southeast University, 2King’s College London, 3The Alan Turing Institute
SCOPE is a simple yet effective framework designed to tackle the KV cache bottleneck in large language models (LLMs) during long-context generation. While existing methods primarily focus on the prefill phase, SCOPE introduces stage-level KV cache compression, addressing both prefill and decoding phases separately—an essential improvement for long-output reasoning tasks.
SCOPE is especially useful for LLM applications that require efficient, scalable generation with long outputs.
![]() Comparison of Three Paradigms |
![]() Overview of Three Decoding Strategies |
-
Excessive compression during the prefill phase which requires specific full context, impairs the comprehension of the reasoning task.
-
Deviation of heavy hitters occurs in the reasoning tasks with long outputs.
![]() Excessive compression |
![]() Deviation of heavy hitters |
We provide a notebook
vis_topk_index_attn.ipynb
to reproduce the Deviation of heavy hitters result(1× A100 (80GB) GPU).
Attention heatmaps for layer 13 of a simplified GSM8k+ sample in LongGenBench:
We provide a notebook
vis_attn_map.ipynb
to reproduce the visualization result(1× A100 (80GB) GPU). Model attention maps for different layers would be stored at./attention_map
.
torch==2.4.0
transformers==4.44.2
flash_attn==2.5.8
conda create -n SCOPE
pip install -r requirements.txt
Our dataset construction method is based on the original LongGenBench repository. We provide scripts for building the LongGenBench dataset as follows:
-
LongGenBench-4K
Dataset Script GSM8K+ create_gsm8k_30.sh
MMLU+ create_mmlu_30.sh
CSQA+ create_csqa_40.sh
-
LongGenBench-8K
Dataset Script GSM8K++ create_gsm8k_60.sh
MMLU++ create_mmlu_60.sh
CSQA++ create_csqa_80.sh
-
Example Usage
To generate the GSM8K+ dataset, run:
bash scripts/scripts_longgenbench/create_gsm8k_30.sh
export CUDA_VISIBLE_DEVICES=$1
method=$2 # Support ALLKV, PyramidKV, PyramidInfer SnapKV, H2O, StreamingLLM
max_capacity_prompts=$3
attn_implementation=$4 # Support "flash_attention_2", "sdpa", "eager".
source_path=$5
model_path=$6
decoding_metric=$7 # H2O Support None,h2o,(slide, adaptive, discontinuous)---SCOPE
decoding_window_size=$8
save_dir=$9 # path to result save_dir
K=$10 #30,60
T=$11
python3 run_longgenbench.py \
--method ${method} \
--model_path ${model_path} \
--max_capacity_prompts ${max_capacity_prompts} \
--attn_implementation ${attn_implementation} \
--save_dir ${save_dir} \
--use_cache True \
--K ${K}\
--decoding_window_size ${decoding_window_size} \
--decoding_recent_size ${decoding_recent_size} \
--decoding_metric ${decoding_metric} \
--max_num_examples ${T} \
results_dir=$1
python3 eval_gen.py \
--results_dir ${results_dir}
The run scripts (bash files) for these experiments are located in the scripts/scripts_longgenbench
folder, and the experimental results can be found in results_longgenbench_4K
and results_longgenbench_8K
.
The plug-in experiment results of LLaMA3.1-8B-instruct on the GSM8K+ task from LONGGENBENCH-4K.
The run scripts (bash files) for these experiments are located in the scripts/scripts_longgenbench
folder, and the experimental results can be found in results_longgenbench_gsm8k_plug_in
.
- fix offset bug
- improve README(expand documentation, add examples, and ensure clarity)
- reorgnize the code for better using experience
@article{wu2024scope,
title={SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation},
author={Wu, Jialong and Wang, Zhenglin and Zhang, Linhai and Lai, Yilong and He, Yulan and Zhou, Deyu},
journal={arXiv preprint arXiv:2412.13649},
year={2024}
}
-
Thanks to SnapKV and PyramidKV (KVCache-Factory for providing open-source code to support the expansion of this project. 🎁
-
Special thanks to LOOK-M for the beautifully designed README template, which we referenced. 🎨
-
Shoutout to @Lueci4er on GitHub for valuable suggestions on code details, which we adopted. 🛠️