Skip to content

SCOPE: Optimizing KV Cache Compression in Long-context Generation

License

Notifications You must be signed in to change notification settings

Linking-ai/SCOPE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


1Southeast University, 2King’s College London, 3The Alan Turing Institute

If you find our project helpful, please give us a star ⭐ on GitHub to stay updated.

Transformers arXiv License

Overview

SCOPE is a simple yet effective framework designed to tackle the KV cache bottleneck in large language models (LLMs) during long-context generation. While existing methods primarily focus on the prefill phase, SCOPE introduces stage-level KV cache compression, addressing both prefill and decoding phases separately—an essential improvement for long-output reasoning tasks.

SCOPE is especially useful for LLM applications that require efficient, scalable generation with long outputs.


Comparison of Three Paradigms

Overview of Three Decoding Strategies

Key Observations

  • Excessive compression during the prefill phase which requires specific full context, impairs the comprehension of the reasoning task.

  • Deviation of heavy hitters occurs in the reasoning tasks with long outputs.


Excessive compression

Deviation of heavy hitters

We provide a notebook vis_topk_index_attn.ipynb to reproduce the Deviation of heavy hitters result(1× A100 (80GB) GPU).

Visualization

Attention heatmaps for layer 13 of a simplified GSM8k+ sample in LongGenBench:


We provide a notebook vis_attn_map.ipynb to reproduce the visualization result(1× A100 (80GB) GPU). Model attention maps for different layers would be stored at ./attention_map.

Requirements

torch==2.4.0
transformers==4.44.2
flash_attn==2.5.8

Environment Setup

conda create -n SCOPE
pip install -r requirements.txt

LongGenBench

Dataset Construction

Our dataset construction method is based on the original LongGenBench repository. We provide scripts for building the LongGenBench dataset as follows:

  • LongGenBench-4K

    Dataset Script
    GSM8K+ create_gsm8k_30.sh
    MMLU+ create_mmlu_30.sh
    CSQA+ create_csqa_40.sh
  • LongGenBench-8K

    Dataset Script
    GSM8K++ create_gsm8k_60.sh
    MMLU++ create_mmlu_60.sh
    CSQA++ create_csqa_80.sh
  • Example Usage

    To generate the GSM8K+ dataset, run:

    bash scripts/scripts_longgenbench/create_gsm8k_30.sh

Inference in LongGenBench

export CUDA_VISIBLE_DEVICES=$1

method=$2 # Support ALLKV, PyramidKV, PyramidInfer SnapKV, H2O, StreamingLLM
max_capacity_prompts=$3
attn_implementation=$4 # Support "flash_attention_2", "sdpa", "eager".
source_path=$5
model_path=$6
decoding_metric=$7 # H2O Support None,h2o,(slide, adaptive, discontinuous)---SCOPE
decoding_window_size=$8
save_dir=$9 # path to result save_dir
K=$10 #30,60
T=$11

python3 run_longgenbench.py \
    --method ${method} \
    --model_path ${model_path} \
    --max_capacity_prompts ${max_capacity_prompts} \
    --attn_implementation ${attn_implementation} \
    --save_dir ${save_dir} \
    --use_cache True \
    --K ${K}\
    --decoding_window_size ${decoding_window_size} \
    --decoding_recent_size ${decoding_recent_size} \
    --decoding_metric ${decoding_metric} \
    --max_num_examples ${T} \

Eval Acc

results_dir=$1

python3 eval_gen.py \
    --results_dir ${results_dir}

Performence in LongGenBench (Llama3.1-8B-Instruct)


The run scripts (bash files) for these experiments are located in the scripts/scripts_longgenbench folder, and the experimental results can be found in results_longgenbench_4K and results_longgenbench_8K.

Performence on the GSM8K+ task from LONGGENBENCH-4K (Llama3.1-8B-Instruct)

The plug-in experiment results of LLaMA3.1-8B-instruct on the GSM8K+ task from LONGGENBENCH-4K.


The run scripts (bash files) for these experiments are located in the scripts/scripts_longgenbench folder, and the experimental results can be found in results_longgenbench_gsm8k_plug_in.

TODO

  • fix offset bug
  • improve README(expand documentation, add examples, and ensure clarity)
  • reorgnize the code for better using experience

Citation

If you find our work valuable, we would appreciate your citation: 🎈

@article{wu2024scope,
  title={SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation},
  author={Wu, Jialong and Wang, Zhenglin and Zhang, Linhai and Lai, Yilong and He, Yulan and Zhou, Deyu},
  journal={arXiv preprint arXiv:2412.13649},
  year={2024}
}

Acknowledgements

  • Thanks to SnapKV and PyramidKV (KVCache-Factory for providing open-source code to support the expansion of this project. 🎁

  • Special thanks to LOOK-M for the beautifully designed README template, which we referenced. 🎨

  • Shoutout to @Lueci4er on GitHub for valuable suggestions on code details, which we adopted. 🛠️

The code is still being organized.🚧

About

SCOPE: Optimizing KV Cache Compression in Long-context Generation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published