Skip to content

feat: Add int8 KV cache compression with head-major layout and async pipelining#229

Open
naalo2 wants to merge 11 commits into
GeeeekExplorer:mainfrom
naalo2:main
Open

feat: Add int8 KV cache compression with head-major layout and async pipelining#229
naalo2 wants to merge 11 commits into
GeeeekExplorer:mainfrom
naalo2:main

Conversation

@naalo2

@naalo2 naalo2 commented May 8, 2026

Copy link
Copy Markdown

Closes #228

Summary

Add int8 KV cache compression with head-major memory layout and async stream pipelining to hide KV store latency behind attention computation.

Key Changes

  • config.py: Add quant flag and group_num configuration fields
  • context.py / sequence.py: Update parameter passing interfaces and methods
  • attention.py: Add quant computation branch and quant_store operator
  • model_runner.py: Add quant KV cache initialization branch and CUDA graph capture under quant path
  • tools/quant_attn_kvhead_based.py: Implement concrete int8 quantization compute operators

Benefits

  • Reduced KV cache memory footprint via int8 quantization
  • Improved GPU memory access efficiency via head-major coalesced layout
  • Better throughput via async pipelining overlapping KV transfers with computation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[change] Int8 KV Cache + Async Pipeline + Head-major reordering for 22% Throughput Boost

1 participant