-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Description
Recently, KV cache compression has emerged as a critical optimization technique for large language models (LLMs). The KV cache exhibits strong temporal and spatial locality—similar to time-series data—where adjacent tokens and attention heads often share redundant patterns. Given these characteristics, does this algorithm (or approach) effectively adapt to KV cache compression while maintaining efficient GPU execution?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels