LLM inference servers reuse cached key-value computations across requests by storing shared prompt prefixes in a radix-tree cache. This project measures how much performance existing eviction policies (LRU, LFU, FIFO, etc.) leave on the table compared to an offline oracle (tc_belady) under a fixed-block radix-tree model, using real ShareGPT conversation traces.
The current interactive report is hosted here:
https://superattention.github.io/llm-prefix-cache-analysis/report_site
This figure shows token hit rate across cache budgets for each policy, plus the relative gap between tc_belady and the best non-oracle policy at each budget. It supports the benchmark-level observation that the online policies leave measurable headroom under this tree-constrained oracle, especially at mid-sized budgets. PDF: gap_curve.pdf
This figure summarizes the shadow-oracle disagreement records for lru at the selected peak-gap cache size. Group A records are blocks LRU evicted but Belady kept; group B records are blocks Belady evicted but LRU kept. The group summaries indicate that B blocks are often recently touched, deep in the tree, and never reused, which is evidence for the "false-warm deep tail" failure mode in this trace. PDF: mechanism_scatter.pdf
src/— cache model, eviction policies, simulator, metrics, oracle, mechanism analysis, and plotting codescripts/— command-line pipeline stagestests/— unit tests for trace construction, simulation, paging, mechanism analysis, plotting, and report-site contentcache/— default local output directory for generated traces, simulation results, and mechanism-analysis pickles (not required to exist before running the pipeline)paper/figures/— default output directory for generated plots fromscripts/04_plot.pybenchmark_results_llama_alt.md— current factual benchmark note
Eviction policies implemented: lru, lfu, fifo, mru, filo, slru, priority, random, depth_lru, and the tree-constrained offline oracle tc_belady.
- Dataset source: cached local snapshot of
anon8231489123/ShareGPT_Vicuna_unfiltered - Tokenizer:
NousResearch/Meta-Llama-3-8B-Alternate-Tokenizer - Conversations:
300 - Accesses:
978 - Ordering: interleaved, random with seed
0 - Cache model: fixed-size block radix tree, capacity measured in blocks
- Default reported page size:
1token per block
Trace: 300 ShareGPT conversations, 978 accesses, 811,050 requested tokens, 319,862 unique block prefixes.
Token hit rate (showing tc_belady, lru, depth_lru, and random as representative oracle, best, near-best, and baseline):
| strategy | 1271 blocks | 20160 blocks | 319862 blocks |
|---|---|---|---|
tc_belady |
0.0969 | 0.4339 | 0.6056 |
lru |
0.0192 | 0.1850 | 0.6056 |
depth_lru |
0.0201 | 0.1860 | 0.6056 |
random |
0.0175 | 0.1472 | 0.6056 |
Peak relative gap for lru vs tc_belady: 80.2% at 1271 blocks. At the 20160-token budget, the absolute lru gap is 0.2488 token-hit-rate points. Full results are in benchmark_results_llama_alt.md and the generated CSV/Markdown sidecars.
Generated binary artifacts are written wherever the script --output argument points. The documented pipeline writes:
- Trace pickle:
cache/sharegpt-trace-300-llama-alt.pkl - Main benchmark results:
cache/sharegpt-results-300-llama-alt-benchmark.pkl - Mechanism analysis:
cache/sharegpt-mechanism-300-llama-alt.pkl - Generated figures:
paper/figures/gap_curve.pdfandpaper/figures/mechanism_scatter.pdf - Simulation sidecars:
cache/sharegpt-results-300-llama-alt-benchmark.csvandcache/sharegpt-results-300-llama-alt-benchmark.md - Mechanism sidecars:
cache/sharegpt-mechanism-300-llama-alt-summary.csvandcache/sharegpt-mechanism-300-llama-alt-summary.md
The reported numbers should remain the same when rerunning from the same cache/sharegpt-trace-300-llama-alt.pkl. Rebuilding the trace from Hugging Face can change if the upstream dataset file or tokenizer revision changes, so the trace pickle is the reproducibility boundary for the simulation and mechanism-analysis stages.
Requires Python 3.10+. Install dependencies:
pip install -r requirements.txtPrepare a tokenized trace:
python scripts/01_prepare_trace.py \
--limit 300 \
--order interleaved \
--seed 0 \
--dataset anon8231489123/ShareGPT_Vicuna_unfiltered \
--tokenizer NousResearch/Meta-Llama-3-8B-Alternate-Tokenizer \
--output cache/sharegpt-trace-300-llama-alt.pklNote:
anon8231489123/ShareGPT_Vicuna_unfilteredcontains mixed file types and multiple JSON files, sodatasets.load_dataset()cannot infer it directly. The trace-preparation helper falls back to the known benchmark fileShareGPT_V3_unfiltered_cleaned_split.json. On a fresh machine this may download a large Hugging Face file. For exact reproduction, prefer reusing the generated trace pickle or pass a local JSON snapshot:python scripts/01_prepare_trace.py \ --limit 300 \ --order interleaved \ --seed 0 \ --dataset /path/to/ShareGPT_V3_unfiltered_cleaned_split.json \ --tokenizer NousResearch/Meta-Llama-3-8B-Alternate-Tokenizer \ --output cache/sharegpt-trace-300-llama-alt.pkl
Run the main simulation sweep:
python scripts/02_simulate.py \
--trace cache/sharegpt-trace-300-llama-alt.pkl \
--output cache/sharegpt-results-300-llama-alt-benchmark.pkl \
--page-size 1 \
--strategies lru lfu fifo mru filo slru priority random tc_belady \
--sizes 319 1271 5061 20160 80302 319862Run mechanism analysis at the peak-gap cache size:
python scripts/03_mechanism.py \
--results cache/sharegpt-results-300-llama-alt-benchmark.pkl \
--trace cache/sharegpt-trace-300-llama-alt.pkl \
--output cache/sharegpt-mechanism-300-llama-alt.pkl \
--baseline lru \
--page-size 1Generate paper figures from result artifacts:
python scripts/04_plot.py \
--results cache/sharegpt-results-300-llama-alt-benchmark.pkl \
--mechanism cache/sharegpt-mechanism-300-llama-alt.pkl \
--figures-dir paper/figures \
--strategy lruThe simulator uses page/block-aligned prefix-cache accounting, which is aligned with SGLang's page_size model: a page groups tokens into a KV-cache block, prefix reuse requires complete pages, and page_size = 1 gives token-level matching. The simplifications are elsewhere:
- no production KV allocator state
- no lock/reference-state modeling
- no host/device cache hierarchy
- no extra-key namespaces
- no Eagle/bigram mode
The goal is a clean, analyzable eviction study under a page-aligned radix model. tc_belady should be read as an upper bound under the same leaf-only candidate constraint, not as an unconstrained global optimum for production prefix caching.
pytestRun a focused test file while iterating:
pytest tests/test_simulate.pyRun an individual test by node id:
pytest tests/test_simulate.py::test_run_suite_reports_both_metrics_for_each_cache_size
