Official implementation of ATTS: Asynchronous Test-Time Scaling via Conformal Prediction.
ATTS achieves up to 56.7x speedup and 4.14x throughput improvement in test-time scaling while maintaining statistical guarantees through conformal prediction.
ATTS requires SGLang 0.4.3.post4 and sgl-kernel 0.0.3.post6. Since sgl-kernel is not on PyPI, we provide a pre-built wheel in third_party/sgl_kernel.zip. Use the following steps (from the repo root).
1. Prerequisites
- Python 3.11
- uv (recommended) or pip
- CUDA Toolkit (with
nvccinPATHif you build from source) - PyTorch with CUDA support (will be installed via
requirements.txt)
# Optional: ensure CUDA is in PATH
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH2. Create virtual environment and activate
uv venv .sglang --python 3.11
source .sglang/bin/activate3. Install sgl-kernel from bundled wheel
Unzip the pre-built sgl-kernel and copy it into the venvβs site-packages (paths assume you are in repo root for step 2, then cd third_party here):
cd third_party
unzip sgl_kernel.zip
cp -r 0.0.3.post6-cp39-abi3-manylinux2014_x86_64/sgl_kernel ../.sglang/lib/python3.11/site-packages/
cp -r 0.0.3.post6-cp39-abi3-manylinux2014_x86_64/sgl_kernel-0.0.3.post6.dist-info ../.sglang/lib/python3.11/site-packages/
rm -rf 0.0.3.post6-cp39-abi3-manylinux2014_x86_64/4. Install remaining dependencies (SGLang and extras)
Still in third_party/:
uv pip install -r requirements.txt
cd ..This installs SGLang 0.4.3.post4 from the bundled source (third_party/sglang-0.4.3.post4/python) and all other dependencies.
5. Pre-compile FlashInfer kernels (recommended for H100/H200)
On first run, FlashInfer JIT-compiles CUDA kernels which can take minutes or hang. Pre-compile them once:
bash scripts/precompile_kernels.shNote: If
nvccversion (e.g. 12.9) differs from PyTorch's bundled CUDA runtime (e.g. 12.4), this script automatically patcheslibcudart.so.12to avoidundefined symbolerrors. See docs/FLASHINFER_WARMUP.md for details.
6. Verify installation
python -c "import sglang; print('sglang version:', sglang.__version__); from sglang import Engine; print('OK')"Expected output: sglang version: 0.4.3.post4 and OK.
Start the SGLang servers for inference (run from repo root):
bash scripts/launch_sglang_servers.shDefault Configuration:
- Small Model:
deepseek-ai/DeepSeek-R1-Distill-Llama-8B(Port 40000) - Eval Model:
Qwen/QwQ-32B(Port 40001)
To customize models or GPUs, edit the configuration variables in scripts/launch_sglang_servers.sh:
SMALL_MODEL="your-model"
EVAL_MODEL="your-eval-model"
SMALL_MODEL_DEVICE="0"
EVAL_MODEL_DEVICES="1,2"To stop servers:
kill $(cat small_model.pid) $(cat eval_model.pid)ATTS supports two conformal prediction coverage modes. Choose the one that fits your use case:
| Coverage Mode | Description | PPL Calibration Script | Async Evaluation Script |
|---|---|---|---|
| Marginal Coverage | Calibrates a single threshold across the entire dataset. Guarantees that the overall (marginal) coverage rate meets the target level. | scripts/suite_conformal.sh |
scripts/suite_async.sh |
| Conditional Coverage | Calibrates a per-question threshold. Provides stronger, question-level coverage guarantees (conditional on each input). | scripts/suite_conformal_per_question.sh |
scripts/suite_async_per_question.sh |
Step 1 β Generate PPL arrays (marginal calibration):
bash scripts/suite_conformal.shStep 2 β Run asynchronous evaluation (marginal coverage):
bash scripts/suite_async.shOr run the Python modules directly:
# PPL calibration (marginal)
python -m ATTS.ref_conformal \
--small_model_name "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
--eval_model_name "Qwen/QwQ-32B" \
--dataset_name "aime24" \
--ppl_array_path "ppls_aime24.npy" \
--small_model_port 40000 \
--eval_model_port 40001
# Async inference (marginal)
python -m ATTS.ref_async \
--small_model_name "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
--eval_model_name "Qwen/QwQ-32B" \
--dataset_name "aime24"Step 1 β Generate PPL arrays (per-question calibration):
bash scripts/suite_conformal_per_question.shStep 2 β Run asynchronous evaluation (conditional coverage):
bash scripts/suite_async_per_question.shOr run the Python modules directly:
# PPL calibration (per-question)
python -m ATTS.ref_conformal_per_question \
--small_model_name "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
--eval_model_name "Qwen/QwQ-32B" \
--dataset_name "aime24" \
--ppl_array_path "ppls_aime24_per_question.npy" \
--small_model_port 40000 \
--eval_model_port 40001
# Async inference (per-question)
python -m ATTS.ref_async_per_question \
--small_model_name "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
--eval_model_name "Qwen/QwQ-32B" \
--dataset_name "aime24"Note: All suite scripts automatically handle SGLang server lifecycle (start β calibrate/evaluate β stop) for each model configuration. Edit the
CONFIGSarray and configuration variables inside each script to customize models, datasets, GPUs, and hyperparameters.
After evaluation completes, the default answer extraction uses regex (\boxed{} / ANSWER: X). You can use a stronger LLM to re-extract answers from the saved reasoning history and recompute accuracy:
bash scripts/re_extract_answer.shThis script:
- Reads all
problem_XXXX.jsonfiles from the results directory - Re-extracts
final_answerusing an LLM (via OpenAI-compatible API) - Computes accuracy against ground-truth using
anyone_check
Configuration β edit scripts/re_extract_answer.sh:
OPENAI_API_KEY="your-api-key"
OPENAI_BASE_URL="https://your-api-endpoint/v1"
OPENAI_MODEL="gpt-4o"
DATASET_NAME="math500" # must match the dataset used during evaluation
REPEATS=16 # must match the repeats used during evaluationYou can also run it directly:
export OPENAI_API_KEY="your-key"
export OPENAI_BASE_URL="https://your-api-endpoint/v1"
export OPENAI_MODEL="gpt-4o"
python -m ATTS.re_extract \
--input_dir "./results/<your_result_dir>" \
--dataset_name math500 \
--repeats 16 \
--concurrency 32Add --dry_run to preview without modifying files.
- AIME24, AIME25
- AMC23
- MATH500
- Olympiad
- GPQA
The framework supports various draft-target model pairs:
- DeepSeek-R1-Distill (1.5B/8B) + QwQ-32B
- Qwen2.5-7B + QwQ-32B
- Llama-3.1-8B + simplescaling/s1.1-32B
- ATTS Python code lives under
ATTS/. Run from repo root:python -m ATTS.<module> ....- Marginal coverage:
ref_conformal.py,ref_async.py - Conditional coverage (per-question):
ref_conformal_per_question.py,ref_async_per_question.py
- Marginal coverage:
- Shell scripts (launch, suite, profiling, etc.) are under
scripts/. Run them from the repo root:- Marginal coverage:
bash scripts/suite_conformal.sh/bash scripts/suite_async.sh - Conditional coverage:
bash scripts/suite_conformal_per_question.sh/bash scripts/suite_async_per_question.sh
- Marginal coverage:
Edit variables in the shell scripts to customize:
SAMPLE_SIZE: Number of samples per question (default: 16)SMALL_MODEL_MAX_TOKENS: Max tokens for small model (default: 500)SMALL_MODEL_TEMPERATURE: Sampling temperature (default: 0.8)CUDA_VISIBLE_DEVICES: GPU allocation
| Variable | Default | Description |
|---|---|---|
SGLANG_HOST |
0.0.0.0 |
Host address for SGLang servers |
SMALL_MODEL_PORT |
52100 |
Port for the small (draft) model server |
EVAL_MODEL_PORT |
52101 |
Port for the evaluation (target) model server |
SMALL_MODEL_DEVICE |
"2" |
CUDA device ID for the small model (single GPU) |
EVAL_MODEL_DEVICES |
"3,4" |
CUDA device IDs for the eval model (supports multi-GPU tensor parallelism) |
| Variable | Default | Description |
|---|---|---|
SAMPLE_SIZE |
16 |
Number of repeated samples per question (pass@k evaluation) |
DEFAULT_TURNS |
15 |
Maximum number of small-model/eval-model interaction turns per sample. Can be overridden per-config in CONFIGS |
SMALL_MODEL_MAX_TOKENS |
500 |
Maximum number of tokens the small model generates per turn |
EVAL_MODEL_MAX_TOKENS |
500 |
Maximum number of tokens the eval model generates per turn (when PPL percentile exceeds threshold) |
SMALL_MODEL_TEMPERATURE |
0.8 |
Sampling temperature for the small model |
SMALL_MODEL_CONFORMAL_TEMPERATURE |
0.8 |
Temperature used during conformal PPL calibration (used in log/output file naming to match the corresponding PPL array) |
EVAL_MODEL_TEMPERATURE |
0.8 |
Sampling temperature for the eval model |
| Variable | Default | Description |
|---|---|---|
USE_EVAL_CHAT_TEMPLATE |
1 |
Whether the eval model applies the tokenizer's chat template when computing PPL. 1 = apply chat template (messages formatted with <|im_start|> / <|im_end|> etc.), 0 = use raw text concatenation. This value is also appended to output paths as _ct{0|1} for easy differentiation |
| Variable | Default | Description |
|---|---|---|
SMALL_MODEL_CONCURRENCY |
16 |
Maximum number of concurrent requests to the small model |
EVAL_MODEL_CONCURRENCY |
4 |
Maximum number of concurrent requests to the eval model |
MAX_RETRIES |
3 |
Number of HTTP request retries on connection errors |
| Variable | Default | Description |
|---|---|---|
EXTRACT_MODE |
"llm" |
Answer extraction mode. "regex": extract via \boxed{} / ANSWER: X pattern matching. "llm": use an external LLM (configured below) for more robust extraction |
OPENAI_API_KEY |
"" |
API key for the OpenAI-compatible endpoint used by anyone_check answer evaluation and LLM-based extraction |
OPENAI_BASE_URL |
"https://..." |
Base URL for the OpenAI-compatible API |
OPENAI_MODEL |
"gpt-5.2" |
Model name used for answer evaluation and extraction |
We provide two baselines in this repo for comparison with ATTS. You can reproduce them as follows.
SpecReason is a speculative-reasoning baseline (draft + target with vLLM). To test it:
-
Environment (from repo root):
conda create -n specreason python=3.12 -y && conda activate specreason pip install vllm datasetsFor vLLM speculative decoding you may need to install from source; see specreason/README.md.
-
Start two vLLM servers (e.g. in two terminals; 32B on 30000, 1.5B on 30001):
VLLM_USE_V1=0 vllm serve Qwen/QwQ-32B --dtype auto -tp 2 --max_model_len 8192 --gpu-memory-utilization 0.8 --enable-prefix-caching --port 30000 VLLM_USE_V1=0 vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --dtype auto -tp 2 --max_model_len 8192 --gpu-memory-utilization 0.1 --enable-prefix-caching --port 30001
-
Run the SpecReason baseline (single problem, optional: change
--problem_id/--dataset_name):cd specreason mkdir -p results && OUTPUT_DIR=./results python spec_reason.py --dataset_name aime --problem_id 60 --repeat_id 0 --score_threshold 7.0 --score_method greedy --token_budget 8192 --output_dir "$OUTPUT_DIR"
Results go to
specreason/results/. For full datasets and batch scripts see specreason/README.md andspecreason/spec_reason_della_*.sh.
Speculative thinking baseline (SkyThought-style evals with sglang/vLLM). To test it:
-
Environment (from repo root):
cd speculative_thinking python -m venv .venv && source .venv/bin/activate pip install sglang vllm # see speculative_thinking/skythought_evals/requirements.txt for full deps
-
Test normal (non-speculative) model (no draft model):
python ./skythought_evals/eval.py --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \ --evals amc23 --n 1 --result-dir ./eval_out --tp 2 --output-file ./eval_out/32B.txt -
Test speculative thinking (draft + target). Pick a config from
speculative/config/(e.g.1b_14b.yml) or add your own, then:python ./skythought_evals/eval.py --evals amc23 --n 1 --result-dir ./eval_out \ --tp 3 --output-file ./eval_out/1b_14b.txt --spe_config ./speculative/config/1b_14b.ymlResults are written to the paths given by
--result-dirand--output-file. More options and config format: speculative_thinking/README.md.
@article{xiong2025atts,
title={ATTS: Asynchronous Test-Time Scaling via Conformal Prediction},
author={Xiong, Jing and Chen, Qiujiang and Ye, Fanghua and Wan, Zhongwei and Zheng, Chuanyang and Zhao, Chenyang and Shen, Hui and Li, Alexander Hanbo and Tao, Chaofan and Tan, Haochen and others},
journal={arXiv preprint arXiv:2509.15148},
url={https://arxiv.org/abs/2509.15148},
year={2025}
}For questions or issues, please open a GitHub issue.