[ICLR2026🔥] ATTS: Asynchronous Test-Time Scaling

Official implementation of ATTS: Asynchronous Test-Time Scaling via Conformal Prediction.

ATTS achieves up to 56.7x speedup and 4.14x throughput improvement in test-time scaling while maintaining statistical guarantees through conformal prediction.

🔧 Installation

⚠️ Critical: Install SGLang 0.4.3.post4 and sgl-kernel

ATTS requires SGLang 0.4.3.post4 and sgl-kernel 0.0.3.post6. Since sgl-kernel is not on PyPI, we provide a pre-built wheel in third_party/sgl_kernel.zip. Use the following steps (from the repo root).

1. Prerequisites

Python 3.11
uv (recommended) or pip
CUDA Toolkit (with nvcc in PATH if you build from source)
PyTorch with CUDA support (will be installed via requirements.txt)

# Optional: ensure CUDA is in PATH
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

2. Create virtual environment and activate

uv venv .sglang --python 3.11
source .sglang/bin/activate

3. Install sgl-kernel from bundled wheel

Unzip the pre-built sgl-kernel and copy it into the venv’s site-packages (paths assume you are in repo root for step 2, then cd third_party here):

cd third_party
unzip sgl_kernel.zip
cp -r 0.0.3.post6-cp39-abi3-manylinux2014_x86_64/sgl_kernel ../.sglang/lib/python3.11/site-packages/
cp -r 0.0.3.post6-cp39-abi3-manylinux2014_x86_64/sgl_kernel-0.0.3.post6.dist-info ../.sglang/lib/python3.11/site-packages/
rm -rf 0.0.3.post6-cp39-abi3-manylinux2014_x86_64/

4. Install remaining dependencies (SGLang and extras)

Still in third_party/:

uv pip install -r requirements.txt
cd ..

This installs SGLang 0.4.3.post4 from the bundled source (third_party/sglang-0.4.3.post4/python) and all other dependencies.

5. Pre-compile FlashInfer kernels (recommended for H100/H200)

On first run, FlashInfer JIT-compiles CUDA kernels which can take minutes or hang. Pre-compile them once:

bash scripts/precompile_kernels.sh

Note: If nvcc version (e.g. 12.9) differs from PyTorch's bundled CUDA runtime (e.g. 12.4), this script automatically patches libcudart.so.12 to avoid undefined symbol errors. See docs/FLASHINFER_WARMUP.md for details.

6. Verify installation

python -c "import sglang; print('sglang version:', sglang.__version__); from sglang import Engine; print('OK')"

Expected output: sglang version: 0.4.3.post4 and OK.

🚀 Quick Start

1. Launch SGLang Servers

Start the SGLang servers for inference (run from repo root):

bash scripts/launch_sglang_servers.sh

Default Configuration:

Small Model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B (Port 40000)
Eval Model: Qwen/QwQ-32B (Port 40001)

To customize models or GPUs, edit the configuration variables in scripts/launch_sglang_servers.sh:

SMALL_MODEL="your-model"
EVAL_MODEL="your-eval-model"
SMALL_MODEL_DEVICE="0"
EVAL_MODEL_DEVICES="1,2"

To stop servers:

kill $(cat small_model.pid) $(cat eval_model.pid)

2. Prepare PPL Arrays & Run Evaluation

ATTS supports two conformal prediction coverage modes. Choose the one that fits your use case:

Coverage Mode	Description	PPL Calibration Script	Async Evaluation Script
Marginal Coverage	Calibrates a single threshold across the entire dataset. Guarantees that the overall (marginal) coverage rate meets the target level.	`scripts/suite_conformal.sh`	`scripts/suite_async.sh`
Conditional Coverage	Calibrates a per-question threshold. Provides stronger, question-level coverage guarantees (conditional on each input).	`scripts/suite_conformal_per_question.sh`	`scripts/suite_async_per_question.sh`

Option A: Marginal Coverage Conformal Prediction

Step 1 — Generate PPL arrays (marginal calibration):

bash scripts/suite_conformal.sh

Step 2 — Run asynchronous evaluation (marginal coverage):

bash scripts/suite_async.sh

Or run the Python modules directly:

# PPL calibration (marginal)
python -m ATTS.ref_conformal \
    --small_model_name "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
    --eval_model_name "Qwen/QwQ-32B" \
    --dataset_name "aime24" \
    --ppl_array_path "ppls_aime24.npy" \
    --small_model_port 40000 \
    --eval_model_port 40001

# Async inference (marginal)
python -m ATTS.ref_async \
    --small_model_name "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
    --eval_model_name "Qwen/QwQ-32B" \
    --dataset_name "aime24"

Option B: Conditional Coverage Conformal Prediction (Per-Question)

Step 1 — Generate PPL arrays (per-question calibration):

bash scripts/suite_conformal_per_question.sh

Step 2 — Run asynchronous evaluation (conditional coverage):

bash scripts/suite_async_per_question.sh

Or run the Python modules directly:

# PPL calibration (per-question)
python -m ATTS.ref_conformal_per_question \
    --small_model_name "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
    --eval_model_name "Qwen/QwQ-32B" \
    --dataset_name "aime24" \
    --ppl_array_path "ppls_aime24_per_question.npy" \
    --small_model_port 40000 \
    --eval_model_port 40001

# Async inference (per-question)
python -m ATTS.ref_async_per_question \
    --small_model_name "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
    --eval_model_name "Qwen/QwQ-32B" \
    --dataset_name "aime24"

Note: All suite scripts automatically handle SGLang server lifecycle (start → calibrate/evaluate → stop) for each model configuration. Edit the CONFIGS array and configuration variables inside each script to customize models, datasets, GPUs, and hyperparameters.

3. Re-extract Answers with LLM & Recompute Accuracy

After evaluation completes, the default answer extraction uses regex (\boxed{} / ANSWER: X). You can use a stronger LLM to re-extract answers from the saved reasoning history and recompute accuracy:

bash scripts/re_extract_answer.sh

This script:

Reads all problem_XXXX.json files from the results directory
Re-extracts final_answer using an LLM (via OpenAI-compatible API)
Computes accuracy against ground-truth using anyone_check

Configuration — edit scripts/re_extract_answer.sh:

OPENAI_API_KEY="your-api-key"
OPENAI_BASE_URL="https://your-api-endpoint/v1"
OPENAI_MODEL="gpt-4o"
DATASET_NAME="math500"   # must match the dataset used during evaluation
REPEATS=16               # must match the repeats used during evaluation

You can also run it directly:

export OPENAI_API_KEY="your-key"
export OPENAI_BASE_URL="https://your-api-endpoint/v1"
export OPENAI_MODEL="gpt-4o"

python -m ATTS.re_extract \
    --input_dir "./results/<your_result_dir>" \
    --dataset_name math500 \
    --repeats 16 \
    --concurrency 32

Add --dry_run to preview without modifying files.

📊 Supported Datasets

AIME24, AIME25
AMC23
MATH500
Olympiad
GPQA

🎯 Model Combinations

The framework supports various draft-target model pairs:

DeepSeek-R1-Distill (1.5B/8B) + QwQ-32B
Qwen2.5-7B + QwQ-32B
Llama-3.1-8B + simplescaling/s1.1-32B

📝 Configuration

ATTS Python code lives under ATTS/. Run from repo root: python -m ATTS.<module> ....
- Marginal coverage: ref_conformal.py, ref_async.py
- Conditional coverage (per-question): ref_conformal_per_question.py, ref_async_per_question.py
Shell scripts (launch, suite, profiling, etc.) are under scripts/. Run them from the repo root:
- Marginal coverage: bash scripts/suite_conformal.sh / bash scripts/suite_async.sh
- Conditional coverage: bash scripts/suite_conformal_per_question.sh / bash scripts/suite_async_per_question.sh

Edit variables in the shell scripts to customize:

SAMPLE_SIZE: Number of samples per question (default: 16)
SMALL_MODEL_MAX_TOKENS: Max tokens for small model (default: 500)
SMALL_MODEL_TEMPERATURE: Sampling temperature (default: 0.8)
CUDA_VISIBLE_DEVICES: GPU allocation

Server Configuration

Variable	Default	Description
`SGLANG_HOST`	`0.0.0.0`	Host address for SGLang servers
`SMALL_MODEL_PORT`	`52100`	Port for the small (draft) model server
`EVAL_MODEL_PORT`	`52101`	Port for the evaluation (target) model server
`SMALL_MODEL_DEVICE`	`"2"`	CUDA device ID for the small model (single GPU)
`EVAL_MODEL_DEVICES`	`"3,4"`	CUDA device IDs for the eval model (supports multi-GPU tensor parallelism)

Sampling & Generation

Variable	Default	Description
`SAMPLE_SIZE`	`16`	Number of repeated samples per question (pass@k evaluation)
`DEFAULT_TURNS`	`15`	Maximum number of small-model/eval-model interaction turns per sample. Can be overridden per-config in `CONFIGS`
`SMALL_MODEL_MAX_TOKENS`	`500`	Maximum number of tokens the small model generates per turn
`EVAL_MODEL_MAX_TOKENS`	`500`	Maximum number of tokens the eval model generates per turn (when PPL percentile exceeds threshold)
`SMALL_MODEL_TEMPERATURE`	`0.8`	Sampling temperature for the small model
`SMALL_MODEL_CONFORMAL_TEMPERATURE`	`0.8`	Temperature used during conformal PPL calibration (used in log/output file naming to match the corresponding PPL array)
`EVAL_MODEL_TEMPERATURE`	`0.8`	Sampling temperature for the eval model

Eval Model Prompt Format

Variable	Default	Description
`USE_EVAL_CHAT_TEMPLATE`	`1`	Whether the eval model applies the tokenizer's chat template when computing PPL. `1` = apply chat template (messages formatted with `<\|im_start\|>` / `<\|im_end\|>` etc.), `0` = use raw text concatenation. This value is also appended to output paths as `_ct{0\|1}` for easy differentiation

Concurrency & Reliability

Variable	Default	Description
`SMALL_MODEL_CONCURRENCY`	`16`	Maximum number of concurrent requests to the small model
`EVAL_MODEL_CONCURRENCY`	`4`	Maximum number of concurrent requests to the eval model
`MAX_RETRIES`	`3`	Number of HTTP request retries on connection errors

Answer Extraction

Variable	Default	Description
`EXTRACT_MODE`	`"llm"`	Answer extraction mode. `"regex"`: extract via `\boxed{}` / `ANSWER: X` pattern matching. `"llm"`: use an external LLM (configured below) for more robust extraction
`OPENAI_API_KEY`	`""`	API key for the OpenAI-compatible endpoint used by `anyone_check` answer evaluation and LLM-based extraction
`OPENAI_BASE_URL`	`"https://..."`	Base URL for the OpenAI-compatible API
`OPENAI_MODEL`	`"gpt-5.2"`	Model name used for answer evaluation and extraction

🧪 Testing the Baselines

We provide two baselines in this repo for comparison with ATTS. You can reproduce them as follows.

Baseline 1: SpecReason (`specreason/`)

SpecReason is a speculative-reasoning baseline (draft + target with vLLM). To test it:

Environment (from repo root):
```
conda create -n specreason python=3.12 -y && conda activate specreason
pip install vllm datasets
```
For vLLM speculative decoding you may need to install from source; see specreason/README.md.

Start two vLLM servers (e.g. in two terminals; 32B on 30000, 1.5B on 30001):

VLLM_USE_V1=0 vllm serve Qwen/QwQ-32B --dtype auto -tp 2 --max_model_len 8192 --gpu-memory-utilization 0.8 --enable-prefix-caching --port 30000
VLLM_USE_V1=0 vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --dtype auto -tp 2 --max_model_len 8192 --gpu-memory-utilization 0.1 --enable-prefix-caching --port 30001

Run the SpecReason baseline (single problem, optional: change --problem_id / --dataset_name):

cd specreason
mkdir -p results && OUTPUT_DIR=./results
python spec_reason.py --dataset_name aime --problem_id 60 --repeat_id 0 --score_threshold 7.0 --score_method greedy --token_budget 8192 --output_dir "$OUTPUT_DIR"

Results go to specreason/results/. For full datasets and batch scripts see specreason/README.md and specreason/spec_reason_della_*.sh.

Baseline 2: Speculative Thinking (`speculative_thinking/`)

Speculative thinking baseline (SkyThought-style evals with sglang/vLLM). To test it:

Environment (from repo root):

cd speculative_thinking
python -m venv .venv && source .venv/bin/activate
pip install sglang vllm   # see speculative_thinking/skythought_evals/requirements.txt for full deps

Test normal (non-speculative) model (no draft model):

python ./skythought_evals/eval.py --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --evals amc23 --n 1 --result-dir ./eval_out --tp 2 --output-file ./eval_out/32B.txt

Test speculative thinking (draft + target). Pick a config from speculative/config/ (e.g. 1b_14b.yml) or add your own, then:
```
python ./skythought_evals/eval.py --evals amc23 --n 1 --result-dir ./eval_out \
    --tp 3 --output-file ./eval_out/1b_14b.txt --spe_config ./speculative/config/1b_14b.yml
```
Results are written to the paths given by --result-dir and --output-file. More options and config format: speculative_thinking/README.md.

📄 Citation

@article{xiong2025atts,
  title={ATTS: Asynchronous Test-Time Scaling via Conformal Prediction},
  author={Xiong, Jing and Chen, Qiujiang and Ye, Fanghua and Wan, Zhongwei and Zheng, Chuanyang and Zhao, Chenyang and Shen, Hui and Li, Alexander Hanbo and Tao, Chaofan and Tan, Haochen and others},
  journal={arXiv preprint arXiv:2509.15148},
  url={https://arxiv.org/abs/2509.15148},
  year={2025}
}

📧 Contact

For questions or issues, please open a GitHub issue.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
ATTS		ATTS
docs		docs
evaluation		evaluation
evaluation_backup_engine		evaluation_backup_engine
results		results
results_backup_engine		results_backup_engine
sbatch		sbatch
scripts		scripts
specreason		specreason
speculative_thinking		speculative_thinking
third_party		third_party
.gitignore		.gitignore
README.md		README.md
analyze_fork.py		analyze_fork.py
analyze_sync.py		analyze_sync.py
eval_model.pid		eval_model.pid
inspect_fork.py		inspect_fork.py
inspect_sample3.py		inspect_sample3.py
small_model.pid		small_model.pid
test_fork.py		test_fork.py
test_fork_single.py		test_fork_single.py
test_run.pid		test_run.pid
test_run.sh		test_run.sh
test_run_self.sh		test_run_self.sh
test_run_slurm.sh		test_run_slurm.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ICLR2026🔥] ATTS: Asynchronous Test-Time Scaling

🔧 Installation

⚠️ Critical: Install SGLang 0.4.3.post4 and sgl-kernel

🚀 Quick Start

1. Launch SGLang Servers

2. Prepare PPL Arrays & Run Evaluation

Option A: Marginal Coverage Conformal Prediction

Option B: Conditional Coverage Conformal Prediction (Per-Question)

3. Re-extract Answers with LLM & Recompute Accuracy

📊 Supported Datasets

🎯 Model Combinations

📝 Configuration

Server Configuration

Sampling & Generation

Eval Model Prompt Format

Concurrency & Reliability

Answer Extraction

🧪 Testing the Baselines

Baseline 1: SpecReason (`specreason/`)

Baseline 2: Speculative Thinking (`speculative_thinking/`)

📄 Citation

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[ICLR2026🔥] ATTS: Asynchronous Test-Time Scaling

🔧 Installation

⚠️ Critical: Install SGLang 0.4.3.post4 and sgl-kernel

🚀 Quick Start

1. Launch SGLang Servers

2. Prepare PPL Arrays & Run Evaluation

Option A: Marginal Coverage Conformal Prediction

Option B: Conditional Coverage Conformal Prediction (Per-Question)

3. Re-extract Answers with LLM & Recompute Accuracy

📊 Supported Datasets

🎯 Model Combinations

📝 Configuration

Server Configuration

Sampling & Generation

Eval Model Prompt Format

Concurrency & Reliability

Answer Extraction

🧪 Testing the Baselines

Baseline 1: SpecReason (specreason/)

Baseline 2: Speculative Thinking (speculative_thinking/)

📄 Citation

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Baseline 1: SpecReason (`specreason/`)

Baseline 2: Speculative Thinking (`speculative_thinking/`)

Packages