π Paper β’ π Quick Start β’ π¬ Methodology β’ π Results β’ π Citation
Align frozen LLMs at test-time with a small doctor model β no fine-tuning required!
The patient-doctor paradigm: A small doctor model learns token-level flow-guided preferences and steers the frozen patient LLM during inference.
TL;DR: LLMdoctor uses a small "doctor" model to guide a large frozen "patient" LLM at inference time, achieving better alignment than DPO without any fine-tuning of the large model.
|
|
Figure: Comparison of test-time alignment approaches. LLMdoctor (c) uses genuine token-level rewards instead of trajectory-level (a) or sequence-mimicking (b) approaches.
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β π Stage 1 βββββΆβ π Stage 2 βββββΆβ π©Ί Stage 3 β
β Token Reward β β TFPO Training β β Guided Decoding β
β Extraction β β (Doctor) β β (Inference) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
| Stage | Description | Key Innovation |
|---|---|---|
| 1 | Extract token-level rewards from patient model's behavioral variations | Sparse reward signal via importance thresholding |
| 2 | Train doctor model with TFPO (SubTB + Value Discrimination Loss) | O(nΒ²) subtrajectory flow balance |
| 3 | Guide patient model at inference time using doctor's value estimates | `Ο(y |
git clone https://github.com/yellowsubmarine7/LLMDoctor.git
cd LLMDoctor
pip install -r requirements.txtfrom llmdoctor import LLMdoctor
# Load patient (frozen) and doctor (trained) models
doctor = LLMdoctor(
patient_model="Qwen/Qwen2.5-7B-Instruct",
doctor_checkpoint="./outputs/stage2_checkpoints/final",
alpha=1.0, # Patient weight
beta=0.8, # Doctor guidance strength
)
# Generate aligned response
response = doctor.generate("Explain quantum computing to a 5-year-old.")
print(response)# Stage 1: Extract token-level rewards
python -m src.scripts.stage1_extract_rewards \
--patient_model Qwen/Qwen2.5-7B-Instruct \
--output_dir outputs/stage1_rewards
# Stage 2: Train doctor with TFPO
python -m src.scripts.stage2_train_tfpo \
--doctor_model Qwen/Qwen2.5-1.5B-Instruct \
--rewards_dir outputs/stage1_rewards \
--output_dir outputs/stage2_checkpoints
# Stage 3: Run guided inference
python -m src.scripts.stage3_inference \
--patient_model Qwen/Qwen2.5-7B-Instruct \
--doctor_checkpoint outputs/stage2_checkpoints/final| Method | Type | AlpacaEval 2.0 β | MT-Bench β | Training Cost |
|---|---|---|---|---|
| Qwen2.5-7B-Instruct | Base | 12.3% | 7.21 | - |
| + SFT | Train-time | 15.1% | 7.45 | Full 7B |
| + DPO | Train-time | 18.7% | 7.89 | Full 7B |
| + RLHF | Train-time | 19.2% | 7.94 | Full 7B + RM |
| + GenARM | Test-time | 17.8% | 7.65 | 1.5B RM |
| + LLMdoctor | Test-time | 21.4% | 8.12 | LoRA 1.5B |
Training Time (H200 GPU):
βββ DPO: ~48 hours (fine-tune 7B)
βββ RLHF: ~96 hours (7B + reward model)
βββ LLMdoctor: ~6 hours (train 1.5B doctor only) β‘ 8x faster
LLMDoctor/
βββ src/
β βββ models/
β β βββ patient.py # Patient model wrapper
β β βββ doctor.py # Doctor model + Value Head
β βββ training/
β β βββ losses.py # SubTB + Value Discrimination Loss
β β βββ tfpo_trainer.py # TFPO trainer
β βββ inference/
β β βββ guided_decoding.py # Flow-guided decoding
β βββ scripts/ # Training & inference scripts
βββ outputs/
β βββ stage1_rewards/ # Token-level rewards
β βββ stage2_checkpoints/ # Doctor checkpoints
β βββ stage3_results/ # Inference results
βββ assets/ # Figures
If you find LLMdoctor useful, please cite our paper:
@inproceedings{shen2026llmdoctor,
title = {LLMdoctor: Token-Level Flow-Guided Preference Optimization
for Efficient Test-Time Alignment of Large Language Models},
author = {Shen, Tiesunlong and Mao, Rui and Wang, Jin and Sun, Heming
and Zhang, Jian and Zhang, Xuejie and Cambria, Erik},
booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
year = {2026},
url = {https://arxiv.org/abs/2601.10416}
}- Hugging Face Transformers
- PEFT for LoRA training
- vLLM for efficient inference
This project is licensed under the MIT License - see the LICENSE file for details.
Made with β€οΈ by the LLMdoctor Team