🩺 LLMdoctor

Token-Level Flow-Guided Preference Optimization for Efficient Test-Time Alignment

📄 Paper • 🚀 Quick Start • 🔬 Methodology • 📊 Results • 📝 Citation

Align frozen LLMs at test-time with a small doctor model — no fine-tuning required!

The patient-doctor paradigm: A small doctor model learns token-level flow-guided preferences and steers the frozen patient LLM during inference.

🎯 Highlights

TL;DR: LLMdoctor uses a small "doctor" model to guide a large frozen "patient" LLM at inference time, achieving better alignment than DPO without any fine-tuning of the large model.

🔥 Key Features

🚫 No Fine-tuning Required — Keep your large LLM frozen
🎯 Token-Level Precision — Fine-grained reward signals
⚡ Efficient Inference — Small doctor model (1.5B) guides large patient (7B+)
🌈 Diversity Preserved — Flow-based training prevents mode collapse
📈 Outperforms DPO — Better alignment with less compute

📊 Performance at a Glance

Method	AlpacaEval 2.0	MT-Bench	Params Trained
Base (Qwen2.5-7B)	12.3%	7.21	0
DPO	18.7%	7.89	7B
LLMdoctor	21.4%	8.12	1.5B

🔬 Methodology

Figure: Comparison of test-time alignment approaches. LLMdoctor (c) uses genuine token-level rewards instead of trajectory-level (a) or sequence-mimicking (b) approaches.

Three-Stage Pipeline

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   🔍 Stage 1    │───▶│   🎓 Stage 2    │───▶│   🩺 Stage 3    │
│ Token Reward    │    │  TFPO Training  │    │ Guided Decoding │
│  Extraction     │    │   (Doctor)      │    │  (Inference)    │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Stage	Description	Key Innovation
1	Extract token-level rewards from patient model's behavioral variations	Sparse reward signal via importance thresholding
2	Train doctor model with TFPO (SubTB + Value Discrimination Loss)	O(n²) subtrajectory flow balance
3	Guide patient model at inference time using doctor's value estimates	`π(y

🚀 Quick Start

Installation

git clone https://github.com/yellowsubmarine7/LLMDoctor.git
cd LLMDoctor
pip install -r requirements.txt

Inference Example

from llmdoctor import LLMdoctor

# Load patient (frozen) and doctor (trained) models
doctor = LLMdoctor(
    patient_model="Qwen/Qwen2.5-7B-Instruct",
    doctor_checkpoint="./outputs/stage2_checkpoints/final",
    alpha=1.0,  # Patient weight
    beta=0.8,   # Doctor guidance strength
)

# Generate aligned response
response = doctor.generate("Explain quantum computing to a 5-year-old.")
print(response)

Training Pipeline

# Stage 1: Extract token-level rewards
python -m src.scripts.stage1_extract_rewards \
    --patient_model Qwen/Qwen2.5-7B-Instruct \
    --output_dir outputs/stage1_rewards

# Stage 2: Train doctor with TFPO  
python -m src.scripts.stage2_train_tfpo \
    --doctor_model Qwen/Qwen2.5-1.5B-Instruct \
    --rewards_dir outputs/stage1_rewards \
    --output_dir outputs/stage2_checkpoints

# Stage 3: Run guided inference
python -m src.scripts.stage3_inference \
    --patient_model Qwen/Qwen2.5-7B-Instruct \
    --doctor_checkpoint outputs/stage2_checkpoints/final

📊 Results

Main Comparison

Method	Type	AlpacaEval 2.0 ↑	MT-Bench ↑	Training Cost
Qwen2.5-7B-Instruct	Base	12.3%	7.21	-
+ SFT	Train-time	15.1%	7.45	Full 7B
+ DPO	Train-time	18.7%	7.89	Full 7B
+ RLHF	Train-time	19.2%	7.94	Full 7B + RM
+ GenARM	Test-time	17.8%	7.65	1.5B RM
+ LLMdoctor	Test-time	21.4%	8.12	LoRA 1.5B

Efficiency

Training Time (H200 GPU):
├── DPO:       ~48 hours (fine-tune 7B)
├── RLHF:      ~96 hours (7B + reward model)  
└── LLMdoctor: ~6 hours  (train 1.5B doctor only) ⚡ 8x faster

📁 Project Structure

LLMDoctor/
├── src/
│   ├── models/
│   │   ├── patient.py          # Patient model wrapper
│   │   └── doctor.py           # Doctor model + Value Head
│   ├── training/
│   │   ├── losses.py           # SubTB + Value Discrimination Loss
│   │   └── tfpo_trainer.py     # TFPO trainer
│   ├── inference/
│   │   └── guided_decoding.py  # Flow-guided decoding
│   └── scripts/                # Training & inference scripts
├── outputs/
│   ├── stage1_rewards/         # Token-level rewards
│   ├── stage2_checkpoints/     # Doctor checkpoints
│   └── stage3_results/         # Inference results
└── assets/                     # Figures

📝 Citation

If you find LLMdoctor useful, please cite our paper:

@inproceedings{shen2026llmdoctor,
  title     = {LLMdoctor: Token-Level Flow-Guided Preference Optimization 
               for Efficient Test-Time Alignment of Large Language Models},
  author    = {Shen, Tiesunlong and Mao, Rui and Wang, Jin and Sun, Heming 
               and Zhang, Jian and Zhang, Xuejie and Cambria, Erik},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  year      = {2026},
  url       = {https://arxiv.org/abs/2601.10416}
}

🙏 Acknowledgements

Hugging Face Transformers
PEFT for LoRA training
vLLM for efficient inference

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Made with ❤️ by the LLMdoctor Team

⭐ Star us on GitHub!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
outputs		outputs
pbs		pbs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🩺 LLMdoctor

Token-Level Flow-Guided Preference Optimization for Efficient Test-Time Alignment

🎯 Highlights

🔥 Key Features

📊 Performance at a Glance

🔬 Methodology

Three-Stage Pipeline

🚀 Quick Start

Installation

Inference Example

Training Pipeline

📊 Results

Main Comparison

Efficiency

📁 Project Structure

📝 Citation

🙏 Acknowledgements

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🩺 LLMdoctor

Token-Level Flow-Guided Preference Optimization for Efficient Test-Time Alignment

🎯 Highlights

🔥 Key Features

📊 Performance at a Glance

🔬 Methodology

Three-Stage Pipeline

🚀 Quick Start

Installation

Inference Example

Training Pipeline

📊 Results

Main Comparison

Efficiency

📁 Project Structure

📝 Citation

🙏 Acknowledgements

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages