Skip to content

Yellow4Submarine7/LLMDoctor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🩺 LLMdoctor

Token-Level Flow-Guided Preference Optimization for Efficient Test-Time Alignment

arXiv AAAI 2026 License Python PyTorch

πŸ“„ Paper β€’ πŸš€ Quick Start β€’ πŸ”¬ Methodology β€’ πŸ“Š Results β€’ πŸ“ Citation

Align frozen LLMs at test-time with a small doctor model β€” no fine-tuning required!


LLMdoctor Framework

The patient-doctor paradigm: A small doctor model learns token-level flow-guided preferences and steers the frozen patient LLM during inference.


🎯 Highlights

TL;DR: LLMdoctor uses a small "doctor" model to guide a large frozen "patient" LLM at inference time, achieving better alignment than DPO without any fine-tuning of the large model.

πŸ”₯ Key Features

  • 🚫 No Fine-tuning Required β€” Keep your large LLM frozen
  • 🎯 Token-Level Precision β€” Fine-grained reward signals
  • ⚑ Efficient Inference β€” Small doctor model (1.5B) guides large patient (7B+)
  • 🌈 Diversity Preserved β€” Flow-based training prevents mode collapse
  • πŸ“ˆ Outperforms DPO β€” Better alignment with less compute

πŸ“Š Performance at a Glance

Method AlpacaEval 2.0 MT-Bench Params Trained
Base (Qwen2.5-7B) 12.3% 7.21 0
DPO 18.7% 7.89 7B
LLMdoctor 21.4% 8.12 1.5B

πŸ”¬ Methodology

Comparison of Test-Time Alignment Approaches

Figure: Comparison of test-time alignment approaches. LLMdoctor (c) uses genuine token-level rewards instead of trajectory-level (a) or sequence-mimicking (b) approaches.

Three-Stage Pipeline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   πŸ” Stage 1    │───▢│   πŸŽ“ Stage 2    │───▢│   🩺 Stage 3    β”‚
β”‚ Token Reward    β”‚    β”‚  TFPO Training  β”‚    β”‚ Guided Decoding β”‚
β”‚  Extraction     β”‚    β”‚   (Doctor)      β”‚    β”‚  (Inference)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Stage Description Key Innovation
1 Extract token-level rewards from patient model's behavioral variations Sparse reward signal via importance thresholding
2 Train doctor model with TFPO (SubTB + Value Discrimination Loss) O(nΒ²) subtrajectory flow balance
3 Guide patient model at inference time using doctor's value estimates `Ο€(y

πŸš€ Quick Start

Installation

git clone https://github.com/yellowsubmarine7/LLMDoctor.git
cd LLMDoctor
pip install -r requirements.txt

Inference Example

from llmdoctor import LLMdoctor

# Load patient (frozen) and doctor (trained) models
doctor = LLMdoctor(
    patient_model="Qwen/Qwen2.5-7B-Instruct",
    doctor_checkpoint="./outputs/stage2_checkpoints/final",
    alpha=1.0,  # Patient weight
    beta=0.8,   # Doctor guidance strength
)

# Generate aligned response
response = doctor.generate("Explain quantum computing to a 5-year-old.")
print(response)

Training Pipeline

# Stage 1: Extract token-level rewards
python -m src.scripts.stage1_extract_rewards \
    --patient_model Qwen/Qwen2.5-7B-Instruct \
    --output_dir outputs/stage1_rewards

# Stage 2: Train doctor with TFPO  
python -m src.scripts.stage2_train_tfpo \
    --doctor_model Qwen/Qwen2.5-1.5B-Instruct \
    --rewards_dir outputs/stage1_rewards \
    --output_dir outputs/stage2_checkpoints

# Stage 3: Run guided inference
python -m src.scripts.stage3_inference \
    --patient_model Qwen/Qwen2.5-7B-Instruct \
    --doctor_checkpoint outputs/stage2_checkpoints/final

πŸ“Š Results

Main Comparison

Method Type AlpacaEval 2.0 ↑ MT-Bench ↑ Training Cost
Qwen2.5-7B-Instruct Base 12.3% 7.21 -
+ SFT Train-time 15.1% 7.45 Full 7B
+ DPO Train-time 18.7% 7.89 Full 7B
+ RLHF Train-time 19.2% 7.94 Full 7B + RM
+ GenARM Test-time 17.8% 7.65 1.5B RM
+ LLMdoctor Test-time 21.4% 8.12 LoRA 1.5B

Efficiency

Training Time (H200 GPU):
β”œβ”€β”€ DPO:       ~48 hours (fine-tune 7B)
β”œβ”€β”€ RLHF:      ~96 hours (7B + reward model)  
└── LLMdoctor: ~6 hours  (train 1.5B doctor only) ⚑ 8x faster

πŸ“ Project Structure

LLMDoctor/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ patient.py          # Patient model wrapper
β”‚   β”‚   └── doctor.py           # Doctor model + Value Head
β”‚   β”œβ”€β”€ training/
β”‚   β”‚   β”œβ”€β”€ losses.py           # SubTB + Value Discrimination Loss
β”‚   β”‚   └── tfpo_trainer.py     # TFPO trainer
β”‚   β”œβ”€β”€ inference/
β”‚   β”‚   └── guided_decoding.py  # Flow-guided decoding
β”‚   └── scripts/                # Training & inference scripts
β”œβ”€β”€ outputs/
β”‚   β”œβ”€β”€ stage1_rewards/         # Token-level rewards
β”‚   β”œβ”€β”€ stage2_checkpoints/     # Doctor checkpoints
β”‚   └── stage3_results/         # Inference results
└── assets/                     # Figures

πŸ“ Citation

If you find LLMdoctor useful, please cite our paper:

@inproceedings{shen2026llmdoctor,
  title     = {LLMdoctor: Token-Level Flow-Guided Preference Optimization 
               for Efficient Test-Time Alignment of Large Language Models},
  author    = {Shen, Tiesunlong and Mao, Rui and Wang, Jin and Sun, Heming 
               and Zhang, Jian and Zhang, Xuejie and Cambria, Erik},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  year      = {2026},
  url       = {https://arxiv.org/abs/2601.10416}
}

πŸ™ Acknowledgements


πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


Made with ❀️ by the LLMdoctor Team

⭐ Star us on GitHub!

About

🩺 Token-Level Flow-Guided Preference Optimization for Efficient Test-Time Alignment (AAAI 2026)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors