This example shows how to run on-policy distillation (OPD) using slime. A small student (Qwen3-8B) is aligned to imitate a larger teacher (Qwen3-32B) by training only on the student's own rollouts and matching the teacher's token-level log-probabilities.
- OPD is orthogonal to advantage estimators: OPD works as an additive KL penalty on top of any advantage estimator (GRPO, PPO, REINFORCE++, etc.), not as a separate estimator.
- Two teacher modes:
- sglang: Teacher runs on an external SGLang server, teacher log-probs are obtained during rollout.
- megatron: Teacher is loaded directly into Megatron via
--opd-teacher-load, teacher log-probs are computed during training forward pass.
| Argument | Description |
|---|---|
--use-opd |
Enable on-policy distillation. Required flag to use OPD. |
--opd-type |
Type of OPD: sglang or megatron. Required when --use-opd is set. |
--opd-kl-coef |
OPD KL penalty coefficient (default: 1.0). |
--opd-teacher-load |
Path to teacher checkpoint. Required when --opd-type=megatron, must not be set when --opd-type=sglang. |
--opd-teacher-ckpt-step |
Optional checkpoint step for teacher model. |
| Mode | Teacher Location | When to use |
|---|---|---|
sglang |
External SGLang server | Teacher has different architecture or larger than GPU memory |
megatron |
Loaded into Megatron training | Teacher has same architecture as policy/ref model |
slime/rollout/on_policy_distillation.pyimplements (for SGLang mode):reward_funccalls the teacher server (viaargs.rm_url) with every sample to obtain token-level logprobs.post_process_rewardstrims the teacher logprobs to the generated response span and writes the tensors back to eachSampleto compute advantages.
run-qwen3-8B-opd.shlaunches an SGLang teacher server, then submits a Ray job that runstrain.py.run-qwen3-8B-opd-megatron.shuses Megatron-loaded teacher model (no external server needed).
- Download or prepare the required checkpoints and data.
hf download Qwen/Qwen3-32B --local-dir /root/Qwen3-32B
hf download Qwen/Qwen3-8B --local-dir /root/Qwen3-8B
hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k- Run the hf to mcore for student model conversion:
cd /root/slime
source scripts/models/qwen3-8B.sh
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--hf-checkpoint /root/Qwen3-8B \
--save /root/Qwen3-8B_torch_dist- Run on-policy distillation:
bash examples/on_policy_distillation/run-qwen3-8B-opd.sh-
Prepare student checkpoint (same as above).
-
IMPORTANT: Convert your teacher model to Megatron format (change the path to your actual teacher):
# This example uses the same model as both student and teacher (for demonstration only)
# In practice, use a different (stronger) model as the teacher!
cd /root/slime
source scripts/models/qwen3-8B.sh # Or your teacher model config
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--hf-checkpoint /root/YourTeacherModel \
--save /root/YourTeacherModel_torch_dist-
Edit
run-qwen3-8B-opd-megatron.shto update paths:- Change
--opd-teacher-loadto your teacher model path - Adjust
--opd-kl-coefbased on your task
- Change
-
Run:
bash examples/on_policy_distillation/run-qwen3-8B-opd-megatron.shUsing Qwen3-8B-Base model sfted on part of the OpenThoughts3-1.2M dataset, we performed on-policy distillation with a Qwen3-32B teacher on the remaining data. Evaluation on Math500 shows:
| Pass@1 | |
|---|---|
| Qwen3-8B-Base + SFT | 76% |
| Qwen3-8B-Base + SFT + On-Policy Distillation | 94% |
-
Why are there two OPD modes?
sglangmode: The teacher runs on an independent SGLang server. This is useful when the teacher has a different architecture or is too large to load together with the policy model.megatronmode: The teacher is loaded into Megatron using the same parameter loading mechanism as the reference model. This requires the teacher to have the same architecture as the policy model.
-
How do I use Megatron-based teacher instead of SGLang server? Replace your OPD arguments:
# Instead of: --use-opd --opd-type sglang --opd-kl-coef 1.0 # Use: --use-opd --opd-type megatron --opd-kl-coef 1.0 --opd-teacher-load /path/to/teacher_checkpoint
-
What happens if I set wrong arguments? The system will raise clear errors:
--use-opdwithout--opd-type: Error asking you to specify type--opd-type megatronwithout--opd-teacher-load: Error asking for teacher checkpoint--opd-type sglangwith--opd-teacher-load: Error indicating conflict