Reproduction
from datasets import load_dataset
from trl.experimental.distillation import DistillationConfig, DistillationTrainer
# 1. Load dataset and format as prompt-only chat messages
dataset = load_dataset("openai/gsm8k", "main", split="train")
dataset = dataset.map(
lambda x: {"messages": [{"role": "user", "content": x["question"]}]},
remove_columns=dataset.column_names,
)
# 2. Configure distillation
config = DistillationConfig(
output_dir="results/distill-qwen-gsm8k",
num_train_epochs=1,
bf16=True,
save_strategy="no",
# Distillation
lmbda=1.0, # fully on-policy (student generates)
beta=1.0, # reverse KL
# Teacher
teacher_model_init_kwargs={"torch_dtype": "bfloat16"},
)
# 3. Train
trainer = DistillationTrainer(
model="Qwen/Qwen3.5-0.8B",
teacher_model="Qwen/Qwen3.5-2B",
args=config,
train_dataset=dataset,
)
trainer.train()
trainer.save_model()
outputs:
...
File "/home/user/venv/.distil/lib/python3.12/site-packages/trl/experimental/distillation/distillation_trainer.py", line 542, in __init__
teacher_model.resize_token_embeddings(self.model.config.vocab_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/venv/.distil/lib/python3.12/site-packages/transformers/configuration_utils.py", line 422, in __getattribute__
return super().__getattribute__(key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Qwen3_5Config' object has no attribute 'vocab_size'
System Info
- Platform: Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39
- Python version: 3.12.3
- TRL version: 1.1.0
- PyTorch version: 2.10.0
- accelerator(s): NVIDIA GeForce RTX 5090
- Transformers version: 5.5.4
- Accelerate version: 1.13.0
- Accelerate config: not found
- Datasets version: 4.8.4
- HF Hub version: 1.11.0
- bitsandbytes version: not installed
- DeepSpeed version: not installed
- Liger-Kernel version: 0.7.0
- LLM-Blender version: not installed
- OpenAI version: 2.24.0
- PEFT version: 0.19.1
- vLLM version: 0.19.0
Checklist
Reproduction
outputs:
System Info
Checklist