ToDi: Token-wise Distillation via Fine-Grained Divergence Control (EMNLP 2025 Oral/Outstanding Paper Award Nominee)
Official PyTorch implementation of ToDi, as presented in our paper:
ToDi: Token-wise Distillation via Fine-Grained Divergence Control
Seongryong Jung, Suwan Yoon, DongGeon Kim, and Hwanhee Lee
Some of our code is based on DSKD, MiniLLM and Distillm.
- deepspeed >= 0.14.0
- torch >= 2.0.1
- transformers >= 4.40.2
- peft >= 0.8.2
- rouge_score >= 0.1.2
You can download the corresponding model files (e.g., pytorch_model.bin or model.safetensors) of LLMs used in this paper into model_hub/*/*/.
Here are the links of these models on huggingface:
- GPT2-120M: Here
- GPT2-1.5B (trained on Dolly by Gu et al.): Here
- TinyLLaMA-1.1B: Here
- Llama2-7B: Here
- OLMo2: Here
- Qwen2.5: Here
- Gemma3: Here
For LLaMA2-7B (LoRA), run:
bash scripts/tinyllama/sft_teacher_llama2.shFor GPT2-base (full fine-tuning), run:
bash scripts/gpt2/sft_gpt2_base.shFor TinyLLaMA-1.1B (LoRA), run:
bash scripts/tinyllama/sft_tinyllama.shFor GPT2-base, run:
bash scripts/gpt2/vanilla_kd_gpt2_base.shFor TinyLLaMA-1.1B, run:
bash scripts/tinyllama/vanilla_kd_tinyllama.shYou can change the distance functions (e.g., KL Divergence, Reverse KL Divergence, JS Divergence, etc.) using KD_OBJ in the above scripts.
The output directory will be created under ./outputs automatically after you run the training scripts.
For full fine-tuning, the file structure of the output directory is as follows (take gpt2 SFT as an example):
./outputs/gpt2/gpt2-base/sft/criterion=cross_entropy__default-bf16__.../
│
├── epochA_step... (model files of epoch A, you can directly load it by AutoModelForCausalLM.from_pretrained(this path))/
│ ├── config.json
│ └── pytorch_model.bin
│ └── tokenizer.json
│ └── ...
│
├── epochB_step... (only exists when SAVE_BEST_N_CKPTS >= 2, similar to epochA_.../)/
│ ├── config.json
│ └── pytorch_model.bin
│ └── tokenizer.json
│ └── ...
│
└── ...
│
└── args.json (The arguments of training)
│
└── train.log (Training log)
For LoRA fine-tuning, the file structure of the output directory is as follows (take TinyLLaMA LoRA SFT as an example):
./outputs/tinyllama/tinyllama-1.1b-3T/sft/criterion=cross_entropy__lora-rank=256-alpha=8.../
│
├── epochA_step... (model files of epoch A, you can directly load it by AutoModelForCausalLM.from_pretrained(this path))/
│ ├── adapter_config.json
│ └── adapter_model.bin
│ └── tokenizer.json
│ └── ...
│
├── epochB_step... (only exists when SAVE_BEST_N_CKPTS >= 2, similar to epochA_.../)/
│ ├── adapter_config.json
│ └── adapter_model.bin
│ └── tokenizer.json
│ └── ...
│
└── ...
│
└── args.json (The arguments of training)
│
└── train.log (Training log)
bash scripts/eval/run_eval.sh ${CKPT_PATH} ${EVAL_BATCH_SIZE}According to the above structure, CKPT_PATH is the absolute path of the model files like /home/xxx/ToDi/outputs/gpt2/gpt2-base/sft/criterion=cross_entropy__default-bf16__.../epochA_step....
bash scripts/eval/run_eval_lora.sh ${LORA_ADAPTER_PATH} ${EVAL_BATCH_SIZE}Please note that MODEL_PATH in run_eval_lora.sh should be changed for different base models (TinyLLaMA, LLaMA2).
Similarly, LORA_ADAPTER_PATH is the absolute path of the LoRA adapter files like /home/xxx/ToDi/outputs/tinyllama/tinyllama-1.1b-3T/sft/criterion=cross_entropy__lora-rank=256-alpha=8.../epochA_step....
If you find this repo useful for your research, please consider citing us:
@inproceedings{jung-etal-2025-todi,
title = "{T}o{D}i: Token-wise Distillation via Fine-Grained Divergence Control",
author = "Jung, Seongryong and
Yoon, Suwan and
Kim, DongGeon and
Lee, Hwanhee",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.409/",
doi = "10.18653/v1/2025.emnlp-main.409",
pages = "8089--8102",
ISBN = "979-8-89176-332-6",
abstract = "Large language models (LLMs) offer impressive performance but are impractical for resource-constrained deployment due to high latency and energy consumption. Knowledge distillation (KD) addresses this by transferring knowledge from a large teacher to a smaller student model. However, conventional KD, notably approaches like Forward KL (FKL) and Reverse KL (RKL), apply uniform divergence loss across the entire vocabulary, neglecting token-level prediction discrepancies. By investigating these representative divergences via gradient analysis, we reveal that FKL boosts underestimated tokens, while RKL suppresses overestimated ones, showing their complementary roles. Based on this observation, we propose Token-wise Distillation (ToDi), a novel method that adaptively combines FKL and RKL per token using a sigmoid-based weighting function derived from the teacher-student probability log-ratio. ToDi dynamically emphasizes the appropriate divergence for each token, enabling precise distribution alignment. We demonstrate that ToDi consistently outperforms recent distillation baselines using uniform or less granular strategies across instruction-following benchmarks. Extensive ablation studies and efficiency analysis further validate ToDi{'}s effectiveness and practicality."
}
If you have any questions or feedback, feel free to reach out:
- Seongryong Jung: jungsr1116@cau.ac.kr