Skip to content

jungseongryong/ToDi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ToDi: Token-wise Distillation via Fine-Grained Divergence Control (EMNLP 2025 Oral/Outstanding Paper Award Nominee)

ACL Anthology Paper PDF BibTeX

Official PyTorch implementation of ToDi, as presented in our paper:
ToDi: Token-wise Distillation via Fine-Grained Divergence Control
Seongryong Jung, Suwan Yoon, DongGeon Kim, and Hwanhee Lee


Some of our code is based on DSKD, MiniLLM and Distillm.

🔧 Requirements

  • deepspeed >= 0.14.0
  • torch >= 2.0.1
  • transformers >= 4.40.2
  • peft >= 0.8.2
  • rouge_score >= 0.1.2

🤖 Models

You can download the corresponding model files (e.g., pytorch_model.bin or model.safetensors) of LLMs used in this paper into model_hub/*/*/.

Here are the links of these models on huggingface:

  • GPT2-120M: Here
  • GPT2-1.5B (trained on Dolly by Gu et al.): Here
  • TinyLLaMA-1.1B: Here
  • Llama2-7B: Here
  • OLMo2: Here
  • Qwen2.5: Here
  • Gemma3: Here

🏋️ Training

SFT for teacher models

For LLaMA2-7B (LoRA), run:

bash scripts/tinyllama/sft_teacher_llama2.sh

SFT for student models

For GPT2-base (full fine-tuning), run:

bash scripts/gpt2/sft_gpt2_base.sh

For TinyLLaMA-1.1B (LoRA), run:

bash scripts/tinyllama/sft_tinyllama.sh

KD

For GPT2-base, run:

bash scripts/gpt2/vanilla_kd_gpt2_base.sh

For TinyLLaMA-1.1B, run:

bash scripts/tinyllama/vanilla_kd_tinyllama.sh

You can change the distance functions (e.g., KL Divergence, Reverse KL Divergence, JS Divergence, etc.) using KD_OBJ in the above scripts.

File Structures in Output Directory

The output directory will be created under ./outputs automatically after you run the training scripts. For full fine-tuning, the file structure of the output directory is as follows (take gpt2 SFT as an example):

./outputs/gpt2/gpt2-base/sft/criterion=cross_entropy__default-bf16__.../
│
├── epochA_step... (model files of epoch A, you can directly load it by AutoModelForCausalLM.from_pretrained(this path))/
│   ├── config.json
│   └── pytorch_model.bin
│   └── tokenizer.json
│   └── ...
│
├── epochB_step... (only exists when SAVE_BEST_N_CKPTS >= 2, similar to epochA_.../)/
│   ├── config.json
│   └── pytorch_model.bin
│   └── tokenizer.json
│   └── ...
│
└── ...
│
└── args.json (The arguments of training)
│
└── train.log (Training log)

For LoRA fine-tuning, the file structure of the output directory is as follows (take TinyLLaMA LoRA SFT as an example):

./outputs/tinyllama/tinyllama-1.1b-3T/sft/criterion=cross_entropy__lora-rank=256-alpha=8.../
│
├── epochA_step... (model files of epoch A, you can directly load it by AutoModelForCausalLM.from_pretrained(this path))/
│   ├── adapter_config.json
│   └── adapter_model.bin
│   └── tokenizer.json
│   └── ...
│
├── epochB_step... (only exists when SAVE_BEST_N_CKPTS >= 2, similar to epochA_.../)/
│   ├── adapter_config.json
│   └── adapter_model.bin
│   └── tokenizer.json
│   └── ...
│
└── ...
│
└── args.json (The arguments of training)
│
└── train.log (Training log)

📊 Evaluation

Evaluate Full Fine-tuning Checkpoints

bash scripts/eval/run_eval.sh ${CKPT_PATH} ${EVAL_BATCH_SIZE}

According to the above structure, CKPT_PATH is the absolute path of the model files like /home/xxx/ToDi/outputs/gpt2/gpt2-base/sft/criterion=cross_entropy__default-bf16__.../epochA_step....

Evaluate LoRA Fine-tuning Checkpoints

bash scripts/eval/run_eval_lora.sh ${LORA_ADAPTER_PATH} ${EVAL_BATCH_SIZE}

Please note that MODEL_PATH in run_eval_lora.sh should be changed for different base models (TinyLLaMA, LLaMA2).

Similarly, LORA_ADAPTER_PATH is the absolute path of the LoRA adapter files like /home/xxx/ToDi/outputs/tinyllama/tinyllama-1.1b-3T/sft/criterion=cross_entropy__lora-rank=256-alpha=8.../epochA_step....

📚 BibTeX

If you find this repo useful for your research, please consider citing us:

@inproceedings{jung-etal-2025-todi,
    title = "{T}o{D}i: Token-wise Distillation via Fine-Grained Divergence Control",
    author = "Jung, Seongryong  and
      Yoon, Suwan  and
      Kim, DongGeon  and
      Lee, Hwanhee",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.409/",
    doi = "10.18653/v1/2025.emnlp-main.409",
    pages = "8089--8102",
    ISBN = "979-8-89176-332-6",
    abstract = "Large language models (LLMs) offer impressive performance but are impractical for resource-constrained deployment due to high latency and energy consumption. Knowledge distillation (KD) addresses this by transferring knowledge from a large teacher to a smaller student model. However, conventional KD, notably approaches like Forward KL (FKL) and Reverse KL (RKL), apply uniform divergence loss across the entire vocabulary, neglecting token-level prediction discrepancies. By investigating these representative divergences via gradient analysis, we reveal that FKL boosts underestimated tokens, while RKL suppresses overestimated ones, showing their complementary roles. Based on this observation, we propose Token-wise Distillation (ToDi), a novel method that adaptively combines FKL and RKL per token using a sigmoid-based weighting function derived from the teacher-student probability log-ratio. ToDi dynamically emphasizes the appropriate divergence for each token, enabling precise distribution alignment. We demonstrate that ToDi consistently outperforms recent distillation baselines using uniform or less granular strategies across instruction-following benchmarks. Extensive ablation studies and efficiency analysis further validate ToDi{'}s effectiveness and practicality."
}

✉️ Contact

If you have any questions or feedback, feel free to reach out:

About

[EMNLP 2025 (Oral Presentation, Outstanding Paper Award Nominee; ∼ 75/1811)] Official PyTorch implementation of ToDi: Token-wise Distillation via Fine-Grained Divergence Control

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors