-
Notifications
You must be signed in to change notification settings - Fork 37
Open
Labels
enhancementNew feature or requestNew feature or request
Description
🎯 Goal (What & Why)
Add activation-level distillation, usually leading to better student performance.
🚀 Execution Plan
Step 1: What is the smallest working version?
- Distill based on all mixer-layer outputs
- Support the case where the student has the same number of layers as the teacher.
- Use MSE loss.
- Add a single coefficient that balances feature-level distillation vs logit-level.
As a first version:
with:
Details:
- teacher stores intermediate activations in
kwargs - student uses
kwargsto compute activation-distillation losses, which it stores inlosses
Step 2: What additional optimizations are possible (but optional)?
- Should support TP with sequence-parallelism (this is actually not optional, but can be done in a second step)
- Can configure which layers' outputs are used for distillation. For example, we could distill only based on mixer-layer outputs, or also based on MLP outputs, etc. Pass a {student -> teacher} mapping of layer-names to use for distillation
- Configurable loss: MSE, cosine, others?
📌 Acceptance Criteria (Must-Haves for Completion)
- The feature must be functional and tested.
- The implementation must be documented in practical terms.
- The PR must include a performance/impact summary.
- No refactors unless directly necessary for feature completion.
🛠️ Project Management
- Assign the project to the Fast-LLM project.
- Set the
Estimatefield (in days) in the GitHub project. - Use the
Sizefield to categorize the PR size (Small/Medium/Large). - Assign an owner when opening the issue.
tscholak
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request