experimental: Self-Distillation Zero#5609
experimental: Self-Distillation Zero#5609LeonEricsson wants to merge 35 commits intohuggingface:mainfrom
Conversation
…onfig parameters moved to sdpoconfig, + other nits
BaseSelfDistillationTrainer was populating _metrics in _log_self_distillation_metric but had no log() override, so those metrics were never forwarded to the Trainer's logging system. The fix merges _metrics into the log dict, prefixes eval keys, and clears after each logging step.
1c4a8f7 to
a110ba8
Compare
a110ba8 to
ab20ace
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ab20ace. Configure here.
|
|
||
| def __init__(self, trainer): | ||
| self.trainer = trainer | ||
| self.prompt_tokenizer = PromptTokenizer(trainer) |
There was a problem hiding this comment.
Dead method references removed generate_from_teacher attribute
Low Severity
DemonstrationTeacherContextBuilder.select_generation_prompts accesses self.trainer.generate_from_teacher, but this instance attribute was removed during the refactor (the old self.generate_from_teacher = args.generate_from_teacher line is gone). The method is never called in the new code flow since finalize_batch replaced _build_buffered_batch, making it dead code that would raise AttributeError if ever invoked. The method and its unreachable code path can be removed for clarity.
Reviewed by Cursor Bugbot for commit ab20ace. Configure here.
| self.scale_rewards = args.scale_rewards | ||
| self.epsilon_low = args.epsilon | ||
| self.epsilon_high = args.epsilon_high | ||
| self.beta = args.beta |
There was a problem hiding this comment.
Unused beta parameter stored but never applied
Medium Severity
SDPOConfig declares a beta parameter documented as "Reference-model KL coefficient for online policy optimization," and SDPOTrainer.__init__ stores it as self.beta. However, _compute_policy_loss never uses self.beta — there is no reference-model KL penalty term in the loss. A user setting beta > 0 would expect a KL regularization effect but get none, leading to silently incorrect training behavior.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit ab20ace. Configure here.


Implements Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision on top of #5573.
SD-Zero is composed of two stages:
Self-Revision Training (SRT):
We sample model responses, evaluate correctness, and prompt the model to revise incorrect outputs. Only traces where the revision succeeds are retained, and the model is fine-tuned on this filtered dataset.
Self-Distillation:
The reviser is used as a teacher to provide token-level supervision over the generator’s responses, effectively converting outcome-level (binary) rewards into dense token-level supervision.
Before submitting
AI writing disclosure
We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
Note
High Risk
High risk because it introduces new training algorithms (SD-Zero/SRT) and significantly refactors the experimental self-distillation trainer base, affecting SDPO/SDFT behavior, reward handling, and teacher synchronization paths (including PEFT/EMA).
Overview
Adds Self-Distillation Zero (SD-Zero) as a new experimental pipeline:
SRTTrainer(supervised self-revision training on an expanded revision dataset) andSDZeroTrainer(on-policy self-distillation using a binary verifier), plus CLI scripts for training and for collecting revision datasets.Refactors the experimental self-distillation foundation by moving SDPO/SDFT onto a new
BaseSelfDistillationTrainercontract (sample_rollouts+finalize_batch), centralizing teacher selection/sync (teacher_model_kind=base|live|ema), distillation objectives (distillation_mode=sampled_token|full_logits|topk_logits), and loss utilities (newloss_utils.py).Updates SDPO to compute rewards and advantages internally (new reward scaling/weights, token vs sequence importance sampling, explicit
policy_only/hybridmodes) and updates docs/tests/scripts to usedistillation_mode(replacingfull_logit_distillation) and new teacher-EMA knobs; also adds new unit tests for the base self-distillation trainer.Reviewed by Cursor Bugbot for commit ab20ace. Bugbot is set up for automated code reviews on this repo. Configure here.