Skip to content

experimental: Self-Distillation Zero#5609

Open
LeonEricsson wants to merge 35 commits intohuggingface:mainfrom
LeonEricsson:feature/sd-zero
Open

experimental: Self-Distillation Zero#5609
LeonEricsson wants to merge 35 commits intohuggingface:mainfrom
LeonEricsson:feature/sd-zero

Conversation

@LeonEricsson
Copy link
Copy Markdown
Collaborator

@LeonEricsson LeonEricsson commented Apr 20, 2026

Implements Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision on top of #5573.

SD-Zero is composed of two stages:

  1. Self-Revision Training (SRT):
    We sample model responses, evaluate correctness, and prompt the model to revise incorrect outputs. Only traces where the revision succeeds are retained, and the model is fine-tuned on this filtered dataset.

  2. Self-Distillation:
    The reviser is used as a teacher to provide token-level supervision over the generator’s responses, effectively converting outcome-level (binary) rewards into dense token-level supervision.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.


Note

High Risk
High risk because it introduces new training algorithms (SD-Zero/SRT) and significantly refactors the experimental self-distillation trainer base, affecting SDPO/SDFT behavior, reward handling, and teacher synchronization paths (including PEFT/EMA).

Overview
Adds Self-Distillation Zero (SD-Zero) as a new experimental pipeline: SRTTrainer (supervised self-revision training on an expanded revision dataset) and SDZeroTrainer (on-policy self-distillation using a binary verifier), plus CLI scripts for training and for collecting revision datasets.

Refactors the experimental self-distillation foundation by moving SDPO/SDFT onto a new BaseSelfDistillationTrainer contract (sample_rollouts + finalize_batch), centralizing teacher selection/sync (teacher_model_kind = base|live|ema), distillation objectives (distillation_mode = sampled_token|full_logits|topk_logits), and loss utilities (new loss_utils.py).

Updates SDPO to compute rewards and advantages internally (new reward scaling/weights, token vs sequence importance sampling, explicit policy_only/hybrid modes) and updates docs/tests/scripts to use distillation_mode (replacing full_logit_distillation) and new teacher-EMA knobs; also adds new unit tests for the base self-distillation trainer.

Reviewed by Cursor Bugbot for commit ab20ace. Bugbot is set up for automated code reviews on this repo. Configure here.

@LeonEricsson LeonEricsson marked this pull request as ready for review April 22, 2026 19:46
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ab20ace. Configure here.


def __init__(self, trainer):
self.trainer = trainer
self.prompt_tokenizer = PromptTokenizer(trainer)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead method references removed generate_from_teacher attribute

Low Severity

DemonstrationTeacherContextBuilder.select_generation_prompts accesses self.trainer.generate_from_teacher, but this instance attribute was removed during the refactor (the old self.generate_from_teacher = args.generate_from_teacher line is gone). The method is never called in the new code flow since finalize_batch replaced _build_buffered_batch, making it dead code that would raise AttributeError if ever invoked. The method and its unreachable code path can be removed for clarity.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ab20ace. Configure here.

self.scale_rewards = args.scale_rewards
self.epsilon_low = args.epsilon
self.epsilon_high = args.epsilon_high
self.beta = args.beta
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused beta parameter stored but never applied

Medium Severity

SDPOConfig declares a beta parameter documented as "Reference-model KL coefficient for online policy optimization," and SDPOTrainer.__init__ stores it as self.beta. However, _compute_policy_loss never uses self.beta — there is no reference-model KL penalty term in the loss. A user setting beta > 0 would expect a KL regularization effect but get none, leading to silently incorrect training behavior.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ab20ace. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant