experimental: Self-Distillation Zero by LeonEricsson · Pull Request #5609 · huggingface/trl

LeonEricsson · 2026-04-20T20:22:28Z

Implements Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision on top of #5573.

SD-Zero is composed of two stages:

Self-Revision Training (SRT):
We sample model responses, evaluate correctness, and prompt the model to revise incorrect outputs. Only traces where the revision succeeds are retained, and the model is fine-tuned on this filtered dataset.
Self-Distillation:
The reviser is used as a teacher to provide token-level supervision over the generator’s responses, effectively converting outcome-level (binary) rewards into dense token-level supervision.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Note

High Risk
High risk because it introduces new training algorithms (SD-Zero/SRT) and significantly refactors the experimental self-distillation trainer base, affecting SDPO/SDFT behavior, reward handling, and teacher synchronization paths (including PEFT/EMA).

Overview
Adds Self-Distillation Zero (SD-Zero) as a new experimental pipeline: SRTTrainer (supervised self-revision training on an expanded revision dataset) and SDZeroTrainer (on-policy self-distillation using a binary verifier), plus CLI scripts for training and for collecting revision datasets.

Refactors the experimental self-distillation foundation by moving SDPO/SDFT onto a new BaseSelfDistillationTrainer contract (sample_rollouts + finalize_batch), centralizing teacher selection/sync (teacher_model_kind = base|live|ema), distillation objectives (distillation_mode = sampled_token|full_logits|topk_logits), and loss utilities (new loss_utils.py).

Updates SDPO to compute rewards and advantages internally (new reward scaling/weights, token vs sequence importance sampling, explicit policy_only/hybrid modes) and updates docs/tests/scripts to use distillation_mode (replacing full_logit_distillation) and new teacher-EMA knobs; also adds new unit tests for the base self-distillation trainer.

^{Reviewed by Cursor Bugbot for commit ab20ace. Bugbot is set up for automated code reviews on this repo. Configure here.}

…onfig parameters moved to sdpoconfig, + other nits

BaseSelfDistillationTrainer was populating _metrics in _log_self_distillation_metric but had no log() override, so those metrics were never forwarded to the Trainer's logging system. The fix merges _metrics into the log dict, prefixes eval keys, and clears after each logging step.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit ab20ace. Configure here.}

cursor · 2026-04-22T19:53:28Z


    def __init__(self, trainer):
        self.trainer = trainer
-        self.prompt_tokenizer = PromptTokenizer(trainer)


Dead method references removed generate_from_teacher attribute

Low Severity

DemonstrationTeacherContextBuilder.select_generation_prompts accesses self.trainer.generate_from_teacher, but this instance attribute was removed during the refactor (the old self.generate_from_teacher = args.generate_from_teacher line is gone). The method is never called in the new code flow since finalize_batch replaced _build_buffered_batch, making it dead code that would raise AttributeError if ever invoked. The method and its unreachable code path can be removed for clarity.

^{Reviewed by Cursor Bugbot for commit ab20ace. Configure here.}

cursor · 2026-04-22T19:53:28Z

+        self.scale_rewards = args.scale_rewards
+        self.epsilon_low = args.epsilon
+        self.epsilon_high = args.epsilon_high
+        self.beta = args.beta


Unused beta parameter stored but never applied

Medium Severity

SDPOConfig declares a beta parameter documented as "Reference-model KL coefficient for online policy optimization," and SDPOTrainer.__init__ stores it as self.beta. However, _compute_policy_loss never uses self.beta — there is no reference-model KL penalty term in the loss. A user setting beta > 0 would expect a KL regularization effect but get none, leading to silently incorrect training behavior.

Additional Locations (1)

trl/experimental/sdpo/sdpo_config.py#L80-L83

^{Reviewed by Cursor Bugbot for commit ab20ace. Configure here.}

LeonEricsson added 22 commits April 20, 2026 22:04

v0.1 transition sdft into unified base

06f02a8

sdft transition v1 complete, starting on sdpo

be1bcbc

sdpo transitioned, needs testing

0628701

remove legacy trainers

55111ff

sdft and sdpo transitioned and tested with new base

81def8a

restructure training batch builder

bad6b62

nits

ef43c95

wip removing mixin

efe0eda

remove mixin, refactoring and cleanup

fa1a8f3

always set teacher_model

6a7d5a8

align generation tokenization with grpotrainer

56b2fd1

fix: generation_kwargs bug

4a9d527

fix: incorrect import source

196feee

fixes: cleanup, standardized tokenization, distill loss=0 fix, sdpo c…

3c87400

…onfig parameters moved to sdpoconfig, + other nits

tests: ported old tests + new tests for base class

d2a78e2

couple more tests and test cleanup

8807088

test: nit fix

0612699

move loss aggregation to loss_util + a few docstrings

3d0cd72

fix: minor cursor issues + config docstrings

a432c20

fix: rename full logit distillation+topk into explicit flags

e30ca04

fix(self-distillation): warn on preloaded peft students

3a9ecb2

LeonEricsson force-pushed the feature/sd-zero branch from 1c4a8f7 to a110ba8 Compare April 21, 2026 17:35

LeonEricsson added 7 commits April 22, 2026 07:47

docs: cleanup

03718eb

feat: srt implemented and validated, sdzero wip

d0e6657

feat: sdzero phase 2

517d4f4

docs: update paper index

f35011c

fix: default sync to

bd00d4b

fix: adapt new distillation config params

c1af84e

fix: default behavior when args=None

4003c2f

LeonEricsson added 5 commits April 22, 2026 07:54

wip: review minors

289bc8c

docs: cleanup

5391c6c

fix: srt chat template and tokenization of dataset

04e1a5d

fix: wrap teacher prompt building in tokenizer apply chat template

701e8eb

feat: add chat_kwargs and cleanup code

17fe1c2

LeonEricsson force-pushed the feature/sd-zero branch from a110ba8 to ab20ace Compare April 22, 2026 19:46

docs: add docstring

ab20ace

LeonEricsson marked this pull request as ready for review April 22, 2026 19:46

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experimental: Self-Distillation Zero#5609

experimental: Self-Distillation Zero#5609
LeonEricsson wants to merge 35 commits intohuggingface:mainfrom
LeonEricsson:feature/sd-zero

LeonEricsson commented Apr 20, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 22, 2026

Uh oh!

cursor Bot Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LeonEricsson commented Apr 20, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before submitting

AI writing disclosure

Who can review?

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 22, 2026

Choose a reason for hiding this comment

Dead method references removed generate_from_teacher attribute

Uh oh!

cursor Bot Apr 22, 2026

Choose a reason for hiding this comment

Unused beta parameter stored but never applied

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LeonEricsson commented Apr 20, 2026 •

edited by cursor Bot

Loading

Dead method references removed `generate_from_teacher` attribute

Unused `beta` parameter stored but never applied