Skip to content

Set _tokenizer attribute in experimental trainers#5566

Merged
albertvillanova merged 20 commits intohuggingface:mainfrom
albertvillanova:set-tokenizer-attribute-exp
Apr 28, 2026
Merged

Set _tokenizer attribute in experimental trainers#5566
albertvillanova merged 20 commits intohuggingface:mainfrom
albertvillanova:set-tokenizer-attribute-exp

Conversation

@albertvillanova
Copy link
Copy Markdown
Member

@albertvillanova albertvillanova commented Apr 16, 2026

Set _tokenizer attribute in experimental trainers.

Follow-up to:


Note

Medium Risk
Broad but mechanical refactor touching padding/EOS IDs in several training and generation paths; a missed reference could cause runtime errors or subtle masking/padding behavior changes.

Overview
Standardizes tokenizer handling across experimental trainers by storing the resolved tokenizer on self._tokenizer (from either a ProcessorMixin or a PreTrainedTokenizerBase) and ensuring pad_token defaults to eos_token when missing.

Updates all downstream uses (data collators, dataset preprocessing EOS appends, padding/gathering logic, and generation configs/EOS detection) to reference self._tokenizer.* instead of local tokenizer variables or separate self.pad_token_id/self.eos_token_id fields.

Reviewed by Cursor Bugbot for commit ba22aad. Bugbot is set up for automated code reviews on this repo. Configure here.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread trl/experimental/online_dpo/online_dpo_trainer.py
Comment thread trl/experimental/sdft/sdft_trainer.py
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ad61f84. Configure here.

Copy link
Copy Markdown
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@albertvillanova albertvillanova merged commit 4455858 into huggingface:main Apr 28, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants