Refactor tokenizer adapter to support special EOS token #927

kamalkraj · 2025-12-21T06:33:20Z

Add BOS/EOS Token Handling in Tokenizer Adapter

The tokenizer adapter now supports configurable beginning-of-sequence (BOS) and end-of-sequence (EOS) token insertion through add_bos and add_eos flags. This feature is essential for Supervised Fine-Tuning (SFT) workflows where:

Input sequences should start with a BOS token
Target sequences should end with an EOS token (without a BOS token)

The implementation automatically prepends or appends the respective tokens based on the specified flags.

Special EOS Token Support

The tokenizer adapter allows you to specify custom EOS tokens for models that use non-standard sequence terminators. This is particularly useful for instruction-tuned models:

Use Case Example: Instruction-tuned Gemma models use <end_of_turn> as their EOS token instead of the standard token
Benefit: Ensures compatibility with diverse model architectures and tokenization schemes

These enhancements include comprehensive unit tests to verify correct token handling behavior across various scenarios.

Checklist

I have added all the necessary unit tests for my change.
I have verified that my change does not break existing code and all unit tests pass.
I have added all appropriate doc-strings/documentation.
My PR is based on the latest changes of the main branch (if unsure, rebase the code).
I have signed the Contributor License Agreement.
I have followed Contribution Guidelines.

…okenization logic

Refactor tokenizer adapter to support special EOS token and improve t…

2085962

…okenization logic

kamalkraj requested review from abheesht17, hgao327, jiangyangmu, lc5211, sizhit2, tianshub and wang2yn84 as code owners December 21, 2025 06:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor tokenizer adapter to support special EOS token #927

Refactor tokenizer adapter to support special EOS token #927

kamalkraj commented Dec 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Refactor tokenizer adapter to support special EOS token #927

Are you sure you want to change the base?

Refactor tokenizer adapter to support special EOS token #927

Conversation

kamalkraj commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add BOS/EOS Token Handling in Tokenizer Adapter

Special EOS Token Support

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kamalkraj commented Dec 21, 2025 •

edited

Loading