Skip to content

Conversation

@kamalkraj
Copy link

@kamalkraj kamalkraj commented Dec 21, 2025

Add BOS/EOS Token Handling in Tokenizer Adapter

The tokenizer adapter now supports configurable beginning-of-sequence (BOS) and end-of-sequence (EOS) token insertion through add_bos and add_eos flags. This feature is essential for Supervised Fine-Tuning (SFT) workflows where:

  • Input sequences should start with a BOS token
  • Target sequences should end with an EOS token (without a BOS token)

The implementation automatically prepends or appends the respective tokens based on the specified flags.

Special EOS Token Support

The tokenizer adapter allows you to specify custom EOS tokens for models that use non-standard sequence terminators. This is particularly useful for instruction-tuned models:

  • Use Case Example: Instruction-tuned Gemma models use <end_of_turn> as their EOS token instead of the standard token
  • Benefit: Ensures compatibility with diverse model architectures and tokenization schemes

These enhancements include comprehensive unit tests to verify correct token handling behavior across various scenarios.

Checklist

  • I have added all the necessary unit tests for my change.
  • I have verified that my change does not break existing code and all unit tests pass.
  • I have added all appropriate doc-strings/documentation.
  • My PR is based on the latest changes of the main branch (if unsure, rebase the code).
  • I have signed the Contributor License Agreement.
  • I have followed Contribution Guidelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant