feat(tokenization): add encode_message to tokenize messages one by one #39507

pco111 · 2025-07-18T16:41:21Z

What does this PR do?
This PR introduces a new method, tokenizer.encode_message, to the base tokenizer class. This method allows for tokenizing a single chat message at a time while correctly handling the conversational context provided by conversation_history. This is particularly useful for token-by-token streaming applications where re-tokenizing the entire conversation history for each new token is inefficient.
The new method works by applying the chat template to the full conversation (history + new message) and then programmatically isolating the tokens that correspond to the new message. This ensures that all special tokens, roles, and formatting are applied correctly according to the model's chat template, maintaining consistency with apply_chat_template.

Fixes #39417
Before submitting
[x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline,
Pull Request section?
[x] Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
[x] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
[x] Did you write any new necessary tests?

Who can review?
@ArthurZucker @Rocketknight1

ArthurZucker · 2025-07-21T10:05:05Z

I like this! cc @Rocketknight1 if you can have a look!

Rocketknight1 · 2025-07-21T12:58:09Z

Hi @pco111, this is a cool idea, but I'm not sure about some of the details! In particular, the interaction with add_generation_prompt is awkward. If we set that to True, then a common scenario is that the conversation_history will be tokenized like this, where the "generation prompt" is the last line:

<im_start>user
message<im_end>
<im_start>assistant

But in this case, encode_message() will treat <im_start>assistant as part of the history, and remove it from the encoded message, and then the encoded message will be incomplete. I'm not sure what the best solution is - maybe always set add_generation_prompt to False?

…arameter and add the corresponding error handling. Update the document to reflect this change and verify the error handling in the test.

pco111 · 2025-07-21T16:39:56Z

Hi @pco111, this is a cool idea, but I'm not sure about some of the details! In particular, the interaction with add_generation_prompt is awkward. If we set that to True, then a common scenario is that the conversation_history will be tokenized like this, where the "generation prompt" is the last line:
<im_start>user
message<im_end>
<im_start>assistant
But in this case, encode_message() will treat <im_start>assistant as part of the history, and remove it from the encoded message, and then the encoded message will be incomplete. I'm not sure what the best solution is - maybe always set add_generation_prompt to False?

Hi @Rocketknight1 ,

Thank you so much for your insightful feedback! You've pointed out a very important edge case with add_generation_prompt that I had overlooked.

Following your thoughts, I've opted for a clearer and more robust approach:

Explicitly Disallowed add_generation_prompt: The encode_message method now raises a ValueError if add_generation_prompt is passed. This prevents any ambiguity.

Updated Documentation: The docstring for encode_message now clearly states that it does not handle the generation prompt and advises users on how to add it separately if needed.

Updated Tests: The tests have been updated to reflect this new design. There is now a test to ensure that the ValueError is raised correctly.

Thank you again for guiding me toward a better solution! I've pushed the new changes for your review.

Rocketknight1

Made some comments! Also, check the CI on Github - you may need to run make fixup to get the style tests to pass.

Rocketknight1 · 2025-07-22T12:52:49Z

src/transformers/tokenization_utils_base.py

+        if conversation_history is None:
+            conversation_history = []


In the case where conversation_history is None, presumably you just want to return the output of apply_chat_template() without changes?

Rocketknight1 · 2025-07-22T12:53:33Z

src/transformers/tokenization_utils_base.py

@@ -1695,6 +1695,89 @@ def apply_chat_template(
        else:
            return rendered_chat

+    def _encode_message(


I'm not sure we need a separate helper function! This can be folded into the main function to keep things simpler.

Rocketknight1 · 2025-07-22T12:53:59Z

src/transformers/tokenization_utils_base.py

@@ -3253,7 +3336,7 @@ def pad(
            pad_to_multiple_of (`int`, *optional*):
                If set will pad the sequence to a multiple of the provided value.

-                This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
+                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability


"Tensor Cores" is correct, so we don't want this change.

Rocketknight1 · 2025-07-22T12:56:05Z

tests/tokenization/test_tokenization_utils.py

@@ -375,3 +376,34 @@ def test_training_new_tokenizer_edge_cases(self):
        tokenizer = PreTrainedTokenizerFast(tokenizer_object=_tokenizer)
        toy_text_iterator = ("a" for _ in range(1000))
        tokenizer.train_new_from_iterator(text_iterator=toy_text_iterator, length=1000, vocab_size=50)
+
+
+class ChatTemplateTest(unittest.TestCase):


There are some other chat template tests in existing test classes already, so this should probably go in one of those rather than making a new class!

… the empty dialogue history, and ensure that the chat template can be applied correctly when the dialogue history is empty. Update the document to reflect these changes.

…simplified, and the functional integrity of the `encode_message` method is ensured. Update the document to reflect these changes.

pco111 · 2025-07-22T18:57:57Z

Made some comments! Also, check the CI on Github - you may need to run make fixup to get the style tests to pass.

Hi @Rocketknight1,

Thank you for the detailed and helpful feedback! I've updated the PR according to all your suggestions:

The _encode_message helper has been folded into the main encode_message function.

An optimization has been added to handle empty conversation history directly.

The "Tensor Cores" typo has been corrected.

The new tests have been moved into the existing TokenizerUtilsTest class.

All local checks (make fixup and pytest) are passing. The code should be in much better shape now. Thanks again for your guidance!

feat(tokenization): add encode_message to tokenize messages one by one

a313fec

ArthurZucker added the Chat Template label Jul 21, 2025

Fix the encode_message method, remove the add_generation_prompt p…

5808d12

…arameter and add the corresponding error handling. Update the document to reflect this change and verify the error handling in the test.

Rocketknight1 reviewed Jul 22, 2025

View reviewed changes

pco111 added 3 commits July 22, 2025 14:25

Optimize the encode_message method, improve the processing logic of…

d396d8c

… the empty dialogue history, and ensure that the chat template can be applied correctly when the dialogue history is empty. Update the document to reflect these changes.

The _encode_message method is deleted, the message coding logic is …

28b9511

…simplified, and the functional integrity of the `encode_message` method is ensured. Update the document to reflect these changes.

Docs fix

7349747

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(tokenization): add encode_message to tokenize messages one by one #39507

feat(tokenization): add encode_message to tokenize messages one by one #39507

pco111 commented Jul 18, 2025

Uh oh!

ArthurZucker commented Jul 21, 2025

Uh oh!

Rocketknight1 commented Jul 21, 2025 •

edited

Loading

Uh oh!

pco111 commented Jul 21, 2025

Uh oh!

Rocketknight1 left a comment

Uh oh!

Rocketknight1 Jul 22, 2025

Uh oh!

Rocketknight1 Jul 22, 2025

Uh oh!

Rocketknight1 Jul 22, 2025

Uh oh!

Rocketknight1 Jul 22, 2025

Uh oh!

pco111 commented Jul 22, 2025

Uh oh!

Uh oh!

feat(tokenization): add encode_message to tokenize messages one by one #39507

Are you sure you want to change the base?

feat(tokenization): add encode_message to tokenize messages one by one #39507

Conversation

pco111 commented Jul 18, 2025

Uh oh!

ArthurZucker commented Jul 21, 2025

Uh oh!

Rocketknight1 commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pco111 commented Jul 21, 2025

Uh oh!

Rocketknight1 left a comment

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

pco111 commented Jul 22, 2025

Uh oh!

Uh oh!

Rocketknight1 commented Jul 21, 2025 •

edited

Loading