-
Notifications
You must be signed in to change notification settings - Fork 29.7k
feat(tokenization): add encode_message to tokenize messages one by one #39507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(tokenization): add encode_message to tokenize messages one by one #39507
Conversation
I like this! cc @Rocketknight1 if you can have a look! |
Hi @pco111, this is a cool idea, but I'm not sure about some of the details! In particular, the interaction with
But in this case, |
…arameter and add the corresponding error handling. Update the document to reflect this change and verify the error handling in the test.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some comments! Also, check the CI on Github - you may need to run make fixup
to get the style tests to pass.
if conversation_history is None: | ||
conversation_history = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where conversation_history is None
, presumably you just want to return the output of apply_chat_template()
without changes?
@@ -1695,6 +1695,89 @@ def apply_chat_template( | |||
else: | |||
return rendered_chat | |||
|
|||
def _encode_message( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need a separate helper function! This can be folded into the main function to keep things simpler.
@@ -3253,7 +3336,7 @@ def pad( | |||
pad_to_multiple_of (`int`, *optional*): | |||
If set will pad the sequence to a multiple of the provided value. | |||
|
|||
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability | |||
This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Tensor Cores" is correct, so we don't want this change.
@@ -375,3 +376,34 @@ def test_training_new_tokenizer_edge_cases(self): | |||
tokenizer = PreTrainedTokenizerFast(tokenizer_object=_tokenizer) | |||
toy_text_iterator = ("a" for _ in range(1000)) | |||
tokenizer.train_new_from_iterator(text_iterator=toy_text_iterator, length=1000, vocab_size=50) | |||
|
|||
|
|||
class ChatTemplateTest(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some other chat template tests in existing test classes already, so this should probably go in one of those rather than making a new class!
… the empty dialogue history, and ensure that the chat template can be applied correctly when the dialogue history is empty. Update the document to reflect these changes.
…simplified, and the functional integrity of the `encode_message` method is ensured. Update the document to reflect these changes.
|
What does this PR do?
This PR introduces a new method, tokenizer.encode_message, to the base tokenizer class. This method allows for tokenizing a single chat message at a time while correctly handling the conversational context provided by conversation_history. This is particularly useful for token-by-token streaming applications where re-tokenizing the entire conversation history for each new token is inefficient.
The new method works by applying the chat template to the full conversation (history + new message) and then programmatically isolating the tokens that correspond to the new message. This ensures that all special tokens, roles, and formatting are applied correctly according to the model's chat template, maintaining consistency with apply_chat_template.
Fixes #39417
Before submitting
[x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline,
Pull Request section?
[x] Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
[x] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
[x] Did you write any new necessary tests?
Who can review?
@ArthurZucker @Rocketknight1