-
Notifications
You must be signed in to change notification settings - Fork 54
Description
First and foremost, thank you for all the excellent work on this project! The code seems very elegant and well-structured.
I have a question about a specific detail, and I apologize in advance if it's something I have overlooked.
Throughout the process, we make use of four special tokens:
<|begin_of_solution|><|end_of_solution|><|begin_of_explanation|><|end_of_explanation|>
My question is whether these have been registered as "special tokens" with the tokenizer. In other words, is each one parsed as a single, unique token?
I noticed that when using a tokenizer, such as from the Qwen2.5 family, these tokens appear to be broken down into multiple sub-tokens. For example, <|begin_of_explanation|> might be tokenized into something like this:
'<'
'|'
'begin'
'_of'
'_ex'
'planation'
'|'
'>\n\n'
I understand that even with this approach, the model can certainly learn to recognize this sequence of sub-tokens as a coherent pattern.
However, I would be grateful if you could share your thoughts on this implementation detail. Was this multi-token representation an intentional design choice, or have these tokens already been registered as single special tokens and I may have missed that step?
Thank you for your time and for any insight you can provide.