Skip to content

[Question] A Question Regarding the Handling of Special Tokens #2

@nkkbr

Description

@nkkbr

First and foremost, thank you for all the excellent work on this project! The code seems very elegant and well-structured.

I have a question about a specific detail, and I apologize in advance if it's something I have overlooked.

Throughout the process, we make use of four special tokens:

  • <|begin_of_solution|>
  • <|end_of_solution|>
  • <|begin_of_explanation|>
  • <|end_of_explanation|>

My question is whether these have been registered as "special tokens" with the tokenizer. In other words, is each one parsed as a single, unique token?

I noticed that when using a tokenizer, such as from the Qwen2.5 family, these tokens appear to be broken down into multiple sub-tokens. For example, <|begin_of_explanation|> might be tokenized into something like this:

'<'                  
'|'                  
'begin'              
'_of'                
'_ex'                
'planation'          
'|'                  
'>\n\n'              

I understand that even with this approach, the model can certainly learn to recognize this sequence of sub-tokens as a coherent pattern.

However, I would be grateful if you could share your thoughts on this implementation detail. Was this multi-token representation an intentional design choice, or have these tokens already been registered as single special tokens and I may have missed that step?

Thank you for your time and for any insight you can provide.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions