Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phi-4 Tokenizer Support #7396

Merged
merged 1 commit into from
Feb 23, 2025
Merged

Phi-4 Tokenizer Support #7396

merged 1 commit into from
Feb 23, 2025

Conversation

tarekgh
Copy link
Member

@tarekgh tarekgh commented Feb 23, 2025

Fixes #7337

Phi-4 model is using the Tiktoken tokenizer with the Cl100K_Base encoding file (like gpt-4 tokenizer). It uses different special tokens though <|im_start|>, <|im_end|>, and <|im_sep|>.

The tokenizer can be created by:

TiktokenTokenizer tokenizer = TiktokenTokenizer.CreateForModel("phi-4");

string text = $"Hello, World!";
IReadOnlyList<int> encoded = tokenizer.EncodeToIds(text);

Note

The Phi-4 model on Hugging Face (https://huggingface.co/microsoft/phi-4/tree/main) converts the CL100K_Base encoding vocabulary into vocab.json and merges.txt, which are compatible with the GPT2TokenizerFast tokenizer in Python. This is likely intended for users who are not using the Tiktoken tokenizer, though both should generally produce the same results.

@Copilot Copilot bot review requested due to automatic review settings February 23, 2025 03:12
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

test/Microsoft.ML.Tokenizers.Tests/TiktokenTests.cs:511

  • Since the Phi-4 model is now supported, adding 'phi-4' as an encoding name in a test case for negative scenarios appears to be a mistake. Consider removing it from TestEncodingNamesNegativeCases or moving it to a positive test case.
[InlineData("phi-4")]

@tarekgh
Copy link
Member Author

tarekgh commented Feb 23, 2025

@tarekgh tarekgh added this to the ML.NET 5.0 milestone Feb 23, 2025
Copy link

codecov bot commented Feb 23, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.89%. Comparing base (99723e7) to head (2c1065c).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #7396   +/-   ##
=======================================
  Coverage   68.89%   68.89%           
=======================================
  Files        1473     1473           
  Lines      270817   270836   +19     
  Branches    27883    27884    +1     
=======================================
+ Hits       186568   186583   +15     
- Misses      76975    76985   +10     
+ Partials     7274     7268    -6     
Flag Coverage Δ
Debug 68.89% <100.00%> (+<0.01%) ⬆️
production 63.20% <100.00%> (-0.01%) ⬇️
test 89.39% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs 78.47% <100.00%> (+0.21%) ⬆️
...est/Microsoft.ML.Tokenizers.Tests/TiktokenTests.cs 99.00% <100.00%> (+0.02%) ⬆️

... and 8 files with indirect coverage changes

@tarekgh tarekgh merged commit fd62e6c into dotnet:main Feb 23, 2025
25 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 26, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Phi-4 to Tiktoken encoding map
2 participants