Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Phi-4 to Tiktoken encoding map #7337

Closed
luisquintanilla opened this issue Dec 16, 2024 · 5 comments · Fixed by #7396
Closed

Add Phi-4 to Tiktoken encoding map #7337

luisquintanilla opened this issue Dec 16, 2024 · 5 comments · Fixed by #7396
Assignees
Labels
Milestone

Comments

@luisquintanilla
Copy link
Contributor

luisquintanilla commented Dec 16, 2024

Phi-4 uses Tiktoken tokenizer (100k vocab).

2412.08905v1

we now use the tiktoken tokenizer (for better multilingual support) with a padded vocabulary size of 100,352 (including unused tokens)

Consider adding it as an option to the encoding map so it's easier to create.

private static readonly (string Prefix, ModelEncoding Encoding)[] _modelPrefixToEncoding =
[
// chat
( "o1-", ModelEncoding.O200kBase ), // e.g. o1-mini
( "gpt-4o-", ModelEncoding.O200kBase), // e.g., gpt-4o-2024-05-13
( "gpt-4-", ModelEncoding.Cl100kBase), // e.g., gpt-4-0314, etc., plus gpt-4-32k
( "gpt-3.5-", ModelEncoding.Cl100kBase), // e.g, gpt-3.5-turbo-0301, -0401, etc.
( "gpt-35-", ModelEncoding.Cl100kBase ) // Azure deployment name
];
private static readonly Dictionary<string, ModelEncoding> _modelToEncoding =

@luisquintanilla luisquintanilla added the enhancement New feature or request label Dec 16, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged label Dec 16, 2024
@tarekgh tarekgh added this to the ML.NET Future milestone Dec 16, 2024
@tarekgh tarekgh added Tokenizers and removed untriaged New issue has not been triaged labels Dec 16, 2024
@tarekgh tarekgh self-assigned this Dec 16, 2024
@MaxAkbar
Copy link

MaxAkbar commented Jan 12, 2025

I got the model to load using the following code, and generate text but I can't seem to workout out the stop sequence:

using Microsoft.Extensions.AI;
using Microsoft.ML.GenAI.Core;
using Microsoft.ML.GenAI.Phi;
using Microsoft.ML.Tokenizers;
using static TorchSharp.torch;
using TorchSharp;
using System.Text.Json;

var weightFolder = @"C:\Users\maxim\source\repos\models\microsoft\phi-4\";
var device = "cuda";
if (device == "cuda")
{
    InitializeDeviceType(DeviceType.CUDA);
}

var defaultType = ScalarType.Float16;
manual_seed(1);
set_default_dtype(defaultType);

var model = Phi3ForCasualLM.FromPretrained(weightFolder, "config.json", layersOnTargetDevice: -1, quantizeToInt4: true);
var tokenizerPath = Path.Combine(weightFolder, "config.json");
var fileConfig = File.ReadAllText(tokenizerPath);
var config = JsonSerializer.Deserialize<Phi3Config>(fileConfig)!;
var tokenizer = TiktokenTokenizer.CreateForModel("gpt-4");
var pipeline = new CausalLMPipeline<Tokenizer, Phi3ForCasualLM>(tokenizer, model, device);
var client = new Phi3CausalLMChatClient(pipeline);

var task = """
            Can you tell me a funny joke?
            """;
var chatMessage = new ChatMessage(ChatRole.User, task);
var options = new ChatOptions 
{
    StopSequences = ["<|endoftext|>"],

};

await foreach (var response in client.CompleteStreamingAsync([chatMessage], options))
{
    Console.Write(response.Text);
}

Console.WriteLine();
Console.WriteLine("End!");

@tarekgh
Copy link
Member

tarekgh commented Jan 12, 2025

@luisquintanilla looking at https://huggingface.co/microsoft/phi-4/tree/main looks it is using vocab.json and merges.txt similar to Phi-2 tokenizer files format (which is based on CodeGen tokenizer). I know the technical article mentions Tiktoken but I couldn't find vocab files that Tiktoken tokenizer can load. Do you know who can clarify that?

@MaxAkbar
Copy link

Looking at the tokenizer_config.json it says it's using GPT2Tokenizer. I think the reason the StopSequences isn't working is that the tokenizer isn't quite correct.

  "bos_token": "<|endoftext|>",
  "chat_template": "{% for message in messages %}{% if (message['role'] == 'system') %}{{'<|im_start|>system<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'user') %}{{'<|im_start|>user<|im_sep|>' + message['content'] + '<|im_end|><|im_start|>assistant<|im_sep|>'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>'}}{% endif %}{% endfor %}",
  "clean_up_tokenization_spaces": false,
  "eos_token": "<|endoftext|>",
  "model_max_length": 16384,
  "pad_token": "<|endoftext|>",
  "tokenizer_class": "GPT2Tokenizer"

I also ran the following:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "microsoft/phi4"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(type(tokenizer))
print(tokenizer)
print(tokenizer.eos_token)

Got the following confirmation:

<class 'transformers.models.gpt2.tokenization_gpt2_fast.GPT2TokenizerFast'>
GPT2TokenizerFast(name_or_path='microsoft/phi-4', vocab_size=100352, model_max_length=16384, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
        100256: AddedToken("<|dummy_0|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100257: AddedToken("<|endoftext|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100258: AddedToken("<|fim_prefix|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100259: AddedToken("<|fim_middle|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100260: AddedToken("<|fim_suffix|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100261: AddedToken("<|dummy_1|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100262: AddedToken("<|dummy_2|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100263: AddedToken("<|dummy_3|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100264: AddedToken("<|im_start|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100265: AddedToken("<|im_end|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100266: AddedToken("<|im_sep|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100267: AddedToken("<|dummy_4|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100268: AddedToken("<|dummy_5|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100269: AddedToken("<|dummy_6|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100270: AddedToken("<|dummy_7|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100271: AddedToken("<|dummy_8|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100272: AddedToken("<|dummy_9|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100273: AddedToken("<|dummy_10|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100274: AddedToken("<|dummy_11|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100275: AddedToken("<|dummy_12|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100276: AddedToken("<|endofprompt|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100277: AddedToken("<|dummy_13|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100278: AddedToken("<|dummy_14|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100279: AddedToken("<|dummy_15|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100280: AddedToken("<|dummy_16|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100281: AddedToken("<|dummy_17|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100282: AddedToken("<|dummy_18|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100283: AddedToken("<|dummy_19|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100284: AddedToken("<|dummy_20|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100285: AddedToken("<|dummy_21|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100286: AddedToken("<|dummy_22|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100287: AddedToken("<|dummy_23|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100288: AddedToken("<|dummy_24|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100289: AddedToken("<|dummy_25|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100290: AddedToken("<|dummy_26|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100291: AddedToken("<|dummy_27|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100292: AddedToken("<|dummy_28|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100293: AddedToken("<|dummy_29|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100294: AddedToken("<|dummy_30|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100295: AddedToken("<|dummy_31|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100296: AddedToken("<|dummy_32|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100297: AddedToken("<|dummy_33|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100298: AddedToken("<|dummy_34|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100299: AddedToken("<|dummy_35|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100300: AddedToken("<|dummy_36|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100301: AddedToken("<|dummy_37|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100302: AddedToken("<|dummy_38|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100303: AddedToken("<|dummy_39|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100304: AddedToken("<|dummy_40|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100305: AddedToken("<|dummy_41|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100306: AddedToken("<|dummy_42|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100307: AddedToken("<|dummy_43|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100308: AddedToken("<|dummy_44|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100309: AddedToken("<|dummy_45|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100310: AddedToken("<|dummy_46|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100311: AddedToken("<|dummy_47|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100312: AddedToken("<|dummy_48|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100313: AddedToken("<|dummy_49|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100314: AddedToken("<|dummy_50|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100315: AddedToken("<|dummy_51|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100316: AddedToken("<|dummy_52|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100317: AddedToken("<|dummy_53|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100318: AddedToken("<|dummy_54|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100319: AddedToken("<|dummy_55|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100320: AddedToken("<|dummy_56|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100321: AddedToken("<|dummy_57|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100322: AddedToken("<|dummy_58|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100323: AddedToken("<|dummy_59|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100324: AddedToken("<|dummy_60|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100325: AddedToken("<|dummy_61|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100326: AddedToken("<|dummy_62|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100327: AddedToken("<|dummy_63|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100328: AddedToken("<|dummy_64|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100329: AddedToken("<|dummy_65|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100330: AddedToken("<|dummy_66|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100331: AddedToken("<|dummy_67|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100332: AddedToken("<|dummy_68|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100333: AddedToken("<|dummy_69|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100334: AddedToken("<|dummy_70|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100335: AddedToken("<|dummy_71|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100336: AddedToken("<|dummy_72|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100337: AddedToken("<|dummy_73|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100338: AddedToken("<|dummy_74|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100339: AddedToken("<|dummy_75|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100340: AddedToken("<|dummy_76|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100341: AddedToken("<|dummy_77|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100342: AddedToken("<|dummy_78|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100343: AddedToken("<|dummy_79|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100344: AddedToken("<|dummy_80|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100345: AddedToken("<|dummy_81|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100346: AddedToken("<|dummy_82|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100347: AddedToken("<|dummy_83|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100348: AddedToken("<|dummy_84|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100349: AddedToken("<|dummy_85|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100350: AddedToken("<|dummy_86|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
        100351: AddedToken("<|dummy_87|>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True),
}
<|endoftext|>

@luisquintanilla
Copy link
Contributor Author

luisquintanilla commented Jan 14, 2025

Adding to this thread. Looks like there may have been a bug with the original Phi-4 tokenizer published which validated @MaxAkbar observations.

https://unsloth.ai/blog/phi4

@github-actions github-actions bot locked and limited conversation to collaborators Mar 26, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants