[WIP] Minimal Tokenizer Implementation #513

snico432 · 2024-11-28T01:57:09Z

Overview

In response to #262 and #263, and building off of #271, I've been working on
a minimal tokenizer for exo. Initially aimed to remove the transformers dependency,
but discovered it's a transitive dependency of mlx_lm, which will require more extensive changes.

So far I've tested on:

llama-3.2-1b
qwen-2.5-coder-1.5b
(both with MLX inference engine)

Questions

Cut down exo package size #263 mentions removing both Jinja2 and tiktoken dependencies. Since my implementation
is currently a wrapper around these packages, I wanted to confirm if removing these
dependencies is still the desired direction before proceeding with changes?
Should we tackle the transformers dependency in a separate PR given its
transitive nature through mlx_lm?

AlexCheema · 2024-11-28T06:32:20Z

This is awesome! Much needed addition.
I'm going to assign a $500 retrospective bounty for this if we can get a Minimal Tokenizer implementation working for all models without any dependency on tokenizers.

To answer your questions:

Jinja2 and tiktoken are fine to keep.
Yeah we can tackle getting rid of transformers completely in a separate PR.

…base class

snico432 added 7 commits November 24, 2024 02:27

initial commit

c6238de

fixed path

508118a

fixed encoding of special tokens

5d4afba

added date to template

4666370

added get_tokenizer() to inference engine

02e2c8f

updated main.py to use get_tokenizer() in run_model_cli()

fdcdb24

fixed merge conflicts

d17cce9

AlexCheema assigned snico432 Nov 28, 2024

snico432 added 4 commits November 30, 2024 15:39

simplified tokenizer by just using vocab instead of merges

ed6bb56

progress on deepseek tokenizer

a4151fd

added fallback values for dict get

cb1e2f9

breaking tokenizers into seperate classes that inherit from abstract …

cb2d274

…base class

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Minimal Tokenizer Implementation #513

[WIP] Minimal Tokenizer Implementation #513

snico432 commented Nov 28, 2024 •

edited

Loading

AlexCheema commented Nov 28, 2024 •

edited

Loading

[WIP] Minimal Tokenizer Implementation #513

Are you sure you want to change the base?

[WIP] Minimal Tokenizer Implementation #513

Conversation

snico432 commented Nov 28, 2024 • edited Loading

Overview

Questions

AlexCheema commented Nov 28, 2024 • edited Loading

snico432 commented Nov 28, 2024 •

edited

Loading

AlexCheema commented Nov 28, 2024 •

edited

Loading