A partition cover approach to tokenization

In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/e)$-approximation greedy algorithm.

Compute Performance Update

Compared to the base implementation, beta6 has a considerable speedup! The key difference is the implementation of greedy updates on top of greedy choice.

Dataset	$k$	#uniq words	#candidates	Previous (arXiv paper's)	Current (>beta6)	improvement
UN	5K	105,505	884,630	~140 seconds	~6 seconds	x23
arXiv	5K	881,233	7,625,530	~28 minutes	~63 seconds	x26
Wiki	10K	8,769,943	93,243,449	~12.5 hours	~11 minutes	x68
PubMed	10K	6,527,614	97,870,366	~24.5 hours	~11 minutes	x133
Wiki-chinese	10K	7,035,544	69,728,860	~12.75 hours	~8.5 minutes	x90
Wiki-japanese	10K	2,737,555	60,410,961	~10.5 hours	~8.5 minutes	x74
Wiki-korean	10K	5,459,833	130,927,124	~26 hours	~18 minutes	x86

Table results shows time to solve (obtain a $k$-sized token set) from word counts. Since most of the compute is front-heavy, solving for larger $k$ size is trivial. For detailed logs, compare cpp_logs/{$data}/{$data}.log versus cpp_logs/{$data}/{$data}_fast.log.

Beta: Huggingface AutoTokenizer interface

Install the beta version (for transformers >= 4):

wget "https://github.com/PreferredAI/pcatt/archive/refs/tags/v0.14-beta8.zip"
unzip v0.14-beta8.zip -d pcatt
cd pcatt
pip install -r requirements.txt
pip install transformers
pip install .

For "training" either:

from pcatt.hf.greedtok import GreedTok
greedtok = GreedTok().train_new_from_iterator(word_iterator, 100, max_token_length=5, min_word_count=1)

or

from pcatt.hf.greedtok import GreedTok
greedtok = GreedTok().train_new_from_counts(word_count_dict, 100, max_token_length=5, min_word_count=1)

To use either:

from pcatt.hf.greedtok import GreedTok
greedtok = GreedTok.from_pretrained(greedtok_file_directory)

or

import pcatt.hf
greedtok = AutoTokenizer.from_pretrained("greedtok_file_directory")

Refer to eval_hf.ipynb for examples and tips. Note that the code in pcatt.hf is Apache 2.0 (following Huggingface Tokenizers).

GreedTok

If using python wrapper

a. Using pip (use the lightweight source code w/o data/notebooks):

wget "https://github.com/PreferredAI/pcatt/archive/refs/tags/v0.13.tar.gz"
unzip pcatt-0.13.zip -d pcatt
cd pcatt
pip install -r requirements.txt
pip install .

b. Or compile manually e.g. (have to specify links)

c++ -O3 -Wall -shared -std=c++20 \
-fPIC $(python3 -m pybind11 --includes) \
-I$CONDA_PREFIX/include/ \
-I$CONDA_PREFIX/include/tbb \
-I$CONDA_PREFIX/include/oneapi \
-L$CONDA_PREFIX/lib/ \
-l tbb \
./pcatt/greedy_builder.cpp \
-o ./pcatt/greedy_builder$(python3-config --extension-suffix)

c. import and use! Examples in eval_tokenizer_example.ipynb

If using C++ files directly

a. Install dependencies for C++ code, we use oneTBB to parallelize the code, simplest way is to use Conda or pip:
```
conda install tbb-devel
```
b. Compile greedy_cache.py e.g.:
```
c++ -O3 -std=c++20 \
-I$CONDA_PREFIX/include/ \
-I$CONDA_PREFIX/include/tbb \
-I$CONDA_PREFIX/include/oneapi \
-L$CONDA_PREFIX/lib/ \
-l tbb \
pcatt/greedy_cache.cpp \
-o pcatt/greedy.exe 
```
c. Prepare inputs (refer to cpp_inputs for examples):
- counts: a file with '\n' delimited integers
- words: a file with ' ' (space) delimited words
d. Run compiled program (currently looks for domain inputs in fixed path under cpp_inputs/*) ./greedy.exe <domain> <k> e. Now we obtained our ranked token sequence (refer to cpp_outputs for examples):
- merges: the number of covers at each step, delimited by '\n'
- tokens: byte sequences in hex-format, delimited by '\n'

Evaluations in eval_notebook.ipynb

Citation

@article{lim2025partition,
      title={A partition cover approach to tokenization}, 
      author={Lim, Jia Peng and Choo, Davin and Lauw, Hady W.},
      year={2025},
      journal={arXiv preprint arXiv:2501.06246},
      url={https://arxiv.org/abs/2501.06246}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
cpp_inputs		cpp_inputs
cpp_logs		cpp_logs
cpp_outputs		cpp_outputs
pcatt		pcatt
python_logs		python_logs
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval_hf.ipynb		eval_hf.ipynb
eval_notebook.ipynb		eval_notebook.ipynb
eval_tokenizer_example.ipynb		eval_tokenizer_example.ipynb
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A partition cover approach to tokenization

Compute Performance Update

Beta: Huggingface AutoTokenizer interface

GreedTok

Citation

About

Releases 2

Packages

Contributors 2

Languages

License

PreferredAI/pcatt

Folders and files

Latest commit

History

Repository files navigation

A partition cover approach to tokenization

Compute Performance Update

Beta: Huggingface AutoTokenizer interface

GreedTok

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages