The Byte-Pair Encoder (BPE) is a tokenization method widely used in natural language processing. This Python implementation is based on the paper Neural Machine Translation of Rare Words with Subword Units and influenced by Lei Mao's tutorial.
- Zero Dependencies — The implementation is self-contained and does not require any external libraries.
- Tokenization — Perform tokenization using the Byte-Pair Encoding algorithm.
- Vocabulary Management — Build, analyze, and persist vocabularies.
- Pair Frequency Analysis — Calculate frequencies of token pairs for subword learning.
-
Clone the Repository
git clone https://github.com/teleprint-me/byte-pair.git cd byte-pair
-
Set Up a Virtual Environment
python -m venv .venv source .venv/bin/activate pip install -r requirements.txt # optional dev dependencies
python -m byte.model -c samples/simple.txt -m 15 -v
python -m byte.model --save tokenizer.json -c samples/simple.txt -m 20
python -m byte.model --save tokenizer.json -c samples -m 2500
python -m byte.model --load tokenizer.json -v
python -m byte.model --load tokenizer.json -p "Hello, world!" -v
python -m byte.model -h
Issues, feature suggestions, and pull requests are welcome. See the LICENSE file for full licensing terms.
This project is licensed under the GNU Affero General Public License (AGPL).
Thanks to Lei Mao for the tutorial that helped shape this implementation.
- On Computable Numbers, with an Application to the Entscheidungsproblem
- Prediction and Entropy of Printed English
- A New Algorithm for Data Compression Optimization
- Neural Machine Translation of Rare Words with Subword Units
- Byte Pair Encoding
- Better language models and their implications
- A Formal Perspective on Byte-Pair Encoding
- A Statistical Extension of Byte-Pair Encoding
- Byte Pair Encoding is Suboptimal for Language Model Pretraining
- Language Model Tokenizers Introduce Unfairness Between Languages
- Rethinking Tokenization
- wHy DoNt YoU jUsT uSe ThE lLaMa ToKeNiZeR??
- Egalitarian Language Representation in Language Models
- Dynamic Chunking for End-to-End Hierarchical Sequence Modeling