Byte-Pair

Overview

The Byte-Pair Encoder (BPE) is a tokenization method widely used in natural language processing. This Python implementation is based on the paper Neural Machine Translation of Rare Words with Subword Units and influenced by Lei Mao's tutorial.

Features

Zero Dependencies — The implementation is self-contained and does not require any external libraries.
Tokenization — Perform tokenization using the Byte-Pair Encoding algorithm.
Vocabulary Management — Build, analyze, and persist vocabularies.
Pair Frequency Analysis — Calculate frequencies of token pairs for subword learning.

Installation

Clone the Repository

git clone https://github.com/teleprint-me/byte-pair.git
cd byte-pair

Set Up a Virtual Environment

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt # optional dev dependencies

Quick Start

Dry Run with Verbose Output

python -m byte.model -c samples/simple.txt -m 15 -v

Save a Trained Tokenizer

python -m byte.model --save tokenizer.json -c samples/simple.txt -m 20

Process an Entire Directory

python -m byte.model --save tokenizer.json -c samples -m 2500

Load and Inspect a Tokenizer

python -m byte.model --load tokenizer.json -v

Predict Token Pairs for Text

python -m byte.model --load tokenizer.json -p "Hello, world!" -v

Show Full CLI Options

python -m byte.model -h

Contributing

Issues, feature suggestions, and pull requests are welcome. See the LICENSE file for full licensing terms.

Name		Name	Last commit message	Last commit date
Latest commit History 170 Commits
.vscode		.vscode
byte		byte
examples		examples
samples		samples
.flake8		.flake8
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Byte-Pair

Overview

Features

Installation

Quick Start

Dry Run with Verbose Output

Save a Trained Tokenizer

Process an Entire Directory

Load and Inspect a Tokenizer

Predict Token Pairs for Text

Show Full CLI Options

Contributing

License

Acknowledgments

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

teleprint-me/byte-pair

Folders and files

Latest commit

History

Repository files navigation

Byte-Pair

Overview

Features

Installation

Quick Start

Dry Run with Verbose Output

Save a Trained Tokenizer

Process an Entire Directory

Load and Inspect a Tokenizer

Predict Token Pairs for Text

Show Full CLI Options

Contributing

License

Acknowledgments

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages