Skip to content

teleprint-me/byte-pair

Repository files navigation

Byte-Pair

Overview

The Byte-Pair Encoder (BPE) is a tokenization method widely used in natural language processing. This Python implementation is based on the paper Neural Machine Translation of Rare Words with Subword Units and influenced by Lei Mao's tutorial.

Features

  • Zero Dependencies — The implementation is self-contained and does not require any external libraries.
  • Tokenization — Perform tokenization using the Byte-Pair Encoding algorithm.
  • Vocabulary Management — Build, analyze, and persist vocabularies.
  • Pair Frequency Analysis — Calculate frequencies of token pairs for subword learning.

Installation

  1. Clone the Repository

    git clone https://github.com/teleprint-me/byte-pair.git
    cd byte-pair
  2. Set Up a Virtual Environment

    python -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt # optional dev dependencies

Quick Start

Dry Run with Verbose Output

python -m byte.model -c samples/simple.txt -m 15 -v

Save a Trained Tokenizer

python -m byte.model --save tokenizer.json -c samples/simple.txt -m 20

Process an Entire Directory

python -m byte.model --save tokenizer.json -c samples -m 2500

Load and Inspect a Tokenizer

python -m byte.model --load tokenizer.json -v

Predict Token Pairs for Text

python -m byte.model --load tokenizer.json -p "Hello, world!" -v

Show Full CLI Options

python -m byte.model -h

Contributing

Issues, feature suggestions, and pull requests are welcome. See the LICENSE file for full licensing terms.

License

This project is licensed under the GNU Affero General Public License (AGPL).

Acknowledgments

Thanks to Lei Mao for the tutorial that helped shape this implementation.

References

About

Byte Pair Encoder (BPE) for Natural Language Processing.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages