-
Notifications
You must be signed in to change notification settings - Fork 0
Machine Learning & Data Science Natural Language Processing Tokenisation Byte Pair Encoding
github-actions[bot] edited this page Feb 7, 2026
·
1 revision
Byte Pair Encoding (BPE) is a subword tokenisation algorithm that iteratively merges the most frequent pair of adjacent characters or character sequences. It balances the granularity of character-level tokenisation with the semantic meaning of word-level tokenisation.
- Prepare a Corpus: Gather a large text dataset.
-
Pre-tokenisation: Split the text into words (often by whitespace) and append a special end-of-word symbol (e.g.,
</w>) to each word to mark boundaries. - Initialize Vocabulary: Create a base vocabulary consisting of all unique characters in the corpus.
- Count Pairs: Count the frequency of every adjacent pair of symbols in the corpus.
- Merge: Identify the most frequent pair (e.g., 'A', 'B') and merge them into a new symbol 'AB'. Add 'AB' to the vocabulary.
- Repeat: Repeat steps 4 and 5 until a predefined vocabulary size is reached (a hyperparameter).
Consider a small corpus with the following word frequencies:
- "hug": 10 times
- "pug": 5 times
- "pun": 12 times
- "bun": 4 times
Step 1: Initial Split Words are split into characters:
-
h u g </w>: 10 -
p u g </w>: 5 -
p u n </w>: 12 -
b u n </w>: 4
Step 2: Count Pairs
-
u g: 10 + 5 = 15 -
u n: 12 + 4 = 16 -
h u: 10 -
p u: 5 + 12 = 17 -
b u: 4 -
g </w>: 15 -
n </w>: 16
Step 3: Merge Most Frequent
The pair p u is the most frequent (17). We merge p and u into pu.
New state:
-
h u g </w>: 10 -
pu g </w>: 5 -
pu n </w>: 12 -
b u n </w>: 4
Step 4: Next Iteration
Recount pairs. u n (16) and n </w> (16) are high. Let's say we merge u n -> un.
New state:
-
h u g </w>: 10 -
pu g </w>: 5 -
pu un </w>: 12 -
b un </w>: 4
This process continues until the vocabulary limit is hit.

Pros:
- Solves OOV: Can represent any word by falling back to characters if necessary.
- Compression: Efficiently represents common words as single tokens and rare words as multiple subwords.
- Language Independent: Does not rely on complex linguistic rules. Cons:
- Ambiguity: The same string can be tokenised differently depending on the merge history.
- No Morphology: Merges are based on frequency, not linguistic meaning (e.g., might merge "th" + "e" instead of "t" + "he").
- GPT-2 & GPT-3: Use byte-level BPE.
- RoBERTa: Uses byte-level BPE.
- BART: Uses BPE.