Machine Learning & Data Science Natural Language Processing Tokenisation Byte Pair Encoding

Byte Pair Encoding (BPE) is a subword tokenisation algorithm that iteratively merges the most frequent pair of adjacent characters or character sequences. It balances the granularity of character-level tokenisation with the semantic meaning of word-level tokenisation.

Algorithm Steps

Prepare a Corpus: Gather a large text dataset.
Pre-tokenisation: Split the text into words (often by whitespace) and append a special end-of-word symbol (e.g., </w>) to each word to mark boundaries.
Initialize Vocabulary: Create a base vocabulary consisting of all unique characters in the corpus.
Count Pairs: Count the frequency of every adjacent pair of symbols in the corpus.
Merge: Identify the most frequent pair (e.g., 'A', 'B') and merge them into a new symbol 'AB'. Add 'AB' to the vocabulary.
Repeat: Repeat steps 4 and 5 until a predefined vocabulary size is reached (a hyperparameter).

Concrete Example

Consider a small corpus with the following word frequencies:

"hug": 10 times
"pug": 5 times
"pun": 12 times
"bun": 4 times

Step 1: Initial Split Words are split into characters:

h u g </w>: 10
p u g </w>: 5
p u n </w>: 12
b u n </w>: 4

Step 2: Count Pairs

u g: 10 + 5 = 15
u n: 12 + 4 = 16
h u: 10
p u: 5 + 12 = 17
b u: 4
g </w>: 15
n </w>: 16

Step 3: Merge Most Frequent The pair p u is the most frequent (17). We merge p and u into pu. New state:

h u g </w>: 10
pu g </w>: 5
pu n </w>: 12
b u n </w>: 4

Step 4: Next Iteration Recount pairs. u n (16) and n </w> (16) are high. Let's say we merge u n -> un. New state:

h u g </w>: 10
pu g </w>: 5
pu un </w>: 12
b un </w>: 4

This process continues until the vocabulary limit is hit.

Pros and Cons

Pros:

Solves OOV: Can represent any word by falling back to characters if necessary.
Compression: Efficiently represents common words as single tokens and rare words as multiple subwords.
Language Independent: Does not rely on complex linguistic rules. Cons:
Ambiguity: The same string can be tokenised differently depending on the merge history.
No Morphology: Merges are based on frequency, not linguistic meaning (e.g., might merge "th" + "e" instead of "t" + "he").

Famous Models

GPT-2 & GPT-3: Use byte-level BPE.
RoBERTa: Uses byte-level BPE.
BART: Uses BPE.

Machine Learning & Data Science Natural Language Processing Tokenisation Byte Pair Encoding

Algorithm Steps

Concrete Example

Pros and Cons

Famous Models

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!