Skip to content

Machine Learning & Data Science Natural Language Processing Tokenisation Byte Pair Encoding

github-actions[bot] edited this page Feb 7, 2026 · 1 revision

Byte Pair Encoding (BPE) is a subword tokenisation algorithm that iteratively merges the most frequent pair of adjacent characters or character sequences. It balances the granularity of character-level tokenisation with the semantic meaning of word-level tokenisation.

Algorithm Steps

  1. Prepare a Corpus: Gather a large text dataset.
  2. Pre-tokenisation: Split the text into words (often by whitespace) and append a special end-of-word symbol (e.g., </w>) to each word to mark boundaries.
  3. Initialize Vocabulary: Create a base vocabulary consisting of all unique characters in the corpus.
  4. Count Pairs: Count the frequency of every adjacent pair of symbols in the corpus.
  5. Merge: Identify the most frequent pair (e.g., 'A', 'B') and merge them into a new symbol 'AB'. Add 'AB' to the vocabulary.
  6. Repeat: Repeat steps 4 and 5 until a predefined vocabulary size is reached (a hyperparameter).

Concrete Example

Consider a small corpus with the following word frequencies:

  • "hug": 10 times
  • "pug": 5 times
  • "pun": 12 times
  • "bun": 4 times

Step 1: Initial Split Words are split into characters:

  • h u g </w>: 10
  • p u g </w>: 5
  • p u n </w>: 12
  • b u n </w>: 4

Step 2: Count Pairs

  • u g: 10 + 5 = 15
  • u n: 12 + 4 = 16
  • h u: 10
  • p u: 5 + 12 = 17
  • b u: 4
  • g </w>: 15
  • n </w>: 16

Step 3: Merge Most Frequent The pair p u is the most frequent (17). We merge p and u into pu. New state:

  • h u g </w>: 10
  • pu g </w>: 5
  • pu n </w>: 12
  • b u n </w>: 4

Step 4: Next Iteration Recount pairs. u n (16) and n </w> (16) are high. Let's say we merge u n -> un. New state:

  • h u g </w>: 10
  • pu g </w>: 5
  • pu un </w>: 12
  • b un </w>: 4

This process continues until the vocabulary limit is hit.

Pros and Cons

Pros:

  • Solves OOV: Can represent any word by falling back to characters if necessary.
  • Compression: Efficiently represents common words as single tokens and rare words as multiple subwords.
  • Language Independent: Does not rely on complex linguistic rules. Cons:
  • Ambiguity: The same string can be tokenised differently depending on the merge history.
  • No Morphology: Merges are based on frequency, not linguistic meaning (e.g., might merge "th" + "e" instead of "t" + "he").

Famous Models

  • GPT-2 & GPT-3: Use byte-level BPE.
  • RoBERTa: Uses byte-level BPE.
  • BART: Uses BPE.

Clone this wiki locally