BPE, WordPiece & SentencePiece
Deep dive into common tokenizer algorithms used by modern NLP and LLM systems.
Learn the inner mechanics of how algorithms like Byte Pair Encoding (BPE), WordPiece, and SentencePiece build vocabularies and tokenize text.
Vocabularies are statistical dictionaries built before LLMs are trained.
BPE builds vocabs by merging adjacent character pairings recursively based on counts.
SentencePiece tokenizes byte streams natively, reducing whitespace rules.
Why This Matters
BPE is the standard algorithm used by GPT and Llama. Understanding merge rules helps debug encoding anomalies and vocabulary bias.
Imagine picking stamps to print books. Instead of carving a stamp for every unique word, you look at text patterns and carve stamps for syllables and letters. BPE does this by carving character pair stamps step-by-step.
Visual Diagram: BPE Bottom-Up Merge Process
Tokenization in Simple Words
Vocabularies are built before model training starts. BPE begins with a list of base characters and merges the most frequent pairs in a dataset. WordPiece does something similar but selects merges that maximize the likelihood of the training data. SentencePiece tokenizes raw bytes directly, treating whitespace as a character.
Tokenizer Algorithm Comparison
| Algorithm | Main Idea | Common Usage |
|---|---|---|
| BPE | Merge frequent character pairs | GPT, Llama models |
| WordPiece | Merges maximizing data likelihood | BERT, RoBERTa |
| SentencePiece | Lossless tokenization from raw bytes | T5, Multilingual models |
Example: Text to Tokens to Token IDs
BPE iteratively merges l + o → lo, then lo + w → low, building subword structures.
Deep-Dive Core Concepts
A bottom-up algorithm. It counts all adjacent token pairs in a training corpus and merges the most frequent pair. This is repeated until the target vocabulary size is reached.
Used by BERT. Instead of merging by raw frequency, WordPiece chooses merges that maximize the probability of the training data according to a unigram language model.
A language-independent tokenizer that treats the input as a raw byte stream and spaces as a special character (e.g., '_'). This removes the need for language-specific word pre-segmentation.
Concepts Covered
Why AI Engineers Care About Tokenization
Vocabularies trained on mostly English text merge common English pairs, leaving other languages split into tiny byte-level pieces.
- Simulate BPE merge pairs in the playground.
- Observe step merges for low, lower, lowest.
- Analyze SentencePiece spacing marks (_).
Tokenizer Visualizer Studio
Implement a mini BPE training loop in TypeScript to merge character pairs from user input text.
Input Text → Count Adjacent Pairs → Merge Top Pair → Update Vocabulary Table
Common Beginner Misconceptions
Tokenizers are trained along with the neural network parameters.
Tokenizers are trained beforehand as a separate static preprocessing step. The model's weights learn embeddings for the static vocabulary.
Technical Interview Defense Q&A
Key Takeaways
- •BPE merges adjacent character pairs iteratively based on frequency.
- •WordPiece merges based on database likelihood models.
- •SentencePiece treats spaces as standard characters, enabling language-agnostic tokenization.