AI Lesson & Submodule

BPE, WordPiece & SentencePiece

Deep dive into common tokenizer algorithms used by modern NLP and LLM systems.

Intermediate18 min readLLM FoundationInterview Ready
Lesson Overview

Learn the inner mechanics of how algorithms like Byte Pair Encoding (BPE), WordPiece, and SentencePiece build vocabularies and tokenize text.

From Beginner to Engineer
Beginner Level

Vocabularies are statistical dictionaries built before LLMs are trained.

Engineer Level

BPE builds vocabs by merging adjacent character pairings recursively based on counts.

Production Level

SentencePiece tokenizes byte streams natively, reducing whitespace rules.

Why This Matters

BPE is the standard algorithm used by GPT and Llama. Understanding merge rules helps debug encoding anomalies and vocabulary bias.

Mental Model: Creating word stamps

Imagine picking stamps to print books. Instead of carving a stamp for every unique word, you look at text patterns and carve stamps for syllables and letters. BPE does this by carving character pair stamps step-by-step.

Visual Diagram: BPE Bottom-Up Merge Process

How the word "lowest" is constructed starting from character bytes:
Start state:
lowest
Merge 'e' + 's':
lowest
Merge 'es' + 't':
lowest
Merge 'l'+'o'+'w':
lowest

Tokenization in Simple Words

Vocabularies are built before model training starts. BPE begins with a list of base characters and merges the most frequent pairs in a dataset. WordPiece does something similar but selects merges that maximize the likelihood of the training data. SentencePiece tokenizes raw bytes directly, treating whitespace as a character.

Tokenizer Algorithm Comparison

AlgorithmMain IdeaCommon Usage
BPEMerge frequent character pairsGPT, Llama models
WordPieceMerges maximizing data likelihoodBERT, RoBERTa
SentencePieceLossless tokenization from raw bytesT5, Multilingual models

Example: Text to Tokens to Token IDs

Step 1: Input text string"low lower lowest"
Step 2: Token representation["low"," low","er"," low","est"]
Step 3: Mapped Token IDs[102,102,304,102,592]

BPE iteratively merges l + o → lo, then lo + w → low, building subword structures.

Deep-Dive Core Concepts

Byte Pair Encoding (BPE)

A bottom-up algorithm. It counts all adjacent token pairs in a training corpus and merges the most frequent pair. This is repeated until the target vocabulary size is reached.

WordPiece

Used by BERT. Instead of merging by raw frequency, WordPiece chooses merges that maximize the probability of the training data according to a unigram language model.

SentencePiece

A language-independent tokenizer that treats the input as a raw byte stream and spaces as a special character (e.g., '_'). This removes the need for language-specific word pre-segmentation.

Concepts Covered

Byte Pair EncodingWordPieceSentencePieceVocabulary MergesToken BoundariesByte Fallback

Why AI Engineers Care About Tokenization

Vocabulary Bias

Vocabularies trained on mostly English text merge common English pairs, leaving other languages split into tiny byte-level pieces.

Production Failure Scenario: East Asian Space-clipping Monoliths
Root Cause: Traditional pre-tokenizers split text by spaces. Languages like Japanese don't use space delimiters, causing SentencePiece splits to classify entire paragraphs as single segments.
Fix / Strategy: Utilize SentencePiece models trained directly on byte representations that treat whitespace as standard characters.
Try This in the Lab
  • Simulate BPE merge pairs in the playground.
  • Observe step merges for low, lower, lowest.
  • Analyze SentencePiece spacing marks (_).
Mapped Foundation Project

Tokenizer Visualizer Studio

Implement a mini BPE training loop in TypeScript to merge character pairs from user input text.

Architecture Preview

Input Text → Count Adjacent Pairs → Merge Top Pair → Update Vocabulary Table

Tech Stack Planned
TypeScriptReactBPE Core Engine

Common Beginner Misconceptions

Misconception

Tokenizers are trained along with the neural network parameters.

Reality

Tokenizers are trained beforehand as a separate static preprocessing step. The model's weights learn embeddings for the static vocabulary.

Technical Interview Defense Q&A

Key Takeaways

  • BPE merges adjacent character pairs iteratively based on frequency.
  • WordPiece merges based on database likelihood models.
  • SentencePiece treats spaces as standard characters, enabling language-agnostic tokenization.

Before You Move Next Checklist