Character, Word & Subword Tokenization
Compare character-level, word-level, and subword tokenization with simple examples.
Understand why modern LLMs use subword tokenization by comparing the three fundamental tokenization approaches, their storage requirements, and computational trade-offs.
Text can be split by letter, by word, or by syllables.
Subword splitting merges frequent character pairings to balance vocabulary lookup arrays against sequence attention metrics.
Granularity choice directly bounds Out-of-Vocabulary (OOV) risks and VRAM footprint in embedding tables.
Why This Matters
Choosing the correct tokenization granularity balances vocabulary size against sequence length, preventing context bottlenecks and out-of-vocabulary terms.
Word-level tokenization is like a massive dictionary containing every word in existence. Character-level is spelling everything letter-by-letter. Subwords are like root syllables, prefixes, and suffixes—offering a highly reusable building kit.
Visual Diagram: Tokenization Granularities Compared
Tokenization in Simple Words
Early NLP models split text by characters or by full space-separated words. Character-level splits create massive sequences that exhaust memory. Word-level splits require infinite vocabularies that fail on typos or new words. Subword tokenization merges these ideas, using subword chunks to balance space and sequence length.
Syllable Splitting Granularity Comparison
| Tokenizer Type | Example | Benefit | Problem |
|---|---|---|---|
| Character | cat → c, a, t | Handles any word, zero OOV terms | Very long sequences, attention memory overhead |
| Word | I love AI → I, love, AI | Intuitive, clean segmentation | Huge vocab, struggles with spelling variations and typos |
| Subword | tokenization → token, ization | Balanced sequence length & vocab size | Slightly complex boundaries rules |
Example: Text to Tokens to Token IDs
Subword tokenizers split 'learning' into root 'learn' and suffix 'ing' to reuse vocabulary items.
Unbelievable Splitting Boundaries
See how the word 'unbelievable' is broken down by different tokenizers:
Deep-Dive Core Concepts
Splits text into letters. Vocabulary is tiny (approx. 256 for ASCII/UTF-8), but sequences are extremely long, making attention scaling slow.
Splits text by spaces. Vocabulary size explodes into millions of entries, and the model struggles with Out-of-Vocabulary (OOV) terms like plurals or typos.
Splits rare words into frequent subword chunks (e.g., 'playing' → ['play', 'ing']). This balances vocabulary size and sequence length.
Concepts Covered
Why AI Engineers Care About Tokenization
A word-level tokenizer requires a huge embedding matrix, which consumes massive GPU VRAM just to store dictionary representations.
Character-level tokenization doubles or triples prompt lengths, inflating attention complexity quadratically.
- Input 'unbelievable' in the text workspace.
- Analyze characters count vs token count outputs.
- Check why typos break words into smaller subwords.
Tokenizer Visualizer Studio
Analyze how different tokenization algorithms split typical inputs and measure sequence length shifts.
Input Text → Algorithm Selector → Merged Token Array → Comparison Report
Common Beginner Misconceptions
Character tokenization is never used anymore.
Character or byte fallbacks are still active inside subword tokenizers to handle unknown symbols without crashing.
Technical Interview Defense Q&A
Key Takeaways
- •Character tokenizers have small vocabularies but suffer from long sequences.
- •Word tokenizers have short sequences but suffer from infinite vocabularies and OOV errors.
- •Subword tokenizers offer the optimal balance for modern LLM transformers.