AI Lesson & Submodule

Character, Word & Subword Tokenization

Compare character-level, word-level, and subword tokenization with simple examples.

Beginner15 min readLLM FoundationInterview Ready
Lesson Overview

Understand why modern LLMs use subword tokenization by comparing the three fundamental tokenization approaches, their storage requirements, and computational trade-offs.

From Beginner to Engineer
Beginner Level

Text can be split by letter, by word, or by syllables.

Engineer Level

Subword splitting merges frequent character pairings to balance vocabulary lookup arrays against sequence attention metrics.

Production Level

Granularity choice directly bounds Out-of-Vocabulary (OOV) risks and VRAM footprint in embedding tables.

Why This Matters

Choosing the correct tokenization granularity balances vocabulary size against sequence length, preventing context bottlenecks and out-of-vocabulary terms.

Mental Model: Word dictionary vs Spelling letters

Word-level tokenization is like a massive dictionary containing every word in existence. Character-level is spelling everything letter-by-letter. Subwords are like root syllables, prefixes, and suffixes—offering a highly reusable building kit.

Visual Diagram: Tokenization Granularities Compared

Target word: "learning"
Character-Level (Small Vocab, Long Sequences)8 tokens
learning
Word-Level (Huge Vocab, OOV Failures)1 token
learning
Subword-Level (Balanced standard for LLMs)2 tokens
learning

Tokenization in Simple Words

Early NLP models split text by characters or by full space-separated words. Character-level splits create massive sequences that exhaust memory. Word-level splits require infinite vocabularies that fail on typos or new words. Subword tokenization merges these ideas, using subword chunks to balance space and sequence length.

Syllable Splitting Granularity Comparison

Tokenizer TypeExampleBenefitProblem
Charactercat → c, a, tHandles any word, zero OOV termsVery long sequences, attention memory overhead
WordI love AI → I, love, AIIntuitive, clean segmentationHuge vocab, struggles with spelling variations and typos
Subwordtokenization → token, izationBalanced sequence length & vocab sizeSlightly complex boundaries rules

Example: Text to Tokens to Token IDs

Step 1: Input text string"learning"
Step 2: Token representation["learn","ing"]
Step 3: Mapped Token IDs[4658,278]

Subword tokenizers split 'learning' into root 'learn' and suffix 'ing' to reuse vocabulary items.

Unbelievable Splitting Boundaries

See how the word 'unbelievable' is broken down by different tokenizers:

"Character-level"
unbelievable
"Word-level"
unbelievable
"Subword-level"
unbelievable

Deep-Dive Core Concepts

Character-Level Tokenization

Splits text into letters. Vocabulary is tiny (approx. 256 for ASCII/UTF-8), but sequences are extremely long, making attention scaling slow.

Word-Level Tokenization

Splits text by spaces. Vocabulary size explodes into millions of entries, and the model struggles with Out-of-Vocabulary (OOV) terms like plurals or typos.

Subword-Level Tokenization

Splits rare words into frequent subword chunks (e.g., 'playing' → ['play', 'ing']). This balances vocabulary size and sequence length.

Concepts Covered

Character TokenizationWord TokenizationSubword TokenizationOut-of-Vocabulary (OOV)SparsitySequence Scaling

Why AI Engineers Care About Tokenization

Vocabulary Size vs Model Weights

A word-level tokenizer requires a huge embedding matrix, which consumes massive GPU VRAM just to store dictionary representations.

Sequence Length Complexity

Character-level tokenization doubles or triples prompt lengths, inflating attention complexity quadratically.

Production Failure Scenario: Character-level Attention Explosion
Root Cause: An engineer trained a character-only chatbot. Prompt character length quadrupled sequence limits, causing attention matrices to exceed GPU memory bounds (O(N^2)).
Fix / Strategy: Migrate to subword tokenization (like SentencePiece) to compress input sequence steps.
Try This in the Lab
  • Input 'unbelievable' in the text workspace.
  • Analyze characters count vs token count outputs.
  • Check why typos break words into smaller subwords.
Mapped Foundation Project

Tokenizer Visualizer Studio

Analyze how different tokenization algorithms split typical inputs and measure sequence length shifts.

Architecture Preview

Input Text → Algorithm Selector → Merged Token Array → Comparison Report

Tech Stack Planned
TypeScriptReactSubword Tokenizers

Common Beginner Misconceptions

Misconception

Character tokenization is never used anymore.

Reality

Character or byte fallbacks are still active inside subword tokenizers to handle unknown symbols without crashing.

Technical Interview Defense Q&A

Key Takeaways

  • Character tokenizers have small vocabularies but suffer from long sequences.
  • Word tokenizers have short sequences but suffer from infinite vocabularies and OOV errors.
  • Subword tokenizers offer the optimal balance for modern LLM transformers.

Before You Move Next Checklist