Back to Module 1.1: TokenizationLesson 02 of 07

AI Lesson & Submodule

Character, Word & Subword Tokenization

Compare character-level, word-level, and subword tokenization with simple examples.

Beginner15 min readLLM FoundationInterview Ready

Lesson Overview

Understand why modern LLMs use subword tokenization by comparing the three fundamental tokenization approaches, their storage requirements, and computational trade-offs.

From Beginner to Engineer

Beginner Level

Text can be split by letter, by word, or by syllables.

Engineer Level

Subword splitting merges frequent character pairings to balance vocabulary lookup arrays against sequence attention metrics.

Production Level

Granularity choice directly bounds Out-of-Vocabulary (OOV) risks and VRAM footprint in embedding tables.

Why This Matters

Choosing the correct tokenization granularity balances vocabulary size against sequence length, preventing context bottlenecks and out-of-vocabulary terms.

Mental Model: Word dictionary vs Spelling letters

Word-level tokenization is like a massive dictionary containing every word in existence. Character-level is spelling everything letter-by-letter. Subwords are like root syllables, prefixes, and suffixes—offering a highly reusable building kit.

Visual Diagram: Tokenization Granularities Compared

Target word: "learning"

Character-Level (Small Vocab, Long Sequences)8 tokens

learning

Word-Level (Huge Vocab, OOV Failures)1 token

learning

Subword-Level (Balanced standard for LLMs)2 tokens

learning

Tokenization in Simple Words

Early NLP models split text by characters or by full space-separated words. Character-level splits create massive sequences that exhaust memory. Word-level splits require infinite vocabularies that fail on typos or new words. Subword tokenization merges these ideas, using subword chunks to balance space and sequence length.

Syllable Splitting Granularity Comparison

Tokenizer Type	Example	Benefit	Problem
Character	cat → c, a, t	Handles any word, zero OOV terms	Very long sequences, attention memory overhead
Word	I love AI → I, love, AI	Intuitive, clean segmentation	Huge vocab, struggles with spelling variations and typos
Subword	tokenization → token, ization	Balanced sequence length & vocab size	Slightly complex boundaries rules

Example: Text to Tokens to Token IDs

Step 1: Input text string"learning"

Step 2: Token representation["learn","ing"]

Step 3: Mapped Token IDs[4658,278]

Subword tokenizers split 'learning' into root 'learn' and suffix 'ing' to reuse vocabulary items.

Unbelievable Splitting Boundaries

See how the word 'unbelievable' is broken down by different tokenizers:

"Character-level"

unbelievable

"Word-level"

unbelievable

"Subword-level"

unbelievable

Deep-Dive Core Concepts

Character-Level Tokenization

Splits text into letters. Vocabulary is tiny (approx. 256 for ASCII/UTF-8), but sequences are extremely long, making attention scaling slow.

Word-Level Tokenization

Splits text by spaces. Vocabulary size explodes into millions of entries, and the model struggles with Out-of-Vocabulary (OOV) terms like plurals or typos.

Subword-Level Tokenization

Splits rare words into frequent subword chunks (e.g., 'playing' → ['play', 'ing']). This balances vocabulary size and sequence length.

Concepts Covered

Character TokenizationWord TokenizationSubword TokenizationOut-of-Vocabulary (OOV)SparsitySequence Scaling

Why AI Engineers Care About Tokenization

Vocabulary Size vs Model Weights

A word-level tokenizer requires a huge embedding matrix, which consumes massive GPU VRAM just to store dictionary representations.

Sequence Length Complexity

Character-level tokenization doubles or triples prompt lengths, inflating attention complexity quadratically.

Production Failure Scenario: Character-level Attention Explosion

Root Cause: An engineer trained a character-only chatbot. Prompt character length quadrupled sequence limits, causing attention matrices to exceed GPU memory bounds (O(N^2)).

Fix / Strategy: Migrate to subword tokenization (like SentencePiece) to compress input sequence steps.

Try This in the Lab

Input 'unbelievable' in the text workspace.
Analyze characters count vs token count outputs.
Check why typos break words into smaller subwords.

Launch Lab Application →Simulator Active

Mapped Foundation Project

Tokenizer Visualizer Studio

Analyze how different tokenization algorithms split typical inputs and measure sequence length shifts.

Architecture Preview

Input Text → Algorithm Selector → Merged Token Array → Comparison Report

Tech Stack Planned

TypeScriptReactSubword Tokenizers

Open Lab View GitHub

View Project Requirements →

Common Beginner Misconceptions

Misconception

Character tokenization is never used anymore.

Reality

Character or byte fallbacks are still active inside subword tokenizers to handle unknown symbols without crashing.

Technical Interview Defense Q&A

Key Takeaways

•Character tokenizers have small vocabularies but suffer from long sequences.
•Word tokenizers have short sequences but suffer from infinite vocabularies and OOV errors.
•Subword tokenizers offer the optimal balance for modern LLM transformers.

Before You Move Next Checklist

I can explain why character tokenization creates context window limits issues.
I understand the OOV bottleneck.
I know why subwords represent the ideal trade-off.

Previous: What Is Tokenization?Next: BPE, WordPiece & SentencePiece