Tokenization is the foundational step in NLP where raw text gets converted into model-digestible tokens. This guide explores modern tokenization techniques, focusing on Byte-Pair Encoding (BPE) and its variants.
Why Tokenization Matters
Modern NLP models like Transformers and BERT process text through tokenization:
- Converts raw text into structured inputs
- Determines model vocabulary boundaries
- Impacts computational efficiency and semantic representation
Ideal tokenization should:
✅ Preserve semantic meaning
✅ Maintain manageable vocabulary size
✅ Avoid out-of-vocabulary (OOV) issues
✅ Produce reasonable sequence lengths
Granularity Levels in Tokenization
1. Character-Level Tokenization
Process: Splits text into individual characters
Examples:
- English: "apple" → a/p/p/l/e
- Chinese: "苹果" (apple) → 苹/果
Pros:
- Minimal vocabulary (~50 English chars)
- No OOV issues
Cons:
- Loses semantic meaning
- Creates excessively long sequences
2. Word-Level Tokenization
Process: Uses linguistic units (words/phrases)
Examples:
- "New York" remains intact
- "我爱吃苹果" → 我/爱/吃/苹果
Tools:
- English: SpaCy, NLTK
- Chinese: Jieba, LTP
Subword Tokenization: The Modern Approach
Balances character/word methods by splitting rare words while preserving common terms.
3.1 Byte-Pair Encoding (BPE)
Core Principle: Iteratively merge highest-frequency character pairs
BPE Workflow:
- Pre-tokenization: Split corpus into words + frequency counts
- Base Vocabulary: Create initial charset (e.g., [a,b,c,...,z])
Merge Phase:
- Calculate all adjacent pair frequencies
- Merge highest-frequency pair → new token
- Repeat until target vocab size reached
Case Study:
👉 See BPE in action with this vocabulary building example
Training Example:
| Iteration | Merge | New Token | Frequency |
|---|---|---|---|
| 1 | u + g | ug | 20 |
| 2 | u + n | un | 16 |
| 3 | h + ug | hug | 15 |
Key Insight: GPT-2 uses 40,478 tokens (478 base chars + 40k merges)
3.2 WordPiece (Google's BPE Variant)
Difference: Merges based on maximum likelihood increase rather than frequency
Selection Criteria: argmax(P(z)/(P(x)*P(y)))
→ Chooses pairs with highest mutual information
Bert Implementation:
- Initialize with character vocabulary
- Train language model on subword distribution
- Iteratively add highest-likelihood merges
3.3 Byte-Level BPE (GPT-2 Innovation)
Breakthrough:
- Processes raw UTF-8 bytes
- Eliminates UNK tokens completely
- Currently proprietary to OpenAI
FAQs: Tokenization Demystified
Q: Why does BPE outperform word-level tokenization?
A: Better handles morphology and rare words through adaptive subword units.
Q: How to choose vocabulary size?
A: Balance computational cost (smaller) vs. semantic precision (larger). 30k-50k is common.
Q: Does BPE work for all languages?
A: Particularly effective for morphologically rich languages like German or Turkish.
Q: What's the memory impact of larger vocabularies?
A: Each additional token requires embedding storage (e.g., 50k tokens × 768 dim = ~150MB).
Advanced Applications
Modern implementations leverage these techniques:
- SentencePiece: Language-agnostic tokenization
- HuggingFace Tokenizers: Optimized Rust implementations
- Subword Regularization: Probabilistic merges for robustness
👉 Explore cutting-edge tokenization implementations
Future directions include dynamically adaptive vocabulary sizes and cross-lingual subword sharing.