Subword tokenization
New concept: vocabulary construction
A language model operates on discrete symbols, not raw text. The first design decision is: what are the symbols?
Word-level tokenization splits on whitespace — “unbelievable” is one token.
It works well for common words but fails on rare ones: any word not seen during
training becomes [UNK], an unknown token that discards all meaning. A vocabulary
large enough to cover all words is also impractically large.
Character-level tokenization avoids unknowns entirely — every string is representable — but forces the model to learn word structure from scratch. Sequences become very long (every character is a step), and training is expensive.
Subword tokenization sits in between. Common words get their own token;
rare words are split into smaller pieces that the model has seen. “unbelievable”
might become ["un", "believ", "able"]. Nothing is truly unknown, and the
vocabulary stays compact.
Byte-Pair Encoding (BPE)
BPE1Sennrich et al., 2016, Neural Machine Translation of Rare Words with Subword Units. Originally proposed for NMT; later adopted by GPT-2 and every subsequent OpenAI model. builds a vocabulary by iteratively merging the most frequent pair of adjacent symbols in a training corpus:
# Start with a character-level vocabulary
corpus = ["l o w </w>", "l o w e r </w>", "n e w e s t </w>", ...]
# Count adjacent pairs
pairs = count_pairs(corpus) # ("e", "s") → 9, ("e", "r") → 6, ...
# Merge the most frequent pair and repeat
# Round 1: merge ("e", "s") → "es"
# Round 2: merge ("es", "t") → "est"
# ...until vocabulary reaches target size
Each merge adds one new token to the vocabulary. Running 50,000 merges starting from a character alphabet produces a ~50k-token vocabulary — GPT-2’s size.
Byte-level BPE (tiktoken / GPT-4)
Standard BPE can still fail on unusual Unicode. Byte-level BPE2Introduced in GPT-2 (Radford et al., 2019). tiktoken is OpenAI’s open-source implementation. The cl100k_base encoding used by GPT-3.5/GPT-4 has a vocabulary of 100,277 tokens; o200k_base used by GPT-4o has 200,019. solves this
by treating UTF-8 bytes (0–255) as the base alphabet instead of Unicode characters.
Every possible string is representable — there is no UNK token.
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
# The vocabulary is ~100k tokens built from byte-level BPE merges
print(enc.n_vocab) # 100277
The trade-off: byte-level tokenization splits emoji and non-Latin scripts into more pieces than character-level schemes, making those sequences longer.
SentencePiece (LLaMA, Mistral, Gemma)
SentencePiece3Kudo & Richardson, 2018, SentencePiece: A simple and language independent subword tokenizer. Unlike BPE tokenizers that split on whitespace first, SentencePiece treats the input as a raw byte stream, making it language-agnostic. is the tokenizer of choice for most non-OpenAI models. It applies BPE
(or a unigram language model variant) directly to the raw character stream rather
than whitespace-presegmented words. A special marker ▁ (U+2581) denotes a
space preceding a token:
from sentencepiece import SentencePieceProcessor
sp = SentencePieceProcessor(model_file="llama3.model")
sp.encode("Hello world", out_type=str)
# → ['▁Hello', '▁world']
sp.encode("unbelievable", out_type=str)
# → ['▁un', 'bel', 'iev', 'able']
LLaMA 3 uses a 128k-token SentencePiece vocabulary — large enough that common English words are almost always a single token, improving both efficiency and the model’s ability to reason about word structure.
What the vocabulary actually is
After training the tokenizer, the vocabulary is simply a mapping between integer IDs and token strings:
# Excerpt from a BPE vocabulary file
{
"<|endoftext|>": 0,
"!": 1,
'"': 2,
"#": 3,
...
" the": 262,
...
"izable": 1891,
...
}
The vocabulary is fixed before model training begins and never changes. It is a hyperparameter of the model architecture, not something learned during training.
The next step shows how a tokenizer uses this vocabulary to encode a string of text into a sequence of integer IDs.