Subword tokenization
New concept: vocabulary construction
A language model operates on discrete symbols, not raw text. The first design decision is: what are the symbols?
Word-level tokenization splits on whitespace — “unbelievable” is one token.
It works well for common words but fails on rare ones: any word not seen during
training becomes [UNK], an unknown token that discards all meaning. A vocabulary
large enough to cover all words is also impractically large.
Character-level tokenization avoids unknowns entirely — every string is representable — but forces the model to learn word structure from scratch. Sequences become very long (every character is a step), and training is expensive.
Subword tokenization sits in between. Common words get their own token;
rare words are split into smaller pieces that the model has seen. “unbelievable”
might become ["un", "believ", "able"]. Nothing is truly unknown, and the
vocabulary stays compact.
Word-level tokenization is shortest on this sentence, but only if the
vocabulary already contains the full word internationalization. Subword
tokenization pays one extra step and gets to reuse pieces like
international and ization instead of memorizing the whole word.
Byte-Pair Encoding (BPE)
BPE1Sennrich et al., 2016, Neural Machine Translation of Rare Words with Subword Units. Originally proposed for NMT; later adopted by GPT-2 and every subsequent OpenAI model. builds a vocabulary by iteratively merging the most frequent pair of adjacent symbols in a training corpus:
# Start with a character-level vocabulary
corpus = ["l o w </w>", "l o w e r </w>", "n e w e s t </w>", ...]
# Count adjacent pairs
pairs = count_pairs(corpus) # ("e", "s") → 9, ("e", "r") → 6, ...
# Merge the most frequent pair and repeat
# Round 1: merge ("e", "s") → "es"
# Round 2: merge ("es", "t") → "est"
# ...until vocabulary reaches target size
Each merge adds one new token to the vocabulary. Running 50,000 merges starting from a character alphabet produces a ~50k-token vocabulary — GPT-2’s size.
Byte-level BPE (GPT-2 / nanoGPT)
Standard BPE can still fail on unusual Unicode. Byte-level BPE2Introduced in GPT-2 (Radford et al., 2019). The GPT-2 tokenizer has a 50,257-token vocabulary, and OpenAI’s tiktoken library can reproduce it with the gpt2 encoding. solves this
by treating UTF-8 bytes (0–255) as the base alphabet instead of Unicode characters.
Every possible string is representable — there is no UNK token.
import tiktoken
enc = tiktoken.get_encoding("gpt2")
# The vocabulary is GPT-2's 50,257 byte-level BPE tokens
print(enc.n_vocab) # 50257
The trade-off: byte-level tokenization splits emoji and non-Latin scripts into more pieces than character-level schemes, making those sequences longer.
SentencePiece (LLaMA, Mistral, Gemma)
SentencePiece3Kudo & Richardson, 2018, SentencePiece: A simple and language independent subword tokenizer. Unlike BPE tokenizers that split on whitespace first, SentencePiece treats the input as a raw byte stream, making it language-agnostic. is the tokenizer of choice for most non-OpenAI models. It applies BPE
(or a unigram language model variant) directly to the raw character stream rather
than whitespace-presegmented words. A special marker ▁ (U+2581) denotes a
space preceding a token:
from sentencepiece import SentencePieceProcessor
sp = SentencePieceProcessor(model_file="llama3.model")
sp.encode("Hello world", out_type=str)
# → ['▁Hello', '▁world']
sp.encode("unbelievable", out_type=str)
# → ['▁un', 'bel', 'iev', 'able']
LLaMA 3 uses a 128k-token SentencePiece vocabulary — large enough that common English words are almost always a single token, improving both efficiency and the model’s ability to reason about word structure.
Same text, different surface form
For plain ASCII, tiktoken and SentencePiece can look fairly similar. Unicode
characters make the difference clearer. Here the SentencePiece row uses the
actual tokenizer from google/gemma-3-1b-it:
The GPT-2 row above is the real gpt2 encoding. The sushi emoji is
encoded through UTF-8 bytes, so inspecting the tokens one by one reveals byte
fragments rather than a clean 🍣 token string. In the display above, those
fragments are shown as exact hex byte chunks such as <20 f0 9f>. Gemma 3’s
SentencePiece tokenizer stores Unicode text
directly, so the emoji remains readable and the preceding space appears as its
own ▁ token.
What the vocabulary actually is
After training the tokenizer, the vocabulary is simply a mapping between integer IDs and token strings:
# Excerpt from a BPE vocabulary file
{
"<|endoftext|>": 0,
"!": 1,
'"': 2,
"#": 3,
...
" the": 262,
...
"izable": 1891,
...
}
The vocabulary is fixed before model training begins and never changes. It is a hyperparameter of the model architecture, not something learned during training.
The next step shows how a tokenizer uses this vocabulary to encode a string of text into a sequence of integer IDs.