Subword tokenization — LLM Explainer

A language model operates on discrete symbols, not raw text. The first design decision is: what are the symbols?

Word-level tokenization splits on whitespace — “unbelievable” is one token. It works well for common words but fails on rare ones: any word not seen during training becomes [UNK], an unknown token that discards all meaning. A vocabulary large enough to cover all words is also impractically large.

Character-level tokenization avoids unknowns entirely — every string is representable — but forces the model to learn word structure from scratch. Sequences become very long (every character is a step), and training is expensive.

Subword tokenization sits in between. Common words get their own token; rare words are split into smaller pieces that the model has seen. “unbelievable” might become ["un", "believ", "able"]. Nothing is truly unknown, and the vocabulary stays compact.

Word-level 2 tokens

internationalization matters

Character-level 28 tokens

internationalization matters

BPE — GPT-2 / nanoGPT 3 tokens

internationalization matters

Word-level tokenization is shortest on this sentence, but only if the vocabulary already contains the full word internationalization. Subword tokenization pays one extra step and gets to reuse pieces like international and ization instead of memorizing the whole word.

Byte-Pair Encoding (BPE)

BPE¹Sennrich et al., 2016, Neural Machine Translation of Rare Words with Subword Units. Originally proposed for NMT; later adopted by GPT-2 and every subsequent OpenAI model. builds a vocabulary by iteratively merging the most frequent pair of adjacent symbols in a training corpus:

# Start with a character-level vocabulary
corpus = ["l o w </w>", "l o w e r </w>", "n e w e s t </w>", ...]

# Count adjacent pairs
pairs = count_pairs(corpus)   # ("e", "s") → 9,  ("e", "r") → 6, ...

# Merge the most frequent pair and repeat
# Round 1: merge ("e", "s") → "es"
# Round 2: merge ("es", "t") → "est"
# ...until vocabulary reaches target size

Each merge adds one new token to the vocabulary. Running 50,000 merges starting from a character alphabet produces a ~50k-token vocabulary — GPT-2’s size.

Byte-level BPE (GPT-2 / nanoGPT)

Standard BPE can still fail on unusual Unicode. Byte-level BPE²Introduced in GPT-2 (Radford et al., 2019). The GPT-2 tokenizer has a 50,257-token vocabulary, and OpenAI’s tiktoken library can reproduce it with the gpt2 encoding. solves this by treating UTF-8 bytes (0–255) as the base alphabet instead of Unicode characters. Every possible string is representable — there is no UNK token.

import tiktoken
enc = tiktoken.get_encoding("gpt2")

# The vocabulary is GPT-2's 50,257 byte-level BPE tokens
print(enc.n_vocab)   # 50257

The trade-off: byte-level tokenization splits emoji and non-Latin scripts into more pieces than character-level schemes, making those sequences longer.

SentencePiece (LLaMA, Mistral, Gemma)

SentencePiece³Kudo & Richardson, 2018, SentencePiece: A simple and language independent subword tokenizer. Unlike BPE tokenizers that split on whitespace first, SentencePiece treats the input as a raw byte stream, making it language-agnostic. is the tokenizer of choice for most non-OpenAI models. It applies BPE (or a unigram language model variant) directly to the raw character stream rather than whitespace-presegmented words. A special marker ▁ (U+2581) denotes a space preceding a token:

from sentencepiece import SentencePieceProcessor
sp = SentencePieceProcessor(model_file="llama3.model")

sp.encode("Hello world", out_type=str)
# → ['▁Hello', '▁world']

sp.encode("unbelievable", out_type=str)
# → ['▁un', 'bel', 'iev', 'able']

LLaMA 3 uses a 128k-token SentencePiece vocabulary — large enough that common English words are almost always a single token, improving both efficiency and the model’s ability to reason about word structure.

Same text, different surface form

For plain ASCII, tiktoken and SentencePiece can look fairly similar. Unicode characters make the difference clearer. Here the SentencePiece row uses the actual tokenizer from google/gemma-3-1b-it:

GPT-2 BPE 5 tokens

I love<20 f0 9f><8d><a3>

SentencePiece — Gemma 3 4 tokens

I▁love▁🍣

The GPT-2 row above is the real gpt2 encoding. The sushi emoji is encoded through UTF-8 bytes, so inspecting the tokens one by one reveals byte fragments rather than a clean 🍣 token string. In the display above, those fragments are shown as exact hex byte chunks such as <20 f0 9f>. Gemma 3’s SentencePiece tokenizer stores Unicode text directly, so the emoji remains readable and the preceding space appears as its own ▁ token.

What the vocabulary actually is

After training the tokenizer, the vocabulary is simply a mapping between integer IDs and token strings:

# Excerpt from a BPE vocabulary file
{
    "<|endoftext|>": 0,
    "!": 1,
    '"': 2,
    "#": 3,
    ...
    " the": 262,
    ...
    "izable": 1891,
    ...
}

The vocabulary is fixed before model training begins and never changes. It is a hyperparameter of the model architecture, not something learned during training.

The next step shows how a tokenizer uses this vocabulary to encode a string of text into a sequence of integer IDs.