Milestone A Phase 0 — Input pipeline Step 00

Subword tokenization

New concept: vocabulary construction

A language model operates on discrete symbols, not raw text. The first design decision is: what are the symbols?

Word-level tokenization splits on whitespace — “unbelievable” is one token. It works well for common words but fails on rare ones: any word not seen during training becomes [UNK], an unknown token that discards all meaning. A vocabulary large enough to cover all words is also impractically large.

Character-level tokenization avoids unknowns entirely — every string is representable — but forces the model to learn word structure from scratch. Sequences become very long (every character is a step), and training is expensive.

Subword tokenization sits in between. Common words get their own token; rare words are split into smaller pieces that the model has seen. “unbelievable” might become ["un", "believ", "able"]. Nothing is truly unknown, and the vocabulary stays compact.

Byte-Pair Encoding (BPE)

BPE1Sennrich et al., 2016, Neural Machine Translation of Rare Words with Subword Units. Originally proposed for NMT; later adopted by GPT-2 and every subsequent OpenAI model. builds a vocabulary by iteratively merging the most frequent pair of adjacent symbols in a training corpus:

# Start with a character-level vocabulary
corpus = ["l o w </w>", "l o w e r </w>", "n e w e s t </w>", ...]

# Count adjacent pairs
pairs = count_pairs(corpus)   # ("e", "s") → 9,  ("e", "r") → 6, ...

# Merge the most frequent pair and repeat
# Round 1: merge ("e", "s") → "es"
# Round 2: merge ("es", "t") → "est"
# ...until vocabulary reaches target size

Each merge adds one new token to the vocabulary. Running 50,000 merges starting from a character alphabet produces a ~50k-token vocabulary — GPT-2’s size.

Byte-level BPE (tiktoken / GPT-4)

Standard BPE can still fail on unusual Unicode. Byte-level BPE2Introduced in GPT-2 (Radford et al., 2019). tiktoken is OpenAI’s open-source implementation. The cl100k_base encoding used by GPT-3.5/GPT-4 has a vocabulary of 100,277 tokens; o200k_base used by GPT-4o has 200,019. solves this by treating UTF-8 bytes (0–255) as the base alphabet instead of Unicode characters. Every possible string is representable — there is no UNK token.

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")

# The vocabulary is ~100k tokens built from byte-level BPE merges
print(enc.n_vocab)   # 100277

The trade-off: byte-level tokenization splits emoji and non-Latin scripts into more pieces than character-level schemes, making those sequences longer.

SentencePiece (LLaMA, Mistral, Gemma)

SentencePiece3Kudo & Richardson, 2018, SentencePiece: A simple and language independent subword tokenizer. Unlike BPE tokenizers that split on whitespace first, SentencePiece treats the input as a raw byte stream, making it language-agnostic. is the tokenizer of choice for most non-OpenAI models. It applies BPE (or a unigram language model variant) directly to the raw character stream rather than whitespace-presegmented words. A special marker (U+2581) denotes a space preceding a token:

from sentencepiece import SentencePieceProcessor
sp = SentencePieceProcessor(model_file="llama3.model")

sp.encode("Hello world", out_type=str)
# → ['▁Hello', '▁world']

sp.encode("unbelievable", out_type=str)
# → ['▁un', 'bel', 'iev', 'able']

LLaMA 3 uses a 128k-token SentencePiece vocabulary — large enough that common English words are almost always a single token, improving both efficiency and the model’s ability to reason about word structure.

What the vocabulary actually is

After training the tokenizer, the vocabulary is simply a mapping between integer IDs and token strings:

# Excerpt from a BPE vocabulary file
{
    "<|endoftext|>": 0,
    "!": 1,
    '"': 2,
    "#": 3,
    ...
    " the": 262,
    ...
    "izable": 1891,
    ...
}

The vocabulary is fixed before model training begins and never changes. It is a hyperparameter of the model architecture, not something learned during training.

The next step shows how a tokenizer uses this vocabulary to encode a string of text into a sequence of integer IDs.