Encoding text
New concept: tokenization pipeline
The previous step built a vocabulary — a fixed mapping from token strings to integer IDs. This step uses that vocabulary to convert a raw string into a sequence of integers the model can actually process.
The encoding algorithm
Given a BPE vocabulary, encoding works by repeatedly merging the best available adjacent pair until no more learned merges apply. GPT-2 does not run this over the whole sentence as one stream. It first uses a regex to split the text into chunks, then runs byte-level BPE separately inside each chunk.1This is sometimes called maximal munch tokenization. The exact algorithm varies: tiktoken uses a regex presplit step to prevent merges across word boundaries, punctuation, and whitespace. SentencePiece uses the Viterbi algorithm over a unigram language model to find the most probable segmentation.
def encode(text: str, vocab: dict[str, int], merges: list[tuple]) -> list[int]:
# Start with individual characters (or bytes for byte-level BPE)
pieces = list(text)
# Apply merges in order
for left, right in merges:
i = 0
while i < len(pieces) - 1:
if pieces[i] == left and pieces[i + 1] == right:
pieces[i] = left + right
del pieces[i + 1]
else:
i += 1
return [vocab[p] for p in pieces]
The same string always produces the same token IDs — encoding is deterministic.2SentencePiece has an optional sample_encode mode that stochastically samples different segmentations of the same string during training. This regularization technique, called subword regularization (Kudo, 2018), can improve robustness on noisy text.
A worked GPT-2 example
The trace below shows the full encoding path for the running example
I see the dog chase the cat.. Each card starts from byte-level pieces inside
one regex chunk, then shows one row per merge until the final GPT-2 token is
formed. Spaces are shown as ␣ so the token boundaries stay visible.
Loading visualization…
Notice that tokens like ␣see, ␣the, ␣dog, ␣chase, and ␣cat keep their
leading space. In GPT-2-style tokenizers, that space is part of the token.
Special tokens
Every tokenizer reserves a small set of special tokens that are never produced by normal text encoding:
| Token | Typical ID | Purpose |
|---|---|---|
<|endoftext|> |
50256 | marks end of a document in GPT-2 |
<|pad|> |
varies | pads short sequences in a batch |
<s> / [BOS] |
1 | beginning-of-sequence (SentencePiece) |
</s> / [EOS] |
2 | end-of-sequence (SentencePiece) |
These are inserted by the tokenization pipeline based on context, not by the BPE algorithm. When you send a prompt to an LLM API, the serving infrastructure adds the appropriate special tokens before the model sees the input.
Why this matters
The token ID sequence is the only form in which a language model ever sees text. Every subsequent operation — embeddings, attention, feed-forward layers, output projection — operates on these integers or on vectors derived from them.
The length of this list (the token count) is what determines compute cost. Longer prompts require more memory and more compute, which is why token budgets matter in practice.
The next step shows how each integer ID is converted into a dense vector — the embedding lookup that translates a discrete symbol into a point in continuous space.