Encoding text
New concept: tokenization pipeline
The previous step built a vocabulary — a fixed mapping from token strings to integer IDs. This step uses that vocabulary to convert a raw string into a sequence of integers the model can actually process.
The encoding algorithm
Given a BPE vocabulary (a set of learned merges), encoding works greedily: scan the input left to right, match the longest token in the vocabulary, emit its ID, and advance past it.1This is sometimes called maximal munch tokenization. The exact algorithm varies: tiktoken uses a regex presplit step to prevent merges across word boundaries, punctuation, and whitespace. SentencePiece uses the Viterbi algorithm over a unigram language model to find the most probable segmentation.
def encode(text: str, vocab: dict[str, int], merges: list[tuple]) -> list[int]:
# Start with individual characters (or bytes for byte-level BPE)
pieces = list(text)
# Apply merges in order
for left, right in merges:
i = 0
while i < len(pieces) - 1:
if pieces[i] == left and pieces[i + 1] == right:
pieces[i] = left + right
del pieces[i + 1]
else:
i += 1
return [vocab[p] for p in pieces]
The same string always produces the same token IDs — encoding is deterministic.2SentencePiece has an optional sample_encode mode that stochastically samples different segmentations of the same string during training. This regularization technique, called subword regularization (Kudo, 2018), can improve robustness on noisy text.
A concrete example with tiktoken
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # GPT-3.5 / GPT-4 encoding
text = "The quick brown fox"
ids = enc.encode(text)
print(ids)
# → [791, 4062, 14198, 39935]
# Decode back to verify round-trip
print(enc.decode(ids))
# → "The quick brown fox"
# Inspect individual tokens
for id_ in ids:
print(repr(enc.decode([id_])))
# → 'The' ' quick' ' brown' ' fox'
Note the leading space on ' quick', ' brown', ' fox'. BPE vocabularies
include both the space-prefixed and non-prefixed versions of common words as
distinct tokens. The space is part of the token, not a separator.
Special tokens
Every tokenizer reserves a small set of special tokens that are never produced by normal text encoding:
| Token | Typical ID | Purpose |
|---|---|---|
<|endoftext|> |
0 | marks end of a document |
<|pad|> |
varies | pads short sequences in a batch |
<s> / [BOS] |
1 | beginning-of-sequence (SentencePiece) |
</s> / [EOS] |
2 | end-of-sequence (SentencePiece) |
These are inserted by the tokenization pipeline based on context, not by the BPE algorithm. When you send a prompt to an LLM API, the serving infrastructure adds the appropriate special tokens before the model sees the input.
The output: a list of integers
After encoding, a string like “The quick brown fox jumps” becomes something like:
[791, 4062, 14198, 39935, 35308]
This is the only form in which a language model ever sees text. Every subsequent operation — embeddings, attention, feed-forward layers, output projection — operates on these integers or on vectors derived from them.
The length of this list (the token count) is what determines compute cost. Longer prompts require more memory and more compute, which is why token budgets matter in practice.
The next step shows how each integer ID is converted into a dense vector — the embedding lookup that translates a discrete symbol into a point in continuous space.