Token embeddings — LLM Explainer

Before a language model can do anything, it needs to convert raw text into numbers. The first step is the vocabulary: a fixed list of every token the model knows. Tokens can be words, subwords, or characters¹Modern models use Byte-Pair Encoding (Sennrich et al., 2016), which merges frequent character pairs iteratively until a target vocabulary size is reached. GPT-2 uses a 50,257-token BPE vocabulary; GPT-4 uses roughly 100k tokens. — the exact scheme doesn’t matter yet. What matters is that every possible input token maps to a unique integer index.

vocab = ["[PAD]", "the", "quick", "brown", "fox", ...]
token_to_id = {token: i for i, token in enumerate(vocab)}

# "the" → 1,  "fox" → 4,  "dog" → 8

An integer index alone carries no semantic meaning — the number 4 doesn’t know it represents a small canid. So each index is mapped to a dense vector of learnable real numbers called an embedding. The model learns these vectors during training.

$$E \in \mathbb{R}^{|V| \times d}$$

where $|V|$ is the vocabulary size and $d$ is the embedding dimension. Looking up a token is a single row fetch from this matrix — just an array index.

d = 8        # embedding dimension (tiny for illustration)
E = random_matrix(vocab_size, d)   # learned during training

def embed(token: str) -> np.ndarray:
    return E[token_to_id[token]]   # shape: (d,)

The table below shows a toy vocabulary with $d = 8$. Each row is a token’s embedding vector. Click any row (or a token chip below the table) to inspect its values.

Loading visualization…

A few things to notice:

[PAD] is all zeros — a conventional choice for the padding token.
Semantically related tokens (“fox”, “dog”, “cat”) have similar embedding patterns. In a trained model this similarity is learned, not hand-crafted.²Word2Vec (Mikolov et al., 2013) first demonstrated that trained embeddings encode semantic structure — the classic example: vec(king) − vec(man) + vec(woman) ≈ vec(queen). Modern LLMs exhibit this property at much larger scale.
Individual values are largely uninterpretable. Meaning emerges from relationships between vectors, not from any single number.

At this point we have a function $\text{embed}: \text{token} \to \mathbb{R}^d$, but no notion of position or context. The next step adds an output projection so the model can at least produce a probability distribution over the vocabulary.