Token embeddings — LLM Explainer

Integer token IDs from the previous step are merely indices into the vocabulary. They do not carry semantic meaning on their own. The model has no idea that token 3290 refers to a dog. To add meaning to these indices, the next step is an embedding lookup.

The embedding table is a learned matrix

$$ E \in \mathbb{R}^{|V| \times d} $$

with one row per token and one column per embedding dimension. If the current token ID is 3290, the model does not do anything fancy yet. It simply fetches row 3290 from that table. If the token ID at position $i$ is $t_i$, then:

$$ x_i = E[t_i] $$

The lookup is simple. The key idea is that the table itself is learned during training, so these row vectors can encode useful structure.

Loading visualization…

This visualization uses real GPT-2 token embeddings. Click a token in the sentence to inspect its embedding values. Notice how a given token always retrieves the same row from the embedding table. What changes at this step is the type of representation: we move from symbolic IDs to dense learned vectors.

Individual coordinates are still not very meaningful on their own. The main signal lives in the geometry of the whole embedding space: which vectors point in similar directions, which ones cluster together, and which relationships are easy to express linearly.

A historical aside

Here we replicate a classic demo using the GPT-2 embedding vectors.¹This example comes from Word2Vec: Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, Efficient Estimation of Word Representations in Vector Space (2013).

Some semantic relationships show up as simple vector arithmetic:

$$ \mathrm{vec}(\text{king}) - \mathrm{vec}(\text{man}) + \mathrm{vec}(\text{woman}) \approx \mathrm{vec}(\text{queen}) $$

Using the real GPT-2 rows for king, man, woman, and queen, the query vector king - man + woman lands very close to queen. We show both cosine similarity and dot product to capture this geometric relationship.²Cosine similarity compares direction while ignoring vector length. Dot product combines both direction and magnitude information.

Top matches

Token	Cosine	Dot
`queen`	0.71	11.2
`princess`	0.60	9.4
`Queen`	0.60	8.7

Random baselines

Token	Cosine	Dot
`Either`	0.21	3.1
`Alpine`	0.23	3.8
`pillars`	0.29	4.6

This geometry is not just a Word2Vec curiosity. Even in GPT-2, token embeddings already capture some high-level semantic structure. Later, when looking at output projections, we will come back to why dot product matters more to the model than cosine similarity.

Embedding tables can be large

The lookup operation is simple, but the table itself is not small. The number of parameters in the input embedding table is

$$ \lvert V \rvert \times d $$

one learned vector for every vocabulary item. For smaller models, that can be a surprisingly large fraction of the total parameter count.

Model	Vocab size	Input embedding table	Share of total params
GPT-2 Small	50,257	38.6M 50,257 × 768	33.0%
Gemma 3 270M	262,144	167.8M 262,144 × 640	62.1%
Qwen3 1.7B	151,936	311.2M 151,936 × 2,048	18.3%
Ministral 8B	131,072	536.9M 131,072 × 4,096	6.7%
DeepSeek V3.2 Speciale	129,280	926.7M 129,280 × 7,168	0.14%

These counts refer to the input token embedding table only. The absolute table keeps growing with larger vocabularies and hidden sizes, but the fractional cost is especially noticeable in smaller models.

What’s next

At this point, each token in the sequence has a learned vector. The next step adds position information so the model can tell where those tokens occur.