Pedagogical LLM Tiny causal LM Step 02

Token embeddings

New concept: dense vector representation

Integer token IDs from the previous step are merely indices into the vocabulary. They do not carry semantic meaning on their own. The model has no idea that token 3290 refers to a dog. To add meaning to these indices, the next step is an embedding lookup.

The embedding table is a learned matrix

$$ E \in \mathbb{R}^{|V| \times d} $$

with one row per token and one column per embedding dimension. If the current token ID is 3290, the model does not do anything fancy yet. It simply fetches row 3290 from that table. If the token ID at position $i$ is $t_i$, then:

$$ x_i = E[t_i] $$

The lookup is simple. The key idea is that the table itself is learned during training, so these row vectors can encode useful structure.

This visualization uses real GPT-2 token embeddings. Click a token in the sentence to inspect its embedding values. Notice how a given token always retrieves the same row from the embedding table. What changes at this step is the type of representation: we move from symbolic IDs to dense learned vectors.

Individual coordinates are still not very meaningful on their own. The main signal lives in the geometry of the whole embedding space: which vectors point in similar directions, which ones cluster together, and which relationships are easy to express linearly.

A historical aside

Here we replicate a classic demo using the GPT-2 embedding vectors.1This example comes from Word2Vec: Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, Efficient Estimation of Word Representations in Vector Space (2013).

Some semantic relationships show up as simple vector arithmetic:

$$ \mathrm{vec}(\text{king}) - \mathrm{vec}(\text{man}) + \mathrm{vec}(\text{woman}) \approx \mathrm{vec}(\text{queen}) $$

Using the real GPT-2 rows for king, man, woman, and queen, the query vector king - man + woman lands very close to queen. We show both cosine similarity and dot product to capture this geometric relationship.2Cosine similarity compares direction while ignoring vector length. Dot product combines both direction and magnitude information.

Top matches

TokenCosineDot
queen0.7111.2
princess0.609.4
Queen0.608.7

Random baselines

TokenCosineDot
Either0.213.1
Alpine0.233.8
pillars0.294.6

This geometry is not just a Word2Vec curiosity. Even in GPT-2, token embeddings already capture some high-level semantic structure. Later, when looking at output projections, we will come back to why dot product matters more to the model than cosine similarity.

Embedding tables can be large

The lookup operation is simple, but the table itself is not small. The number of parameters in the input embedding table is

$$ \lvert V \rvert \times d $$

one learned vector for every vocabulary item. For smaller models, that can be a surprisingly large fraction of the total parameter count.

Model Vocab size Input embedding table Share of total params
GPT-2 Small 50,257 38.6M
50,257 × 768
33.0%
Gemma 3 270M 262,144 167.8M
262,144 × 640
62.1%
Qwen3 1.7B 151,936 311.2M
151,936 × 2,048
18.3%
Ministral 8B 131,072 536.9M
131,072 × 4,096
6.7%
DeepSeek V3.2 Speciale 129,280 926.7M
129,280 × 7,168
0.14%

These counts refer to the input token embedding table only. The absolute table keeps growing with larger vocabularies and hidden sizes, but the fractional cost is especially noticeable in smaller models.

What’s next

At this point, each token in the sequence has a learned vector. The next step adds position information so the model can tell where those tokens occur.