Token embeddings
New concept: dense vector representation
Integer token IDs from the previous step are merely indices into the vocabulary. They do not carry semantic meaning on their own. The model has no idea that token 3290 refers to a dog. To add meaning to these indices, the next step is an embedding lookup.
The embedding table is a learned matrix
$$ E \in \mathbb{R}^{|V| \times d} $$
with one row per token and one column per embedding dimension. If the current
token ID is 3290, the model does not do anything fancy yet. It simply fetches
row 3290 from that table. If the token ID at position $i$ is $t_i$, then:
$$ x_i = E[t_i] $$
The lookup is simple. The key idea is that the table itself is learned during training, so these row vectors can encode useful structure.
Loading visualization…
This visualization uses real GPT-2 token embeddings. Click a token in the sentence to inspect its embedding values. Notice how a given token always retrieves the same row from the embedding table. What changes at this step is the type of representation: we move from symbolic IDs to dense learned vectors.
Individual coordinates are still not very meaningful on their own. The main signal lives in the geometry of the whole embedding space: which vectors point in similar directions, which ones cluster together, and which relationships are easy to express linearly.
A historical aside
Here we replicate a classic demo using the GPT-2 embedding vectors.1This example comes from Word2Vec: Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, Efficient Estimation of Word Representations in Vector Space (2013).
Some semantic relationships show up as simple vector arithmetic:
$$ \mathrm{vec}(\text{king}) - \mathrm{vec}(\text{man}) + \mathrm{vec}(\text{woman}) \approx \mathrm{vec}(\text{queen}) $$
Using the real GPT-2 rows for king, man, woman, and queen, the query
vector king - man + woman lands very close to queen. We show both cosine similarity and dot product to capture this geometric relationship.2Cosine similarity compares direction while ignoring vector length. Dot product combines both direction and magnitude information.
This geometry is not just a Word2Vec curiosity. Even in GPT-2, token embeddings already capture some high-level semantic structure. Later, when looking at output projections, we will come back to why dot product matters more to the model than cosine similarity.
Embedding tables can be large
The lookup operation is simple, but the table itself is not small. The number of parameters in the input embedding table is
$$ \lvert V \rvert \times d $$
one learned vector for every vocabulary item. For smaller models, that can be a surprisingly large fraction of the total parameter count.
| Model | Vocab size | Input embedding table | Share of total params |
|---|---|---|---|
| GPT-2 Small | 50,257 | 38.6M |
33.0% |
| Gemma 3 270M | 262,144 | 167.8M |
62.1% |
| Qwen3 1.7B | 151,936 | 311.2M |
18.3% |
| Ministral 8B | 131,072 | 536.9M |
6.7% |
| DeepSeek V3.2 Speciale | 129,280 | 926.7M |
0.14% |
These counts refer to the input token embedding table only. The absolute table keeps growing with larger vocabularies and hidden sizes, but the fractional cost is especially noticeable in smaller models.
What’s next
At this point, each token in the sequence has a learned vector. The next step adds position information so the model can tell where those tokens occur.