Position embeddings
New concept: position information
The embedding matrix $X$ tells the model which tokens are present, but not where they appear. If the same token occurs twice, the two rows are initially identical. To inject word order, we add a position embedding to each row:
$$X’_i = X_i + PE_i$$
where $PE \in \mathbb{R}^{n \times d}$ is a matrix with one row per position. This addition is the key idea:
- token embeddings say what token is present
- position embeddings say where it appears
- the sum carries both pieces of information forward
GPT-2 uses learned absolute position embeddings: a trainable lookup table indexed by position, just as the token embedding matrix is indexed by vocabulary token.1The original transformer paper (Vaswani et al., 2017) used fixed sinusoidal position embeddings instead of a learned table. That historical variant is useful for intuition, but GPT-2 (Radford et al., 2019) and most later GPT-style models switched to learned position rows.
The visualization below shows this addition for our canonical example. The token rows come from the same GPT-2-derived embedding slice used in the previous step. The position rows now come from GPT-2’s learned position embedding table on those same sampled dimensions: token row plus position row, then the sum. Switch between the three views to compare them directly.
Loading visualization…
On the Token embeddings tab, rows 2 and 5 are both " the", so they are
identical. On the Position embeddings tab, every row is different. After
addition, those two " the" rows diverge, so the model can distinguish the
earlier occurrence from the later one.
Later we will revisit position information through RoPE (Rotary Position Embedding), which encodes position inside the attention mechanism rather than by adding an input-side position row.2RoPE (Su et al., 2021) is used by LLaMA, Mistral, and most modern open-weight models. It is covered in step 20.
With position embeddings in place, the input pipeline is complete:
$$\text{token} \to \text{embedding lookup} \to + \text{position} \to X’ \in \mathbb{R}^{n \times d}$$
The matrix $X’$ is the input to the first transformer layer. The next step in the model is the attention mechanism that operates on it.