Position embeddings — LLM Explainer

The embedding matrix $X$ treats all positions identically. To give the model a sense of order, we add a position embedding to each row:

$$X’_i = X_i + PE_i$$

where $PE \in \mathbb{R}^{n \times d}$ is a matrix of position encodings, one row per position. After this addition, tokens that are identical in vocabulary become distinct in representation because their position vectors differ.

Sinusoidal position embeddings

The original transformer paper (Vaswani et al., 2017) uses a fixed sinusoidal formula:¹GPT-2 (Radford et al., 2019) switched to learned absolute embeddings — a trainable lookup table with the same semantics but no fixed formula. nanoGPT follows this approach.

$$PE_{(pos, 2i)} = \sin!\left(\frac{pos}{10000^{2i/d}}\right)$$ $$PE_{(pos, 2i+1)} = \cos!\left(\frac{pos}{10000^{2i/d}}\right)$$

Each pair of adjacent dimensions $(2i, 2i+1)$ encodes position using a sine/cosine pair at a specific frequency. Low-numbered dimensions oscillate rapidly (high frequency), high-numbered dimensions oscillate slowly (low frequency). Together they create a unique fingerprint for every position.

def sinusoidal_pe(seq_len: int, d: int) -> np.ndarray:
    pos = np.arange(seq_len)[:, None]       # (seq_len, 1)
    i   = np.arange(d // 2)[None, :]        # (1, d/2)
    div = np.power(10000, 2 * i / d)
    pe  = np.zeros((seq_len, d))
    pe[:, 0::2] = np.sin(pos / div)
    pe[:, 1::2] = np.cos(pos / div)
    return pe

The visualisation below shows all three matrices for our seven-token sequence. Switch between tabs to see how the position encoding modifies the token embeddings.

Loading visualization…

On the Token embeddings tab, rows 0 and 4 (both “the”) are identical. On the Position embeddings tab, each row is unique — that is the point. On the Sum tab the two “the” rows finally diverge: the model can now distinguish the first “the” from the second.

Learned vs. fixed embeddings

Modern models (GPT, LLaMA, etc.) typically use learned position embeddings — a trainable matrix $W_{pos} \in \mathbb{R}^{n_{max} \times d}$ instead of a fixed formula. The mechanism is identical to what we saw here; only the values change from hand-crafted to trained.

A later step will cover RoPE (Rotary Position Embedding), which encodes position in the attention weights rather than the input vectors and handles sequences longer than those seen during training.²RoPE (Su et al., 2021) is used by LLaMA, Mistral, and most modern open-weight models. It is covered in step 20.

With position embeddings in place, we have the complete input pipeline:

$$\text{token} \to \text{embedding lookup} \to + \text{position} \to X’ \in \mathbb{R}^{n \times d}$$

The matrix $X’$ is the input to the first transformer layer. Milestone B will introduce the attention mechanism that operates on it.