Sequence input
New concept: context as a token matrix
So far the model takes a single token and predicts the next one. Real language models take a sequence of tokens and predict the next one after all of them. This gives the model context — the prior tokens it can attend to.
A sequence of $n$ tokens1GPT-2 supports sequences up to n=1,024 tokens; GPT-4 supports up to 128k. Extending the context window is one of the central engineering challenges in modern LLMs — attention cost grows quadratically with n. $[t_0, t_1, \ldots, t_{n-1}]$ is embedded into a matrix by looking up each token independently:
$$X \in \mathbb{R}^{n \times d}, \quad X_i = E[t_i]$$
Each row of $X$ is the $d$-dimensional embedding of the token at position $i$.
def embed_sequence(tokens: list[str]) -> np.ndarray:
return np.stack([E[token_to_id[t]] for t in tokens])
# shape: (seq_len, d)
The matrix $X$ is the input to every subsequent layer. The model reads all rows simultaneously — this is different from recurrent models, which process tokens one by one.2RNNs (Elman, 1990) and LSTMs (Hochreiter & Schmidhuber, 1997) pass a hidden state forward through each position serially, making parallelization impossible during training. Processing all positions in parallel is what makes GPU-scale training practical.
Loading visualization…
Notice the two highlighted rows (positions 0 and 4): both are the token “the”, so their embeddings are identical. The model cannot tell them apart at this stage — it has no notion of where in the sequence a token appears.
This is a real problem. “The dog bit the man” and “the man bit the dog” use the same tokens; without position information the model sees the same input for both sentences. Word order carries meaning, but the embedding lookup discards it.
The next step introduces position embeddings — vectors added to each row of $X$ to encode the token’s position in the sequence. After that addition, even two identical tokens at different positions will have different representations.