Pedagogical LLM Tiny causal LM Step 03

Sequence input

New concept: context as a token matrix

The previous step showed the embedding lookup for a single token ID. For a real prompt, the model applies that same lookup at every position and stacks the results into a matrix.

A sequence of $n$ tokens1GPT-2 supports sequences up to n=1,024 tokens. Modern LLMs can handle much longer contexts, but extending the context window remains a central engineering challenge because attention cost grows quadratically with n. $[t_0, t_1, \ldots, t_{n-1}]$ becomes

$$X \in \mathbb{R}^{n \times d}, \quad X_i = E[t_i]$$

Each row of $X$ is the $d$-dimensional embedding at position $i$. The important point is that each lookup still happens independently: at this stage, the model has a stack of token vectors, but no interactions between positions yet.

This is the transformer’s starting representation of the whole prompt: one row per token position, all available at once.2RNNs (Elman, 1990) and LSTMs (Hochreiter & Schmidhuber, 1997) pass a hidden state forward through each position serially, making parallelization impossible during training. Processing all positions in parallel is what makes GPU-scale training practical.

At full prompt length, this input matrix can be non-trivial:

Model Total values
GPT-2 Small 0.79M
1,024 × 768
DeepSeek V3.2 Speciale 1.17G
163,840 × 7,168

Each total is just n × d: prompt length times embedding width.

One limitation that the visualization exposes is that token embeddings only encode what token appears in the matrix directly. Where it appears is only indirectly encoded by the location of embedding vector. The next step adds position embeddings so repeated tokens at different positions no longer share the same vector representation.