Sequence input
New concept: context as a token matrix
The previous step showed the embedding lookup for a single token ID. For a real prompt, the model applies that same lookup at every position and stacks the results into a matrix.
A sequence of $n$ tokens1GPT-2 supports sequences up to n=1,024 tokens. Modern LLMs can handle much longer contexts, but extending the context window remains a central engineering challenge because attention cost grows quadratically with n. $[t_0, t_1, \ldots, t_{n-1}]$ becomes
$$X \in \mathbb{R}^{n \times d}, \quad X_i = E[t_i]$$
Each row of $X$ is the $d$-dimensional embedding at position $i$. The important point is that each lookup still happens independently: at this stage, the model has a stack of token vectors, but no interactions between positions yet.
Loading visualization…
This is the transformer’s starting representation of the whole prompt: one row per token position, all available at once.2RNNs (Elman, 1990) and LSTMs (Hochreiter & Schmidhuber, 1997) pass a hidden state forward through each position serially, making parallelization impossible during training. Processing all positions in parallel is what makes GPU-scale training practical.
At full prompt length, this input matrix can be non-trivial:
| Model | Total values |
|---|---|
| GPT-2 Small | 0.79M |
| DeepSeek V3.2 Speciale | 1.17G |
Each total is just n × d: prompt length times embedding width.
One limitation that the visualization exposes is that token embeddings only encode what token appears in the matrix directly. Where it appears is only indirectly encoded by the location of embedding vector. The next step adds position embeddings so repeated tokens at different positions no longer share the same vector representation.