A Tiny causal LM

B Minimal transformer

07 Add self-attention scores
08 Add values and weighted mixing
09 Add causal masking
10 Add learned Q/K/V projections
11 Add a full single-head self-attention layer

C Classic dense decoder

12 Add residual connection around attention
13 Add feed-forward network after attention
14 Add residual connection around the MLP
15 Add layer normalization
16 Move to pre-norm ordering
17 Stack multiple blocks

D Modern dense backbone

18 Split attention into multiple heads
19 Add attention output projection
20 Scale attention scores by sqrt(d_k)
21 Tie token embedding matrix to output projection
22 Replace learned absolute positions with RoPE
23 Replace LayerNorm with RMSNorm
24 Replace ReLU/GELU FFN with SwiGLU
25 Widen the FFN to modern proportions
26 Standardize on decoder-only architecture
27 Add grouped-query attention

E Practical long-context

28 Separate prefill from decode conceptually
29 Add KV caching
30 Make decoding iterative on a single new token
31 Add batching across requests
32 Add variable-length masking within a batch
33 Train with a fixed short context window
34 Increase maximum context in architecture
35 Add RoPE scaling/interpolation
36 Add long-context finetuning stage
37 Add length-aware data curriculum

F Modern MoE

38 Replace dense FFN with multiple experts
39 Add a router network
40 Route each token to top-1 expert
41 Combine expert output into the residual stream
42 Upgrade top-1 to top-2 routing
43 Add routing weights to combine expert outputs
44 Replace dense FFNs in many layers with MoE
45 Add expert capacity limits
46 Add load-balancing loss
47 Add token overflow handling
48 Add a shared expert alongside routed experts
49 Scale capacity primarily through sparsity

G Frontier architecture

50 Use FlashAttention-style memory-efficient attention
51 Add sliding-window attention for selected layers
52 Alternate global and local attention patterns
53 Add KV-compressed / MLA-style attention
54 Add chunked prefill
55 Add prefix cache reuse
56 Add paged KV cache
57 Add cache compaction / page reuse
58 Add tiered KV residency
59 Shard experts across devices
60 Dispatch tokens across devices
61 Add all-to-all communication
62 Overlap communication with expert compute
63 Add expert placement strategy
64 Add refined router balancing

H Production inference

65 Add continuous batching
66 Separate prefill and decode schedulers
67 Disaggregate prefill and decode workers
68 Add speculative decoding
69 Add multi-request prefix deduplication
70 Add decode prioritization / QoS policy

I Production training

71 Add mixed precision training
72 Add activation checkpointing
73 Add data parallelism
74 Add tensor parallelism
75 Add pipeline parallelism
76 Add sequence parallelism
77 Add optimizer state sharding
78 Add expert parallel training
79 Add FP8 or lower-precision training

J Full production

80 Improve tokenizer design
81 Add sequence packing in training
82 Add data curriculum
83 Add instruction finetuning
84 Add preference optimization
85 Add tool-use finetuning
86 Add multi-token prediction head
87 Add checkpoint versioning and resumability
88 Add inference telemetry
89 Add eval harness
90 Add safety / policy layer
91 Add canary deployment
92 Add rollback and shadow traffic

Milestone A Phase 0 — Input pipeline Step 03

Sequence input

New concept: context as a token matrix

So far the model takes a single token and predicts the next one. Real language models take a sequence of tokens and predict the next one after all of them. This gives the model context — the prior tokens it can attend to.

A sequence of $n$ tokens¹GPT-2 supports sequences up to n=1,024 tokens; GPT-4 supports up to 128k. Extending the context window is one of the central engineering challenges in modern LLMs — attention cost grows quadratically with n. $[t_0, t_1, \ldots, t_{n-1}]$ is embedded into a matrix by looking up each token independently:

$$X \in \mathbb{R}^{n \times d}, \quad X_i = E[t_i]$$

Each row of $X$ is the $d$-dimensional embedding of the token at position $i$.

def embed_sequence(tokens: list[str]) -> np.ndarray:
    return np.stack([E[token_to_id[t]] for t in tokens])
    # shape: (seq_len, d)

The matrix $X$ is the input to every subsequent layer. The model reads all rows simultaneously — this is different from recurrent models, which process tokens one by one.²RNNs (Elman, 1990) and LSTMs (Hochreiter & Schmidhuber, 1997) pass a hidden state forward through each position serially, making parallelization impossible during training. Processing all positions in parallel is what makes GPU-scale training practical.

Loading visualization…

Notice the two highlighted rows (positions 0 and 4): both are the token “the”, so their embeddings are identical. The model cannot tell them apart at this stage — it has no notion of where in the sequence a token appears.

This is a real problem. “The dog bit the man” and “the man bit the dog” use the same tokens; without position information the model sees the same input for both sentences. Word order carries meaning, but the embedding lookup discards it.

The next step introduces position embeddings — vectors added to each row of $X$ to encode the token’s position in the sequence. After that addition, even two identical tokens at different positions will have different representations.