A Tiny causal LM

B Minimal transformer

07 Add self-attention scores
08 Add values and weighted mixing
09 Add causal masking
10 Add learned Q/K/V projections
11 Add a full single-head self-attention layer

C Classic dense decoder

12 Add residual connection around attention
13 Add feed-forward network after attention
14 Add residual connection around the MLP
15 Add layer normalization
16 Move to pre-norm ordering
17 Stack multiple blocks

D Modern dense backbone

18 Split attention into multiple heads
19 Add attention output projection
20 Scale attention scores by sqrt(d_k)
21 Tie token embedding matrix to output projection
22 Replace learned absolute positions with RoPE
23 Replace LayerNorm with RMSNorm
24 Replace ReLU/GELU FFN with SwiGLU
25 Widen the FFN to modern proportions
26 Standardize on decoder-only architecture
27 Add grouped-query attention

E Practical long-context

28 Separate prefill from decode conceptually
29 Add KV caching
30 Make decoding iterative on a single new token
31 Add batching across requests
32 Add variable-length masking within a batch
33 Train with a fixed short context window
34 Increase maximum context in architecture
35 Add RoPE scaling/interpolation
36 Add long-context finetuning stage
37 Add length-aware data curriculum

F Modern MoE

38 Replace dense FFN with multiple experts
39 Add a router network
40 Route each token to top-1 expert
41 Combine expert output into the residual stream
42 Upgrade top-1 to top-2 routing
43 Add routing weights to combine expert outputs
44 Replace dense FFNs in many layers with MoE
45 Add expert capacity limits
46 Add load-balancing loss
47 Add token overflow handling
48 Add a shared expert alongside routed experts
49 Scale capacity primarily through sparsity

G Frontier architecture

50 Use FlashAttention-style memory-efficient attention
51 Add sliding-window attention for selected layers
52 Alternate global and local attention patterns
53 Add KV-compressed / MLA-style attention
54 Add chunked prefill
55 Add prefix cache reuse
56 Add paged KV cache
57 Add cache compaction / page reuse
58 Add tiered KV residency
59 Shard experts across devices
60 Dispatch tokens across devices
61 Add all-to-all communication
62 Overlap communication with expert compute
63 Add expert placement strategy
64 Add refined router balancing

H Production inference

65 Add continuous batching
66 Separate prefill and decode schedulers
67 Disaggregate prefill and decode workers
68 Add speculative decoding
69 Add multi-request prefix deduplication
70 Add decode prioritization / QoS policy

I Production training

71 Add mixed precision training
72 Add activation checkpointing
73 Add data parallelism
74 Add tensor parallelism
75 Add pipeline parallelism
76 Add sequence parallelism
77 Add optimizer state sharding
78 Add expert parallel training
79 Add FP8 or lower-precision training

J Full production

80 Improve tokenizer design
81 Add sequence packing in training
82 Add data curriculum
83 Add instruction finetuning
84 Add preference optimization
85 Add tool-use finetuning
86 Add multi-token prediction head
87 Add checkpoint versioning and resumability
88 Add inference telemetry
89 Add eval harness
90 Add safety / policy layer
91 Add canary deployment
92 Add rollback and shadow traffic

Milestone A Phase 0 — Output pipeline Step 05

Output projection

New concept: vocabulary scoring

After the embedding lookup we have a dense vector $e \in \mathbb{R}^d$. But a language model’s job is to predict which token comes next, not to produce an arbitrary vector. To connect the embedding space back to the vocabulary we need an output projection.

The output projection is a weight matrix $W_{out} \in \mathbb{R}^{d \times |V|}$. Multiplying the embedding by this matrix produces one real-valued score — a logit — for every token in the vocabulary:

$$\text{logits} = e \cdot W_{out} \quad \in \mathbb{R}^{|V|}$$

A higher logit means the model considers that token more likely to come next. The values are unbounded real numbers; we haven’t turned them into probabilities yet.

W_out = random_matrix(d, vocab_size)   # learned during training

def project(embedding: np.ndarray) -> np.ndarray:
    return embedding @ W_out   # shape: (vocab_size,)

Select a token below to see its embedding vector and the resulting logit scores. Notice that semantically related tokens tend to score higher — “fox” assigns high logits to other animals and action words, while “the” scores function words and adjectives more favourably.

Loading visualization…

A few things to notice:

The embedding dimensions (left) are in the range $[-1, 1]$; logits (right) can be larger because they accumulate contributions across all $d$ dimensions.
[PAD] almost always receives the lowest logit — the model is trained to never predict a padding token as output.
$W_{out}$ is just a learned linear map.¹GPT-2 (Radford et al., 2019) ties the output projection to the embedding matrix: $W_{out} = E^T$. This halves the parameter count for the projection layer and consistently improves perplexity. nanoGPT follows the same convention. There is no activation function here; the non-linearity comes later (softmax, and eventually the full network).

At this point we have $\text{embed} \to \text{project}$, a two-step pipeline that maps a token string to a vector of raw scores. The next step turns those scores into a proper probability distribution.