Pedagogical LLM

Tiny causal LM

Minimal transformer

07 Self-attention scores
08 Values and weighted mixing
09 Causal masking
10 Learned Q/K/V projections
11 Single-head self-attention

Classic dense decoder

12 Residual connection around attention
13 Feed-forward network
14 Residual connection around the MLP
15 Layer normalization
16 Pre-norm ordering
17 Stacked blocks

Local LLM

Modern dense backbone

18 Multi-head attention
19 Attention output projection
20 Scaled dot-product attention
21 Weight tying
22 Rotary positional embeddings
23 RMSNorm
24 SwiGLU
25 Modern FFN width
26 Decoder-only architecture
27 Grouped-query attention

Practical long-context

28 Prefill and decode
29 KV cache
30 Incremental decoding
31 Batched inference
32 Ragged batching
33 Finite context window
34 Larger context window
35 RoPE scaling
36 Long-context fine-tuning
37 Length-aware data curriculum

Modern MoE

38 Expert FFNs
39 Router network
40 Top-1 routing
41 Expert integration
42 Top-2 routing
43 Routing weights
44 Deep MoE backbone
45 Expert capacity limits
46 Load-balancing loss
47 Token overflow handling
48 Shared expert
49 Capacity-through-sparsity

Server LLM

Frontier architecture

50 FlashAttention
51 Sliding-window attention
52 Hybrid attention schedule
53 KV compression
54 Chunked prefill
55 Prefix cache reuse
56 Paged KV cache
57 Cache compaction
58 Tiered KV residency
59 Expert parallelism
60 MoE token dispatch
61 All-to-all communication
62 Compute-communication overlap
63 Expert placement strategy
64 Router balancing

Production inference

65 Continuous batching
66 Prefill-decode scheduling
67 Disaggregated serving
68 Speculative decoding
69 Prefix deduplication
70 Serving QoS

Production training

71 Mixed precision training
72 Activation checkpointing
73 Data parallelism
74 Tensor parallelism
75 Pipeline parallelism
76 Sequence parallelism
77 Optimizer state sharding
78 Expert parallel training
79 FP8 training

Full production

80 Tokenizer design
81 Sequence packing
82 Data curriculum
83 Instruction fine-tuning
84 Preference optimization
85 Tool-use fine-tuning
86 Multi-token prediction
87 Checkpoint versioning
88 Inference telemetry
89 Eval harness
90 Safety and policy layer
91 Canary deployment
92 Rollback and shadow traffic

Pedagogical LLM Tiny causal LM Step 03

Sequence input

New concept: context as a token matrix

The previous step showed the embedding lookup for a single token ID. For a real prompt, the model applies that same lookup at every position and stacks the results into a matrix.

A sequence of $n$ tokens¹GPT-2 supports sequences up to n=1,024 tokens. Modern LLMs can handle much longer contexts, but extending the context window remains a central engineering challenge because attention cost grows quadratically with n. $[t_0, t_1, \ldots, t_{n-1}]$ becomes

$$X \in \mathbb{R}^{n \times d}, \quad X_i = E[t_i]$$

Each row of $X$ is the $d$-dimensional embedding at position $i$. The important point is that each lookup still happens independently: at this stage, the model has a stack of token vectors, but no interactions between positions yet.

Loading visualization…

This is the transformer’s starting representation of the whole prompt: one row per token position, all available at once.²RNNs (Elman, 1990) and LSTMs (Hochreiter & Schmidhuber, 1997) pass a hidden state forward through each position serially, making parallelization impossible during training. Processing all positions in parallel is what makes GPU-scale training practical.

At full prompt length, this input matrix can be non-trivial:

Model	Total values
GPT-2 Small	0.79M 1,024 × 768
DeepSeek V3.2 Speciale	1.17G 163,840 × 7,168

Each total is just n × d: prompt length times embedding width.

One limitation that the visualization exposes is that token embeddings only encode what token appears in the matrix directly. Where it appears is only indirectly encoded by the location of embedding vector. The next step adds position embeddings so repeated tokens at different positions no longer share the same vector representation.