From Token to Transformer
An interactive step-by-step guide to how large language models work — from a single embedding lookup all the way to a production MoE serving stack.
Start from Step 0Curriculum
93 steps across 10 milestones. Each step introduces exactly one new concept.
A
Tiny causal LM
Steps 0–6 · Tokenization, embeddings, decoding
B
Minimal transformer
Steps 7–11 · Attention, masking, QKV
C
Classic dense decoder
Steps 12–17 · Residual stream, MLP, LayerNorm
D
Modern dense backbone
Steps 18–27 · MHA, RoPE, RMSNorm, SwiGLU, GQA
E
Practical long-context
Steps 28–37 · KV cache, batching, RoPE scaling
F
Modern MoE
Steps 38–49 · Expert routing, load balancing
G
Frontier architecture
Steps 50–64 · FlashAttention, paged KV, expert parallelism
H
Production inference
Steps 65–70 · Continuous batching, speculative decoding
I
Production training
Steps 71–79 · Mixed precision, tensor/pipeline parallelism
J
Full production
Steps 80–92 · Post-training, eval, safety, rollout
Your progress
0 / 93