From Token to Transformer

An interactive step-by-step guide to how large language models work — from a single embedding lookup all the way to a production MoE serving stack.

Start from Step 0

Curriculum

93 steps across 10 milestones. Each step introduces exactly one new concept.

A Tiny causal LM Steps 0–6 · Tokenization, embeddings, decoding
B Minimal transformer Steps 7–11 · Attention, masking, QKV
C Classic dense decoder Steps 12–17 · Residual stream, MLP, LayerNorm
D Modern dense backbone Steps 18–27 · MHA, RoPE, RMSNorm, SwiGLU, GQA
E Practical long-context Steps 28–37 · KV cache, batching, RoPE scaling
F Modern MoE Steps 38–49 · Expert routing, load balancing
G Frontier architecture Steps 50–64 · FlashAttention, paged KV, expert parallelism
H Production inference Steps 65–70 · Continuous batching, speculative decoding
I Production training Steps 71–79 · Mixed precision, tensor/pipeline parallelism
J Full production Steps 80–92 · Post-training, eval, safety, rollout

Your progress

0 / 93