From Token to Transformer

An interactive step-by-step guide to how large language models work — from a single embedding lookup all the way to a production MoE serving stack.

Curriculum

93 steps across 10 milestones. Each step introduces exactly one new concept.

B Minimal transformer Steps 7–11 · Attention, masking, QKV

C Classic dense decoder Steps 12–17 · Residual stream, MLP, LayerNorm

D Modern dense backbone Steps 18–27 · MHA, RoPE, RMSNorm, SwiGLU, GQA

E Practical long-context Steps 28–37 · KV cache, batching, RoPE scaling

F Modern MoE Steps 38–49 · Expert routing, load balancing

G Frontier architecture Steps 50–64 · FlashAttention, paged KV, expert parallelism

H Production inference Steps 65–70 · Continuous batching, speculative decoding

I Production training Steps 71–79 · Mixed precision, tensor/pipeline parallelism

J Full production Steps 80–92 · Post-training, eval, safety, rollout

0 / 93