How do LLMs work?
A step-by-step guide for engineers who have used large language models and want to deeply understand the mechanics of what's happening inside.
From a toy model to a production LLM, across three milestones.
Start hereYou'll need a basic understanding of vectors, matrices, and Python.
Pedagogical LLM
Build a working language model from scratch — no framework, no pretrained weights. You will trace every number through every operation.
Tiny causal LM
Builds the input/output skeleton of a language model — tokenization, embeddings, and probabilistic decoding.
Minimal transformer
Adds self-attention, the mechanism that lets tokens influence each other.
Classic dense decoder
Completes the transformer block with residual connections, feed-forward layers, and layer normalization.
Local LLM
Add the components that make modern models actually work: attention, efficient positional encoding, the MoE layer. By the end you can read a real model config.
Modern dense backbone
Extends the architecture to match a real model — multi-head attention, RoPE, RMSNorm, SwiGLU, and GQA.
Practical long-context
Shows how models handle long inputs — the KV cache, batching, and context extension.
Modern MoE
Introduces mixture-of-experts, routing tokens to specialized sub-networks for scale without proportional compute.
Server LLM
The engineering behind production deployment — FlashAttention, KV cache management, multi-GPU serving, and the full training and post-training pipeline.
Frontier architecture
Examines the systems-level optimizations in frontier models — FlashAttention, paged KV, and expert parallelism.
Production inference
Shows how a serving system handles thousands of concurrent requests efficiently.
Production training
Covers distributed training at scale — mixed precision, tensor parallelism, and pipeline parallelism.
Full production
Walks through the complete production lifecycle — data, post-training, evaluation, safety, and deployment.
Your progress
0 / 93