Pedagogical LLM

Tiny causal LM

Minimal transformer

07 Self-attention scores
08 Values and weighted mixing
09 Causal masking
10 Learned Q/K/V projections
11 Single-head self-attention

Classic dense decoder

12 Residual connection around attention
13 Feed-forward network
14 Residual connection around the MLP
15 Layer normalization
16 Pre-norm ordering
17 Stacked blocks

Local LLM

Modern dense backbone

18 Multi-head attention
19 Attention output projection
20 Scaled dot-product attention
21 Weight tying
22 Rotary positional embeddings
23 RMSNorm
24 SwiGLU
25 Modern FFN width
26 Decoder-only architecture
27 Grouped-query attention

Practical long-context

28 Prefill and decode
29 KV cache
30 Incremental decoding
31 Batched inference
32 Ragged batching
33 Finite context window
34 Larger context window
35 RoPE scaling
36 Long-context fine-tuning
37 Length-aware data curriculum

Modern MoE

38 Expert FFNs
39 Router network
40 Top-1 routing
41 Expert integration
42 Top-2 routing
43 Routing weights
44 Deep MoE backbone
45 Expert capacity limits
46 Load-balancing loss
47 Token overflow handling
48 Shared expert
49 Capacity-through-sparsity

Server LLM

Frontier architecture

50 FlashAttention
51 Sliding-window attention
52 Hybrid attention schedule
53 KV compression
54 Chunked prefill
55 Prefix cache reuse
56 Paged KV cache
57 Cache compaction
58 Tiered KV residency
59 Expert parallelism
60 MoE token dispatch
61 All-to-all communication
62 Compute-communication overlap
63 Expert placement strategy
64 Router balancing

Production inference

65 Continuous batching
66 Prefill-decode scheduling
67 Disaggregated serving
68 Speculative decoding
69 Prefix deduplication
70 Serving QoS

Production training

71 Mixed precision training
72 Activation checkpointing
73 Data parallelism
74 Tensor parallelism
75 Pipeline parallelism
76 Sequence parallelism
77 Optimizer state sharding
78 Expert parallel training
79 FP8 training

Full production

80 Tokenizer design
81 Sequence packing
82 Data curriculum
83 Instruction fine-tuning
84 Preference optimization
85 Tool-use fine-tuning
86 Multi-token prediction
87 Checkpoint versioning
88 Inference telemetry
89 Eval harness
90 Safety and policy layer
91 Canary deployment
92 Rollback and shadow traffic

Pedagogical LLM Tiny causal LM

Tiny causal LM

A language model is a function: text goes in, a probability distribution over the next piece of text comes out. Everything else — attention, transformers, mixture of experts — is a choice about how to compute that function well.

What this checkpoint covers

This checkpoint builds the simplest thing that fits the above definition. No attention, no transformer blocks. Just the skeleton: text becomes tokens, tokens become vectors, vectors get scored, and a probability distribution is sampled to produce the next token.

Loading visualization…

After this checkpoint you will be able to explain what happens at each of these boundaries: why the numbers have the shapes they do, what information they carry, and what would break if a step were removed.

What will come later

There is no attention here. Without it, the model cannot learn how tokens relate to each other — “the cat” and “the dog” produce the same output regardless of context. That mechanism comes in the next checkpoint.

The goal here is to understand the wiring before adding what makes it work.