Tiny causal LM
A language model is a function: text goes in, a probability distribution over the next piece of text comes out. Everything else — attention, transformers, mixture of experts — is a choice about how to compute that function well.
What this checkpoint covers
This checkpoint builds the simplest thing that fits the above definition. No attention, no transformer blocks. Just the skeleton: text becomes tokens, tokens become vectors, vectors get scored, and a probability distribution is sampled to produce the next token.
Loading visualization…
After this checkpoint you will be able to explain what happens at each of these boundaries: why the numbers have the shapes they do, what information they carry, and what would break if a step were removed.
What will come later
There is no attention here. Without it, the model cannot learn how tokens relate to each other — “the cat” and “the dog” produce the same output regardless of context. That mechanism comes in the next checkpoint.
The goal here is to understand the wiring before adding what makes it work.