Milestone A Phase 0 — Output pipeline Step 05

Output projection

New concept: vocabulary scoring

After the embedding lookup we have a dense vector $e \in \mathbb{R}^d$. But a language model’s job is to predict which token comes next, not to produce an arbitrary vector. To connect the embedding space back to the vocabulary we need an output projection.

The output projection is a weight matrix $W_{out} \in \mathbb{R}^{d \times |V|}$. Multiplying the embedding by this matrix produces one real-valued score — a logit — for every token in the vocabulary:

$$\text{logits} = e \cdot W_{out} \quad \in \mathbb{R}^{|V|}$$

A higher logit means the model considers that token more likely to come next. The values are unbounded real numbers; we haven’t turned them into probabilities yet.

W_out = random_matrix(d, vocab_size)   # learned during training

def project(embedding: np.ndarray) -> np.ndarray:
    return embedding @ W_out   # shape: (vocab_size,)

Select a token below to see its embedding vector and the resulting logit scores. Notice that semantically related tokens tend to score higher — “fox” assigns high logits to other animals and action words, while “the” scores function words and adjectives more favourably.

A few things to notice:

  • The embedding dimensions (left) are in the range $[-1, 1]$; logits (right) can be larger because they accumulate contributions across all $d$ dimensions.
  • [PAD] almost always receives the lowest logit — the model is trained to never predict a padding token as output.
  • $W_{out}$ is just a learned linear map.1GPT-2 (Radford et al., 2019) ties the output projection to the embedding matrix: $W_{out} = E^T$. This halves the parameter count for the projection layer and consistently improves perplexity. nanoGPT follows the same convention. There is no activation function here; the non-linearity comes later (softmax, and eventually the full network).

At this point we have $\text{embed} \to \text{project}$, a two-step pipeline that maps a token string to a vector of raw scores. The next step turns those scores into a proper probability distribution.