Softmax and sampling — LLM Explainer

We now have a logit vector — one real number per vocabulary token. Logits are useful for training (they feed directly into cross-entropy loss), but to actually sample a next token we need a valid probability distribution: non-negative values that sum to 1.

Softmax does exactly this:

$$\text{softmax}(x)_i = \frac{e^{x_i}}{\sum_j e^{x_j}}$$

Exponentiation makes all values positive. Dividing by the sum normalises them to sum to 1.¹Subtracting the maximum logit before exponentiating prevents float32 overflow — without it, exp(800) would exceed the largest representable value (~3.4×10³⁸). The shift is mathematically neutral: it cancels exactly in numerator and denominator. The operation preserves relative order — the highest-logit token still gets the highest probability — but it is not a simple rescaling.

def softmax(logits: np.ndarray) -> np.ndarray:
    exp = np.exp(logits - logits.max())   # subtract max for numerical stability
    return exp / exp.sum()

Temperature

A single hyperparameter — temperature $T$ — controls the sharpness of the distribution:

$$\text{softmax}(x / T)_i = \frac{e^{x_i / T}}{\sum_j e^{x_j / T}}$$

$T < 1$: amplifies differences between logits, making the distribution sharper (the model is more “confident” and repetitive).
$T > 1$: flattens the distribution, giving lower-probability tokens more of a chance (the model is more “creative” and unpredictable).
$T = 1$: the default; no modification.

Switch the view to Probabilities and drag the temperature slider to see the effect. Compare $T = 0.5$ (spiky) against $T = 2.0$ (flat).

Loading visualization…

The logits shown are from “fox” predicting the next token. “jumps”, “cat”, and “dog” dominate because the model (in this toy example) has learned that animals and action words follow “fox”. At $T = 1$ these tokens receive most of the probability mass; at high temperature the mass spreads across the vocabulary.

Sampling

In practice, next-token generation involves one of several sampling strategies applied to this probability distribution:

Greedy: always pick the highest-probability token.
Top-$k$: sample from the $k$ highest-probability tokens only.
Top-$p$ (nucleus): sample from the smallest set of tokens whose cumulative probability exceeds $p$.²Introduced by Holtzman et al. (2020) in The Curious Case of Neural Text Degeneration. Nucleus sampling outperforms fixed top-k because the candidate pool adapts to the model’s confidence: a spiky distribution yields a small nucleus, a flat one a large nucleus.

We now have a complete (if minimal) forward pass: token → embedding → logits → probabilities. The next step extends the input from a single token to a sequence.