03 · Attention: How Tokens Talk to Each Other

🌐 English · Русский · Eesti

Here’s the problem attention solves. By lesson 01’s design, each position must predict its next character — but a character alone is nearly useless context. The n in an… needs to know the a came before it; the second n in ann… needs to know it’s the second n. Attention is the mechanism that lets every position reach back, ask “who before me matters right now?”, and pull in exactly the information it needs — with nothing but dot products and a softmax. This single idea is the “T” in GPT, and the reason this architecture took over the world.

Before you dive in. Attention is built almost entirely out of the dot product from 00 · Foundations §3 — multiply two lists of numbers position by position, add it up, and read the result as a similarity score — plus softmax (§5) to turn those scores into weights that sum to 1. When the lesson says q, k, and v are “a learned linear map of the embedding”, it means: take the token’s vector and multiply it by a matrix of learned dials — same operation, three different dial-sets, three different purposes. The √d divisor is just a volume knob that keeps scores in a comfortable range.

Theory

Self-attention asks, for each query token q_i, “how much should I listen to each prior token’s value v_j?” The recipe:

Project every token into three vectors: query q, key k, value v. Each is a learned linear map of the token’s embedding.
Score the query against every key: score[i][j] = (q_i · k_j) / sqrt(d_head). The sqrt keeps the variance stable as the head dimension grows.
Apply a causal mask — token i can only see tokens j ≤ i. Softmax the (masked) row so the attention weights sum to 1.
Take the weighted sum of the value vectors: output_i = Σ_j softmax(score)[i][j] * v_j.

The whole layer does this per head in parallel; each head gets its own slice of the embedding and learns its own query/key/value projections. The per-head outputs are concatenated and projected back to the embedding dimension via W_o.

The sandbox below is a token communication lab for one head and one query token you choose. Watch the query send out score beams to each visible key, the causal mask wall block the future tokens, the raw scores turn into softmax weights (which sum to 1), and the weighted values flow into a mixer to form output_i. A few things the picture makes explicit:

Scores and softmax are different stages. attention_logits (the raw q·k/√d) and attention_softmax (the weights) are captured separately from the real model — the beams show the scores first, then re-label as weights.
Masked future tokens never enter the softmax. A score q_i·k_j can be computed for a future key j > i (and the lab shows the mask wall over those positions purely to make the rule visible), but the mask is applied before softmax — the model never lets a future token participate, and the weights only normalise over j ≤ i.
The Q/K/V strips are labels, not magnitudes. The three coloured strips under each token just mark which projection is the query, key, and value — their length is illustrative. The actual numbers live in the q·k breakdown panel.
The value beams are a weighted sum. Each visible value flows into the mixer tagged wⱼ·vⱼ; the mixer adds them up to output_i = Σ wⱼ·vⱼ.
Multi-head is real, not decoration. Each ring in the bottom row shows that head’s own softmax distribution over the same keys — flip between heads (or watch the mini-bars) and you’ll see different heads focus on different positions.
This is one selected head. output_i here is that head’s output. The full layer runs all heads in parallel, then concatenates their outputs and projects them with W_o.

Annotated Code

The attention block lives in src/microgpt_annotated.py, subsection attention-multihead:


def gpt(token_id, pos_id, keys, values):
    # ...embedding + rmsnorm above this point...
    for li in range(n_layer):
        x_residual = x
        x = rmsnorm(x)
        q = linear(x, state_dict[f'layer{li}.attn_wq'])
        k = linear(x, state_dict[f'layer{li}.attn_wk'])
        v = linear(x, state_dict[f'layer{li}.attn_wv'])
        keys[li].append(k)
        values[li].append(v)
        x_attn = []
        for h in range(n_head):
            hs = h * head_dim
            q_h = q[hs:hs+head_dim]
            k_h = [ki[hs:hs+head_dim] for ki in keys[li]]
            v_h = [vi[hs:hs+head_dim] for vi in values[li]]
            attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5
                           for t in range(len(k_h))]
            attn_weights = softmax(attn_logits)
            head_out = [sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h)))
                        for j in range(head_dim)]
            x_attn.extend(head_out)
        x = linear(x_attn, state_dict[f'layer{li}.attn_wo'])
        x = [a + b for a, b in zip(x, x_residual)]
        # ...MLP block follows...

Notice the causal structure is built into the call signature: gpt() is called once per position, with the KV cache (keys, values) growing one row per call. The TypeScript port in src/inference/model.ts instead computes the whole T-length sequence in one call with an explicit j ≤ i loop — same math, different control flow.

Sandbox

Pick a sentence (≤6 chars), a head (0–3), and the query token i. Press Play to watch the phases — tokens → Q/K/V → scores → mask → softmax → weighted-value sum → multi-head. The toggles show/hide the Q/K/V vectors, raw scores, softmax weights, the masked future, and the multi-head overview. Click an inspect q·k button (or a beam label) for the per-dimension dot-product breakdown.

Two colours carry the two stages, top to bottom: orange beams are the attention weights from the query to each visible key (upper lane), and green beams are the weighted values wⱼ·vⱼ flowing from each Vⱼ chip down into the output mixer (lower lane). A future token (j > i) is masked: its chip greys out and it gets no orange edge and no green beam.

Try this.

Load anna, query the final a, and look at the orange weights. Which earlier character does this head care about most? Now switch heads (0–3) — do they agree? (They shouldn’t: each head learned its own habit.)

Set the query to the first character. Its softmax row has exactly one visible key — itself. What must its attention weight be, before you even look?

Click an inspect q·k button and find which dimensions of the dot product contribute most. A similarity score is a sum of per-dimension agreements — here they are, one by one.

Watch the mask wall. The greyed-out future tokens get no orange beam at all — confirm the weights over the visible past still sum to 1.