04 · Transformer Block: One Pass from token_id to logits

🌐 English · Русский · Eesti

This is the lesson where the machine stops being a diagram and becomes a specific, finite object. One number goes in (a token id), thirty-ish operations happen — every one of them visible below — and 27 numbers come out. GPT-4 and microGPT differ here only in quantity: more blocks stacked, wider vectors, more heads. The shape of the journey you’re about to trace is the same one a trillion-dollar industry runs billions of times per second.

Before you dive in. This lesson is an assembly line: a token’s vector (its embedding, a list of 16 numbers — 00 · Foundations §3) enters, passes through a fixed sequence of stations, and comes out as logits — one raw score per possible next character (§5). Every station is something you’ve already met: matrix multiplications (batched dot products), the attention recipe from lesson 03, additions, and two small new pieces. RMSNorm rescales a vector so its numbers stay a healthy overall size — normalization, nothing more. ReLU is the simplest curve in deep learning: negatives become 0, positives pass through. Don’t memorize the station order; the sandbox is the map.

Theory

Lessons 01–03 zoomed out (the forward → loss → sample loop) and in (autograd, attention). This lesson wires up the piece in the middle: the single function gpt() that turns one (token_id, pos_id) into a vector of logits over the next character. For microGPT that’s exactly one transformer block (n_layer = 1), n_embd = 16, n_head = 4, head_dim = 4.

The data path, in the order the reference code runs it:

Embedding — look up the token’s row of wte and the position’s row of wpe, add them element-wise → a length-16 vector x.
RMSNorm (the initial one) — Karpathy applies rmsnorm once right here, before the block. It’s easy to miss and looks redundant next to the norm inside the block, but his comment is explicit: “not redundant due to backward pass via the residual connection.” It changes what the residual branch carries.
Attention sub-block (pre-norm + residual):
- save x_residual = x (branch ①),
- rmsnorm a copy,
- multi-head attention — the same q·kᵀ/√head_dim → softmax → ·v from lesson 03, run per head and concatenated,
- project with attn_wo,
- add the saved branch ① back.
MLP sub-block (pre-norm + residual):
- save x_residual = x (branch ②),
- rmsnorm a copy,
- mlp_fc1 expands 16 → 64,
- ReLU (not GeLU — the reference uses max(0, x)),
- mlp_fc2 projects 64 → 16,
- add the saved branch ② back.
LM Head — linear(x, lm_head) projects the final 16-vector to one logit per vocabulary token. There is no final norm before lm_head in the reference.

That’s the whole block. Things microGPT deliberately does not have, and which therefore are not in the sandbox: LayerNorm (it uses RMSNorm), GeLU (it uses ReLU), dropout, and biases on any linear.

A note on the two residuals: this is a pre-norm transformer. Each sub-block normalizes a copy of x, runs its sub-layer, and adds the result back to the un-normalized x it saved. That saved-then-added bypass is drawn as the two arcs on the right of the scene.

Parameter Initialization

Before any forward pass, the weights have to exist. microGPT builds them once, as plain Gaussian random scalars wrapped in Value (py lines 99–114):


matrix = lambda nout, nin, std=0.08: [[Value(random.gauss(0, std)) for _ in range(nin)] for _ in range(nout)]
state_dict = {'wte': matrix(vocab_size, n_embd), 'wpe': matrix(block_size, n_embd), 'lm_head': matrix(vocab_size, n_embd)}
for i in range(n_layer):
    state_dict[f'layer{i}.attn_wq'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.attn_wk'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.attn_wv'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.attn_wo'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.mlp_fc1'] = matrix(4 * n_embd, n_embd)
    state_dict[f'layer{i}.mlp_fc2'] = matrix(n_embd, 4 * n_embd)
params = [p for mat in state_dict.values() for row in mat for p in row]

Every matrix is nout × nin of random.gauss(0, 0.08) values — a plain normal distribution with standard deviation 0.08. That is the entire initialization: no Xavier, no Kaiming, no special scaling. With n_embd = 16, block_size = 16, n_head = 4, and vocab_size = len(uchars) + 1, the state_dict holds:

matrix	shape (nout × nin)	role
`wte`	vocab_size × 16	token embedding table
`wpe`	16 × 16	position embedding table (block_size × n_embd)
`attn_wq` / `wk` / `wv` / `wo`	16 × 16 each	per-layer Q/K/V projections + output projection
`mlp_fc1`	64 × 16	MLP up-projection (4·n_embd × n_embd)
`mlp_fc2`	16 × 64	MLP down-projection (n_embd × 4·n_embd)
`lm_head`	vocab_size × 16	final projection to logits

linear(x, w) reads each weight matrix as [nout][nin], so output j is the dot product of w[j] with the input. Finally params flattens every scalar from every matrix into one flat list — exactly what the 05 · Training & Generation Adam loop walks over, with one m/v buffer and one update per scalar, every step.

Annotated Code

The block lives in src/microgpt_annotated.py, subsection attention-multihead (the helpers linear / softmax / rmsnorm are in overview-pipeline-helpers):


def gpt(token_id, pos_id, keys, values):
    tok_emb = state_dict['wte'][token_id]          # token embedding
    pos_emb = state_dict['wpe'][pos_id]            # position embedding
    x = [t + p for t, p in zip(tok_emb, pos_emb)]  # joint embedding
    x = rmsnorm(x)  # note: not redundant due to backward pass via the residual connection
 
    for li in range(n_layer):
        # 1) Multi-head Attention block
        x_residual = x
        x = rmsnorm(x)
        q = linear(x, state_dict[f'layer{li}.attn_wq'])
        k = linear(x, state_dict[f'layer{li}.attn_wk'])
        v = linear(x, state_dict[f'layer{li}.attn_wv'])
        keys[li].append(k); values[li].append(v)
        x_attn = []
        for h in range(n_head):
            hs = h * head_dim
            q_h = q[hs:hs+head_dim]
            k_h = [ki[hs:hs+head_dim] for ki in keys[li]]
            v_h = [vi[hs:hs+head_dim] for vi in values[li]]
            attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5
                           for t in range(len(k_h))]
            attn_weights = softmax(attn_logits)
            head_out = [sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h)))
                        for j in range(head_dim)]
            x_attn.extend(head_out)
        x = linear(x_attn, state_dict[f'layer{li}.attn_wo'])
        x = [a + b for a, b in zip(x, x_residual)]
        # 2) MLP block
        x_residual = x
        x = rmsnorm(x)
        x = linear(x, state_dict[f'layer{li}.mlp_fc1'])
        x = [xi.relu() for xi in x]
        x = linear(x, state_dict[f'layer{li}.mlp_fc2'])
        x = [a + b for a, b in zip(x, x_residual)]
 
    logits = linear(x, state_dict['lm_head'])
    return logits

The TypeScript port in src/inference/model.ts computes the same path. Its difference is mechanical: Python calls gpt() once per position with a growing KV cache, while the port takes the whole sequence and applies an explicit j ≤ i causal mask — same math, different control flow (the same point lesson 03 makes about attention).

Sandbox

Each module on the path is a block you can click to see its input → output shape and the exact Python line it runs. Press play (or scrub) to send a data pulse down the path; the two green arcs are the residual bypasses (saved at ①/② and added back at the matching Add stages). The attention stage summarizes the same computation explained in lesson 03 — this lesson focuses on where attention sits inside the complete block. It is a map of the block’s structure and execution order, not a per-stage inspector of real tensor values (the shapes shown are the static layer dimensions; lesson 03 is where you watch the actual attention numbers).

Try this.

Play the pulse once end to end, then answer without looking: how many times does the data get normalized? How many times does a residual arc rejoin the path?

Click the mlp_fc1 station and check its shape: 16 in, 64 out. Then mlp_fc2: 64 in, 16 out. The block breathes — expand, think, then compress.

Find the station whose Python line you could now rewrite from memory. (Candidate: the embedding add — it’s one zip away from lesson 00’s vector addition.)

Using the parameter table above, estimate what fraction of all ~4,000 dials live in the embedding and lm_head tables, versus inside the block itself. Surprised?