04 · Transformer Block: One Pass from token_id to logits
This is the lesson where the machine stops being a diagram and becomes a specific, finite object. One number goes in (a token id), thirty-ish operations happen — every one of them visible below — and 27 numbers come out. GPT-4 and microGPT differ here only in quantity: more blocks stacked, wider vectors, more heads. The shape of the journey you’re about to trace is the same one a trillion-dollar industry runs billions of times per second.
Before you dive in. This lesson is an assembly line: a token’s vector (its embedding, a list of 16 numbers — 00 · Foundations §3) enters, passes through a fixed sequence of stations, and comes out as logits — one raw score per possible next character (§5). Every station is something you’ve already met: matrix multiplications (batched dot products), the attention recipe from lesson 03, additions, and two small new pieces. RMSNorm rescales a vector so its numbers stay a healthy overall size — normalization, nothing more. ReLU is the simplest curve in deep learning: negatives become 0, positives pass through. Don’t memorize the station order; the sandbox is the map.
Theory
Lessons 01–03 zoomed out (the forward → loss → sample loop) and in (autograd, attention). This lesson wires up the piece in the middle: the single function gpt() that turns one (token_id, pos_id) into a vector of logits over the next character. For microGPT that’s exactly one transformer block (n_layer = 1), n_embd = 16, n_head = 4, head_dim = 4.
The data path, in the order the reference code runs it:
- Embedding — look up the token’s row of
wteand the position’s row ofwpe, add them element-wise → a length-16 vectorx. - RMSNorm (the initial one) — Karpathy applies
rmsnormonce right here, before the block. It’s easy to miss and looks redundant next to the norm inside the block, but his comment is explicit: “not redundant due to backward pass via the residual connection.” It changes what the residual branch carries. - Attention sub-block (pre-norm + residual):
- save
x_residual = x(branch ①), rmsnorma copy,- multi-head attention — the same
q·kᵀ/√head_dim → softmax → ·vfrom lesson 03, run per head and concatenated, - project with
attn_wo, - add the saved branch ① back.
- save
- MLP sub-block (pre-norm + residual):
- save
x_residual = x(branch ②), rmsnorma copy,mlp_fc1expands 16 → 64,- ReLU (not GeLU — the reference uses
max(0, x)), mlp_fc2projects 64 → 16,- add the saved branch ② back.
- save
- LM Head —
linear(x, lm_head)projects the final 16-vector to one logit per vocabulary token. There is no final norm beforelm_headin the reference.
That’s the whole block. Things microGPT deliberately does not have, and which therefore are not in the sandbox: LayerNorm (it uses RMSNorm), GeLU (it uses ReLU), dropout, and biases on any linear.
A note on the two residuals: this is a pre-norm transformer. Each sub-block normalizes a copy of x, runs its sub-layer, and adds the result back to the un-normalized x it saved. That saved-then-added bypass is drawn as the two arcs on the right of the scene.
Parameter Initialization
Before any forward pass, the weights have to exist. microGPT builds them once, as plain Gaussian random scalars wrapped in Value (py lines 99–114):
matrix = lambda nout, nin, std=0.08: [[Value(random.gauss(0, std)) for _ in range(nin)] for _ in range(nout)]
state_dict = {'wte': matrix(vocab_size, n_embd), 'wpe': matrix(block_size, n_embd), 'lm_head': matrix(vocab_size, n_embd)}
for i in range(n_layer):
state_dict[f'layer{i}.attn_wq'] = matrix(n_embd, n_embd)
state_dict[f'layer{i}.attn_wk'] = matrix(n_embd, n_embd)
state_dict[f'layer{i}.attn_wv'] = matrix(n_embd, n_embd)
state_dict[f'layer{i}.attn_wo'] = matrix(n_embd, n_embd)
state_dict[f'layer{i}.mlp_fc1'] = matrix(4 * n_embd, n_embd)
state_dict[f'layer{i}.mlp_fc2'] = matrix(n_embd, 4 * n_embd)
params = [p for mat in state_dict.values() for row in mat for p in row]Every matrix is nout × nin of random.gauss(0, 0.08) values — a plain normal distribution with standard deviation 0.08. That is the entire initialization: no Xavier, no Kaiming, no special scaling. With n_embd = 16, block_size = 16, n_head = 4, and vocab_size = len(uchars) + 1, the state_dict holds:
| matrix | shape (nout × nin) | role |
|---|---|---|
wte | vocab_size × 16 | token embedding table |
wpe | 16 × 16 | position embedding table (block_size × n_embd) |
attn_wq / wk / wv / wo | 16 × 16 each | per-layer Q/K/V projections + output projection |
mlp_fc1 | 64 × 16 | MLP up-projection (4·n_embd × n_embd) |
mlp_fc2 | 16 × 64 | MLP down-projection (n_embd × 4·n_embd) |
lm_head | vocab_size × 16 | final projection to logits |
linear(x, w) reads each weight matrix as [nout][nin], so output j is the dot product of w[j] with the input. Finally params flattens every scalar from every matrix into one flat list — exactly what the 05 · Training & Generation Adam loop walks over, with one m/v buffer and one update per scalar, every step.
Annotated Code
The block lives in src/microgpt_annotated.py, subsection attention-multihead (the helpers linear / softmax / rmsnorm are in overview-pipeline-helpers):
def gpt(token_id, pos_id, keys, values):
tok_emb = state_dict['wte'][token_id] # token embedding
pos_emb = state_dict['wpe'][pos_id] # position embedding
x = [t + p for t, p in zip(tok_emb, pos_emb)] # joint embedding
x = rmsnorm(x) # note: not redundant due to backward pass via the residual connection
for li in range(n_layer):
# 1) Multi-head Attention block
x_residual = x
x = rmsnorm(x)
q = linear(x, state_dict[f'layer{li}.attn_wq'])
k = linear(x, state_dict[f'layer{li}.attn_wk'])
v = linear(x, state_dict[f'layer{li}.attn_wv'])
keys[li].append(k); values[li].append(v)
x_attn = []
for h in range(n_head):
hs = h * head_dim
q_h = q[hs:hs+head_dim]
k_h = [ki[hs:hs+head_dim] for ki in keys[li]]
v_h = [vi[hs:hs+head_dim] for vi in values[li]]
attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5
for t in range(len(k_h))]
attn_weights = softmax(attn_logits)
head_out = [sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h)))
for j in range(head_dim)]
x_attn.extend(head_out)
x = linear(x_attn, state_dict[f'layer{li}.attn_wo'])
x = [a + b for a, b in zip(x, x_residual)]
# 2) MLP block
x_residual = x
x = rmsnorm(x)
x = linear(x, state_dict[f'layer{li}.mlp_fc1'])
x = [xi.relu() for xi in x]
x = linear(x, state_dict[f'layer{li}.mlp_fc2'])
x = [a + b for a, b in zip(x, x_residual)]
logits = linear(x, state_dict['lm_head'])
return logitsThe TypeScript port in src/inference/model.ts computes the same path. Its difference is mechanical: Python calls gpt() once per position with a growing KV cache, while the port takes the whole sequence and applies an explicit j ≤ i causal mask — same math, different control flow (the same point lesson 03 makes about attention).
Sandbox
Each module on the path is a block you can click to see its input → output shape and the exact Python line it runs. Press play (or scrub) to send a data pulse down the path; the two green arcs are the residual bypasses (saved at ①/② and added back at the matching Add stages). The attention stage summarizes the same computation explained in lesson 03 — this lesson focuses on where attention sits inside the complete block. It is a map of the block’s structure and execution order, not a per-stage inspector of real tensor values (the shapes shown are the static layer dimensions; lesson 03 is where you watch the actual attention numbers).
Try this.
- Play the pulse once end to end, then answer without looking: how many times does the data get normalized? How many times does a residual arc rejoin the path?
- Click the mlp_fc1 station and check its shape: 16 in, 64 out. Then mlp_fc2: 64 in, 16 out. The block breathes — expand, think, then compress.
- Find the station whose Python line you could now rewrite from memory. (Candidate: the embedding add — it’s one
zipaway from lesson 00’s vector addition.)- Using the parameter table above, estimate what fraction of all ~4,000 dials live in the embedding and lm_head tables, versus inside the block itself. Surprised?