Glossary

🌐 English · Русский · Eesti

Every term the lessons use, one ruthless sentence each. Terms link back to the lesson that shows them in action.

The model

Parameter / weight — one of the model’s ~4,000 adjustable dials; training is the process of setting them. (00, 05)
Token — a character converted to a number so the math can eat it; big models use word-chunks, microGPT uses single characters. (01)
Tokenizer — the lookup table that converts characters to tokens and back. (01)
Embedding — the list of 16 numbers (a vector) that represents one token inside the model. (04)
Logits — the model’s 27 raw output scores, one per possible next character, before they’re turned into probabilities. (01, 04)
Softmax — the converter from raw scores to probabilities: exponentiate everything, then divide by the total. (00, 03, 05)
Distribution — the full set of 27 probabilities summing to 1; the model’s actual answer is always a distribution, never a single character. (01)
Sampling — drawing one character at random according to the distribution, rather than always taking the most likely one. (01, 05)
Temperature — a constant the scores are divided by before softmax; high = flatter and more random, low = sharper and more repetitive. (05)
START / STOP sentinel — the one special token that marks “beginning of text” on the way in and “end of text” when predicted as output. (01)

The architecture (lesson 03–04)

Attention — the step where each position scores all earlier positions for relevance (dot products), softmaxes the scores into weights, and takes a weighted average of what those positions offer. (03)
Query / Key / Value (q, k, v) — three vectors derived from each position: what I’m looking for (q), what I advertise (k), what I hand over if chosen (v). (03)
Causal mask — the rule that a position may only look at earlier positions, never future ones, enforced by zeroing out the forbidden scores. (03)
Attention head — one independent copy of the attention mechanism; microGPT runs several in parallel so they can each learn a different pattern. (03, 04)
MLP (multi-layer perceptron) — the “think it over” stage after attention: two matrix multiplications with a ReLU between them, applied to each position separately. (04)
ReLU — the simplest curve in deep learning: negative numbers become 0, positive numbers pass through unchanged. (02, 04)
RMSNorm — a rescaling step that keeps each vector’s numbers at a healthy overall size so nothing explodes or vanishes; normalization, nothing more. (04)
Residual connection — adding a block’s input back to its output, so each block only has to learn a correction, not a full transformation. (04)
Rotary embedding (RoPE) — the trick that tells attention where each token sits in the sequence, by rotating q and k by position-dependent angles. (04)
Transformer block — one full unit of embedding-era architecture: norm → attention → residual → norm → MLP → residual. (04)
Decoder-only transformer — the GPT family shape: a stack of transformer blocks with a causal mask, trained to predict the next token. (04)

The training loop (lesson 02 & 05)

Forward pass — running the function: tokens in, logits out. (01, 04)
Loss — one number measuring how wrong the model is; here, -log p(truth) averaged over positions, so confident wrong answers hurt most. (01, 05)
Gradient — for each parameter, the answer to “if I nudge this dial up a little, how much does the loss change, and which way?” (00, 02)
Chain rule — sensitivities through chained operations multiply, like gear ratios; the backbone of backpropagation. (00, 02)
Backpropagation / .backward() — computing every parameter’s gradient in one backward sweep through the computation graph. (02)
Autograd — the bookkeeping machinery that records the computation graph during the forward pass so .backward() can replay it in reverse. (02)
DAG (directed acyclic graph) — the shape of that recorded computation: arrows flow forward, no loops; “acyclic” just means you can’t walk in a circle. (02)
Topological order — visiting the graph’s nodes so that every node comes after the nodes it depends on; the order .backward() walks in reverse. (02)
Gradient descent — the training recipe: nudge every dial a small step against its gradient, repeat thousands of times. (00, 05)
Learning rate — the size of that nudge; too big overshoots, too small crawls. (05)
Adam — a smarter version of gradient descent that adapts each dial’s step size based on the history of its gradients. (05)
Step / iteration — one full cycle of forward pass → loss → backward pass → parameter update. (05)