Skip to Content
Glossary

Glossary

🌐 English · Русский · Eesti

Every term the lessons use, one ruthless sentence each. Terms link back to the lesson that shows them in action.

The model

  • Parameter / weight — one of the model’s ~4,000 adjustable dials; training is the process of setting them. (00, 05)
  • Token — a character converted to a number so the math can eat it; big models use word-chunks, microGPT uses single characters. (01)
  • Tokenizer — the lookup table that converts characters to tokens and back. (01)
  • Embedding — the list of 16 numbers (a vector) that represents one token inside the model. (04)
  • Logits — the model’s 27 raw output scores, one per possible next character, before they’re turned into probabilities. (01, 04)
  • Softmax — the converter from raw scores to probabilities: exponentiate everything, then divide by the total. (00, 03, 05)
  • Distribution — the full set of 27 probabilities summing to 1; the model’s actual answer is always a distribution, never a single character. (01)
  • Sampling — drawing one character at random according to the distribution, rather than always taking the most likely one. (01, 05)
  • Temperature — a constant the scores are divided by before softmax; high = flatter and more random, low = sharper and more repetitive. (05)
  • START / STOP sentinel — the one special token that marks “beginning of text” on the way in and “end of text” when predicted as output. (01)

The architecture (lesson 03–04)

  • Attention — the step where each position scores all earlier positions for relevance (dot products), softmaxes the scores into weights, and takes a weighted average of what those positions offer. (03)
  • Query / Key / Value (q, k, v) — three vectors derived from each position: what I’m looking for (q), what I advertise (k), what I hand over if chosen (v). (03)
  • Causal mask — the rule that a position may only look at earlier positions, never future ones, enforced by zeroing out the forbidden scores. (03)
  • Attention head — one independent copy of the attention mechanism; microGPT runs several in parallel so they can each learn a different pattern. (03, 04)
  • MLP (multi-layer perceptron) — the “think it over” stage after attention: two matrix multiplications with a ReLU between them, applied to each position separately. (04)
  • ReLU — the simplest curve in deep learning: negative numbers become 0, positive numbers pass through unchanged. (02, 04)
  • RMSNorm — a rescaling step that keeps each vector’s numbers at a healthy overall size so nothing explodes or vanishes; normalization, nothing more. (04)
  • Residual connection — adding a block’s input back to its output, so each block only has to learn a correction, not a full transformation. (04)
  • Rotary embedding (RoPE) — the trick that tells attention where each token sits in the sequence, by rotating q and k by position-dependent angles. (04)
  • Transformer block — one full unit of embedding-era architecture: norm → attention → residual → norm → MLP → residual. (04)
  • Decoder-only transformer — the GPT family shape: a stack of transformer blocks with a causal mask, trained to predict the next token. (04)

The training loop (lesson 02 & 05)

  • Forward pass — running the function: tokens in, logits out. (01, 04)
  • Loss — one number measuring how wrong the model is; here, -log p(truth) averaged over positions, so confident wrong answers hurt most. (01, 05)
  • Gradient — for each parameter, the answer to “if I nudge this dial up a little, how much does the loss change, and which way?” (00, 02)
  • Chain rule — sensitivities through chained operations multiply, like gear ratios; the backbone of backpropagation. (00, 02)
  • Backpropagation / .backward() — computing every parameter’s gradient in one backward sweep through the computation graph. (02)
  • Autograd — the bookkeeping machinery that records the computation graph during the forward pass so .backward() can replay it in reverse. (02)
  • DAG (directed acyclic graph) — the shape of that recorded computation: arrows flow forward, no loops; “acyclic” just means you can’t walk in a circle. (02)
  • Topological order — visiting the graph’s nodes so that every node comes after the nodes it depends on; the order .backward() walks in reverse. (02)
  • Gradient descent — the training recipe: nudge every dial a small step against its gradient, repeat thousands of times. (00, 05)
  • Learning rate — the size of that nudge; too big overshoots, too small crawls. (05)
  • Adam — a smarter version of gradient descent that adapts each dial’s step size based on the history of its gradients. (05)
  • Step / iteration — one full cycle of forward pass → loss → backward pass → parameter update. (05)
Last updated on