02 · Autograd: How Gradients Flow Through a Tiny DAG

🌐 English · Русский · Eesti

Every deep-learning framework on Earth — PyTorch, JAX, TensorFlow — has one spell at its core: loss.backward(). Call it, and somehow every one of a billion parameters learns exactly how it contributed to the error. This lesson de-mystifies that spell with a version small enough to watch: 25 lines of Python, and a 3D graph where you can literally see the gradients flow backward, arrow by arrow, number by number.

Before you dive in. Everything here rests on one idea from 00 · Foundations: the derivative as a sensitivity dial (§4) — “if I nudge this input a tiny bit, how much does the output move?” — and the chain rule as gear ratios: sensitivities through chained operations multiply. Two new words to decode the title: a DAG is just the diagram of a computation (arrows point forward, no loops), and topological order means “visit every node after the nodes it depends on” — the order you’d naturally compute things in anyway. That’s all the prerequisite theory; the sandbox makes the rest visible.

Theory

A neural network is a function composed of many small ops: add, multiply, exponentiate, ReLU. To train it, we need the gradient of the loss with respect to every parameter. The trick that makes this practical is reverse-mode automatic differentiation.

Karpathy’s Value class implements it in about 25 lines. Each Value records:

its scalar data,
the list of _children it was built from,
the local gradient d(self) / d(child_i) for each child.

When you call loss.backward():

Traverse the graph in topological order (children before parents).
Initialize loss.grad = 1.
Walk the topo list in reverse and, for each node v, distribute v.grad into each child via the chain rule: child.grad += local_grad_i * v.grad.

That’s it. No symbolic math, no static graph — the graph is built on the fly as you do the forward pass, and backward() just plays it in reverse.

The 3D sandbox below lets you type an expression, see the live DAG that the Value ops build, then play either the forward pulse (data flowing from leaves to root) or the backward pulse (gradients flowing root to leaves). Drag a slider to change a leaf’s value and watch the whole graph recompute.

Reading the graph

Layout. Leaf variables sit on the left, each operation to the right of its inputs, and the root on the far right. Branches that merge — like a and b both feeding +, then (a+b) and c feeding * — are drawn as branches that visibly come together, so you can see it’s a DAG, not a chain.
Backward is progressive. Press Backward and the reveal starts at the root with root.grad = 1, then flows outward; each node’s g=… only appears once the gradient has actually reached it. Nothing shows all the final gradients up front.
Chain rule on the arrows. Every backward arrow is labelled incoming grad × local derivative = contribution. For the default example, the * node sends 1 × c = 10 toward (a+b) and 1 × (a+b) = -1 toward c; (a+b) then sends 10 × 1 = 10 to each of a and b. That is the chain rule, made literal.
Derived ops. - and / are tagged derived: the engine builds them from primitives (a - b is a + (b·-1), a / b is a · b^-1), so the single node you see folds that internal structure. The gradients are still exact — e.g. the right input of - gets a local derivative of -1.
Exponents are constants. ** only accepts a numeric literal exponent (a ** 3), shown tagged const with no gradient flowing into it. A variable exponent like a ** b is rejected, because this engine’s pow differentiates only the base — accepting it would silently leave b.grad = 0.

Annotated Code

The Value class lives in src/microgpt_annotated.py, in the subsection marked autograd-value-class:


class Value:
    def __init__(self, data, _children=(), _local_grads=()):
        self.data = data
        self.grad = 0
        self._children = _children
        self._local_grads = _local_grads
 
    def __add__(self, other):
        return Value(self.data + other.data, (self, other), (1, 1))
 
    def __mul__(self, other):
        return Value(self.data * other.data, (self, other), (other.data, self.data))
 
    # ... pow, exp, log, relu identical in spirit ...
 
    def backward(self):
        topo = []
        visited = set()
        def build(v):
            if v not in visited:
                visited.add(v)
                for c in v._children: build(c)
                topo.append(v)
        build(self)
        self.grad = 1
        for v in reversed(topo):
            for child, local_grad in zip(v._children, v._local_grads):
                child.grad += local_grad * v.grad

The TypeScript port in src/inference/value.ts mirrors this one-for-one — same field names, same op semantics — so the equivalence tests can introspect both sides.

Sandbox

Type any expression using + - * / **, relu(x), exp(x), log(x), single-letter variables, and parentheses. Hit a preset for a starting point. Drag sliders to change variable values. Press Play to watch the pulse sweep the graph, or scrub the timeline by hand; switching Forward/Backward restarts the pulse from the start.

Each node is a small computation chip — a structured card shows its value and grad (the grad reads -- until the backward wave reaches it). Two toggles: local derivatives annotates the derived ops with their primitive expansion, and final gradients reveals every gradient at once (off by default, so the step-by-step backward stays intact). Drag to orbit a little; the view is clamped so the graph always stays readable.

Try this.

Keep the default (a + b) * c and — before pressing Backward — predict the three gradients yourself with the chain rule. Then press play and check. (Spoiler-free hint: c’s gradient is whatever (a+b) currently equals.)

Type relu(a * b) and drag the sliders until a * b goes negative. Watch every gradient upstream of the ReLU snap to zero — the “dead ReLU” you’ll meet again in lesson 04, live.

Build a * a (the same variable twice). Notice the gradient adds up from both paths — that’s the += in the backward loop doing its job.

Find an expression where nudging a variable up makes the output go down. What sign is its gradient?