Autodifferentiation in a nutshell

26 Jan, 2026

This post is going to be a short one. There are two ways to perform automatic differentiation (“autodiff”), forward and backward differentiation.

Forward differentiation

Dual numbers for forward differentation are quite similar to complex numbers. We define a dual number as such:

$$a + b \epsilon$$

where $\epsilon^2 = 0$. The whole premise of this is that suppose that we have a function $f(x)$, we can compute it’s Taylor expansion:

$$f(y) = f(x) + f'(x)(y - x) + \frac{f''(x)}{2}(y - x)^2 + \cdots$$

Then letting $y = x + \epsilon$, we get:

$$f(x + \epsilon) = f(x) + f'(x)\epsilon$$

This means that we can compute the derivative of $f(x)$ by computing $f(x + \epsilon)$ and looking at the coefficient of $\epsilon$. This is the forward mode of autodiff. This was the single dimensional case. We define dual numbers in a similar fashion for the multi-dimensional case, then we have, where $e_i$ is the $i$-th standard basis vector:

$$f(x + \epsilon e_i) = f(x) + \nabla f(x)^T \epsilon e_i = \frac{\partial f}{\partial x_i} \epsilon$$

Doing this $n$ times for each basis vector, we get the gradient of $f(x)$.

Backward differentiation

Backward differentiation is a bit more complex but no more than multivariate calculus. Suppose we have a function $f(x, y, z)$ and $x = g(a, b, c)$, $y = h(a, b, c)$, $z = k(a, b, c)$. Then we can compute the gradient of $f$ with respect to $a, b, c$ by computing the partial derivatives of $f$ with respect to $x, y, z$ and then multiplying by the gradient of $x, y, z$ with respect to $a, b, c$. This is the backward mode of autodiff.

$$\frac{\partial f}{\partial a} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial a} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial a} + \frac{\partial f}{\partial z} \frac{\partial z}{\partial a}$$

$$\frac{\partial f}{\partial b} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial b} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial b} + \frac{\partial f}{\partial z} \frac{\partial z}{\partial b}$$

$$\frac{\partial f}{\partial c} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial c} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial c} + \frac{\partial f}{\partial z} \frac{\partial z}{\partial c}$$

Note the computational tricks that we can employ here. At any point we need only materialize the partial derivatives of the most recent “layer”.

Notes on Jacobian-Vector and Vector-Jacobian products

(edit: 03/22/2026)

One thing we can note is that the operations above are simply VJP and JVP operations. To see this, let’s look at forward differentiation. Given a direction $\vec{v}$, we have that:

$$f(x + \epsilon \vec{v}) = f(x) + \nabla f(x)^T \epsilon \vec{v} = f(x) + \epsilon \nabla f(x)^T \vec{v}$$

In other words, the coefficient of $\epsilon$ is simply the directional derivative of $f(x)$ in the direction $\vec{v}$. When $f(x): \mathbb{R}^n \to \mathbb{R}$, we have that the Jacobian of $f$ is simply a row vector. This means that the directional derivative is given by:

$$J_f(x) \vec{v} = \nabla f(x)^T \vec{v}$$

However, when $f(x): \mathbb{R}^n \to \mathbb{R}^m$, we have that the Jacobian is a matrix. This means that the directional derivative is given by:

$$J_f(x) \vec{v} = \begin{bmatrix} \nabla f_1(x)^T \vec{v} \\ \nabla f_2(x)^T \vec{v} \\ \vdots \\ \nabla f_m(x)^T \vec{v} \end{bmatrix}$$

This is precisely the JVP operation as it is a Jacobian times a vector.

Now let’s look at backward differentiation. Suppose that we have an output function, we can fix it to be a MSE type of loss function:

$$L(\theta) = \frac{1}{2} \sum_{i=1}^n (y_i - f_\theta(x_i))^2$$

Suppose that $f$ is a deep learning model, it is incredibly nested, and that we have computed $\frac{\partial L}{\partial g}$ where $g$ is some intermediate output of the model. We are concerned with computing $\frac{\partial L}{\partial h_i}$. Let $g$ be a function of $h_1, \dots, h_n$ and $g: \mathbb{R}^n \to \mathbb{R}^m$. Then we have:

$$\frac{\partial L}{\partial h_i} = \sum_{j=1}^m \frac{\partial L}{\partial g_j} \frac{\partial g_j}{\partial h_i}$$

Expanding it into vector form, we have:

$$\nabla_h L = J_g(h)^T \nabla_g L = (\nabla_g L)^T J_g(h)$$

This is precisely the VJP operation.

KZ's Brain Dump

Autodifferentiation in a nutshell

Forward differentiation

Backward differentiation

Notes on Jacobian-Vector and Vector-Jacobian products

(edit: 03/22/2026)

Comments