Autodifferentiation in a nutshell
This post is going to be a short one. There are two ways to perform automatic differentiation (“autodiff”), forward and backward differentiation.
Forward differentiation
Dual numbers for forward differentation are quite similar to complex numbers. We define a dual number as such: $$a + b \epsilon$$ where $\epsilon^2 = 0$. The whole premise of this is that suppose that we have a function $f(x)$, we can compute it’s Taylor expansion: $$f(y) = f(x) + f’(x)(y - x) + \frac{f’’(x)}{2}(y - x)^2 + \cdots$$ Then letting $y = x + \epsilon$, we get: $$f(x + \epsilon) = f(x) + f’(x)\epsilon$$ This means that we can compute the derivative of $f(x)$ by computing $f(x + \epsilon)$ and looking at the coefficient of $\epsilon$. This is the forward mode of autodiff. This was the single dimensional case. We define dual numbers in a similar fashion for the multi-dimensional case, then we have, where $e_i$ is the $i$-th standard basis vector: $$f(x + \epsilon e_i) = f(x) + \nabla f(x)^T \epsilon e_i = \frac{\partial f}{\partial x_i} \epsilon$$ Doing this $n$ times for each basis vector, we get the gradient of $f(x)$.
Backward differentiation
Backward differentiation is a bit more complex but no more than multivariate calculus. Suppose we have a function $f(x, y, z)$ and $x = g(a, b, c)$, $y = h(a, b, c)$, $z = k(a, b, c)$. Then we can compute the gradient of $f$ with respect to $a, b, c$ by computing the partial derivatives of $f$ with respect to $x, y, z$ and then multiplying by the gradient of $x, y, z$ with respect to $a, b, c$. This is the backward mode of autodiff. $$\frac{\partial f}{\partial a} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial a} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial a} + \frac{\partial f}{\partial z} \frac{\partial z}{\partial a}$$ $$\frac{\partial f}{\partial b} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial b} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial b} + \frac{\partial f}{\partial z} \frac{\partial z}{\partial b}$$ $$\frac{\partial f}{\partial c} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial c} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial c} + \frac{\partial f}{\partial z} \frac{\partial z}{\partial c}$$
Note the computational tricks that we can employ here. At any point we need only materialize the partial derivatives of the most recent “layer”.
Notes on Jacobian-Vector and Vector-Jacobian products
(edit: 03/22/2026)
One thing we can note is that the operations above are simply VJP and JVP operations. To see this, let’s look at forward differentiation. Given a direction $\vec{v}$, we have that:
$$f(x + \epsilon \vec{v}) = f(x) + \nabla f(x)^T \epsilon \vec{v} = f(x) + \epsilon \nabla f(x)^T \vec{v}$$ In other words, the coefficient of $\epsilon$ is simply the directional derivative of $f(x)$ in the direction $\vec{v}$. When $f(x): \mathbb{R}^n \to \mathbb{R}$, we have that the Jacobian of $f$ is simply a row vector. This means that the directional derivative is given by: $$J_f(x) \vec{v} = \nabla f(x)^T \vec{v}$$ However, when $f(x): \mathbb{R}^n \to \mathbb{R}^m$, we have that the Jacobian is a matrix. This means that the directional derivative is given by: $$J_f(x) \vec{v} = \begin{bmatrix} \nabla f_1(x)^T \vec{v} \\ \nabla f_2(x)^T \vec{v} \\ \vdots \\ \nabla f_m(x)^T \vec{v} \end{bmatrix}$$ This is precisely the JVP operation as it is a Jacobian times a vector.
Now let’s look at backward differentiation. Suppose that we have an output function, we can fix it to be a MSE type of loss function: $$L(\theta) = \frac{1}{2} \sum_{i=1}^n (y_i - f_\theta(x_i))^2$$ Suppose that $f$ is a deep learning model, it is incredibly nested, and that we have computed $\frac{\partial L}{\partial g}$ where $g$ is some intermediate output of the model. We are concerned with computing $\frac{\partial L}{\partial h_i}$. Let $g$ be a function of $h_1, \dots, h_n$ and $g: \mathbb{R}^n \to \mathbb{R}^m$. Then we have:
$$\frac{\partial L}{\partial h_i} = \sum_{j=1}^m \frac{\partial L}{\partial g_j} \frac{\partial g_j}{\partial h_i}$$ Expanding it into vector form, we have: $$\nabla_h L = J_g(h)^T \nabla_g L = (\nabla_g L)^T J_g(h)$$ This is precisely the VJP operation.