Autodifferentiation in a nutshell
This post is going to be a short one. There are two ways to perform automatic differentiation (“autodiff”), forward and backward differentiation.
Forward differentiation
Dual numbers for forward differentation are quite similar to complex numbers. We define a dual number as such: $$a + b \epsilon$$ where $\epsilon^2 = 0$. The whole premise of this is that suppose that we have a function $f(x)$, we can compute it’s Taylor expansion: $$f(y) = f(x) + f’(x)(y - x) + \frac{f’’(x)}{2}(y - x)^2 + \cdots$$ Then letting $y = x + \epsilon$, we get: $$f(x + \epsilon) = f(x) + f’(x)\epsilon$$ This means that we can compute the derivative of $f(x)$ by computing $f(x + \epsilon)$ and looking at the coefficient of $\epsilon$. This is the forward mode of autodiff. This was the single dimensional case. We define dual numbers in a similar fashion for the multi-dimensional case, then we have, where $e_i$ is the $i$-th standard basis vector: $$f(x + \epsilon e_i) = f(x) + \nabla f(x)^T \epsilon e_i = \frac{\partial f}{\partial x_i} \epsilon$$ Doing this $n$ times for each basis vector, we get the gradient of $f(x)$.
Backward differentiation
Backward differentiation is a bit more complex but no more than multivariate calculus. Suppose we have a function $f(x, y, z)$ and $x = g(a, b, c)$, $y = h(a, b, c)$, $z = k(a, b, c)$. Then we can compute the gradient of $f$ with respect to $a, b, c$ by computing the partial derivatives of $f$ with respect to $x, y, z$ and then multiplying by the gradient of $x, y, z$ with respect to $a, b, c$. This is the backward mode of autodiff. $$\frac{\partial f}{\partial a} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial a} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial a} + \frac{\partial f}{\partial z} \frac{\partial z}{\partial a}$$ $$\frac{\partial f}{\partial b} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial b} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial b} + \frac{\partial f}{\partial z} \frac{\partial z}{\partial b}$$ $$\frac{\partial f}{\partial c} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial c} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial c} + \frac{\partial f}{\partial z} \frac{\partial z}{\partial c}$$
Note the computational tricks that we can employ here. At any point we need only materialize the partial derivatives of the most recent “layer”.