Variational Inference: ELBO

27 Dec, 2025

The concept of dimensionality reduction is a fundamental problem in machine learning. Oftentimes, we can represent high-dimensional distributions on parameters over a smaller and much more tractable lower-dimensional manifold. In fact, there is an interesting result known as the “Johnson-Lindenstrauss Lemma” (JL Lemma) that shows something quite remarkable.

Theorem (Probabilistic Johnson-Lindenstrauss Lemma): Let $X$ be a set of $n$ points in $\mathbb{R}^d$. Then, there exists a map $f: \mathbb{R}^d \to \mathbb{R}^k$ such that for all $x, y \in X$: $$P(|f(x) - f(y)|^2 \in [(1 - \epsilon) |x - y|^2, (1 + \epsilon) |x - y|^2]) \geq 1 - \delta$$ Colloquially, the map, with high probability, preserves the Euclidean distances between points in our dataset and is a global phenomenon.

Proof: Let $Z_{ij} \sim N(0, 1)$ be iid standard normal random variables, where $Z \in \mathbb{R}^{k \times d}$. We claim that the following map preserves the Euclidean distances between points in our dataset: $$f(x) = \frac{1}{||x||_2} Z x$$ We assume that $k < d$ because we are concerned with dimensionality reduction. We observe the quantity: \begin{align*} (Zx)_{i} &= \sum_{j=1}^d Z_{ij} x_j \sim N(0, \sum_{j=1}^d x_{j}^2) \\ \end{align*} This implies that: \begin{align*} x^TZ^TZx &= \sum_{i=1}^k (Zx)_i^2 \\ &= \sum_{i=1}^k ||x||_2^2 \cdot N(0, 1)^2 \\ &= ||x||_2^2 \sum_{i=1}^k \chi^2(1) \end{align*} In other words: $$||f(x)||_2^2 \sim \chi^2(k)$$ Knowing this, because $\chi^2$ is sub-Exponential, we can apply a sub-Exponential tail bound to obtain: \begin{align*} P(||f(x)||_2^2 \not \in 1 \pm \epsilon) &= P(\big | ||f(x)||_2^2 - 1 \big | \geq \epsilon) \leq 2e^{-\frac{\epsilon^2 k}{8}} \\ \end{align*} However, this is just for a singular instance of $x$. Because there were $\binom{n}{2}$ pairs of points in our dataset, we can apply a union bound to obtain: \begin{align*} P(||f(x)||_2^2 \not \in 1 \pm \epsilon) \leq \binom{n}{2} 2e^{-\frac{\epsilon^2 k}{8}} \\ \end{align*} This means that for any $\delta \in (0, 1)$, we can choose $k$ large enough such that the probability of the map not preserving the distances is less than $\delta$. Because the probability of the map preserving the distances is nonzero, we have an existence proof as well.

The “next” phase of dimensionality reduction comes in the form of latent variable models. For any dataset that we have, we assume a latent variable governing the underlying data distribution. It is ultimately a “generative” model where we model the full probabilty distribution. This is a natural framework to introduce the ELBO (Evidence Lower Bound) typically used in the optimization of variational auto-encoders (VAE). Assume we have a dataset $x$ and the latent space model, we are naturally concerned with maximizing the log-likelihood of our model under the obtained dataset: \begin{align*} \log p(x) = \log \int p(x, z) dz &= \log \int p(x, z) \frac{q_\theta(z | x)}{q_\theta(z | x)} dz \\ &= \log \int \frac{p(x, z)}{q_\theta(z | x)} q_\theta(z | x) dz \\ &= \log E_{q_\theta(z|x)}[\frac{p(x, z)}{q_\theta(z | x)}] \\ &\geq E_{q_\theta(z | x)}[\log \frac{p(x, z)}{q_\theta(z | x)}] \\ &= E_{q_\theta(z | x)}[\log p(x, z)] + E_{q_\theta(z | x)}[\log \frac{1}{q_\theta(z | x)}] \\ &= E_{q_\theta(z | x)}[\log p(x, z)] + H(q_\theta(z|x)) \end{align*} We can interpret this as a tradeoff between choosing the best $z$ in the distribution and incentivizing some entropy of the learned distribution. The ELBO is known as: $$\mathcal{L}_{\theta} = E_{q_\theta(z | x)}[\log p(x, z)] + H(q_\theta(z|x))$$ We can enforce equality by observing that: \begin{align*} \log p(x) &= \mathcal{L}_\theta + D_{KL}(q_\theta(z | x) || p(z | x)) \\ \end{align*} For practical purposes, we can learn a lower dimensional form of our intended probability distribution through this latent-variable model, without directly optimizing $\log p(x)$, which contains an intractable integral.