jonathan’s blog

mathematics, physics, programming, and machine learning

Variational autoencoders

The following is a summary of some of the results of KW2013.

Evidence lower bound

Suppose we have a probability space \(X\) from which we can sample, but whose probability density \(p(x)\) is unknown. How can we try to estimate \(p(x)\)? Imagine that \(X\) is the set of all possible natural images, or the set of all English sentences. How do we assign probabilities to such objects?

We can try to model \(X\) as follows. Let’s suppose that there is some generative process \(Z \to X\) from some latent variable \(z\). We assume the following

  • There is some process \(Z \to X\) with tractable condtional density \(p(x|z)\)
  • There is some tractable prior \(p(z)\) In this setup, we are free to choose \(p(z)\) and \(p(x|z)\). However, once we have fixed these densities, the posterior \(p(z|x)\) is completely determined by Bayes' rule \[ p(z|x) = \frac{p(x|z)p(z)}{p(x)}, \] and is typically intractible.

So once we have specified the prior \(p(z)\) and the decoder \(p(x|z)\), we have no freedom to choose \(p(z|x)\). Furthermore, without access to \(p(x)\), we have no way to recover \(p(x|z)\) analytically. To get around this problem, we introduce another probability density \(q(z|x)\), the approximate posterior, which is intended to be a tractable approximation to the intractable \(p(z|x)\). We can quantify the difference between the true posterior and approximate posterior with the Kullback-Leiber divergence,

\[ \begin{aligned} & D_{KL}(q(z|x)||p(z|x)) \cr &= E_{z \sim q(z|x)}\left(\frac{\log q(z|x)}{\log p(z|x)}\right) \cr &=E_{z \sim q(z|x)}\left(\log q(z|x)-\log p(z|x)\right) \cr &=E_{z \sim q(z|x)}\left(\log q(z|x)-\log p(x,z)+\log p(x)\right) \cr &=E_{z \sim q(z|x)}\left(\log q(z|x)-\log p(x|z) - \log p(z) +\log p(x)\right) \cr &=\log p(x) + D_{KL}(q(z|x)||p(z)) - E_{z \sim q(z|x)}\left(\log p(x|z) \right) \end{aligned} \]

Rearranging this expression, we have

\[ \log p(x) = D_{KL}(q(z|x)||p(z|x)) + L \]

where \(L\) is the variational lower bound,

\[ L = -D_{KL}(q(z|x)||p(z)) + E_{z \sim q(z|x)}(\log p(x|z)) \]

Since the Kullback-Leibler divergence is non-negative, we can we obtain the evidence lower bound (ELBO)

\[ \log p(x) \geq -D_{KL}(q(z|x)||p(z)) + E_{z \sim q(z|x)}(\log p(x|z)) \]

This inequality is true for any choice of \(p(z)\), \(q(z|x)\), and \(p(x|z)\). This gives us a natural optimization objective. Given a sample from \(X\), we typically try to maximize the likelihood of the observed data. Since the intractible \(\log p(x)\) is bounded by ELBO, we can maximize the ELBO as a proxy for the true log-likelihood.

Reparametrization trick

The second term in the RHS of ELBO is an expectation over \(z \sim q(z|x)\). Suppose that our model \(q(z|x)\) depends differentiably on some parameters \(\phi\). The expectation can be approximated by taking a sample of \(z\) values. The question is, can we make this expression differentiable in \(\phi\)? Differentiability is a necessary conditon to be able to solve this problem using gradient descent/ascent.

To obtain a differentiable expression, we assume that there is some fixed random variable \(\epsilon\), and that \(z\) is a deterministic function of \(\epsilon\) and \(x\): \[ z = g_\phi(\epsilon, x) \] Then we have \[ \begin{aligned} E_{z \sim q(z|x)}[f(z)] &= \int f(z) q(z|x) dz \cr &= \int f(g_\phi(\epsilon, x)) p(\epsilon) d\epsilon \end{aligned} \] The latter expression can be approximated using a finite sample \(\epsilon \sim p(\epsilon)\), which is independent of \(\phi\), and therefore the approximation will be differentiable in \(\phi\).

The mapping \(g_\phi: (x, \epsilon) \mapsto z\) can be interpreted as an encoder from \(X\) to \(Z\).

Variational autoencoder

By making a few simplifications, including the reparametrization trick above, we can reinterpret aforementioned ELBO optimization problem as an autoencoder. Let us consider the following setup:

  • \(p(\epsilon)\) is standard normal in \(d\) dimensions
  • \(p(z)\) is standard normal in \(d\) dimensions dimensions
  • \(q(z|x)\) is determined implictly by \(z = \mu_\phi(x) + \sigma_\phi(x) \cdot \epsilon\) via the reparametrization trick
  • We have a decoder map \(f_\theta: Z \to X\)
  • The conditional probability density \(p(x|z)\) is modeled as a Gaussian distribution centered on \(f(z)\) with some fixed covariance matrix

Then in this case, the RHS of ELBO can be approximated by an expression that is differentiable in \((\phi, \theta)\). The expectation term involving \(p(x|z)\) simplifies to the sample-wise (negative) MSE between \(x\) and \(f(z)\), i.e. it is a kind of reconstruction error, and the KL-divergence term acts as a regularizer for the latent embedding. In summary, under the above simplifcations, maximizing ELBO corresponds to training and autoencoder with a regularization term that encourages the distribution of observed data in latent space to be consistent with a fixed prior distribution \(p(z)\).

References