What Is Generative Models?

A generative model can be seen as a way to model the conditional probability of the observed $X$ given a target $y$ (e.g., given a target ‘dog’, generate a picture of the dog). Once trained, we can easily sample a stance of $X$. While training a generative model is significantly more challenging than a discriminative model (e.g., it is more difficult to generate an image of a dog than to identify a dog in a picture), it offers the ability to create entirely new data.

Latent Variable Model

For the data $x$ we observe, we imagine a latent variable $z$ and model them as a joint distribution $p(x, z)$. Therefore we have this form: $$p(x) = \int p(x,z)dz$$ Apply bayes’s theorem, so: $$p(x) = \frac{p(x,z)}{p(z|x)}$$ The log-likelihood of $p(x)$ is below: $$\begin{align*} \log p(x) &= \log \int p(x,z)dz \\ &= \log \int \frac{p(x,z)q_{\phi}(z|x)}{q_{\phi}(z|x)}dz \\ &= \log \mathbb{E}_{q_{\phi}(z|x)}\frac{p(x,z)}{q_{\phi}(z|x)} \\ &\geq \mathbb{E}_{q_{\phi}(z|x)} \log \frac{p(x,z)}{q_{\phi}(z|x)} \\ \end{align*}$$From above, we derive the term $\mathbb{E}_{q_{\phi}(z|x)} \log \frac{p(x,z)}{q_{\phi}(z|x)}$ called Evidence Lower Bound (ELBO), therefore maximizing the ELBO becomes a proxy objective with which to optimize a latent variable model.

Variable Autoencoders (VAE)

The purpose of the variable autoencoders is to maximize ELBO by optimizing for the best $q_{\phi}(z|x)$ amongst a family of posterior distribution parameters by $\phi$. $$\begin{align*} \mathbb{E}_{q_{\phi}(z|x)} \log \frac{p(x,z)}{q_{\phi}(z|x)} &= \mathbb{E}_{q_{\phi}(z|x)} \log \frac{p_{\theta}(x|z)p(z)}{q_{\phi}(z|x)} \\ &= \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] + \mathbb{E}_{q_{\phi}(z|x)} [\log \frac{p(z)}{q_{\phi}(z|x)}] \\ &= \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{\text{KL}}(q_{\phi}(z|x) \parallel p(z)) \end{align*}$$ In conclusion, we obtain a decoder term $\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]$ and an encoder term $D_{\text{KL}}(q_{\phi}(z|x) \parallel p(z))$. Our objective is to maximize the first term and minimize the second term. The encoder of VAE is commonly chosen to model a multivariate Gaussian with diagonal covariance, and the prior is typically assumed to follow a standard multivariate Gaussian: $$q_{\phi}(z|x) = \mathcal{N}(z;\mu_{\phi}(x),\sigma^{2}(x)\text{I})$$ $$p(z) = \mathcal{N}(z;0,\text{I})$$ And then the objective function can be rewritten through Monte Carlo sampling as below (here $z^{(l)}$ is sampled from $q_{\phi}(z|x)$ for every $x$ in the dataset: $$\underset{\phi, \theta}{\mathrm{arg\ max}}\ \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{\text{KL}}(q_{\phi}(z|x) \parallel p(z)) \ \approx \underset{\phi, \theta}{\mathrm{arg\ max}} \sum_{l=1}^{L}\log p_{\theta}(x|z^{(l)})-D_{\text{KL}}(q_{\phi}(z|x) \parallel p(z))$$ there has been still a problem that $z^{(l)}$ is sampled and unable to optimize throuth gradient descent. To solve this issue, reparameterization trick rewrites the random variable as a deterministic function of a noise variable. $$z = \mu_{\phi}(x) + \sigma_{\phi}(x) \odot \epsilon\ \ where\ \ \ \ \epsilon \in \mathcal{N}(\epsilon; 0,\text{I})$$ where $\odot$ represents an element-wise product. Under this reparameterized version of $z$, gradients can be computed by optimizing $\mu_{\phi}$ and $\sigma_\phi$. After training, new data can be generated while sample a latent variable from $p(z)$ and feed it into the decoder of VAE. Furthermore, when a powerful semantic latent space is learned, latent vector can be edited or controled before being passed into the decoder to generate the desired data.

Hierarchical Variational Autoencoders

A hierarchical variational autoencoder is a generalization of a standard VAE that extends to multiple hierarchies over latent variables.

Figure 1. The process of A Markovian Hierarchical Variational Autoencoder with T hierarchical latents (Image source: Diffusion Models.)

Figure 1. The process of A Markovian Hierarchical Variational Autoencoder with T hierarchical latents (Image source: Diffusion Models.)