Auto-Encoding Variational Bayes

12/14/2024

Auto-Encoding Variational Bayes

Introduction

For the case of an i.i.d. dataset and continuous latent variables per datapoint, we propose the Auto-Encoding VB (AEVB) algorithm.
The learned approximate posterior inference model can also be used for a host of tasks such as recognition, denoising, representation and visualization purposes. When a neural network is used for the recognition model, we arrive at the variational auto-encoder

Method

The strategy in this section can be used to derive a lower bound estimator (a stochastic objective function) for a variety of directed graphical models with continuous latent variables.
We will restrict ourselves here to the common case where we have an i.i.d. dataset with latent variables per datapoint, and where we like to perform maximum likelihood (ML) or maximum a posteriori (MAP) inference on the (global) parameters, and variational inference on the latent variables.
Note that our method can be applied to online, non-stationary settings, e.g. streaming data, but here we assume a fixed dataset for simplicity.

Problem scenario

We assume that the data are generated by some random process, involving an unobserved continuous random variable $\mathbf{z}$. The process consists of two steps: (1) a value $z^{(i)}$ is generated from some prior distribution $p_{\theta^*} (\mathbf{z})$; (2) a value $\mathbf{x}^{(i)}$ is generated from some conditional distribution $p_{\theta^*}(\mathbf{x}|\mathbf{z})$
We assume that the prior $p_{\theta^*} (\mathbf{z}) $ and likelihood $p_{\theta^*} (\mathbf{x}|\mathbf{z}) $ come from parametric families of distributions $p_{\theta} (\mathbf{z}) $ and $p_{\theta} (\mathbf{x}|\mathbf{z}) $ , and that their PDFs are differentiable almost everywhere w.r.t both $\theta$ and $\mathbf{z} $
We are interested in, and propose a solution to, problems in the above scenario:
1. Efficient approximate ML or MAP estimation for the parameters $\theta$.
2. Efficient approximate posterior inference of the latent variable $\mathbf{z}$ given an observed value $\mathbf{x}$ for a choice of parameters $\theta$. This is useful for coding or data representation tasks.

For the purpose of solving the above problems, let us introduce a recognition model $q_{\phi}(\mathbf{z}|\mathbf{x})$ : an approximation to the intractable true posterior $p_{\theta}(\mathbf{z}|\mathbf{x}) $.
From a coding theory perspective, the unobserved variables $\mathbf{z}$ have an interpretation as a latent representation. In this paper we will therefore also refer to the recognition model $q_{\phi}(\mathbf{z}|\mathbf{x})$ as a probabilistic encoder, since given a datapoint $\mathbf{x}$ it produces a distribution over the possible values of the code $\mathbf{z}$ from which the datapoint $\mathbf{x}$ could have been generated.
1. $q_{\phi}(\mathbf{z}|\mathbf{x})$ is the encoder, mapping $\mathbf{x}$ to a distribution over $\mathbf{z}$
2. $p_{\theta}(\mathbf{z}|\mathbf{x}) $ is the decoder, generating $\mathbf{x}$ from the latent variable $\mathbf{z}$

The variational bound

The marginal likelihood(also known as Bayesian evidence) is composed of a sum over the marginal likelihoods of individual datapoints $\log p_{\theta} (\mathbf{x}^{(1)}, ..., \, \mathbf{x}^{(N)} ) \,=\, \sum^{N}_{i=1} \log p_{\theta}(\mathbf{x}^{(i)}) $, which can each be rewritten as: \[ \,\\ \log p_{\theta} (\mathbf{x}^{(i)}) = D_{KL}( q_{\phi}(\mathbf{z}|\mathbf{x}^{(i)}) \,\|\, p_{\theta}(\mathbf{z}|\mathbf{x}^{(i)}) ) + \mathcal{L}(\theta, \phi; \mathbf{x}^{(i)}) \,\\ \] where $ D_{KL}( q_{\phi}(\mathbf{z}|\mathbf{x}^{(i)}) \,\|\, p_{\theta}(\mathbf{z}|\mathbf{x}^{(i)}) ) \ge 0 $. At the minimum, it is zero when $ q_{\phi}(\mathbf{z} | \mathbf{x}) $ eaxctly equals $p_{\theta}(\mathbf{z} | \mathbf{x}) $.

The first RHS term is the KL divergence of the approximate from the true posterior. Since this KL-divergence is non-negative, the second RHS term $\mathcal{L}(\theta, \phi; \mathbf{x}^{(i)})$ is called the (variational) lower boun(Evidence Lower Bound, ELBO) on the marginal likelihood of datapoint $i$, can be written as:
\[ \,\\ \log p_{\theta}(\mathcal{x}^{(i)}) \ge \mathcal{L}(\theta, \phi; \mathcal{x}^{(i)}) = \mathbb{E}_{q \phi (\mathbf{z}|\mathbf{x})} [-\log q_{\phi}(\mathbf{z}|\mathbf{x}) + \log p_{\theta}(\mathbf{x}, \mathbf{z})] \,\\ \]
where the joint probability decomposition $ p_{\theta}(\mathbf{x}, \mathbf{z}) = p_{\theta}(\mathbf{x}|\mathbf{z})p_{\theta}(\mathbf{z}) $ is applied:
\[ \,\\ \mathcal{L}(\theta, \phi; \mathbf{x}^{(i)}) = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})} \big[\log p_{\theta}(\mathbf{x}|\mathbf{z}) + \log p_{\theta}(\mathbf{z}) - \log q_{\phi}(\mathbf{z}|\mathbf{x})\big] \,\\ \]
To include the KL divergence, we use its definition: \[ \,\\ D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \| p_{\theta}(\mathbf{z})) = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})} \big[\log q_{\phi}(\mathbf{z}|\mathbf{x}) - \log p_{\theta}(\mathbf{z})\big]. \,\\ \] Rearranging this definition gives:
\[ \,\\ \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})} \big[\log p_{\theta}(\mathbf{z})\big] - \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})} \big[\log q_{\phi}(\mathbf{z}|\mathbf{x})\big] = -D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \| p_{\theta}(\mathbf{z})). \,\\ \]
Substituting this into the ELBO, we have:
\[ \,\\ \mathcal{L}(\theta, \phi; \mathbf{x}^{(i)}) = - D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \| p_{\theta}(\mathbf{z})) + \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})} \big[\log p_{\theta}(\mathbf{x}|\mathbf{z})\big] \,\\ \]
The first term, the KL divergence term, regularizes the approximate posterior $q_{\phi}(\mathbf{z}|\mathbf{x})$ to be close to the prior $p_{\theta}(\mathbf{z})$. The second term represents the reconstruction term, which measures how well the data can be reconstructed given the latent variables.

We want to differentiate and optimize the lower bound $\mathcal{L}(\theta, \phi; \mathbf{x}^{(i)})$ w.r.t both the variational parameters $\phi$ and generative parameters $\theta$.

The SGVB estimator and AEVB algorithm

In this section we introduce a practical estimator of the lower bound and its derivatives w.r.t the parameters.
\[ \begin{array}{l} \hline \textbf{Algorithm 1} \text{ Minibatch version of the Auto-Encoding VB (AEVB) algorithm. Either of the two} \\ \text{SGVB estimators in section 2.3 can be used. We use settings } M = 100 \text{ and } L = 1 \text{ in experiments.} \\ \hline \quad \theta, \phi \leftarrow \text{Initialize parameters} \\ \quad \textbf{repeat} \\ \qquad \mathbf{X}^M \leftarrow \text{Random minibatch of } M \text{ datapoints (drawn from full dataset)} \\ \qquad \epsilon \leftarrow \text{Random samples from noise distribution } p(\epsilon) \\ \qquad \mathbf{g} \leftarrow \nabla_{\theta,\phi}\tilde{\mathcal{L}}^M(\theta, \phi; \mathbf{X}^M, \epsilon) \text{ (Gradients of minibatch estimator)} \\ \qquad \theta, \phi \leftarrow \text{Update parameters using gradients } \mathbf{g} \text{ (e.g. SGD or Adagrad)} \\ \quad \textbf{until } \text{convergence of parameters } (\theta, \phi) \\ \quad \textbf{return } \theta, \phi \\ \hline \end{array} \]
Under certain mild conditions outlined in section 2.4 for a chosen approximate posterior $q_{\phi}(\mathbf{z}|\mathbf{x}) $ we can reparameterize the random variable $\tilde{z} \sim q_{\phi}(\mathbf{z}|\mathbf{x}) $ using a differentiable transformation $g_{\phi}(\epsilon, \mathbf{x}) $ of an noise variable $\epsilon$: \[ \,\\ \tilde{z} = g_{\phi}(\epsilon, \mathbf{x}) \quad \text{with} \quad \epsilon \sim p(\epsilon) \,\\ \] The reparameterization trick is a method that makes the sampling process differentiable. Instead of sampling directly from $q_{\phi}(\mathbf{z}|\mathbf{x})$, we sample a random noise $\epsilon$ and transform it using the function $g_{\phi}$. This allows us to compute gradients and train the network using backpropagation.

The reparameterizationi trick

In order to solve our problem we invoked an alternative method for generating samples from $q_{\phi}(\mathbf{z}|\mathbf{x}) $. Let $\mathbf{z}$ be a continuous random variable, and $z \sim q_{\phi}(\mathbf{z}|\mathbf{x}) $ be some conditional distribution. \[ \,\\ \mathbf{z} = g_{\phi}(\epsilon, \mathbf{x}) \,\\ \] where $\epsilon$ is an auxiliary variable with independent marginal $p(\epsilon)$, and $g_{\phi}(.) $ is some vector-valued function parameterized by $\phi$.
Take, for example, let $z \sim p(z|x) = \mathcal{N}(\mu, \sigma^2) $. In this case, a valid reparameterization is $z = \mu + \sigma \epsilon $, where $\epsilon$ is an auxiliary noise variable $\epsilon \sim \mathcal{N}(0, 1) $.

( 직접 샘플링하면 미분할 수 없으므로, 인코더 파라미터 $\phi$에 의해 입력 데이터 $\mathbf{x}$가 잠재변수 $z$의 분포를 매핑하고 $\mu$와 $\sigma$를 추정. 그 후 정규분포 $\epsilon \sim \mathcal{N}(0, 1) $에서 샘플링된 노이즈를 사용하여 잠재변수 $z$를 생성. 텐서 그래프 연결 유지. )

Example: Variational Auto-Encoder

In this section we'll give an example where we use a neural network for the probability encoder $q_{\phi}(\mathbf{z}|\mathbf{x}) $ (the approximation to the posterior of the generative (decoder) model $p_{\theta}(\mathbf{x}, \mathbf{z})$) . \[ \,\\ \log q_{\phi}(\mathbf{z}|\mathbf{x}^{(i)}) = \log \mathcal{N} (\mathbf{z}; \mu^{(i)}, \sigma^{2(i)} \mathbf{I}) \,\\ \] where the mean and s.d. of the approximate posterior, $\mu^{(i)}$ and $\sigma^{(i)} $, are outputs of the encoding MLP, i.e. nonlinear functions of datapoint $\mathbf{x}^{(i)}$ and the variational parameters $\phi$.
As explained in section reparameterization trick, we sample from the posterior $\mathbf{z}^{(i, l)} \sim q_{\phi}(\mathbf{z}|{x}^{(i)}) $ using \[ \,\\ g_{\phi}( \mathbf{x}^{(i)}, \epsilon^{(l)} ) = \mu^{(i)} + \sigma^{(i)} \odot \epsilon^{(l)} \,\\ \] where $\epsilon^{(l)} \sim \mathcal{N}(0, \mathbf{I}) $. With $\odot$ we signify an element-wise product.

(
　$\mathcal{N}(0, \mathbf{I})$: 평균이 $0$이고 분산이 $\mathbf{I}$로서, $\mathbf{z} \in \mathbb{R}^d $ 형태.
　$\mathcal{N}(0, 1)$: 평균이 $0$이고 분산이 $1$로서, $z \in \mathbb{R} $ 형태.
)
\[ \,\\ \mathcal{L}(\theta, \phi; \mathbf{x}^{(i)}) \simeq \frac{1}{2} \sum^J_{j=1} \left( 1 + \log((\sigma_j^{(i)})^2) - (\mu_j^{(i)})^2 - (\sigma_j^{(i)})^2 \right) + \frac{1}{L} \sum^L_{l=1} \log p_{\theta}(\mathbf{x}^{(i)}|\mathbf{z}^{(i,l)}) \,\\ \]
where $\mathbf{z}^{(i, l)} = \mu^{(i)} + \sigma{(i)} \odot \epsilon^{(l)} $ and $\epsilon \sim \mathcal{L}(0, \mathbf{I}) $