Variational Auto-Encoder (VAE)

Matheus Schmitz
LinkedIn
Github Portfolio

Open In Colab

Variational Auto-Encoders (VAEs) are a widely used class of generative models. They are simple to implement and, in contrast to other generative model classes like Generative Adversarial Networks, they optimize an explicit maximum likelihood objective to train the model. Finally, their architecture makes them well-suited for unsupervised representation learning, i.e. learning low-dimensional representations of high-dimenionsal inputs, like images, with only self-supervised objectives (data reconstruction in the case of VAEs).

VAE_sketch.png (image source: https://mlexplained.com/2017/12/28/an-intuitive-explanation-of-variational-autoencoders-vaes-part-1)

By working on this problem you will learn and practice the following steps:

  1. Set up a data loading pipeline in PyTorch.
  2. Implement, train and visualize an auto-encoder architecture.
  3. Extend your implementation to a variational auto-encoder.
  4. Learn how to tune the critical beta parameter of your VAE.
  5. Inspect the learned representation of your VAE.

Note: For faster training of the models in this assignment you can use Colab with enabled GPU support. In Colab, navigate to "Runtime" --> "Change Runtime Type" and set the "Hardware Accelerator" to "GPU".

1. MNIST Dataset

We will perform all experiments for this problem using the MNIST dataset, a standard dataset of handwritten digits. The main benefits of this dataset are that it is small and relatively easy to model. It therefore allows for quick experimentation and serves as initial test bed in many papers.

Another benefit is that it is so widely used that PyTorch even provides functionality to automatically download it.

Let's start by downloading the data and visualizing some samples.

2. Auto-Encoder

Before implementing the full VAE, we will first implement an auto-encoder architecture. Auto-encoders feature the same encoder-decoder architecture as VAEs and therefore also learn a low-dimensional representation of the input data without supervision. In contrast to VAEs they are fully deterministic models and do not employ variational inference for optimization.

The architecture is very simple: we will encode the input image into a low-dimensional representation using a convolutional network with strided convolutions that reduce the image resolution in every layer. This results in a low-dimensional representation of the input image. This representation will get decoded back into the dimensionality of the input image using a convolutional decoder network that mirrors the architecture of the encoder. It employs transposed convolutions to increase the resolution of its input in every layer. The whole model is trained by minimizing a reconstruction loss between the input and the decoded image.

Intuitively, the auto-encoder needs to compress the information contained in the input image into a much lower dimensional representation (e.g. 28x28=784px vs. 64 embedding dimensions for our MNIST model). This is possible since the information captured in the pixels is highly redundant. E.g. encoding an MNIST image requires <4 bits to encode which of the 10 possible digits is displayed and a few additional bits to capture information about shape and orientation. This is much less than the $255^{28\cdot 28}$ bits of information that could be theoretically captured in the input image.

Learning such a compressed representation can make downstream task learning easier. For example, learning to add two numbers based on the inferred digits is much easier than performing the task based on two piles of pixel values that depict the digits.

In the following, we will first define the architecture of encoder and decoder and then train the auto-encoder model.

Defining the Auto-Encoder Architecture [6pt]

Testing the Auto-Encoder Forward Pass [1pt]

Now that we defined encoder and decoder network our architecture is nearly complete. However, before we start training, we can wrap encoder and decoder into an auto-encoder class for easier handling.

Setting up the Auto-Encoder Training Loop [6pt]

After implementing the network architecture, we can now set up the training loop and run training.

Verifying reconstructions [0pt]

Now that we trained the auto-encoder we can visualize some of the reconstructions on the test set to verify that it is converged and did not overfit. Before continuing, make sure that your auto-encoder is able to reconstruct these samples near-perfectly.

Sampling from the Auto-Encoder [2pt]

To test whether the auto-encoder is useful as a generative model, we can use it like any other generative model: draw embedding samples from a prior distribution and decode them through the decoder network. We will choose a unit Gaussian prior to allow for easy comparison to the VAE later.

Inline Question: Describe your observations, why do you think they occur? [2pt] \ (please limit your answer to <150 words) \ Answer: The compressed encoder representation is unconstrained and therefore unlikely to be nornally distributed, so we are sampling from areas of the embedding space that were un-unsed during encoding and thus the decoder didn't learn how to handle.

3. Variational Auto-Encoder (VAE)

Variational auto-encoders use a very similar architecture to deterministic auto-encoders, but are inherently storchastic models, i.e. we perform a stochastic sampling operation during the forward pass, leading to different outputs every time we run the network for the same input. This sampling is required to optimize the VAE objective also known as the evidence lower bound (ELBO):

$$ p(x) > \underbrace{\mathbb{E}_{z\sim q(z\vert x)} p(x \vert z)}_{\text{reconstruction}} - \underbrace{D_{\text{KL}}\big(q(z \vert x), p(z)\big)}_{\text{prior divergence}} $$

Here, $D_{\text{KL}}(q, p)$ denotes the Kullback-Leibler (KL) divergence between the posterior distribution $q(z \vert x)$, i.e. the output of our encoder, and $p(z)$, the prior over the embedding variable $z$, which we can choose freely.

For simplicity, we will again choose a unit Gaussian prior. The left term is the reconstruction term we already know from training the auto-encoder. When assuming a Gaussian output distribution for both encoder $q(z \vert x)$ and decoder $p(x \vert z)$ the objective reduces to:

$$ \mathcal{L}_{\text{VAE}} = \sum_{x\sim \mathcal{D}} (x - \hat{x})^2 - \beta \cdot D_{\text{KL}}\big(\mathcal{N}(\mu_q, \sigma_q), \mathcal{N}(0, I)\big) $$

Here, $\hat{x}$ is the reconstruction output of the decoder. In comparison to the auto-encoder objetive, the VAE adds a regularizing term between the output of the encoder and a chosen prior distribution, effectively forcing the encoder output to not stray too far from the prior during training. As a result the decoder gets trained with samples that look pretty similar to samples from the prior, which will hopefully allow us to generate better images when using the VAE as a generative model and actually feeding it samples from the prior (as we have done for the AE before).

The coefficient $\beta$ is a scalar weighting factor that trades off between reconstruction and regularization objective. We will investigate the influence of this factor in out experiments below.

If you need a refresher on VAEs you can check out this tutorial paper: https://arxiv.org/abs/1606.05908

Reparametrization Trick

The sampling procedure inside the VAE's forward pass for obtaining a sample $z$ from the posterior distribution $q(z \vert x)$, when implemented naively, is non-differentiable. However, since $q(z\vert x)$ is parametrized with a Gaussian function, there is a simple trick to obtain a differentiable sampling operator, known as the reparametrization trick.

Instead of directly sampling $z \sim \mathcal{N}(\mu_q, \sigma_q)$ we can "separate" the network's predictions and the random sampling by computing the sample as:

$$ z = \mu_q + \sigma_q * \epsilon , \quad \epsilon \sim \mathcal{N}(0, I) $$

Note that in this equation, the sample $z$ is computed as a deterministic function of the network's predictions $\mu_q$ and $\sigma_q$ and therefore allows to propagate gradients through the sampling procedure.

Note: While in the equations above the encoder network parametrizes the standard deviation $\sigma_q$ of the Gaussian posterior distribution, in practice we usually parametrize the logarithm of the standard deviation $\log \sigma_q$ for numerical stability. Before sampling $z$ we will then exponentiate the network's output to obtain $\sigma_q$.

Defining the VAE Model [7pt]

Indented block

Setting up the VAE Training Loop [4pt]

Let's start training the VAE model! We will first verify our implementation by setting $\beta = 0$.

Let's look at some reconstructions and decoded embedding samples!

Inline Question: What can you observe when setting $\beta = 0$? Explain your observations! [3pt] \ (please limit your answer to <150 words) \ Answer: When β = 0, the model only considers reconstruction loss. As a result, it shows good reconstruction of image. However, when sampled mean and variance is given, it generates ambiguous images.

Let's repeat the same experiment for $\beta = 10$, a very high value for the coefficient. You can modify the $\beta$ value in the cell above and rerun it (it is okay to overwrite the outputs of the previous experiment, but make sure to copy the visualizations of training curves, reconstructions and samples for $\beta = 0$ into your solution PDF before deleting them).

Inline Question: What can you observe when setting $\beta = 10$? Explain your observations! [3pt] \ (please limit your answer to <200 words) \ Answer: When β = 10, the model gives low weight on reconstruction loss. Therefore, generated images show blurred unclear images for both reconstruction and sample decoding task. But from the samples, it shows very similar result for all the different samples.

Now we can start tuning the beta value to achieve a good result. First describe what a "good result" would look like (focus what you would expect for reconstructions and sample quality).

Inline Question: Characterize what properties you would expect for reconstructions (1pt) and samples (2pt) of a well-tuned VAE! [3pt] \ (please limit your answer to <200 words) \ Answer:

  • In case of reconstruction, as the model aims to generate similar image as the input, the model have to show very similar images compared to the input images.
  • Even with the samples, it should generate images which is clear and images like trained images. In addition, each output from the samples should be different to each other.

Tuning the $\beta$-factor [5pt]

Now that you know what outcome we would like to obtain, try to tune $\beta$ to achieve this result.

(logarithmic search in steps of 10x will be helpful, good results can be achieved after ~20 epochs of training). It is again okay to overwrite the results of the previous $\beta=10$ experiment after copying them to the solution PDF.

Your final notebook should include the visualizations of your best-tuned VAE.

4. Embedding Space Interpolation [3pt]

As mentioned in the introduction, AEs and VAEs cannot only be used to generate images, but also to learn low-dimensional representations of their inputs. In this final section we will investigate the representations we learned with both models by interpolating in embedding space between different images. We will encode two images into their low-dimensional embedding representations, then interpolate these embeddings and reconstruct the result.

Repeat the experiment for different start / end labels and different samples. Describe your observations.

Inline Question: Repeat the interpolation experiment with different start / end labels and multiple samples. Describe your observations! Focus on: \

  • In case of AE, it generates ambiguous image in the middle of interpolations. However, in case of VAE, it shows clear, distinguishable image in the interpolations.
  1. How do AE and VAE embedding space interpolations differ? \
  • As interpolated embeddings are different from embeddings generated from train data, in case of AE, it generates unclear images. However, as embeddings of VAE are means and variances and it samples from the means and variances, it generates clear images.
  1. How do you expect these differences to affect the usefulness of the learned representation for downstream learning? \
  • VAE looks more useful since it can always generate clear result compared to Auto Encoder. As VAE can generate outputs which has similar characteristic to the training data, it can be used to generate new images which is similar to training data such as generating face images. In addition, in terms of embedding space, due to the KL divergence loss, embeddings of similar output in VAE gather more than embeddings of AE.

End

Matheus Schmitz
LinkedIn
Github Portfolio