Under and over autoencoders

Vikas Jha
Analytics Vidhya
Published in
6 min readMay 14, 2021

--

Introduction

Cat & Dog (source: Photo by Tran Mau Tri Tam on Unsplash)

If a person is shown same size images of a dog and a cat, and asked what image would lie halfway between them, a sensible answer would not be to take the images’ pixel value and take mean of them. However ridiculous it might sound, the reasonable answer would be some sort of an image of an animal which is half cat and half dog.

Or as a more sensible example, consider an image of a building at night. If one wants to see how it looks in the day, the right approach would not be to take individual pixels and move the intensity more towards white. The target image should take care of the individual pixel’s response to sunlight, how the pixels in different location might respond to light differently, and multiple other factors.

Photo by Rachit Chaudhary on Unsplash

The common factor in both cases is that image reconstruction or modification, if performed in pixel space gives rather non-natural and bad result. However, humans are able to imagine and visualize in both the cases how the end result might look like. Humans are definitely not thinking in pixel space.

Latent Space

The reason humans are able to process the images and visualize what a non-existing mix of cat or dog will look like, or look at a building in night and picture what that building might look like in daylight is because human thinking seems to operate in latent space. Latent space is abstract multi-dimensional space containing latent variables, which are feature vectors that encode the useful generic information/representation of easily observable features. Latent space contains the internal hidden representation of the actual vector. It is not only applicable for images, but for any kind of information.

The real world high dimensional data is supposed to lie in low-dimensional (latent space) manifolds embedded within the high dimensional space(Manifold hypothesis). Capturing manifold capture the latent space characteristics which will help in learning the generalized features of the data. For example, consider Swiss Roll 3-d representation of data with multiple classes represented by different colours.

Fig 1. Swiss Roll Data representation (source: https://i.stack.imgur.com/pa1FR.png)

The real world data lies in 3 dimensions. No linear separator can separate the classes. However, if the actual manifold representation is understood, not only separating the different classes becomes easier, but also imputing is more effective.

On the same lines, the learning of latent space representation forces the system to learn a more efficient representation of the data.

Autoencoders

An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner.(Wikipedia) The autoencoder filters the noise and learns the encodings(representation) of the data is lower dimensional hyperspace. As there is no target variable, the learning of the autoencoder is unsupervised. The output in autoencoder is produced from the encoded representation of the input.

Why predict the output that we already know as input?

The predict output produced by encoder is compared with the input. The difference between the two represented as loss is hence used to learn the weights inside the autoencoders. The function to calculate the loss depends on the input type and task at hand.

Fig 2. Basic Autoencoder

Autoencoder has two parts:

  1. Encoder: It converts the input to the encodeded representation. The code ‘h’ in Fig 2 is the latent space.
  2. Decoder: It takes the encoded representation and provides output based on that.

Little taste of mathematics

Reconstruction loss

Let loss be represented by L. Loss can be calculated as mean of individual losses over data points.

As we see in the calculations, input and output, both lie in the same dimensional(n) hyperspace. The encoded representation lies in a different dimension represented by d. Based on the dimension (d) of the latent space, we can have two types of autoencoders.

Under and over-autoencoders

When d < n, the autoencoder is under-autoencoder. It forces the learning of efficient representation of the input by creating a bottleneck in the dimensions/ degree of freedom through which information can be passed and learnt.

However, when d ≥ n(over-autoencoder), the encoder and decoders can be just identity functions, resulting in the output which is same as input. It is because the loss function penalizes on the dissimilarity of the input and output. The simplest way to do that is to just copy the input without any changes, which is made possible by passing input signal through identity functions in encoder and decoder. The autoencoder here is overfitting. Autoencoder should be able to reconstruct the data which lives on the manifold, ignoring the points in data which represent noise. The enforce policy that autoencoder should only be able to reconstruct a small set of input which lies on the manifold. Autoencoder should not be able to reconstruct the data points which lie away from the manifold.

Under-autoencoder’s encoder and decoder layers cannot act as identity functions as the latent space/hidden layer does not have enough dimensions to copy the input.

Missing pixels (source: https://paperswithcode.com/task/facial-inpainting)

For instance, if an autoencoder is trained on images of faces, and a new face is given with a part removed(missing/masked image), the autoencoder should be reconstructing only those pixels which lie on facial manifold it has learnt. It is similar to how humans would impute the rest of face after observing a partial face. Any variation applied to input later is considered as noise by autoencoder and be ignored as network will be indifferent to those points.

Then why take d ≥ n?

If an over-autoencoder having hidden layer(latent space) dimension greater than input dimension is so prone to overfitting, and might end up with encoder and decoder both represented by identity matrices, what is the point of using it? The answer is that the larger intermediate representation i.e. expanded space compared to input space makes it easier to extract features and optimize because there is more space to move around.

However, to avoid the issue to identity matrix copying the input data to output directly, constraints in data flow has to be introduced. The constraints force the network to choose a selective subset of larger intermediate hidden space.

The constraints can be applied by introducing noise(Denoising autoencoder) in input, and learning the weights by comparing the output to actual clean input.

Denoising autoencoder

Another way to add constraint is to regularization term. Contractive autoencoder adds ‘squared norm of the gradient of the hidden representation with respect to the input’ to the reconstruction loss. While reconstruction loss penalizes for insensitivity to reconstruction directions, the squared norm of the gradient penalizes for sensitivity to any direction. The added term to reconstruction loss reduces the variance in the hidden layer/latent space representation, by increasing bias.

Contractive autoencoder(source:https://atcold.github.io/pytorch-Deep-Learning/en/week07/07-3/)

As over-autoencoders are high variance models, introduction of constraints reduces variance increasing the bias. The aim here is to get to the sweet spot where the margin increase in bias becomes equal to marginal reduction in variance, eliminating the reason of any further tradeoff.

Conclusion

We have understood the fundamental idea behind how autoencoders work, i.e. by trying to find a latent space representation of the data which captures its actual generic and the most important features. The data is assumed to lie in a lower dimensional manifold, and the autoencoder tried to extract that represent from the data provided.

--

--