Implementing under & over autoencoders using PyTorch

Published in

Analytics Vidhya

7 min readMay 20, 2021

Photo by Christopher Robin Ebbinghaus on Unsplash

Introduction

Autoencoder is a neural network which converts data to a more efficient representation in latent space using encoder, and then tries to derive the original data back from the latent space using decoder. The bottleneck or the constraint applied to information flow obviates the direct copying of data between encoder and decoder, and so the network learn to keep the most general and efficient representation of the data, while ignoring the noise or temporary patterns.

Autoencoders can be implemented from scratch in python using numpy, which would require implementing the gradient framework manually. However, differentiable programming is available in python through efficient frameworks like PyTorch which will be used for the current article.

Brief visit to architecture

As described in my article, one of the many ways to look at the autoencoders is to characterize them based on the hidden/intermediate layer or latent space dimensions. If the information flow bottleneck in autoencoder is applied by restricting the dimension of the hidden/intermediate layer, then it is under-autoencoder, otherwise it is over-autoencoder. In latter case, the information bottleneck is applied by introducing noise in the input data, or modifying the loss function, reducing the effective space in which the latent space representations can lie.

Data

DataLoader from torch is used to create iterable/map style over dataset for multiple batches. torchvision contains transforms module which contains transformation methods. They can be chained together to apply all transformations on the images in one go. The pixel values in MNIST dataset is scaled in range -1 to +1, from 0 to 1.

PyTorch expects data in form (batch size, channel, height, width). In case the data is in some other form, proper transformations should be executed to bring it in the required form.

Also note that torch.nn.Functional(often imported as F) contains some useful functions like activation functions a convolution operations which can be used. However, these are not full layers. So, to specify a layer, torch.nn.module should be used.

Fully connected and Convolution autoencoders

The autoencoder will be implemented in two ways, fully connected network and Convulational network. Former has more weights, and can be implemented in most kind of data. CNNs are good for signals/inputs that come in the form of multidimensional arrays and have three major properties, locality(presence of strong local correlation between values), stationarity(properties of the signal repeat themselves, hence shared weights can be used) and compositionality(features compose image in hierarchical manner, justifying use of multiple layers to identify different level of detail).

Model architecture: Fully connected autoencoder

We define Autoencoder class inherited from parent class nn.Module. The input is tensor of size 28x28(as the MNIST images are of size 28x28). Encoder and decoder layers are specified inside the Autoencoder class. The forward pass is defined as data being passed via encoder followed by decoder. The output post the forward pass is compared with the input, giving the loss. Mean square error is taken as loss function here as the pixel intensity is continuous. For categorical data, loss functions like cross-entropy will be more suitable.

Fully connected autoencoder. Depending on size of hidden layer, it can either under or over autoencoder

Model architecture: Convulational autoencoder

The architecture of the convulational autoencoder is similar, except that instead of feeding a long single vector with specified channels and batch size(thus a 3-d vector), a 4-d vector is fed with batch size, channel, height and width as dimensions.

Training Loop

It is a typical training loop used in training any neural network:

The image is moved to device(CPU or GPU). The output is calculated.
Based on output, loss is calculated.
The gradients are zeroed.
Back-propagation and accumulation are implemented.
Optimizer.step() moves the weights opposite to the direction of gradients in magnitude guided by learning rate.

Training loop for autoencoder

Results

Under-autoencoder

For under-autoencoders with hidden layer/latent space dimension 30 compared to the input dimension of 28x28=784, after 20 epochs, we get following reduction in MSE:

epoch [1/20], loss:0.1892
epoch [2/20], loss:0.1600
epoch [3/20], loss:0.1274
epoch [4/20], loss:0.1137
…..
epoch [18/20], loss:0.0520
epoch [19/20], loss:0.0558
epoch [20/20], loss:0.0574

The top row shows the actual images, and the bottom row shows corresponding recreated image. As we see, in case of third and fourth column, the recreation is not very clear.

Weights in Linear layer of Encoder for under-autoencoder

Looking at the some of the weights of the linear layer of the encoding, it is clear that there are weights which seem to be just random noise(Fourth column of first row and third column of second row), with no pattern in activations. Other weights seem to be capturing some patterns.

Let’s compare them with the output of over-autoencoder.

Over-autoencoder

Consider over-autoencoder with hidden layer/latent space dimension as 500. However, as we know the input data dimension is 28x28=784, how can this be called an over-autoencoder? The answer lies in the the effective dimension of the input. More than 80% of the input image pixels do not contribute to the image of the numerical. Hence, the effective dimension of the input layer is only around 150, which is the average number of active pixels in the images. Surround area is mostly dark.

There are three rows of images from the over-autoencoder. The top row is the corrupted input, i.e. image which was fed to the autoencoder (after adding the noise). Second row has the reconstructed image, and the third row shows the actual input image before corruption.

Even with visual inspection, it is apparent that over-autoencoder is performing better than under-autoencoder. The assertion becomes clearer if you look at the loss values:

epoch [1/20], loss:0.0702
epoch [2/20], loss:0.0594
epoch [3/20], loss:0.0551
epoch [4/20], loss:0.0519
epoch [5/20], loss:0.0534
epoch [6/20], loss:0.0463
....
epoch [16/20], loss:0.0415
epoch [17/20], loss:0.0416
epoch [18/20], loss:0.0423
epoch [19/20], loss:0.0400
epoch [20/20], loss:0.0407

The over-autoencoder converges faster than under-autoencoder. Even the final loss value is lower than under-autoencoder. It is because the input is expanded to higher dimensional space where it can move in much more number of ways to fit the model to actual manifold, even when there is constraint in terms of data flow. For under-autoencoders, we move from a higher dimensional hyperspace to lower dimensional one, where the movement is much more constrained.

Weights in Linear layer of Encoder for over-autoencoder

The weights seemed to have learnt better representations too, with hardly any weight devoid of any patterns.

Convulational autoencoder

Convulational autoencoder presented here are also a type of over-autoencoder as 1 channel data is moved to 16 channels. Further, due to CNN layers being specialized for the image type data, they fit faster compared to the autoencoders mentioned above. Considering the loss value over epochs:

epoch [1/20], loss: 0.0497
epoch [2/20], loss: 0.0227
epoch [3/20], loss: 0.0172
epoch [4/20], loss: 0.0143
epoch [5/20], loss: 0.0124
epoch [6/20], loss: 0.0114
epoch [7/20], loss: 0.0105
epoch [8/20], loss: 0.0088
...
epoch [16/20], loss: 0.0050
epoch [17/20], loss: 0.0045
epoch [18/20], loss: 0.0043
epoch [19/20], loss: 0.0040
epoch [20/20], loss: 0.0038

As expected the loss quickly reduces and reaches much lower value compared to under and over-encoders based on conventional fully connected layers.

Visual inspection of the images generated validates our hypothesis. The three rows represent the noisy data sent as input to encoder, the generated clean output and the actual clean input respectively.

Conclusion

Based on the implementation and the results on MNIST dataset, it is clear that over-autoencoder is outperforming the under-autoencoder due to bigger latent space where it gets the freedom to model the manifold more accurately. However, it is more to prone to over-fitting too.

On the other hand, convulational autoencoders(which are over-autoencoders too) outperform fully connected layer based autoencoders because they take the properties of images into account to extract better representation of the data in hand.

In all the autoencoders mentioned above, there is a flaw that they are mapping a point in input hyperspace to a point on manifold. However, it would be better to generalize by mapping points to a vector field in the manifold instead of discrete points which would give a smoother mapping of manifold. The Variational Autoencoders(VAE) achieve that by introducing a conditional distribution with mean as point value of the latent representation and some variance. We will be looking at VAEs in subsequent articles.

For the full script: https://github.com/jha-vikas/pyTorch-implementations