In this article, we will learn about autoencoders in deep learning. We will show a practical implementation of using a Denoising Autoencoder on the MNIST handwritten digits dataset as an example. In addition, we are sharing an implementation of the idea in Tensorflow.

## 1. What is an autoencoder?

An autoencoder is an unsupervised machine learning algorithm that takes an image as input and reconstructs it using fewer number of bits. That may sound like image compression, but the biggest difference between an autoencoder and a general purpose image compression algorithms is that in case of autoencoders, the compression is achieved by learning on a training set of data. While reasonable compression is achieved when an image is similar to the training set used, autoencoders are poor general-purpose image compressors; JPEG compression will do vastly better.

Autoencoders are similar in spirit to dimensionality reduction techniques like principal component analysis. They create a space where the essential parts of the data are preserved, while non-essential ( or noisy ) parts are removed.

There are two parts to an autoencoder

**Encoder**: This is the part of the network that compresses the input into a fewer number of bits. The space represented by these fewer number of bits is called the “latent-space” and the point of maximum compression is called the bottleneck. These compressed bits that represent the original input are together called an “encoding” of the input.**Decoder**: This is the part of the network that reconstructs the input image using the encoding of the image.

Let’s look at an example to understand the concept better.

In the above picture, we show a vanilla autoencoder — a 2-layer autoencoder with one hidden layer. The input and output layers have the same number of neurons. We feed five real values into the autoencoder which is compressed by the encoder into three real values at the bottleneck (middle layer). Using these three real values, the decoder tries to reconstruct the five real values which we had fed as an input to the network.

In practice, there are a far larger number of hidden layers in between the input and the output.

There are various kinds of autoencoders like sparse autoencoder, variational autoencoder, and denoising autoencoder. In this post, we will learn about a denoising autoencoder.

## 2. Denoising Autoencoder

The idea behind a denoising autoencoder is to learn a representation (latent space) that is robust to noise. We add noise to an image and then feed this noisy image as an input to our network. The encoder part of the autoencoder transforms the image into a different space that preserves the handwritten digits but removes the noise. As we will see later, the original image is 28 x 28 x 1 image, and the transformed image is 7 x 7 x 32. You can think of the 7 x 7 x 32 image as a 7 x 7 image with 32 color channels.

The decoder part of the network then reconstructs the original image from this 7 x 7 x 32 image and voila the noise is gone!

How does this magic happen?

During training, we define a loss (cost function) to minimize the difference between the reconstructed image and the original noise-free image. In other words, we learn a 7 x 7 x 32 space that is noise free.

** Download Code**

To easily follow along this tutorial, please download the iPython notebook code by clicking on the button below. It’s FREE!

## 3. Implementation of Denoising Autoencoder

### 3.1 The Network

The images are matrices of size 28 x 28. We reshape the image to be of size 28 x 28 x 1, convert the resized image matrix to an array, rescale it between 0 and 1, and feed this as an input to the network. The encoder transforms the 28 x 28 x 1 image to a 7 x 7 x 32 image. You can think of this 7 x 7 x 32 image as a point in a 1568 ( because 7 x 7 x 32 = 1568 ) dimensional space. This 1568 dimensional space is called the bottleneck or the latent space. The architecture is graphically shown below.

The decoder does the exact opposite of an encoder; it transforms this 1568 dimensional vector back to a 28 x 28 x 1 image. We call this output image a “reconstruction” of the original image. The structure of the decoder is shown below.

Let’s dive into the implementation of an autoencoder using tensorflow.

### 3.2 Encoder

The encoder has two convolutional layers and two max pooling layers. Both Convolution layer-1 and Convolution layer-2 have 32-3 x 3 filters. There are two max-pooling layers each of size 2 x 2.

#Encoder with tf.name_scope('en-convolutions'): conv1 = tf.layers.conv2d(inputs_,filters=32,kernel_size=(3,3),strides=(1,1),padding='SAME',use_bias=True,activation=lrelu,name='conv1') # Now 28x28x32 with tf.name_scope('en-pooling'): maxpool1 = tf.layers.max_pooling2d(conv1,pool_size=(2,2),strides=(2,2),name='pool1') # Now 14x14x32 with tf.name_scope('en-convolutions'): conv2 = tf.layers.conv2d(maxpool1,filters=32,kernel_size=(3,3),strides=(1,1),padding='SAME',use_bias=True,activation=lrelu,name='conv2') # Now 14x14x32 with tf.name_scope('encoding'): encoded = tf.layers.max_pooling2d(conv2,pool_size=(2,2),strides=(2,2),name='encoding') # Now 7x7x32. #latent space

### 3.3 Decoder

The decoder has two Conv2d_transpose layers, two Convolution layers, and one Sigmoid activation function. Conv2d_transpose is for upsampling which is opposite to the role of a convolution layer. The Conv2d_transpose layer upsamples the compressed image by two times each time we use it.

#Decoder with tf.name_scope('decoder'): conv3 = tf.layers.conv2d(encoded,filters=32,kernel_size=(3,3),strides=(1,1),name='conv3',padding='SAME',use_bias=True,activation=lrelu) #Now 7x7x32 upsample1 = tf.layers.conv2d_transpose(conv3,filters=32,kernel_size=3,padding='same',strides=2,name='upsample1') # Now 14x14x32 upsample2 = tf.layers.conv2d_transpose(upsample1,filters=32,kernel_size=3,padding='same',strides=2,name='upsample2') # Now 28x28x32 logits = tf.layers.conv2d(upsample2,filters=1,kernel_size=(3,3),strides=(1,1),name='logits',padding='SAME',use_bias=True) #Now 28x28x1 # Pass logits through sigmoid to get denoisy image decoded = tf.sigmoid(logits,name='recon')

Finally, we calculate the loss of the output using cross-entropy loss function and use Adam optimizer to optimize our loss function.

### 3.4 Why do we use a leaky ReLU and not a ReLU as an activation function?

We want gradients to flow while we backpropagate through the network. We stack many layers in a system in which there are some neurons whose value drop to zero or become negative. Using a ReLU as an activation function clips the negative values to zero and in the backward pass, the gradients do not flow through those neurons where the values become zero. Because of this the weights do not get updated, and the network stops learning for those values. So using ReLU is not always a good idea. However, we encourage you to change the activation function to ReLU and see the difference.

def lrelu(x,alpha=0.1): return tf.maximum(alpha*x,x)

Therefore, we use a leaky ReLU which instead of clipping the negative values to zero, cuts them to a specific amount based on a hyperparameter alpha. This ensures that the network learns something even when the pixel value is below zero.

### 3.5 Load the data

Once the architecture has been defined, we load the training and validation data.

As shown below, Tensorflow allows us to easily load the MNIST data. The training and testing data loaded is stored in variables train_X and test_X respectively. Since its an unsupervised task we do not care about the labels.

from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets("MNIST_data/", one_hot=True) train_X = mnist.train.images test_X = mnist.test.images

### 3.6 Data Analysis

Before training a neural network, it is always a good idea to do a sanity check on the data.

Let’s see how the data looks like. The data consists of handwritten numbers ranging from 0 to 9, along with their ground truth labels. It has 55,000 train samples and 10,000 test samples. Each sample is a 28×28 grayscale image.

print('Training data shape' :train_X.shape) print('Testing data shape' :test_X.shape) nsample = 1 rand_train_idx = np.random.randint(mnist.train.images.shape[0], size=nsample) for i in rand_train_idx: curr_img = np.reshape(mnist.train.images[i, :], (28,28)) curr_lbl = np.argmax(mnist.train.labels[i, :]) plt.matshow(curr_img, cmap=plt.get_cmap('gray')) plt.title(""+str(i)+"th Training Image "+ "(label: " + str(curr_lbl) + ")") plt.show() rand_test_idx = np.random.randint(mnist.test.images.shape[0], size=nsample) for i in rand_test_idx: curr_img = np.reshape(mnist.test.images[i, :], (28,28)) curr_lbl = np.argmax(mnist.test.labels[i, :]) plt.matshow(curr_img, cmap=plt.get_cmap('gray')) plt.title(""+str(i)+"th Test Image "+ "(label: " + str(curr_lbl) + ")") plt.show()

**Output:**

(Training data shape :&nbsp; (55000, 784)) (Testing data shape :&nbsp; (10000, 784))

### 3.7 Preprocessing the data

The images are grayscale and the pixel values range from 0 to 255. We apply following preprocessing to the data before feeding it to the network.

- Convert each 784-dimensional vector into a matrix of size 28 x 28 x 1 which is fed into the network.

batch_train_x = mnist.train.next_batch(batch_size) batch_test_x = mnist.test.next_batch(batch_size) imgs_train= batch_train_x[0].reshape((-1, 28, 28, 1)) imgs_test = batch_test_x[0].reshape((-1, 28, 28, 1))

- Add noise to both train and test images which we then feed into the network. Noise factor is a hyperparamter and can be tuned accordingly.

noise_factor = 0.5 x_train_noisy = imgs_train + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=imgs_train.shape) x_test_noisy = imgs_test + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=imgs_test.shape) x_train_noisy = np.clip(x_train_noisy, 0., 1.) x_test_noisy = np.clip(x_test_noisy, 0., 1.)

### 3.8 Training the model

The network is ready to get trained. We specify the number of epochs as 25 with batch size of 64. This means that the whole dataset will be fed to the network 25 times. We will be using the test data for validation.

batch_cost, _ = sess.run([cost, opt], feed_dict={inputs_: x_train_noisy,targets_: imgs,learning_rate:lr})

### 3.9 Evaluate the model

We check the performance of the model on our test set by checking the cost (loss).

batch_cost_test = sess.run(cost, feed_dict={inputs_: x_test_noisy,targets_: imgs_test})

**Output**

('Epoch: 25/25...', 'Training loss: 0.1196', 'Validation loss: 0.1171')

After 25 epochs we can see our training loss and validation loss is quite low which means our network did a pretty good job. Let’s now see the loss plot between training and validation data.

### 3.10 Training Vs. Validation Loss Plot

loss.append(batch_cost) valid_loss.append(batch_cost_test) plt.plot(range(e+1), loss, 'bo', label='Training loss') plt.plot(range(e+1), valid_loss, 'r', label='Validation loss') plt.title('Training and validation loss') plt.xlabel('Epochs ',fontsize=16) plt.ylabel('Loss',fontsize=16) plt.legend() plt.figure() plt.show()

From the above loss plot, we can observe that the validation loss and training loss are both steadily decreasing in the first ten epochs. This training loss and the validation loss are also very close to each other. This means that our model has generalized well to unseen test data.

We can further validate our results by observing the original, noisy and reconstruction of test images.

### 3.11 Results

From the above figures, we can observe that our model did a good job in denoising the noisy images that we had fed into our model.

## Subscribe & Download Code

If you liked this article and would like to download code (iPython notebook), please subscribe to our newsletter. You will also receive a free Computer Vision Resource Guide. In our newsletter, we share OpenCV tutorials and examples written in C++/Python, and Computer Vision and Machine Learning algorithms and news.