Convolutional Autoencoder (CAE)

Convolutional Autoencoder (CAE)
- Python for Integrated Circuits -
- An Online Book -

Python for Integrated Circuits http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Autoencoders has been in the deep learning research project for a long time, especially, is most popular for data compression tasks. Autoencoders and Convolutional Autoencoders are both types of neural networks used for unsupervised learning, but they have differences in their architecture and use cases.

Autoencoders:
- Autoencoders are a type of neural network used for data compression and feature learning.
- They consist of an encoder and a decoder. The encoder compresses the input data into a lower-dimensional representation (encoding), while the decoder reconstructs the original input data from the encoding.
- Autoencoders can be fully connected, meaning all neurons in one layer are connected to all neurons in the next layer. This type is often called a "fully connected autoencoder" or "dense autoencoder."
Convolutional Autoencoders:
- Convolutional Autoencoders (CAEs) are a specific type of autoencoder designed for processing grid-like data, such as images.
- They use convolutional layers in both the encoder and decoder parts of the network. Convolutional layers are well-suited for capturing spatial patterns in data, making CAEs particularly effective for image denoising, image generation, and feature learning in computer vision tasks.
- In the encoder, convolutional layers are used to extract hierarchical features from the input image, and in the decoder, transposed convolutional layers (sometimes called "deconvolution" or "up-sampling" layers) are used to reconstruct the image.

Relationship between convolutional autoencoders and autoencoders, and convolutional layers

Figure 4215. Relationship between convolutional autoencoders and autoencoders, and convolutional layers.

Regarding data compression via autoencoders, we can think about such examples. Assuming that a program is created to send some data from one PC to another. The data is a collection of data points, each has two dimensions as shown in Figure 4215a. Since the network bandwidth is limited, every bit of data which is going to be sent should be optimized. When sending the data, instead of sending all the data points, we can send only the first dimension of every data point to the other PC, and then at the other PC, we compute the value of the second dimension with the linear relationship from the first dimension. This method requires some computation, and the compression is lossy, but it reduces the network traffic by ~50%. In practice, autoencoders can be done with TensorFlow.

Data to be sent from one PC to another PC

Figure 4215a. Data to be sent from one PC to another PC. The blue dots represent the data points. The horizontal axis is the value of the first dimension, while the vertical axis is the value of the second dimension.

The problem above is two-dimensional (2D). If the data is high-dimensional, then the set of data points can be given by {a⁽¹⁾, a⁽²⁾, ..., a^(m)}, where each data point has many dimensions. Therefore, a method is needed to map the points to another set of data points {z⁽¹⁾, z⁽²⁾, ... , z^(m)}, where z’s have lower dimensionality than a’s and a’s can faithfully reconstruct a’s.

Recall that sending the data from one PC to another includes the steps below:
          i) Encoding on the first PC. Map the original data a⁽ⁱ⁾ to the compressed data z⁽ⁱ⁾.
          ii) Sending the data z⁽ⁱ⁾ to the second PC.
          iii) Decoding the data. Map the compressed data z⁽ⁱ⁾ back to reconstructed data ã⁽ⁱ⁾, which approximates the original data a⁽ⁱ⁾.

Therefore, the following equations can be obtained:
          z⁽ⁱ⁾= W₁a⁽ⁱ⁾+ b₁ ------------------------------------------------------------------------------------ [4215a]
          ã⁽ⁱ⁾ = W₂z⁽ⁱ⁾+ b₂ ------------------------------------------------------------------------------------ [4215b]
where,
          W₁ and W₂ -- Represents the weight matrix that transforms the input x into some hidden representation.
          b₁ and b₂ -- Vector of biases for each hidden unit.

The error can be given by,
Reconstruction Error = Reconstructed data – original data ------------------------------------ [4215c]
Error = Decoder(Encoder(X)) - X ---------------------------------------------------------------- [4215d]

When no other constraints are imposed on the loss function, the auto-encoder weights tend to learn the identity function. Some form of regularization then must be imposed on the model so that the model can uncover the underlying structure in the data. Some forms of regularization include adding noise to the input units [1], see Equation 4215e, and requiring the hidden unit activations be sparse [5] or have small derivatives [6]. These models are known as de-noising, sparse, and contractive auto-encoders respectively.

Error = Decoder(Encoder(X + noise)) - X ------------------------------------------------------- [4215e]

The auto-encoder model learns a function that minimizes the squared error between the input a⁽ⁱ⁾ ∈ Rⁿ and its reconstruction ã⁽ⁱ⁾:
          L = ||a⁽ⁱ⁾ - ã⁽ⁱ⁾||₂² ------------------------------------------------------------------------------------- [4215f]
          --------------------------------------------------------------------- [4215g]
where,
          f(·) -- Some nonlinear function. Commonly chosen examples for f(·) include the sigmoid and hyperbolic tangent functions.
         W_d -- The weight matrix that maps back from the hidden representation to the input space.
          c -- A vector of biases for each input (visible) unit.
These parameters above are normally learned by minimizing the loss function over the training data via stochastic gradient descent.

In practice, if a⁽ⁱ⁾ is a two-dimensional vector, it can be possible to visualize the data to find W₁, b₁ and W₂, b₂ analytically. However, in most cases, it is difficult to find those matrices using visualization only; therefore, gradient descent is needed. As the goal is to have ã⁽ⁱ⁾ is approximately equal to a⁽ⁱ⁾, thus the sum of squared differences between is given by objective function,
Data to be sent from one PC to another PC --------------------------------------------------- [4215h]
------------------------ [4215i]
which can be minimized using stochastic gradient descent.

Autoencoders are unsupervised neural network models that summarize the general properties of original data in fewer parameters while learning how to reconstruct it after compression. [3] This particular architecture above is also known as a linear autoencoder as shown in the network architecture in Figure 4215b. In the case in the figure, we are trying to map data from 8 dimensions to 4 dimensions using a neural network with one hidden layer z. The activation function of the hidden layer is linear, and thus it is called linear autoencoder, which works for the case that the data lie on a linear surface. The Encoder Network is used to transfer the original data to a lower dimensional representation, namely it approximates the function to match the data from its full input space into lower dimensional coordinate system to take the advantage of the structure of the data. The decoder network attempts to recreate the original input using the output of the encoder, in other words, it tries to reverse the encoder process. The z is called embedding vector.

If the data lie on a nonlinear surface, it makes more sense to use a nonlinear autoencoder. Furthermore, if the data is highly nonlinear, one could add more hidden layers to the network to have a deep autoencoder.

Linear autoencoder

Figure 4215b. Linear autoencoder. The left part before z is called Encoder Network, while the right part after z is called Decoder Network. The Encoder Network is used to transfer the original data to a lower dimensional representation. The z is called embedding vector.

Auto-encoders are popular models for performing unsupervised feature extraction for highly nonlinear data. In other words, unlike in the supervised learning, the data above only have a’s but do not have b’s. The unsupervised learning and data compression through autoencoders require modifications in the loss function. The simplest implementation of an auto-encoder is a simple feed-forward neural network where the learned latent representations are given by the hidden vector,
          h= σ(W_xx+b_x) ------------------------------------------------------------------------------- [4215j]
where,
          σ -- An activation function.
          W_x -- Weight matrix.
          b_x -- Bias.

While the autoencoder mentioned above had shown impressive results, they do not directly address the structure of images. Convolutional neural networks (CNNs) [7-8], see page4237, show a way to reduce the number of connections by having each hidden unit only be responsible for a small local neighborhood of visible units. Such schemes allow for dense feature extraction followed by pooling layers when stacked denoising could allow the network to learn over larger and larger receptive fields. Convolutional auto-encoders (CAEs) combined aspects from both autoencoders and convolutional neural nets, which maks it possible to extract highly localized patch-based information in an unsupervised fashion. The CAE is an unsupervised learning model (page4322) for extracting hierarchical features from natural images. CAEs can be stacked in such a way that each CAE takes the latent representation of the previous CAE for higher-level representations [4]. A CAE is similar to a traditional auto-encoder except it uses convolutional (and optionally pooling) layers for the hidden layers in the network and the k^th feature map outputted by a convolutional layer is given by,
h_k = σ(x*W_x+b_x) ---------------------------------------------------------------------------- [4215k]
where,
* -- Denotes the 2D convolution operator.

Similar to CNNs, max-pooling can optionally be applied on the feature maps outputted by a convolutional layer. The activation values would then be the max of multiple k by k patches spanning across a given feature map. For highly non-linear data, a CAE can be stacked (CAES) to obtain a deep structure for modelling the data (similar to [1]). Experiments on MNIST, this CAE model is capable of learning robust feature-representations for image data.

The deep topology of a CAES enables each layer of the network to model increasingly abstract latent representations of the input based on the latent output of the previous layer. Therefore, CAES offers a powerful model for learning robust hierarchical latent representations for highly-structured inputs such as natural images. Given the similarity in structure between a CAES and popular CNN classification models based on the architecture of AlexNet [2], it follows that the learned weights in a CAES can also be used for initializing the weights in the latter group of networks. The tricky part of CAEs is at the decoder side of the model. During encoding, the image sizes get shrunk by subsampling with either average pooling or max-pooling, resulting in information loss which is hard to re-obtain while decoding. The intuition is that this will ensure the weights in the CNN are initially set to sensible values for training with back-propagation.

In practice, different developments of CAEs are:
          i) Rely on sparse coding to force their unsupervised learning to learn non-trival solutions. [9-10]
          ii) The work in i) was extended by introducing pooling/unpooling and visualizing how individual feature maps at different layers influenced specific portions of the reconstruction. However, these sparse coding approaches had limitations because they used an iterative procedure for inference. [11]
          iii) Trained deep feed forward convolutional autoencoders, using only max-pooling and saturating tanh non-linearities as a form of regularization, while still showing a modest improvement over randomly initialized CNNs. [12]
          iv) Showed that ReLUs are more suitable for learning given their non-saturating behavior. [13]

Autoencoders have many interesting applications:
          i) Data compression.
          ii) Visualization
          iii) Pre-train neural networks because the pretraining can improve deep neural networks (probably due to the fact that pretraining is done
               one layer at a time, which means it does not suffer from the difficulty of full supervised learning) and thus can address the problems
               below:
               iii.A) Training very deep neural networks is difficult.
               iii.B) The magnitudes of gradients in the lower layers and in higher layers are different.
               iii.C) The landscape or curvature of the objective function is difficult for stochastic gradient descent to find a good local optimum.
               iii.D) Deep networks have many parameters, which can remember training data and do not generalize well.
               iii.E) With pre-training, the process of training a deep network can be divided in a sequence of steps:
                  iii.E.a) Pre-train a sequence of shallow autoencoders, greedily one layer at a time, using unsupervised data.
                  iii.E.b) Use fine-tuning step to train the last layer using supervised data. (e.g. the code line "output_tensor = layers.Dense(8,
                           activation='softmax')(dense_2)" in code)

============================================

[1] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(Dec):3371–3408, 2010.
[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[3] Gallinari, P., LeCun, Y., Thiria, S., & Fogelman-Soulie, F. (1987). “Memoires associatives distribuees”. Proceedings of COGNITIVA 87. Paris, La Villette.
[4] Masci, Jonathan & Meier, Ueli & Ciresan, Dan & Schmidhuber, Jürgen. (2011). “Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction”. 52–59. 10.1007/978–3–642–21735–7_7.
[5] Adam Coates, Andrew Y Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In International Conference on Artificial Intelligence and Statistics, pp 215–223, 2011.
[6] Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive autoencoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 833–840, 2011.
[7] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to ́document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[8] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 609–616. ACM, 2009.
[9] Kevin Jarrett, Koray Kavukcuoglu, M Ranzato, and Yann LeCun. What is the best multi-stage architecture for object recognition? In Computer Vision, 2009 IEEE 12th International Conference on, pages 2146–2153. IEEE, 2009.
[10] Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Robert Fergus. Deconvolutional net-works. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2528–2535. IEEE, 2010.
[11] Matthew D Zeiler, Graham W Taylor, and Rob Fergus. Adaptive deconvolutional networks for mid and high level feature learning. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2018–2025. IEEE, 2011.
[12] Jonathan Masci, Ueli Meier, Dan Cires ̧an, and Jurgen Schmidhuber. Stacked convolutional auto-̈encoders for hierarchical feature extraction. In Artificial Neural Networks and Machine Learning–ICANN 2011, pages 52–59. Springer, 2011.
[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

=================================================================================