Comprehensive Overview of Autoencoders and Their Applications
Autoencoders (AEs) are neural networks trained using unsupervised learning to copy input to output, learning an embedding. This article discusses various types of autoencoders, topics in autoencoders, applications such as dimensionality reduction and image compression, and related concepts like embeddings and other dimensionality reduction methods like PCA and Multidimensional Scaling.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Autoencoders (AEs) Thanks to Sargur Srihari, Fei-Fei Li, Justin Johnson, Serena Yeung, Sosuke Kobayashi, Yingyu Liang, Guy Golan, Song Han, Jason Brownlee, Jefferson Hernandez
Previously 1. Principles of machine learning 2. Deep Feedforward NNs 3. Regularization 4. Optimization 5. Convolutional NNs 6. Recurrent NNs 7. Memory NNs 8. Today Autoencoders, GANs
Generic Neural Architectures (1-11) 14 types of neurons
Topics in Autoencoders What is an autoencoder? 1. Undercomplete Autoencoders 2. Regularized Autoencoders 3. Representational Power, Layout Size and Depth 4. Stochastic Encoders and Decoders 5. Denoising Autoencoders 6. Learning Manifolds and Autoencoders 7. Contractive Autoencoders 8. Predictive Sparse Decomposition 9. Applications of Autoencoders
Some Autoencoder Applications 1.Dimensionality Reduction 2.Image Compression 3.Image Denoising 4.Feature Extraction 5.Image generation 6.Sequence to sequence prediction 7.Encoders for transformers
What is an Autoencoder (AE) ? A neural network trained using unsupervised learning Trained to copy its input to itsoutput Learns an embedding h
Embedding is a Point on a Manifold An embedding is a low-dimensional vector With fewer dimensions than the ambient space of which the manifold is a low-dimensional subset Embedding Algorithm Maps any point in ambient space x to its embedding h Embeddings of related inputs form a manifold
Other Embeddings All are dimensionally reduction methods: Principle component analysis (PCA): PCA is a feature extraction technique it combines the variables, and then it drops the least important variables while still retains the valuable parts of the variables Probably the most widely used embedding to date. The idea is simple: Find a linear transformation of features that maximizes the captured variance or (equivalently) minimizes the quadratic reconstruction error. Multidimensional Scaling (MDS): Unsupervised ML methods that represent high- dimensional data in a lower dimensional space, while preserving the inter-point distances as best as possible.
General Structure of an Autoencoder Maps an input x to an output r (called a reconstruction) through an internal representation code h Hidden layer h describes a code used to represent theinput The network has two parts The encoder function h=f(x) A decoder that produces a reconstructionr=g(h)
Autoencoders Differ from Classical Data Compression Autoencoders are data-specific i.e., only able to compress data similar to what they have been trainedon Different from MP3 or JPEG compression algorithm These make general assumptions about "sound/images , but not about specific types of sounds/images Autoencoder for pictures of cats would do poorly in compressing pictures of trees Features it would learn would be cat-specific Autoencoders are lossy Their decompressed outputs will be degraded compared to the original inputs (similar to MP3 or JPEGcompression). This differs from lossless arithmetic compression Autoencoders are learned
What does an Autoencoder Learn? Learning g (f (x))=x everywhere is not useful Autoencoders are designed to be unable to copy perfectly Restricted to copying only approximately Autoencoders learn useful properties of the data Forced to prioritize which aspects of input should becopied Can learn stochastic mappings Go beyond deterministic functions to mappings pencoder(h|x) andpdecoder(x|h)
Autoencoder History Part of neural network landscape for decades Used for dimensionality reduction and feature learning Historical note: goes back to (LeCun, 1987; Bourlard and Kamp, 1988; Hinton and Zemel, 1994). Theoretical connection to latent variable models AE s brought them into forefront of generative models Variational Autoencoders
Basic Types of Autoencoders (AEs) We distinguish between two types of AE structures: Undercomplete Overcomplete
Undercomplete AE Hidden layer is Undercomplete if smaller than the input layer Compresses the input Compresses well only for the training distribution ? ? ? ? ? Hidden nodes will be Good features for the training distribution. Bad for other types on input ?
Overcomplete AE Hidden layer is Overcomplete if greater than the input layer No compression in hidden layer. Each hidden unit could copy a different input component. No guarantee that the hidden units will extract meaningful structure. Adding dimensions is good for training a linear classifier (XOR case example). A higher dimension code helps model a more complex distribution. ? ? ? ? ? ?
An autoencoder architecture Decoderg Weights W are learned using: 1. Training samples, and 2. a loss function Encoderf
Autoencoder Training Methods 1. Autoencoder is a feed-forward non-recurrent neural net With an input layer, an output layer and one or more hiddenlayers Can be trained using the same techniques Compute gradients using back-propagation Followed by minibatch gradient descent 2. Unlike feedforward networks, can also be trained using Recirculation Compare activations on the input to activations of the reconstructed input More biologically plausible than back-prop but rarely used in ML
1. Undercomplete Autoencoder Copying input to output seems useless but we have no interest in decoder output Want h to take on useful properties Undercomplete autoencoder Constrain h to have lower dimension than x Force it to capture most salient features of training data
Autoencoder with Linear Decoder +MSE is a PCA Learning process is minimizing a loss function L(x, g ( f (x))) where L is a loss function penalizing g( f (x)) for being dissimilar fromx Exs: L2 norm of difference: mean squarederror When the decoder g is linear and L is the mean squared error, an undercomplete autoencoder learns to span the same subspace asPCA In this case the autoencoder trained to perform the copying task has learned the principal subspace of the training data as a side-effect Autoencoders with nonlinear f and g can learn morepowerful nonlinear generalizations of PCA But high capacity is not desirable
Autoencoder Training Using a Loss Function Autoencoder with 3 fully connected hidden layers Encoder f and decoderg f : g : h h X 2 (f !g)X arg min X f,g h One hidden layer Non-linear encoder Takes input x Rd Maps into output h Rp h = 1(Wx +b) x '= 2(W 'h +b') Trained to minimize reconstruction error (such as sum of squared errors) Decoderg Encoderf o is an element-wise activation function such as sigmoid or Relu 2 2 (Wt( (Wx +b))+b') 2 1 L(x,x ') = x = x x ' Provides a compressed representation of the inputx
Encoder/decoder Capacity If encoder f and decoder g are allowed too much capacity autoencoder can learn to perform the copying task without learning any useful information about the distribution of data Autoencoder with a one-dimensional code and a very powerful nonlinear encoder can learn to map x(i) to code i. The decoder can learn to map these integer indices back to the valuesof specific training examples Autoencoder trained for copying task fails to learn anything useful if f/g capacity is too great A model with too little capacity cannot learn the training dataset meaning it will underfit, whereas a model with too much capacity may memorize the training dataset, meaning it will overfit or may get stuck or lost during the optimization process. The capacity of a neural network model is defined by configuring the number of nodes and the number of layers.
Cases When Autoencoder Learning Fails When do autoencoders fail to learn anything useful: 1. Capacity of encoder/decoder f/g is too high Capacity controlled by depth 2. Hidden code h has dimension equal to input x 3. Overcomplete case: where hidden code h has dimension greater than input x Even a linear encoder/decoder can learn to copy input tooutput without learning anything useful about data distribution
2. Correct AE Design: use Regularization Ideally, choose code size (dimension of h) small and capacityof encoder f and decoder g based on complexity of distribution modeled Regularized autoencoders Rather than limiting model capacity by keeping encoder/decoder shallow and code size small, use a loss function that encourages the model to have properties other than copy its input to output
Regularized Autoencoder Properties Regularized AEs have properties beyond copying input to output: Sparsity of representation Smallness of the derivative of the representation Robustness to noise Robustness to missing inputs Regularized autoencoders can be nonlinear and overcomplete Still can learn something useful about the data distribution even if model capacity is great enough to learn trivial identity function
Generative Models Viewed as AEs Beyond regularized autoencoders Generative models with latent variables and an inference procedure (for computing latent representations given input) can be viewed as a particular form of autoencoder Generative modeling approaches which have a connection with autoencoders are descendants of the Helmholtz machine. Examples 1. Variational autoencoder 2. Generative stochastic networks
Latent variables treated as distributions Source: https://www.jeremyjordan.me/variational-autoencoders/
Variational Autoencoder (VAE) VAE is a generative model able to generate samples that look like samples from trainingdata With MNIST, these fake samples would be synthetic images of digits Due to random variable between input & output it cannot be trained using backprop Instead, backprop uses the parameters of the latent distribution Called reparameterization trick N( , ) = + N(0, I) Where is diagonal 2 1
Sparse Autoencoder Only a few nodes are encouraged to activate when a single sample is fed into the network Fewer nodes activating while still maintaining performance guarantees that the autoencoder is actually learning latent representations instead of redundant information in the input data
Sparse Autoencoder Loss Function A sparse autoencoder is an autoencoder whose Training criterion includes a sparsity penalty (h) on the code layer hin addition to the reconstruction error: L(x, g ( f (x))) + (h) where g (h) is the decoder output and typically we have h = f(x) Sparse encoders are typically used to learn features for another task such as classification An autoencoder that has been trained to be sparse must respond to unique statistical features of the dataset rather than simply perform the copying task A sparsity penalty can yield a model that has learned useful features as a byproduct
Sparse encoder doesnt have a Bayesian Interpretation Penalty term (h) is a regularizer term added to afeedforward network Primary task: copy input to output (with Unsupervised learning objective) Also perform some supervised task (with Supervised learning objective) that depends on the sparse features In supervised learning regularization term corresponds to prior probabilities over model parameters Regularized MLE corresponds to maximizing p( |x), which is equivalent to maximizing log p(x| )+logp( ) First term is data log-likelihood and second term is log-prior over parameters Regularizer depends on data and thus is not a prior Instead, regularization terms express a preference over functions
Generative Model View of Sparse AE Rather than thinking of a sparsity penalty as a regularizer for the copying task, think of a sparse autoencoder as approximating ML training of a generative model that has latent variables Suppose model has visible/latent variables x and h Explicit joint distribution is pmodel(x,h) = pmodel(h)pmodel(x|h) where pmodel(h) is model s prior distribution over latentvariables Different from p( ) being distribution of parameters The log-likelihood can be decomposed aslogpmodel(x,h) Autoencoder approximates the sum with a point estimatefor just one highly likely value of h, the output of a parametric encoder For a chosen h we are maximizing log pmodel(x,h) = log pmodel(h)+logpmodel(x|h) log pmodel(h,x) h
Denoising Autoencoders (DAE) Rather than adding a penalty to the cost function, we can obtain an autoencoder that learns something useful by changing the reconstruction error of the cost function Traditional autoencoders minimize L(x, g ( f(x))) where L is a loss function penalizing g( f (x)) for being dissimilar fromx, such as L2 norm of difference: mean squarederror A DAE minimizes where is a copy of x that has been corrupted by some form ofnoise The autoencoder must undo this corruption rather than simply copying their input Denoising training forces f and g to implicitly learn the structure of pdata(x) Another example of how useful properties can emerge as a by -product of minimizing reconstruction error L(x,g(f()))
Regularizing by Penalizing Derivatives Another strategy for regularizing an autoencoder Use penalty as in sparse autoencoders L(x, g ( f (x))) + (h,x) But with a different form of 2 xhi (h,x) i Forces the model to learn a function that does not change much when x changes slightly Called a Contractive Auto Encoder (CAE) This model has theoretical connections to Denoising autoencoders Manifold learning Probabilistic modeling
3. Representational Power, Layer Size and Depth Autoencoders are often trained with with a single layer However using a deep encoder offers many advantages Recall: Although universal approximation theorem states that a single layer is sufficient, there are disadvantages: 1. number of units needed may be too large 2. may not generalize well Common strategy: greedily pretrain a stack of shallow autoencoders
4. Stochastic Encoders and Decoders General strategy for designing the output units and loss function of a feedforward network is to Define the output distribution p(y|x) Minimize the negative log-likelihood logp(y|x) In this case y is a vector of targets such as classlabels In an autoencoder x is the target as well as the input Yet we can apply the same machinery as before
Loss function for Stochastic Decoder Given a hidden code h, we may think of the decoderas providing a conditional distribution pdecoder(x|h) We train the autoencoder by minimizing logpdecoder(x|h) The exact form of this loss function will change depending on the form of pdecoder(x|h) As with feedforward networks we use linear output units to parameterize the mean of the Gaussian distribution if x is real In this case negative log-likelihood is the mean-squarederror With binary x values correspond to a Bernoulli distribution with parameters given by a sigmoid output Discrete x values correspond to a softmax output The output variables are treated as being conditionally independent given h so the probably distribution is inexpensive to evaluate
Stochastic Encoder We can also generalize the notion of an encoding function f(x) to an encoding distribution pencoder(h|x)
Structure of stochastic autoencoder Both the encoder and decoder are not simple functions but involve a distribution The output is sampled from a distribution pencoder(h|x) for the encoder and pdecoder(x|h) for the decoder
Relationship to the Joint Distribution Any latent variable model pmodel(h|x) defines a stochastic encoder pencoder(h|x)=pmodel(h|x) And a stochastic decoder pdecoder(x|h)=pmodel(x|h) In general the encoder and decoder distributions are not conditional distributions compatible with a unique joint distribution pmodel(x,h) Training the autoencoder as a denoising autoencoder will tend to make them compatible asymptotically With enough capacity and examples
Sampling pmodel(h|x) x pencoder(h|x) pdecoder(x|h)
Ex: Sampling p(x|h): Deepstyle Look at a representation which relates to style By iterating neural network through a set of images learn efficient representations Choosing a random numerical description in encoded space will generate new images of styles not seen Using one input image and changing values along different dimensions of feature space you can see how the generated image changes (patterning, color texture) in style space
Topics in Autoencoders What is an autoencoder? 1. Undercomplete Autoencoders 2. Regularized Autoencoders 3. Representational Power, Layout Size and Depth 4. Stochastic Encoders and Decoders 5. Denoising Autoencoders 6. Learning Manifolds and Autoencoders 7. Contractive Autoencoders 8. Predictive Sparse Decomposition 9. Applications of Autoencoders
5. Denoising Autoencoders (DAEs) Defined as an autoencoder that receives a corrupted data point as input and is trained to predict the original, uncorrupted data point as its output Traditional autoencoders minimize L(x, g ( f (x))) where L is a loss function penalizing g( f (x)) for being dissimilarfrom x, such as L2 norm of difference: mean squarederror A DAE minimizes L(x,g(f())) where The autoencoder must undo this corruption rather than simply copying their input is a copy of x that is corrupted by some form ofnoise
Example of Noise in a DAE An autoencoder with high capacity can end up learning an identity function (also called null function) where input=output A DAE can solve this problem by corrupting the datainput How much noise to add? Corrupt input nodes by setting 30-50% of random input nodes to zero Original input, corrupted data, reconstructed data
DAE Training Procedure Computational graph of cost function below DAE trained to reconstruct clean data point x from the corrupted Accomplished by minimizing loss L=-logpencoder(x|h=f(x)) Corruption process, C( |x) is a conditional distribution over corrupted samples data sample x given the The autoencoder learns a reconstruction distribution preconstruct(x| )) ) estimated from training pairs (x,))asfollows: 1 Sample a training sample x from the trainingdata 2. Sample a corrupted version from C(|||x) 3.Use (x, )) as a training example for estimating the autoencoder distribution precoconstruct(x| | ) =pdecoder(x|h) with h the output of encoder f( ) and pdecodertypically defined by a decoder g(h) DAE performs SGD on the expectation Ex! ~p^data(x) logpdecoder(x|h=f())
DAE for MNIST Data Python/Theano import theano.tensor as T from opendeep.models.model import Model from opendeep.utils.nnet import get_weights_uniform, get_bias from opendeep.utils.noise import salt_and_pepper from opendeep.utils.activation import tanh, sigmoid from opendeep.utils.cost import binary_crossentropy # create our class initialization! class DenoisingAutoencoder(Model): """ A denoising autoencoder will corrupt an input (add noise) and try to reconstructit. """ def init (self): # Define some model hyperparameters to work with MNISTimages! input_size = 28*28 # dimensions of image hidden_size = 1000 # number of hidden units - generally bigger than input size for DAE # Now, define the symbolic input to the model(Theano) # We use a matrix rather than a vector so that minibatch processing can be done inparallel. x = T.fmatrix("X") self.inputs = [x] # Build the model's parameters - a weight matrix and two biasvectors W = get_weights_uniform(shape=(input_size, hidden_size), name="W") b0 = get_bias(shape=input_size, name="b0") b1 = get_bias(shape=hidden_size, name="b1") self.params = [W, b0, b1] # Perform the computation for a denoising autoencoder! # first, add noise (corrupt) theinput corrupted_input = salt_and_pepper(input=x, corruption_level=0.4) # next, compute the hidden layer given the inputs (the encodingfunction) hiddens = tanh(T.dot(corrupted_input, W) + b1) # finally, create the reconstruction from the hidden layer (we tie the weights withW.T) reconstruction = sigmoid(T.dot(hiddens, W.T) + b0) # the training cost is reconstruction error - with MNIST this is binarycross-entropy self.train_cost = binary_crossentropy(output=reconstruction, target=x) Unsupervised Denoising Autoencoder Left: original test images Center: corrupted noisy images Right: reconstructed images
Denoising Autoencoders Intuition: - We still aim to encode the input and to NOT mimic the identity function. - We try to undo the effect of corruption process stochastically applied to the input. A more robust model Encoder Decoder Noisy Input Denoised Input Latent space representation
Denoising Autoencoders Use Case: - Extract robust representation for a NN classifier. Encoder Noisy Input Latent space representation
Denoising Autoencoders Instead of trying to mimic the identity function by minimizing: ? ?,? ? ? where L is some loss function A DAE instead minimizes: ? ?,? ? ? where ? is a copy of ? that has been corrupted by some form of noise.
Denoising Autoencoders ? Idea: A robust representation against noise: ? - Random assignment of subset of inputs to 0, with probability ?. - Gaussian additive noise. ? ? ? ?