6 Generative Models Ar Vae PDF

Generative Models
Auto-Regressive Models and Variational Autoencoders
Chetan Arora
Disclaimer: The contents of these slides are taken from various publicly available resources such as research papers,
talks and lectures. The sources are usually acknowledged but sometimes not. To be used for the purpose of
classroom teaching, and academic dissemination only.
Chetan Arora
Computer Vision and Graphics Lab, IIT Delhi
Supervised vs Unsupervised Learning

• Supervised Learning
• Data: (", $)
• " is data, $ is label
• Goal: Learn a function to map " → $
• Examples: Classification, regression,

object detection, semantic
segmentation, image captioning, etc.
Chetan Arora

• Data: (", $)

Chetan Arora

• Data: (", $)

Chetan Arora

• Unsupervised Learning
• Data: !
• Just data, no labels!
• Goal: Learn some underlying hidden

structure of the data
• Examples: Clustering, dimensionality

reduction, feature learning, density
estimation, etc.
Chetan Arora

• Data: !


estimation, etc.
Chetan Arora

• Data: !


estimation, etc.
Chetan Arora

• Data: !
1D Density Estimation

estimation, etc.
2D Density Estimation
Chetan Arora

Supervised Learning Unsupervised Learning
• Data: (", $) • Data: "
• " is data, $ is label • Just data, no labels!
• Goal: Learn a function to map • Goal: Learn some underlying

variables " → $ hidden structure of the data
• Examples: Classification, • Examples: Clustering,

regression, object detection, dimensionality reduction, feature
semantic segmentation, image learning, density estimation, etc.
captioning, etc
Chetan Arora
Generative Models: Density Est. and Sampling

• Given training data, generate new samples from same distribution
Training data ∼ !$*+* (() Generate samples ∼ !"#$%& (()

• Want to learn !"#$%& (() similar to !$*+* (()
Several flavours:
• Explicit density estimation: explicitly define and solve for !"#$%& (()
• Implicit density estimation: learn model that can sample from !"#$%& (
w/o explicitly defining it.
Chetan Arora
Why Generative Models?

• Realistic samples for artwork, super-resolution, colorization, etc.
• Generative models of time-series data can be used for simulation and
planning (reinforcement learning applications!)
• Training generative models can also enable inference of latent
representations that can be useful as general features
Chetan Arora
Taxonomy of Generative Models

Direct
Generative models GAN
Explicit density Implicit density
Tractable density
Approximate density Markov Chain
PixelRNN/CNN
Variational Markov Chain

Variational Autoencoder Boltzmann Machine
Slide Credit: Ian Goodfellow
PixelRNN and PixelCNN
Chetan Arora
Fully Visible Belief Network Will need to define

• Explicit density model an ordering for
“previous pixels”
• Use chain rule to decompose likelihood of an image x into product of 1-d

distributions:
(
! " = $ !("% |"' , … , "%-' )

%&'
Likelihood of Probability of / 01 pixel value
image " given all previous pixels
• Then maximize likelihood of training data
Chetan Arora
Fully Visible Belief Network

(
! " = $ !("% |"' , … , "%-' )

%&'
• Probability distribution can be complex. How to learn the / function?

• DNN
• Output of the network?
• 256 way softmax
• How to handle RGB data?
• ! "% "0% = ! "%,1 "0% ! "%,2 "0% , "%,1 ! "%,3 "0% , "%,1 , "%,2
Chetan Arora
Computer Vision and Graphics Lab, IIT Delhi Hidden Layer
(" − 1, $)
PixelRNN (", $)
(", $ − 1)
• Generate image pixels starting from corner
• Dependency on previous pixels modelled
Input Layer
using an RNN (LSTM)
(", $)
ℎ(", $) = LSTM(ℎ(" − 1, $) + ℎ(", $ − 1) + /(", $)
Drawback: sequential generation is slow! Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu
Pixel Recurrent Neural Networks. 2016
Chetan Arora
Row LSTM
• Gives rise to a triangular context.
• Context of 2 pixels per row is lost recursively
• Inference can be parallelised for a row. All pixels of a row

can be generated in parallel making it faster than naive.
ℎ(#, %) = LSTM(ℎ # − 1, % − 1 + ℎ # − 1, % + ℎ # − 1, % + 1 + /(#, %)
Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu

Chetan Arora
Diagonal BiLSTM
• Designed to capture the entire available context for any
image size.
• Each of the two directions of the layer scans the image in

a diagonal fashion starting from a corner at the top and
reaching the opposite corner at the bottom.
• Not parallelisable in naïve form.

ℎ(#, %) = LSTM(ℎ(# − 1, %) + ℎ(#, % − 1) + /(#, %)

Chetan Arora
Diagonal BiLSTM
Rotation Trick
• To allow for parallelization along the diagonals, the input map is skewed by
offsetting each row by one position with respect to the previous row
• Location (" − 1, &) moves to (" − 1, & − 1) and now all pixels of a column
can be generated in parallel.

Chetan Arora
Row Vs Diagonal BiLSTM

• Row model can be easily parallelized at training and test time, thus
speeding up the optimization
• Diagonal model is able to capture the whole context (even for the border
pixels), which is not true for Row model.
Chetan Arora
PixelCNN
• Row and Diagonal capture long range dependencies in the images using
LSTM layers.
• Learning of dependencies comes at a computational cost as each state
needs to be computed sequentially.
• Standard convolutional layers can capture a bounded receptive field and
compute features for all pixel positions at once.
• PixelCNN uses multiple convolutional layers that preserve spatial

resolution; pooling layers are not used.
• Masks are adopted in convolutions to avoid seeing future context.
Chetan Arora
PixelCNN
• Still generate image pixels starting from corner Softmax loss at each pixel
• Dependency on previous pixels now modelled

using a CNN over context region
• Training: maximize likelihood of training Images

(
! " = $ !("% |"' , … , "%-' )

%&'
Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, Koray Kavukcuoglu
Conditional Image Generation with PixelCNN Decoders. NIPS 2016
Chetan Arora
PixelCNN
• Still generate image pixels starting from corner Softmax loss at each pixel
• Dependency on previous pixels now modelled

using a CNN over context region
• Training is faster than PixelRNN (can parallelize

convolutions since context region values
known from training images). But generation
must still proceed sequentially ⇒ still slow
Chetan Arora
PixelCNN: Masking
Chetan Arora
PixelCNN: Blind Spots

• Q is a function of K,L,M,P A B C D E
F G H I J
• Which are a function of F,G,H,I,
K L M N O
• Which are a function of A,B,C,D,E P Q R S T
U V W X Y
• Deeper layers will have larger receptive field
But Q has "blind spots" as it does not consider J,N,O
Chetan Arora
PixelCNN: Fixing Blind Spots

Chetan Arora
PixelCNN No Residual connection in vertical

Attention like stack (no empirical benefit)
product layer Residual
to mimic Connection
LSTM gates
Vertical Stack Horizontal Stack

Chetan Arora
Generation Samples
32x32 CIFAR-10 32x32 ImageNet

Chetan Arora
Problems with 256-way Softmax

• Memory intensive and generates sparse gradients
• The model does not know that a value of 128 is close to a value of 127 or
129.
• Need to learn such relationship first before higher level structures could be
learnt.
Tim Salimans, Andrej Karpathy, Xi Chen, Diederik P. Kingma

PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications. ICLR 2017
Chetan Arora
Problems with 256-way Softmax

• If a particular sub-pixel value (e.g. R=201) is never observed, the model will
learn to assign it zero probability.
• Especially problematic for data with higher accuracy on the observed pixels
than the usual 8 bits

Chetan Arora
PixelCNN++
• Use parameterised probability distribution: Discretized Logistic Mixture
Likelihood:
• ! ∼ ∑'
$%& ($ logistic 0$ , 2$
789.;<=> 7<9.;<=>
• 3 4 (, 0, 2 = ∑'
$%& ($ 6 −6
?> ?>
• A is the number of components in mixture, 6 logistic function, ($ , 0$ , 2$
weight, mean and scale of component B respectively.
• Condition on whole pixels, rather than R/G/B sub-pixels, simplifying the

model structure.

Chetan Arora
PixelCNN++
• Use down-sampling to efficiently capture structure at multiple resolutions.
• Short-cut connections to further speed up optimization.
• Regularize the model using dropout

Chetan Arora
PixelRNN and PixelCNN

Pros:
• Can explicitly compute likelihood !(#)
• Explicit likelihood of training data gives good evaluation metric
• Good samples
Con:
• Sequential generation ⇒ slow
Chetan Arora
So far…
PixelCNNs define tractable density function, optimize likelihood of training
data:
)
!" # = % !" (#& |#( , … , #&.( )

&'(
Is there any other any other way to learn the tractable density function?
Chetan Arora
Density Estimation using Real NVP Transformations

Given:
• An observed data variable ! ∈ #,
• A simple prior probability distribution $(&) on a latent variable & ∈ (, and
• A bijection ): # → ( (with , = ) ./ ), the change of variable formula

defines a model distribution on # by:
56(7)
$0 ! = $1 ) ! det 57 8
Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio

Density estimation using Real NVP. ICLR 2017
Chetan Arora

If g is monotonically increasing, then cumulative distribution:
! " ≤ $ = ! & ' ≤ $ = ! ' ≤ & () $
!* $ = !+ (& () $ )
Then probability density function:

()
/ ()
.* $ = .+ (& $ ) & ($)
/$

Chetan Arora

%&
! !
! "=$ ! = +
10 !
Carl Doesrsch
Tutorial on Variational Autoencoders
Chetan Arora

Chetan Arora

Disadvantages:
• Transformation must be invertible
• Latent dimension must match visible dimension
• Tractable (determinant of the) Jacobian.

Chetan Arora
What if latent variables were known?
Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim Tatarchenko, Thomas Brox

Learning to Generate Chairs, Tables and Cars with Convolutional Networks. PAMI 2017
Chetan Arora
What if latent variables were known?

Chetan Arora
Generative Vs Supervised Models: So far…

• In the case of supervised learning, both the inputs ! and the outputs "
were given, and the optimization procedure needs only to learn how to
produce the specified mapping.
• Generative modelling seems difficult because In the case of generative

modeling, the data does not specify the inputs # as well as the outputs ! of
the generator net. The learning procedure needs to determine how to
arrange # space in a useful way and additionally how to map from # to !.
Chetan Arora
Variational Autoencoders
• Learn both latent variables as well as the transformation function to sample
! from it.
• VAEs define intractable density function with latent ":

#$ ! = ∫ #$ ' #$ !|' )'
• Maximize the log of data likelihood: #$ ! .
• Cannot optimize directly, derive and optimize lower bound on the log
likelihood instead
Chetan Arora
Some background first: Autoencoders
# usually smaller than $ Originally: Linear +

(dimensionality reduction) nonlinearity (sigmoid)
Q: Why dimensionality Later: Deep, fully-connected
reduction? Later: ReLU CNN
Features "
A: Want features to
capture meaningful Encoder
factors of variation in
data Input data !
Chetan Arora
Some background first: Autoencoders Reconstructed data

After training throw the decoder
Doesn’t use labels!
# % L2 Loss
!−!
How to learn this feature function:
representation?
!# Reconstructed
Train such that features input data
Encoder: 4-layer conv
can be used to reconstruct Decoder
Decoder: 4-layer upconv
original data " Features
“Autoencoding” - encoding
itself Encoder
! Input data
Chetan Arora

Encoder can be used to initialize a supervised model
Predicted Label !# $ plane bird truck dog deer
Classifier Fine-tune
encoder
Features " jointly Train for final task
with sometimes with small data)
Encoder
classifier
Input data !
Chetan Arora
!# Autoencoders can reconstruct

Reconstructed
data, and can learn features to
input data
Decoder initialize a supervised model
Features " Features capture factors of

variation in training data. Can
Encoder we generate new images from
an autoencoder?
Input data !
Chetan Arora
Probabilistic spin on autoencoders - will let us sample from the model
to generate data!
" %
Assume training data ! "#$ is generated from underlying unobserved
(latent) representation &
Sample from Intuition (remember from

true conditional ! autoencoders!): ! is an image, & is
'( ∗ !|& " latent factors used to generate !:
attributes, orientation, etc.
Sample from
true prior
'( ∗ (&) &
Chetan Arora
We want to estimate the true
parameters ! ∗ of this generative model.
Sample from
true conditional %
How should we represent this model?
'( ∗ %|$ )
Decoder
Sample from network Choose prior p $ to be simple, e.g.
true prior Gaussian. Reasonable for latent
'( ∗ ($) $ attributes, e.g. pose, how much smile.
Conditional p %|$ is complex (generates

image) => represent with neural network
Chetan Arora
We want to estimate the true
parameters ! ∗ of this generative model.
Sample from
true conditional % How to train the model?
#$ ∗ %|( +
Decoder Learn model parameters to
Sample from network maximize likelihood of training data
true prior #$ % = ∫ #$ ( #$ %|( *(
#$ ∗ (() (
Now with latent (
Single ( in autoencoders
Chetan Arora
!" # = ∫ !" & !" #|& (&
How to train a DNN model to learn such a probability distribution?

• Model the output distribution as Gaussian, i.e.,
!(#|&) = +(#|, &; . , 0 1 ∗ 3)
• , &; . is the mean predicted through a neural network and 0 Is a

hyperparameter to control the covariance matrix
Chetan Arora
!" # = ∫ !" & !" #|& (&
How to train a DNN model to learn such a probability distribution?

• Model the output distribution is often Gaussian, i.e.,
!(#|&) = +(#|, &; . , 0 1 ∗ 3)
• Using Gaussian makes it differentiable and we can do gradient ascent to

make training data more and more likely.
• Would not have been possible if it was a Dirac-Delta function.

Chetan Arora
!" # = ∫ !" & !" #|& (&
What if ) is binary?
• One can use Bernoulli distribution parameterized by f(&, -)
Chetan Arora
!" # = ∫ !" & !" #|& (&
How to train the overall system?

• How to find !(&)
• Gaussian distribution.
• How to find !(#|&)
• DNN
• How to compute the integral
• ???
Chetan Arora
How to compute !(#) in a simple manner?
• Sample many z = {() , … (, }
)
• Compute . / = ∑1 , .(/|(1 )
What is the problem to use it in our equation: .3 / = ∫ .3 ( .3 /|( 5(

Intractable!
Chetan Arora
Maximize Data likelihood: !" # = ∫ !" & !" #|& (&
Intractable to compute Decoder neural network

!(#|&) for every &!
Simple Gaussian prior
Solution?
• Many &’s are unlikely to produce a given sample #. !(#|&) = 0 for such &
and do not contribute to likelihood.
• Can we compute & which are likely to produce # and then integrate over
those only.
Chetan Arora
Maximize Data likelihood: !" # = ∫ !" & !" #|& (&
• Posterior density for &

!" &|# = !" #|& !" & /!" #
• Intractable again, since !(#) is intractable.
• Solution?
• Use ,- (&|#) that approximates !" (&|#)
Chetan Arora
Since we’re modelling probabilistic generation of data, encoder and decoder
networks are probabilistic
Sample , from ,|!
Sample ! from ~ 2("%|# , &%|# )
Mean and (diagonal)
!|, ~ 2("%|# , &%|# ) Mean and (diagonal)
covariance of (|'
covariance of '|(
"%|# &%|# "#|% &#|%

Encoder network
)* (!|,) Decoder network
/0 (,|!)
(parameters .)
, (parameters 1) !
Encoder and decoder networks also called “recognition”/“inference” and “generation” networks
Chetan Arora
Now equipped with our encoder and decoder networks, let’s work out the
(log) data likelihood:
log $% & ' = )*~,-(*|0 1) log $% & ' ($% (& ' ) Does not depend on 3)
Taking expectation w.r.t. 3 (using encoder network) will come in handy later
45 0 1 |* 45 *
= )*~,- log (Bayes’ Rule)
45 *|0 1
45 0 1 |* 45 * ,- 3 &'
= )*~,- log
45 *|0 1 ,- 3 &'
,- 3 &' ,- 3 &'
= )*~,- log $% & ' |3 − )* log + )* log
45 * 45 *|0 1
Chetan Arora
log $% & '
,- / &' ,- / &'
= )*~,- log $% &' |/ − )*~,- log + )*~,- log
12 * 12 *|4 5
= )*~,- log $% & ' |/ − 678 9: / & ' ||$% (/) + 678 9: / & ' ||$% (/|& ' )
Decoder gives $% &|/ . Can This KL term (between $% /|& ' intractable
compute estimate of this Gaussians for encoder (saw earlier), can’t
term through sampling and z prior) has nice compute this KL term :(
closed-form solution! But we know KL
divergence always >= 0.
Chetan Arora
log $% & ' ≥ ℒ *, ,, & ' = ./~12 log $% & ' |4 − 678 9: 4 & ' ||$% (4)
Variational lower Reconstruct Make approximate

bound the input data Posterior distribution
close to prior
* ∗ , , ∗ = arg max B ℒ *, ,, & ' Training: Maximize Lower Bound

%,:
'
Chetan Arora
3
0 1 −!
Without Reparameterization Trick 0(1)
Decoder(P)
'() (*(%, Σ)||* 0, / ) Sample z from *(%, Σ)
% &
Non Differentiable Step
Encoder(Q)
! Carl Doesrch
Tutorial on Variational Autoencoders
Chetan Arora
3
0 1 −!
With Reparameterization Trick 0(1)
+ Decoder(P)
'() (*(%, Σ)||* 0, / ) ∗ Sample z from *(0, I)
% &
Still Non Differentiable
Encoder(Q)
However gradient
can flow from here
!
Tutorial on Variational Autoencoders, Carl Doesrch.
Chetan Arora
3
0 1 −!
1
= ;< Σ + %= % − > − log det(Σ 0(1)
2
+ Decoder(P)
'() (*(%, Σ)||* 0, / ) ∗ Sample z from *(0, I)
% &
• k is the dimensionality
of the distribution
• Easy to backpropagate Encoder(Q)
when Σ is diagonal.
Possible otherwise also !
Tutorial on Variational Autoencoders, Carl Doesrch.
Chetan Arora
Variational Autoencoders: Generating Data!

For generating data, throw encoder network. Data manifold for 2-d $
Use decoder network and sample z from prior!
,(!)
Vary !"
Decoder(P)
Sample ! from %(0, I)
Vary !#
Kingma and Welling
Auto-Encoding Variational Bayes. ICLR 2014
Chetan Arora

• Different dimensions of ! encode
interpretable factors of variation
Degree of smile
• Diagonal prior on ! =>independent Vary $'

latent variables
• Also good feature representation

that can be computed using "# $|& !
Head pose Vary $(
Kingma and Welling

Auto-Encoding Variational Bayes. ICLR 2014
Chetan Arora
32x32 CIFAR-10 Labeled Faces in the Wild

Chetan Arora
Conditional VAEs
• Mix of Pixel RNN and VAEs.
• Progressive Refinement and

Spatial Attention
Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, Daan Wierstra
DRAW: A Recurrent Neural Network For Image Generation. PMLR 2015.
Chetan Arora
DRAW: Without Attention

Chetan Arora
DRAW: With Attention

Chetan Arora
Active areas of research:
• More flexible approximations, e.g. richer approximate posterior instead of

diagonal Gaussian.
• Incorporating structure in latent variables.

Chetan Arora
Summary:
• Probabilistic spin to traditional autoencoders => allows generating data
• Defines an intractable density => derive and optimize a (Variational) lower

bound
Chetan Arora
Pros:
• Principled approach to generative models
• Allows inference of q(z|x), which can be a useful feature representation
for other tasks
Cons:
• Maximizes lower bound of likelihood: okay, but not as good evaluation as
PixelRNN/PixelCNN
• Samples blurrier and lower quality compared to state-of-the-art (GANs).

6 Generative Models Ar Vae PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

6 Generative Models Ar Vae PDF

Загружено:

Авторское право:

Доступные форматы

Generative Models

Auto-Regressive Models and Variational Autoencoders

Supervised vs Unsupervised Learning

• Goal: Learn a function to map " → $

• Examples: Classification, regression,

Supervised vs Unsupervised Learning

• Goal: Learn a function to map " → $

• Examples: Classification, regression,

Supervised vs Unsupervised Learning

• Goal: Learn a function to map " → $

• Examples: Classification, regression,

Supervised vs Unsupervised Learning

• Goal: Learn some underlying hidden

• Examples: Clustering, dimensionality

Supervised vs Unsupervised Learning

• Goal: Learn some underlying hidden

• Examples: Clustering, dimensionality

Supervised vs Unsupervised Learning

• Goal: Learn some underlying hidden

• Examples: Clustering, dimensionality

Supervised vs Unsupervised Learning

• Examples: Clustering, dimensionality

Supervised vs Unsupervised Learning

• Goal: Learn a function to map • Goal: Learn some underlying

• Examples: Classification, • Examples: Clustering,

Generative Models: Density Est. and Sampling

Training data ∼ !$*+* (() Generate samples ∼ !"#$%& (()

Why Generative Models?

Taxonomy of Generative Models

Explicit density Implicit density

Variational Markov Chain

Fully Visible Belief Network Will need to define

• Use chain rule to decompose likelihood of an image x into product of 1-d

! " = $ !("% |"' , … , "%-' )

Fully Visible Belief Network

! " = $ !("% |"' , … , "%-' )

• Probability distribution can be complex. How to learn the / function?

ℎ(", $) = LSTM(ℎ(" − 1, $) + ℎ(", $ − 1) + /(", $)

• Context of 2 pixels per row is lost recursively

• Inference can be parallelised for a row. All pixels of a row

ℎ(#, %) = LSTM(ℎ # − 1, % − 1 + ℎ # − 1, % + ℎ # − 1, % + 1 + /(#, %)

Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu

• Each of the two directions of the layer scans the image in

• Not parallelisable in naïve form.

Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu

Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu

Row Vs Diagonal BiLSTM

• PixelCNN uses multiple convolutional layers that preserve spatial

• Dependency on previous pixels now modelled

• Training: maximize likelihood of training Images

! " = $ !("% |"' , … , "%-' )

• Dependency on previous pixels now modelled

• Training is faster than PixelRNN (can parallelize

PixelCNN: Blind Spots

But Q has "blind spots" as it does not consider J,N,O

PixelCNN: Fixing Blind Spots

PixelCNN No Residual connection in vertical

Vertical Stack Horizontal Stack

32x32 CIFAR-10 32x32 ImageNet

Problems with 256-way Softmax

Tim Salimans, Andrej Karpathy, Xi Chen, Diederik P. Kingma

Problems with 256-way Softmax

Tim Salimans, Andrej Karpathy, Xi Chen, Diederik P. Kingma

Training data ∼ !$+ (() Generate samples ∼ !"#$%& (()

'() ((%, Σ)|| 0, / ) Sample z from *(%, Σ)

'() ((%, Σ)|| 0, / ) ∗ Sample z from *(0, I)

'() ((%, Σ)|| 0, / ) ∗ Sample z from *(0, I)