Lecture Conv1

10707
Deep Learning
Russ Salakhutdinov
Machine Learning Department
rsalakhu@cs.cmu.edu
h0p://www.cs.cmu.edu/~rsalakhu/10707/
Convolutional Networks I
Used Resources
Disclaimer: Much of the material in this lecture was borrowed from
Hugo Larochelles class on Neural Networks:
https://sites.google.com/site/deeplearningsummerschool2016/
Some tutorial slides were borrowed from Rob Fergus CIFAR

tutorial on ConvNets:
https://sites.google.com/site/deeplearningsummerschool2016/speakers
Some slides were borrowed from Marc'Aurelio Ranzatos

CVPR 2014 tutorial on Convolutional Nets
https://sites.google.com/site/lsvrtutorialcvpr14/home/deeplearning
Computer Vision
Design algorithms that can process visual data to
accomplish a given task:
For example, object recognition: Given an input image, identify

which object it contains
Computer Vision
Our goal is to design neural networks that are specifically
adapted for such problems
Must deal with very high-dimensional inputs: 150 x 150 pixels =
22500 inputs, or 3 x 22500 if RGB pixels
Can exploit the 2D topology of pixels (or 3D for video data)
Can build in invariance to certain variations: translation,
illumination, etc.
Convolutional networks leverage these ideas
Local connectivity
Parameter sharing
Convolution
Pooling / subsampling hidden units
Local Connectivity
Use a local connectivity of hidden units
Each hidden unit is connected only to a
sub-region (patch) of the input image
It is connected to all channels: 1 if

grayscale, 3 (R, G, B) if color image
Why local connectivity?
Fully connected layer has a lot of

parameters to fit, requires a lot of data
Spatial correlation is local

Local Connectivity
Units are connected to all channels:
1 channel if grayscale image,
3 channels (R, G, B) if color image
Local Connectivity
Example: 200x200 image, 40K hidden units, ~2B parameters!
Spatial correlation is local

Too many parameters, will require a
lot of training data!
Local Connectivity
Example: 200x200 image, 40K hidden units, filter size 10x10,
4M parameters!
This parameterization is good

when input image is registered
Computer Vision
illumination, etc.
Local connectivity
Parameter sharing
Convolution
Parameter Sharing
Share matrix of parameters across some units
Units that are organized into the feature map share parameters
Hidden units within a feature map cover different positions in the

image
same color
= Wij is the matrix connec?ng
same matrix of the ith input channel with the
connection jth feature map
Parameter Sharing
Why parameter sharing?
Reduces even more the number of parameters
Will extract the same features at every position (features are

equivariant)
same color
= Wij is the matrix connec?ng
same matrix of the ith input channel with the
connection jth feature map
Parameter Sharing
Share matrix of parameters across certain units
Convolutions with certain kernels

Computer Vision
illumination, etc.
Local connectivity
Parameter sharing
Convolution
Filter Bank Layer - FCSG : the input of a
Parameterlayer is a 3DSharing
array with n1 2D feature maps of si
Each componentMath for my
is denoted slides
xijk , and Compu
each fea
Each feature map forms denoted xi . ofThe
a 2D grid output is also a 3D array, y co
features
can be computed withm1 afeature
maps
discrete H X
convolution () m
of size of a m3 . A filter in the
2 kernel
kijhidden
matrix kij which is the has size l1 matrix
weights l2 Wand connects input feature
ij with its rows and
columns flipped output feature map yj . The module computes:
!
yj = gj tanh( kij xi )
i
- xi is the ith channel of input
where tanh is the hyperbolic tangent non-linear
- kij is the convolution kernel
2D discrete convolution operator and gj is a train
- gj is a learned scaling factor
coefficient. By taking into account the borders
- gj is the hidden layer
have m1 = n1 l1 + 1, and m2 = n2 l2 + 1. T
Jarret et al. 2009
denoted by FCSG canbecause
add bias it is composed of a se
lution filters (C), a sigmoid/tanh non-linearity (S
Figure 1. A example of feature extracti
coefficients (G). In the following, superscripts
Discrete Convolution
The convolution of an image x with a kernel k is computed as
follows:
X
(x k)ij = xi+p,j+q kr p,r q
pq
Example:
follows:
X
pq
Example:
k = k with rows and columns flipped
follows:
X
pq
Example: 1 x 0 + 0.5 x 80 + 0.25 x 20 + 0 x 40 = 45

follows:
X
pq
Example: 1 x 80 + 0.5 x 40 + 0.25 x 40 + 0 x 0 = 110

follows:
X
pq
Example: 1 x 20 + 0.5 x 40 + 0.25 x 0 + 0 x 0 = 40

follows:
X
pq
Example: 1 x 40 + 0.5 x 0 + 0.25 x 0 + 0 x 40 = 40

Pre-activations from channel xi into feature map yj can be
computed by:
~
getting the convolution kernel where kij =Wij from the
connection matrix Wij
applying the convolution xi * kij
This is equivalent to computing the discrete correlation

of xi with Wij
Example
Illustration: Example
0% 0.5% connexions%
x kij , where Wij = Wij
Calcul%dune%couche%%simple%cell%% vers%les%neurones
premire%tape%:%calcul%de%la%convolu7on%% 0.5% 0% cachs%
0.5%
W
0% 0% 0% 255% 0% 0%
0% 128% 128% 0%
0% 0.5% 0% 255% 0% 0%
0% 0% 128% 128% 0%
0% 0% 255% 0% 0%
0% 255% 0% 0%
0% 255% 0% 0% 0%
255% 0% 0% 0%
255% 0% 0% 0% 0%
%%%%%
X X W%%%%%
Example
With a non-linearity, we get a detector of a feature at any
position in the image:
Example
0% 0.5% connexions%
x kij , where Wij = Wij
Calcul%dune%couche%%simple%cell%% vers%les%neurones
premire%tape%:%calcul%de%la%convolu7on%% 0.5% 0% cachs%
0.5%
W
0% 0% 0% 255% 0% 0%
0% 128% 128% 0%
0% 0.5% 0% 255% 0% 0%
0% 0% 128% 128% 0%
0% 0% 255% 0% 0%
0% 255% 0% 0%
0% 255% 0% 0% 0%
255% 0% 0% 0%
255% 0% 0% 0% 0%
%%%%%
X X W%%%%%
Example
Can use zero padding to allow going over the borders ( * )
Example
Multiple Feature Maps
Example: 200x200 image, 100 filters,
filter size 10x10, 10K parameters
Computer Vision
illumination, etc.
Local connectivity
Parameter sharing
Convolution
Abstract
Pooling
.
Pool hidden units in same neighborhood
pooling is performed in non-overlapping neighborhoods
(subsampling)
- xi is the ith channel of input

yijk = max xi,j+p,k+q - xi,j,k is value of the ith feature
p,q
map at position j,k
- p is vertical index in local
1 X neighborhood
yijk = xi,j+p,k+q - q is horizontal index in local
M p,q neighborhood
- yijk is pooled / subsampled
layer
Jarret et al. 2009

on.
Pooling
Poolyijk
hidden unitsxini,j+p,k+q
= max same neighborhood
p,q
an alternative to max pooling is average pooling
1 X - xi is the ith channel of input

yijk = 2 xi,j+p,k+q
m p,q - xi,j,k is value of the ith feature
map at position j,k
- p is vertical index in local
neighborhood
- q is horizontal index in local
neighborhood
- yijk is pooled / subsampled
layer
- m is the neighborhood
Jarret et al. 2009 height/width
Example
Example: Pooling
Calcul%dune%couche%%complex%cell%%
maximum%dans%plusieurs%segments%
Illustration of pooling/subsampling operation
0.02% 0.19% 0.19% 0.02%

max% max%
0.02% 0.19% 0.19% 0.02% 0.19% 0.19%
0.02% 0.75% 0.02% 0.02% 0.75% 0.02%
0.75% 0.02% 0.02% 0.02%

max% max%
Logis6c(%(%%%%%%%%%%%%%n%200%)%/%50%)%
X W%%%%%
%%%%%
Why pooling?
% %
couche))simple)cell)) couche))complex)cell))
Introduces invariance to local translations
IFT%615% Hugo%Larochelle% 49%
Reduces the number of hidden units in hidden layer
Example: Pooling
can we make the detection robust

to the exact location of the eye?
Example: Pooling
By pooling (e.g., taking max) filter

responses at different locations we
gain robustness to the exact spatial
location of features.
Translation Invariance
Illustration of local translation invariance
both images result in the same feature map after pooling/

subsampling
Convolutional Network
Convolutional neural network alternates between the
convolutional and pooling layers
From Yann LeCuns slides

For classification: Output layer is a regular, fully connected layer
with softmax non-linearity
Output provides an estimate of the conditional probability of each
class
The network is trained by stochastic gradient descent
Backpropagation is used similarly as in a fully connected network

We have seen how to pass gradients through element-wise
activation function
We also need to pass gradients through the convolution operation
and the pooling operation
or my slides Computer vision. hugo.larochelle@usherbrooke.ca
Gradient of Convolutional
November 8, 2012 Layer
Let l be the loss function
Abstract
For max pooling operation yijk = max xi,j+p,k+q , the
Math for my slides Computer p,q
gradient for xijkvision.
is
H X rxijk l = 0, except for rx1i,j+p X 0 ,k+q 0 l = ryijk l
yijk =
where p, q = argmax xi,j+p,k+q xi,j+p,k+q
M p,q
yijk = max xi,j+p,k+q
In other words, only the winning units
p,qin layer x get the gradient
from the pooled layer
1 X
For the average operation yijk = 2
m p,q
xi,j+p,k+q , the
gradient for xijk is
P 1
vijk = xijk rx l = 2 upsample(ry l)
ipq wpq xi,j+p,k+q
m
yijk = vijk / max(c, jk )
where upsample inverts subsampling
P 2
jk = ( ipq wpq vi,j+p,k+q )1/2
Convolutional neural network alternates between the
convolutional and pooling layers
Need to introduce other operations that can improve object

recognition.
Rectification
Rectification layer: yijk = |xijk|
introduces invariance to the sign of the

bank unit in the previous layer
n3 . for instance, loss of information of

whether an edge is
ap is black-to-white or white-to-black
ed of Figure 1. A example of feature extractio

bank Rabs N PA . An input image (o
xi to through a non-linear filterbank, follo
contrast normalization and spatial pool
H X

yijk = max xi,j+p,k+q
yijk = max
p,q xi,j+p,k+q
Local Contrast Normalization p,q x

yijk = max
p,q
X
i,j+p,k+q
1 X
Perform local contrast normalization
yijk = 1 xi,j+p,k+q
yijk = M X x
1 p,q i,j+p,k+q
yijk = M p,q xi,j+p,k+q
Local average M p,q
P
vijk = xijk Pipq wpq xi,j+p,k+q
vijk = xijk Pipq wpq xi,j+p,k+q
vijk = xijk ipq wpq)xi,j+p,k+q
yijk = vijk / max(c,
yijk = vijk / max(c, jk jk )
yijk = vPijk / max(c,
P 2 jk ) 1/2
Local stdev
jk = ( w pq v 2
i,j+p,k+q )1/2 X
ipq
jk = ( Pipq wpq vi,j+p,k+q )
2 1/2 wpq = 1
jk = ( ipq wpq vi,j+p,k+q )
pq
Figure
where c1. Asmall
is a example
constant to of feature
prevent extraction
division by 0 stage of
Rabs

N P A . An input image
reduces units activation if neighbors are also active
(or a featur
through
createsacompetition
non-linear between filterbank,
feature maps followed by re
scales activations at each layer better for learning
contrast normalization and spatial pooling/sub-sa
Local Contrast Normalization
Perform local contrast normalization
Local mean=0, Local std. = 1, Local is 7x7 Gaussian
Feature Maps after

Feature Maps Contrast Normalization
These operations are inserted after the convolutions and before
the pooling
nk
3.
is
of Figure 1. A example of feature extraction stage of the type
JarretFetCSG
al. 2009

nk Rabs N PA . An input image (or a feature map) is passed
to through a non-linear filterbank, followed by rectification, local
contrast normalization and spatial pooling/sub-sampling.
1) layer with 4x4 down-sampling is denoted: PA44 .

Max-Pooling and Subsampling Layer - PM : building lo-
K. Kavukcuoglu
he cal invariance to shift can be performed with any symmetric
Remember Batch Normalization
Learned linear transformation to adapt to non-linear

activation function ( and are trained)

Lecture Conv1

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Lecture Conv1

Загружено:

Авторское право:

Доступные форматы

10707

Some tutorial slides were borrowed from Rob Fergus CIFAR

Some slides were borrowed from Marc'Aurelio Ranzatos

For example, object recognition: Given an input image, identify

Convolutional networks leverage these ideas

It is connected to all channels: 1 if

Why local connectivity?

Fully connected layer has a lot of

Spatial correlation is local

Spatial correlation is local

This parameterization is good

Convolutional networks leverage these ideas

Hidden units within a feature map cover different positions in the

Will extract the same features at every position (features are

Convolutions with certain kernels

Convolutional networks leverage these ideas

Example: 1 x 0 + 0.5 x 80 + 0.25 x 20 + 0 x 40 = 45

Example: 1 x 80 + 0.5 x 40 + 0.25 x 40 + 0 x 0 = 110

Example: 1 x 20 + 0.5 x 40 + 0.25 x 0 + 0 x 0 = 40

Example: 1 x 40 + 0.5 x 0 + 0.25 x 0 + 0 x 40 = 40

applying the convolution xi * kij

This is equivalent to computing the discrete correlation

Convolutional networks leverage these ideas

- xi is the ith channel of input

Jarret et al. 2009

1 X - xi is the ith channel of input

0.02% 0.19% 0.19% 0.02%

0.02% 0.75% 0.02% 0.02% 0.75% 0.02%

0.75% 0.02% 0.02% 0.02%

can we make the detection robust

By pooling (e.g., taking max) filter

both images result in the same feature map after pooling/

From Yann LeCuns slides

The network is trained by stochastic gradient descent

Backpropagation is used similarly as in a fully connected network

H X rxijk l = 0, except for rx1i,j+p X 0 ,k+q 0 l = ryijk l

Need to introduce other operations that can improve object

introduces invariance to the sign of the

n3 . for instance, loss of information of

ed of Figure 1. A example of feature extractio

Local Contrast Normalization p,q x

Local mean=0, Local std. = 1, Local is 7x7 Gaussian

Feature Maps after

1) layer with 4x4 down-sampling is denoted: PA44 .

Learned linear transformation to adapt to non-linear

Вам также может понравиться