Cs7015 (Deep Learning) : Lecture 11: Convolutional Neural Networks, Lenet, Alexnet, Zf-Net, Vggnet, Googlenet and Resnet

CS7015 (Deep Learning) : Lecture 11
Convolutional Neural Networks, LeNet, AlexNet, ZF-Net, VGGNet,

GoogLeNet and ResNet
Mitesh M. Khapra
Department of Computer Science and Engineering

Indian Institute of Technology Madras
1/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Module 11.1 : The convolution operation
2/68
Suppose we are tracking the position
of an aeroplane using a laser sensor at
discrete time intervals
x0
3/68
x0 x1
3/68
x0 x1 x2
3/68
Now suppose our sensor is noisy
x0 x1 x2
3/68
x0 x1 x2
To obtain a less noisy estimate we
would like to average several measure-
ments
3/68
x0 x1 x2
ments
More recent measurements are more
important so we would like to take a
weighted average
3/68
x0 x1 x2
∞
X ments
st = xt−a w−a =
a=0 More recent measurements are more
weighted average
3/68
x0 x1 x2
∞
X ments
st = xt−a w−a = (x∗w)t
weighted average
3/68
x0 x1 x2
∞
X ments
st = xt−a w−a = (x∗w)t
input filter weighted average
convolution
3/68
In practice, we would only sum over a
6
X small window
st = xt−a w−a
a=0
4/68
6
X small window
st = xt−a w−a
a=0
The weight array (w) is known as the
filter
4/68
6
X small window
st = xt−a w−a
a=0
filter
We just slide the filter over the input and
compute the value of st based on a win-
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5
X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70
S 0.00 1.80 0.00 0.00 0.00 0.00 0.00
s6 = x6 w0 + x5 w−1 + x4 w−2 + x3 w−3 + x4 w−4 + x5 w−5 + x6 w−6
4/68
6
X small window
st = xt−a w−a
a=0
filter
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5
X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70
S 0.00 1.80 1.96 0.00 0.00 0.00 0.00
4/68
6
X small window
st = xt−a w−a
a=0
filter
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5
X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70
S 0.00 1.80 1.96 2.11 0.00 0.00 0.00
4/68
6
X small window
st = xt−a w−a
a=0
filter
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5
X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70
S 0.00 1.80 1.96 2.11 2.16 0.00 0.00
4/68
6
X small window
st = xt−a w−a
a=0
filter
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5
X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70
S 0.00 1.80 1.96 2.11 2.16 2.28 0.00
4/68
6
X small window
st = xt−a w−a
a=0
filter
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5
X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70
S 0.00 1.80 1.96 2.11 2.16 2.28 2.42
4/68
6
X small window
st = xt−a w−a
a=0
filter
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5 Here the input (and the kernel) is one
dimensional
X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70
S 0.00 1.80 1.96 2.11 2.16 2.28 2.42
4/68
6
X small window
st = xt−a w−a
a=0
filter
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5 Here the input (and the kernel) is one
dimensional
X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70 Can we use a convolutional operation on
a 2D input also?
S 0.00 1.80 1.96 2.11 2.16 2.28 2.42
4/68
We can think of images as 2D inputs
5/68
We would now like to use a 2D filter
(m × n)
5/68
(m × n)
First let us see what the 2D formula
looks like
m−1
X n−1
X
Sij = (I ∗ K)ij = Ii−a,j−b Ka,b
a=0 b=0
5/68
(m × n)
looks like
This formula looks at all the preced-
ing neighbours (i − a, j − b)
m−1
X n−1
X
Sij = (I ∗ K)ij = Ii−a,j−b Ka,b
a=0 b=0
5/68
(m × n)
looks like
This formula looks at all the preced-
ing neighbours (i − a, j − b)
In practice, we use the following for-
mula which looks at the succeeding
m−1
X n−1
X neighbours
Sij = (I ∗ K)ij = Ii+a,j+b Ka,b
a=0 b=0
5/68
Let us apply this idea to a toy exam-
ple and see the results
6/68
Input Let us apply this idea to a toy exam-
Kernel ple and see the results
a b c d
w x
e f g h
y z
i j k `
Output
aw+bx+ey+fz
6/68
a b c d
w x
e f g h
y z
i j k `
Output
aw+bx+ey+fz bw+cx+fy+gz
6/68
a b c d
w x
e f g h
y z
i j k `
Output
aw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz
6/68
a b c d
w x
e f g h
y z
i j k `
Output
ew+fx+iy+jz
6/68
a b c d
w x
e f g h
y z
i j k `
Output
ew+fx+iy+jz fw+gx+jy+kz
6/68
a b c d
w x
e f g h
y z
i j k `
Output
ew+fx+iy+jz fw+gx+jy+kz gw+hx+ky+`z
6/68
For the rest of the discussion we will
use the following formula for convolu-
tion
7/68
2 c
bX
m
2c
bX
n
Sij = (I ∗ K)ij = Ii−a,j−b K m +a, n +b
2 2 tion
a=b− m
2 c
b=b− n
2c
7/68
2 c
bX
m
2c
bX
n
2 2 tion
a=b− m
2 c
b=b− n
2c
In other words we will assume that
the kernel is centered on the pixel of
pixel of interest interest
7/68
2 c
bX
m
2c
bX
n
2 2 tion
a=b− m
2 c
b=b− n
2c
In other words we will assume that
the kernel is centered on the pixel of
pixel of interest interest
So we will be looking at both preceed-
ing and succeeding neighbors
7/68
Let us see some examples of 2D convolutions applied to images
8/68
1 1 1
∗ 1 1 1 =
1 1 1
9/68
1 1 1
∗ 1 1 1 =
1 1 1
blurs the image
9/68
0 -1 0
∗ -1 5 -1 =
0 -1 0
10/68
0 -1 0
∗ -1 5 -1 =
0 -1 0
sharpens the image
10/68
1 1 1
∗ 1 -8 1 =
1 1 1
11/68
1 1 1
∗ 1 -8 1 =
1 1 1
detects the edges
11/68
We will now see a working example of 2D convolution.
12/68
We just slide the kernel over the input
image
13/68
image
Each time we slide the kernel we get
one value in the output
13/68
image
13/68
image
13/68
image
13/68
image
13/68
image
13/68
image
13/68
image
13/68
image
13/68
image
13/68
image
13/68
image
13/68
image
13/68
image
13/68
image
13/68
image
13/68
image
13/68
image
13/68
image
The resulting output is called a fea-
ture map.
13/68
image
The resulting output is called a fea-
ture map.
We can use multiple filters to get mul-
tiple feature maps.
13/68
Question
In the 1D case, we slide a one dimensional
filter over a one dimensional input
14/68
Question A B C B A B C

14/68

14/68

14/68

14/68

14/68
Question
In the 2D case, we slide a two dimen-
stional filter over a two dimensional out-
put
14/68
a b c d
Question
In the 1D case, we slide a one dimensional e f g h
i j k l
put
14/68
a b c d
Question
i j k l
put
14/68
a b c d
Question
i j k l
put
14/68
a b c d
Question
i j k l
put
14/68
a b c d
Question
i j k l
put
14/68
a b c d
Question
i j k l
put
14/68
Question
put
What would happen in the 3D case?
14/68
R G B
What would a 3D filter look like?
INPUT
15/68
R G B

It will be 3D and we will refer to it as a volume
filter
INPUT
15/68
R G B

Once again we will slide the volume over the
3D input and compute the convolution oper-
ation
OUTPUT
INPUT
15/68
R G B

ation
OUTPUT
INPUT
15/68
R G B

ation
OUTPUT
INPUT
15/68
R G B

ation
OUTPUT
INPUT
15/68
R G B

ation
OUTPUT
INPUT
15/68
R G B

ation
OUTPUT
INPUT
15/68
R G B

ation
OUTPUT
INPUT
15/68
R G B

ation
OUTPUT
INPUT
15/68
R G B

ation
OUTPUT
INPUT
15/68
R G B

ation
OUTPUT
INPUT
15/68
R G B

ation
OUTPUT
INPUT
15/68
R G B

ation
OUTPUT
INPUT
15/68
R G B

ation
Note that in this lecture we will assume that
the filter always extends to the depth of the
image
OUTPUT
INPUT
15/68
R G B

ation
image
In effect, we are doing a 2D convolution oper-
ation on a 3D input (because the filter moves
along the height and the width but not along
the depth)
OUTPUT
INPUT
15/68
R G B

ation
image
the depth)
As a result the output will be 2D (only width
OUTPUT
and height, no depth)
INPUT
15/68
R G B

ation
image
the depth)
As a result the output will be 2D (only width
OUTPUT
and height, no depth)
INPUT
Once again we can apply multiple filters to get
multiple feature maps 15/68
Module 11.2 : Relation between input size, output size
and filter size
16/68
So far we have not said anything explicit about the dimensions of the
17/68
1 inputs
17/68
1 inputs
2 filters
17/68
1 inputs
2 filters
3 outputs
17/68
1 inputs
2 filters
3 outputs
and the relations between them
17/68
1 inputs
2 filters
3 outputs
and the relations between them
We will see how they are related but before that we will define a few quantities
17/68
We first define the following quanti-
ties
18/68
ties
Width (W1 ),
W1
18/68
ties
Width (W1 ), Height (H1 )
H1
W1
18/68
ties
Width (W1 ), Height (H1 ) and Depth
(D1 ) of the original input
H1
W1
D1
18/68
ties
The Stride S (We will come back to
this later)
H1
W1
D1
18/68
ties
this later)
H1
W1
D1
18/68
ties
this later)
The number of filters K
H1
W1
D1
18/68
ties
F Width (W1 ), Height (H1 ) and Depth
F The Stride S (We will come back to
D1
this later)
H1
The spatial extent (F ) of each filter
(the depth of each filter is same as
the depth of each input)
W1
D1
18/68
ties
F Width (W1 ), Height (H1 ) and Depth
F The Stride S (We will come back to
H2
D1
this later)
H1
The spatial extent (F ) of each filter
(the depth of each filter is same as
the depth of each input)
The output is W2 × H2 × D2 (we will
W2
soon see a formula for computing W2 ,
W1
D2 H2 and D2 )
D1
18/68
Let us compute the dimension (W2 , H2 ) of
the output
19/68
the output
19/68
the output
19/68
the output
19/68
the output
19/68
the output
19/68
the output
19/68
the output
Notice that we can’t place the kernel at the
= corners as it will cross the input boundary
pixel of interest
19/68
the output
19/68
the output
This is true for all the shaded points (the
kernel crosses the input boundary)
19/68
the output
This results in an output which is of smaller
dimensions than the input
19/68
the output
As the size of the kernel increases, this be-
comes true for even more pixels
20/68
the output
For example, let’s consider a 5 × 5 kernel
20/68
the output
We have an even smaller output now
20/68
the output
20/68
the output
20/68
the output
pixel of interest dimensions than the input
20/68
the output
pixel of interest dimensions than the input
20/68
the output
In general, W2 = W1 − F + 1 comes true for even more pixels
H2 = H1 − F + 1 For example, let’s consider a 5 × 5 kernel
We will refine this formula further We have an even smaller output now
20/68
What if we want the output to be of same
size as the input?
21/68
size as the input?
We can use something known as padding
21/68
size as the input?
Pad the inputs with appropriate number of 0
inputs so that you can now apply the kernel
at the corners
21/68
0 0 0 0 0 0 0 0 0 size as the input?
0 0
0 0
0 0 Pad the inputs with appropriate number of 0
0 0 = inputs so that you can now apply the kernel
0 0 at the corners
0 0 Let us use pad P = 1 with a 3 × 3 kernel
0 0
0 0 0 0 0 0 0 0 0
21/68
0 0 0 0 0 0 0 0 0 size as the input?
0 0
0 0
0 0 at the corners
0 0
This means we will add one row and one
0 0 0 0 0 0 0 0 0
column of 0 inputs at the top, bottom, left
and right
21/68
0 0 0 0 0 0 0 0 0 size as the input?
0 0
0 0
0 0 at the corners
0 0
0 0 0 0 0 0 0 0 0
and right
21/68
0 0 0 0 0 0 0 0 0 size as the input?
0 0
0 0
0 0 at the corners
0 0
0 0 0 0 0 0 0 0 0
and right
21/68
0 0 0 0 0 0 0 0 0 size as the input?
0 0
0 0
0 0 at the corners
0 0
0 0 0 0 0 0 0 0 0
and right
21/68
0 0 0 0 0 0 0 0 0 size as the input?
0 0
0 0
0 0 at the corners
0 0
0 0 0 0 0 0 0 0 0
and right
We now have,
W2 = W1 − F + 2P + 1
H2 = H1 − F + 2P + 1
We will refine this formula further
21/68
What does the stride S do?
22/68
It defines the intervals at which the
filter is applied (here S = 2)
22/68
0 0 0 0 0 0 0 0 0
0 0
0 0 It defines the intervals at which the
0 0 filter is applied (here S = 2)
0 0 =
0 0
Here, we are essentially skipping
0 0 every 2nd pixel which will again
0 0 result in an output which is of
0 0 0 0 0 0 0 0 0 smaller dimensions
22/68
0 0 0 0 0 0 0 0 0
0 0
0 0 =
0 0
22/68
0 0 0 0 0 0 0 0 0
0 0
0 0 =
0 0
22/68
0 0 0 0 0 0 0 0 0
0 0
0 0 =
0 0
22/68
0 0 0 0 0 0 0 0 0
0 0
0 0 =
0 0
22/68
0 0 0 0 0 0 0 0 0
0 0
0 0 =
0 0
22/68
0 0 0 0 0 0 0 0 0
0 0
0 0 =
0 0
22/68
0 0 0 0 0 0 0 0 0
0 0
0 0 =
0 0
22/68
0 0 0 0 0 0 0 0 0
0 0
0 0 =
0 0
22/68
0 0 0 0 0 0 0 0 0
0 0
0 0 =
0 0
22/68
0 0 0 0 0 0 0 0 0
0 0
0 0 =
0 0
22/68
0 0 0 0 0 0 0 0 0
0 0
0 0 =
0 0
22/68
0 0 0 0 0 0 0 0 0
0 0
0 0 =
0 0
22/68
0 0 0 0 0 0 0 0 0
0 0
0 0 =
0 0
22/68
0 0 0 0 0 0 0 0 0
0 0
0 0 =
0 0
So what should our final formula look like,
22/68
0 0 0 0 0 0 0 0 0
0 0
0 0 =
0 0
So what should our final formula look like,

W1 − F + 2P
W2 = +1
S
H1 − F + 2P
H2 = +1
S
22/68
Finally, coming to the depth of the
output.
filter
H1
W1
D1
23/68
output.
Each filter gives us one 2D output.
H1
W1
D1
23/68
output.
K filters will give us K such 2D out-
H2 puts
H1
W2
W1
D2 = K
D1
W1 −F +2P
W2 = S +1
H1 −F +2P
H2 = S +1
D2 = K
23/68
output.
H2 puts
We can think of the resulting output
as K × W2 × H2 volume
H1
W2
W1
D2 = K
D1
W1 −F +2P
W2 = S +1
H1 −F +2P
H2 = S +1
D2 = K
23/68
output.
H2 puts
We can think of the resulting output
as K × W2 × H2 volume
H1
Thus D2 = K
W2
W1
D2 = K
D1
W1 −F +2P
W2 = S +1
H1 −F +2P
H2 = S +1
D2 = K
23/68
Let us do a few exercises
H2 =?
11
∗ =
11
227
3
W2 =?
96 filters
227 Stride = 4
P adding = 0
W2 = W1 −F +2P
+1 D2 =?
S
3 H1 −F +2P
H2 = S
+1
24/68
H2 =?
11
∗ =
11
227
3
W2 =?
96 filters
227 Stride = 4
P adding = 0
W2 = W1 −F +2P
+1 96
S
3 H1 −F +2P
H2 = S
+1
24/68
227−11
55 = 4
+1
11
∗ =
11
227
3
W2 =?
96 filters
227 Stride = 4
P adding = 0
W2 = W1 −F +2P
+1 96
S
3 H1 −F +2P
H2 = S
+1
24/68
227−11
55 = 4
+1
11
∗ =
11
227
3
227−11
55 = 4
+1
96 filters
227 Stride = 4
P adding = 0
W2 = W1 −F +2P
+1 96
S
3 H1 −F +2P
H2 = S
+1
24/68
H2 =?
5
∗ =
5
32
1
W2 =?
6 filters
32 Stride = 1
P adding = 0
W2 = W1 −F +2P
+1 D2 =?
S
1 H1 −F +2P
H2 = S
+1
25/68
H2 =?
5
∗ =
5
32
1
W2 =?
6 filters
32 Stride = 1
P adding = 0
W2 = W1 −F +2P
+1 6
S
1 H1 −F +2P
H2 = S
+1
25/68
32−5
28 = 1
+1
5
∗ =
5
32
1
W2 =?
6 filters
32 Stride = 1
P adding = 0
W2 = W1 −F +2P
+1 6
S
1 H1 −F +2P
H2 = S
+1
25/68
32−5
28 = 1
+1
5
∗ =
5
32
1
32−5
28 = 1
+1
6 filters
32 Stride = 1
P adding = 0
W2 = W1 −F +2P
+1 6
S
1 H1 −F +2P
H2 = S
+1
25/68
Module 11.3 : Convolutional Neural Networks
26/68
Putting things into perspective
What is the connection between this operation (convolution) and neural net-
works?
27/68
Putting things into perspective
What is the connection between this operation (convolution) and neural net-
works?
We will try to understand this by considering the task of “image classification”
27/68
28/68
Features
Raw pixels
28/68
Features
Raw pixels
car, bus, monument, flower
28/68
Features
Raw pixels
28/68
Features
Raw pixels
Edge Detector
28/68
Features
Raw pixels
Edge Detector
28/68
Features
Raw pixels
Edge Detector
28/68
Features
Raw pixels
Edge Detector
SIF T /HOG
28/68
Features
Raw pixels
Edge Detector
SIF T /HOG
28/68
Features
Raw pixels
Edge Detector
SIF T /HOG
static feature extraction (no learning) learning weights of classifier
28/68
Input Features Classifier
0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0
Instead of using handcrafted kernels such as edge detectors can we learn meaningful ker-
nels/filters in addition to learning the weights of the classifier?
29/68
0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0
-1.21358689e-03 3.23652686e-03 ··· ··· -2.06615720e-02

-1.52757822e-03 2.36130832e-03 ··· ··· -1.19824838e-02
.. .. ..
. . .
.. .. ..
. . .
-8.25322699e-04 -5.14897937e-03 ··· ··· -9.90395527e-03
29/68
0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0
-1.21358689e-03 3.23652686e-03 ··· ··· -2.06615720e-02

-1.52757822e-03 2.36130832e-03 ··· ··· -1.19824838e-02
.. .. ..
.
..
.
..
.
..
Learn these weights
. . .
-8.25322699e-04 -5.14897937e-03 ··· ··· -9.90395527e-03
29/68
0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0
-1.21358689e-03 3.23652686e-03 ··· ··· -2.06615720e-02

-1.52757822e-03 2.36130832e-03 ··· ··· -1.19824838e-02
.. .. ..
. . .
.. .. ..
. . .
-8.25322699e-04 -5.14897937e-03 ··· ··· -9.90395527e-03
Even better: Instead of using handcrafted kernels (such as edge detectors)can we learn
multiple meaningful kernels/filters in addition to learning the weights of the clas-
30/68
sifier?
0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0
-0.02337041 -0.03243878 ··· ··· -0.04728875

-0.05375158 -0.05350766 ··· ··· -0.04323674
.. .. ..
. . .
.. .. ..
. . .
-0.00792501 -0.00503319 ··· ··· 0.00174674
30/68
sifier?
0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0
-0.01871333 -0.01075948 ··· ··· 0.04684572

0.00104325 0.01935937 ··· ··· 0.01016542
.. .. ..
. . .
.. .. ..
. . .
0.03008777 0.00335217 ··· ··· -0.02791128
30/68
sifier?
Can we learn multiple layers of meaningful kernels/filters in addition to
learning the weights of the classifier?
31/68
Input Classifier

Yes, we can !
31/68
Input Classifier
-1.21358689e-03 3.23652686e-03 ··· ··· -2.06615720e-02 -0.01112582 0.02185669 ··· ··· 0.00015161
-1.52757822e-03 2.36130832e-03 ··· ··· -1.19824838e-02 -0.00687587 0.01229961 ··· ··· 0.00214013
.. .. .. .. .. .. backpropagation
. . . . . .
.. .. .. .. .. ..
. . . . . .
-8.25322699e-04 -5.14897937e-03 ··· ··· -9.90395527e-03 -0.00372989 -0.00886137 ··· ··· -0.01974954

Yes, we can !
Simply by treating these kernels as parameters and learning them in addition to the
weights of the classifier (using back propagation)
31/68
Input Classifier
-1.21358689e-03 3.23652686e-03 ··· ··· -2.06615720e-02 -0.01112582 0.02185669 ··· ··· 0.00015161
-1.52757822e-03 2.36130832e-03 ··· ··· -1.19824838e-02 -0.00687587 0.01229961 ··· ··· 0.00214013
.. .. .. .. .. .. backpropagation
. . . . . .
.. .. .. .. .. ..
. . . . . .
-8.25322699e-04 -5.14897937e-03 ··· ··· -9.90395527e-03 -0.00372989 -0.00886137 ··· ··· -0.01974954

Yes, we can !
Simply by treating these kernels as parameters and learning them in addition to the
weights of the classifier (using back propagation)
Such a network is called a Convolutional Neural Network. 31/68
Okay, I get it that the idea is to learn the kernel/filters by just treating them
as parameters of the classification model
32/68
But how is this different from a regular feedforward neural network
32/68
But how is this different from a regular feedforward neural network
Let us see
32/68
2
33/68
16
2
33/68
10 classes(digits)
16
2
33/68
10 classes(digits)
16
2
33/68
10 classes(digits)
16
2
33/68
10 classes(digits)
...
16
2
33/68
10 classes(digits)
...
16
2
33/68
10 classes(digits)
...
16
2
33/68
10 classes(digits)
...
16
2
33/68
10 classes(digits)
...
This is what a regular feed-forward
neural network will look like
16
2
33/68
10 classes(digits)
...
There are many dense connections

here
16
2
33/68
10 classes(digits)
...

here
For example all the 16 input neurons

are contributing to the computation
of h11
16
2
33/68
10 classes(digits)
...

here
For example all the 16 input neurons

are contributing to the computation
of h11
16
Contrast this to what happens in the
case of convolution
2
33/68
...
16
2 * =
34/68
Only a few local neurons participate
in the computation of h11
...
16
2 * =
34/68
h11 Only a few local neurons participate
For example, only pixels 1, 2, 5, 6

... contribute to h11
16
2
h11
* =
34/68
h11 h12 Only a few local neurons participate

16
2
h12
* =
34/68

16
2
h13
* =
34/68

16
2
h14
* =
34/68

16 The connections are much sparser
2
h14
* =
34/68

We are taking advantage of the struc-

ture of the image(interactions be-
2
h14 tween neighboring pixels are more in-
* = teresting)
34/68

We are taking advantage of the struc-

ture of the image(interactions be-
2
h14 tween neighboring pixels are more in-
* = teresting)
This sparse connectivity reduces

the number of parameters in the
model
34/68
But is sparse connectivity really good
thing ?
∗
Goodfellow-et-al-2016
35/68
thing ?
Aren’t we losing information (by los-

ing interactions between some input
pixels)
∗
35/68
thing ?

pixels)
Well, not really
∗
35/68
thing ?

pixels)
Well, not really
The two highlighted neurons (x1 &

x5 )∗ do not interact in layer 1
∗
35/68
thing ?

pixels)
Well, not really
The two highlighted neurons (x1 &

x5 )∗ do not interact in layer 1
But they indirectly contribute to the

computation of g3 and hence interact
indirectly
∗
35/68
Another characteristic of
CNNs is weight sharing
36/68
Consider the following net-

work
36/68

work
16
Kernel 1
Kernel 2
4x4 Image
36/68

work
16 Do we want the kernel

weights to be different for dif-
ferent portions of the image?
Kernel 1
Kernel 2
4x4 Image
36/68

work

Kernel 1 Imagine that we are trying

to learn a kernel that detects
edges
Kernel 2
4x4 Image
36/68

work

Kernel 1 Imagine that we are trying

to learn a kernel that detects
edges
Kernel 2
4x4 Image Shouldn’t we be applying the
same kernel at all the por-
tions of the image?
36/68
In other words shouldn’t the orange
and pink kernels be the same
16
37/68
Yes, indeed
16
37/68
Yes, indeed
16
37/68
Yes, indeed
This would make the job of learning

easier(instead of trying to learn the
same weights/kernels at different lo-
cations again and again)
16
37/68
Yes, indeed

But does that mean we can have only

16 one kernel?
37/68
Yes, indeed


16 one kernel?
No, we can have many such kernels

but the kernels will be shared by all
locations in the image
37/68
Yes, indeed


16 one kernel?

37/68
Yes, indeed


16 one kernel?

37/68
Yes, indeed


16 one kernel?

This is called “weight sharing” 37/68

So far, we have focused only on the convolution operation
38/68
So far, we have focused only on the convolution operation
Let us see what a full convolutional neural network looks like
38/68
Convolution Layer 2
Input Convolution Layer 1 Pooling Layer 2
FC 1(120)FC 2(84)
A
Pooling Layer 1 Output(10)
32
28
14
P aram
32
10 P aram P aram = 850
28 14 5 = 48120 = 10164
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
P aram = 2400 P aram = 0
39/68
Convolution Layer 2
FC 1(120)FC 2(84)
A
32
28
14
P aram
32
28 14 5 = 48120 = 10164
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
It has alternate convolution and pooling layers
39/68
Convolution Layer 2
FC 1(120)FC 2(84)
A
32
28
14
P aram
32
28 14 5 = 48120 = 10164
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,

What does a pooling layer do?
39/68
Convolution Layer 2
FC 1(120)FC 2(84)
A
32
28
14
P aram
32
28 14 5 = 48120 = 10164
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,

What does a pooling layer do?
Let us see
39/68
Input
40/68
*
Input 1 filter
40/68
* =
Input 1 filter
40/68
1 4 2 1
5 8 3 4
* =
7 6 4 5
1 3 1 2
Input 1 filter
40/68
1 4 2 1
5 8 3 4 maxpool
* =
7 6 4 5 2x2 filters (stride 2)
1 3 1 2
Input 1 filter
40/68
1 4 2 1
5 8 3 4 maxpool
* =
1 3 1 2
Input 1 filter
40/68
1 4 2 1
5 8 3 4 maxpool 8
* =
1 3 1 2
Input 1 filter
40/68
1 4 2 1
5 8 3 4 maxpool 8 4
* =
1 3 1 2
Input 1 filter
40/68
1 4 2 1
5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7
1 3 1 2
Input 1 filter
40/68
1 4 2 1
5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5
1 3 1 2
Input 1 filter
40/68
1 4 2 1
5 8 3 4 maxpool 8 4
* =
1 3 1 2
Input 1 filter
1 4 2 1
5 8 3 4
7 6 4 5
1 3 1 2
40/68
1 4 2 1
5 8 3 4 maxpool 8 4
* =
1 3 1 2
Input 1 filter
1 4 2 1
5 8 3 4 maxpool
1 3 1 2
40/68
1 4 2 1
5 8 3 4 maxpool 8 4
* =
1 3 1 2
Input 1 filter
1 4 2 1
5 8 3 4 maxpool
1 3 1 2
40/68
1 4 2 1
5 8 3 4 maxpool 8 4
* =
1 3 1 2
Input 1 filter
1 4 2 1
5 8 3 4 maxpool 8
1 3 1 2
40/68
1 4 2 1
5 8 3 4 maxpool 8 4
* =
1 3 1 2
Input 1 filter
1 4 2 1
5 8 3 4 maxpool 8 8
1 3 1 2
40/68
1 4 2 1
5 8 3 4 maxpool 8 4
* =
1 3 1 2
Input 1 filter
1 4 2 1
5 8 3 4 maxpool 8 8 4
1 3 1 2
40/68
1 4 2 1
5 8 3 4 maxpool 8 4
* =
1 3 1 2
Input 1 filter
1 4 2 1
5 8 3 4 maxpool 8 8 4
7 6 4 5 2x2 filters (stride 1) 8
1 3 1 2
40/68
1 4 2 1
5 8 3 4 maxpool 8 4
* =
1 3 1 2
Input 1 filter
1 4 2 1
5 8 3 4 maxpool 8 8 4
1 3 1 2
40/68
1 4 2 1
5 8 3 4 maxpool 8 4
* =
1 3 1 2
Input 1 filter
1 4 2 1
5 8 3 4 maxpool 8 8 4
7 6 4 5 2x2 filters (stride 1) 8 8 5
1 3 1 2
40/68
1 4 2 1
5 8 3 4 maxpool 8 4
* =
1 3 1 2
Input 1 filter
1 4 2 1
5 8 3 4 maxpool 8 8 4
7 6 4 5 2x2 filters (stride 1) 8 8 5
1 3 1 2 7
40/68
1 4 2 1
5 8 3 4 maxpool 8 4
* =
1 3 1 2
Input 1 filter
1 4 2 1
5 8 3 4 maxpool 8 8 4
7 6 4 5 2x2 filters (stride 1) 8 8 5
1 3 1 2 7 6
40/68
1 4 2 1
5 8 3 4 maxpool 8 4
* =
1 3 1 2
Input 1 filter
1 4 2 1
5 8 3 4 maxpool 8 8 4
7 6 4 5 2x2 filters (stride 1) 8 8 5
1 3 1 2 7 6 5
40/68
1 4 2 1
5 8 3 4 maxpool 8 4
* =
1 3 1 2
Input 1 filter
1 4 2 1
5 8 3 4 maxpool 8 8 4
7 6 4 5 2x2 filters (stride 1) 8 8 5
1 3 1 2 7 6 5
Instead of max pooling we can also do average pooling
40/68
We will now see some case studies where convolution neural networks have been
successful
41/68
LeNet-5 for handwritten character recognition
Input
A 32
32
42/68
Input Convolution Layer 1
A 32
28
32
28
S = 1,F = 5,
K = 6,P = 0,
P aram =?
42/68
A 32
28
32
28
S = 1,F = 5,
K = 6,P = 0,
P aram = 150
42/68
A
Pooling Layer 1
32
28
14
32
28 14
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
P aram = 150 P aram =?
42/68
A
Pooling Layer 1
32
28
14
32
28 14
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
42/68
Convolution Layer 2
A
Pooling Layer 1
32
28
14
32
10
28 14
10
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5,
P aram = 150 P aram = 0 K = 16,P = 0,
P aram =?
42/68
Convolution Layer 2
A
Pooling Layer 1
32
28
14
32
10
28 14
10
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5,
P aram = 150 P aram = 0 K = 16,P = 0,
P aram = 2400
42/68
Convolution Layer 2
A
Pooling Layer 1
32
28
14
32
10 5
28 14
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
P aram = 2400 P aram =?
42/68
Convolution Layer 2
A
Pooling Layer 1
32
28
14
32
10 5
28 14
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
42/68
Convolution Layer 2
FC 1(120)
A
Pooling Layer 1
32
28
14
32 P aram
10 5
28 14 =?
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
42/68
Convolution Layer 2
FC 1(120)
A
Pooling Layer 1
32
28
14
32 P aram
10 5
28 14 = 48120
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
42/68
Convolution Layer 2
FC 1(120)FC 2(84)
A
Pooling Layer 1
32
28
14
32
10 P aram P aram
28 14 5 = 48120 =?
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
42/68
Convolution Layer 2
FC 1(120)FC 2(84)
A
Pooling Layer 1
32
28
14
32
10 P aram P aram
28 14 5 = 48120 = 10164
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
42/68
Convolution Layer 2
FC 1(120)FC 2(84)
A
32
28
14
P aram
32
10 P aram P aram =?
28 14 5 = 48120 = 10164
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
42/68
Convolution Layer 2
FC 1(120)FC 2(84)
A
32
28
14
P aram
32
28 14 5 = 48120 = 10164
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
42/68
How do we train a convolutional neural network ?
43/68
Input Kernel
b c d l m n o
w x
e f g
y z
b c d e f g h i j
h i j
A CNN can be implemented as a

feedforward neural network
44/68
Input Kernel
b c d l m n o
w x
e f g
y z
b c d e f g h i j
h i j
Output A CNN can be implemented as a

` m
wherein only a few weights(in color)
are active
n o
44/68
Input Kernel
b c d l m n o
w x
e f g
y z
b c d e f g h i j
h i j

` m
are active
n o
the rest of the weights (in gray) are
zero
44/68
Input Kernel
b c d l m n o
w x
e f g
y z
b c d e f g h i j
h i j

` m
are active
n o
zero
44/68
Input Kernel
b c d l m n o
w x
e f g
y z
b c d e f g h i j
h i j

` m
are active
n o
zero
44/68
Input Kernel
b c d l m n o
w x
e f g
y z
b c d e f g h i j
h i j

are active
zero
44/68
Input Kernel
b c d l m n o
w x
e f g
y z
b c d e f g h i j
h i j

are active
We can thus train a convolution zero
neural network using
backpropagation by thinking of it as
a feedforward neural network with
sparse connections 44/68
Module 11.4 : CNNs (success stories on ImageNet)
45/68
ImageNet Success Stories(roadmap for rest of the talk)
AlexNet
46/68
AlexNet
ZFNet
46/68
AlexNet
ZFNet
VGGNet
46/68
28.2
ILSVRC’10
47/68
28.2
25.8
ILSVRC’10 ILSVRC’11
47/68
28.2
25.8
16.4
ILSVRC’10 ILSVRC’11 ILSVRC’12

AlexNet
47/68
28.2
25.8
16.4
11.7
ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13

AlexNet ZFNet
47/68
28.2
25.8
16.4
11.7
7.3
ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13 ILSVRC’14

AlexNet ZFNet VGG
47/68
28.2
25.8
16.4
11.7
7.3 6.7
ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13 ILSVRC’14 ILSVRC’14

AlexNet ZFNet VGG GoogleNet
47/68
28.2
25.8
16.4
11.7
7.3 6.7
3.57
ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13 ILSVRC’14 ILSVRC’14 ILSVRC’15

AlexNet ZFNet VGG GoogleNet ResNet
47/68
28.2
25.8
152 layers
16.4
11.7
19 layers 22 layers
7.3 6.7
shallow 8 layers 8 layers 3.57

47/68
28.2
25.8
152 layers
16.4
11.7
19 layers 22 layers
7.3 6.7
shallow 8 layers 8 layers 3.57

47/68
AlexNet
ZFNet
VGGNet
48/68
227
227
3
Input
49/68
Input: 227 × 227 × 3
Conv1: K = 96,F = 11
S = 4,P = 0
Output:W2 =?, H2 =?
Parameters: ?
227
11
11
227
3
Input
49/68
Input: 227 × 227 × 3
Conv1: K = 96,F = 11
S = 4,P = 0
Output:W2 = 55, H2 = 55
Parameters: ?
227
55
11
11
55
227
96
Convolution
3
Input
49/68
Input: 227 × 227 × 3
Conv1: K = 96,F = 11
S = 4,P = 0
Output:W2 = 55, H2 = 55
Parameters: (11 × 11 × 3) × 96 = 34K
227
55
11
11
55
227
96
Convolution
3
Input
49/68
Max Pool Input: 55 × 55 × 96
F = 3,S = 2
Output:W2 =?, H2 =?
Parameters: ?
227
55
11
3
3
11
55
227
96
Convolution
3
Input
49/68
F = 3,S = 2
Output:W2 = 27, H2 = 27
Parameters: ?
227
55
27
11
3
3
11
27
55
227
96
MaxPooling
96
Convolution
3
Input
49/68
F = 3,S = 2
Output:W2 = 27, H2 = 27
Parameters: 0
227
55
27
11
3
3
11
27
55
227
96
MaxPooling
96
Convolution
3
Input
49/68
Input: 27 × 27 × 96
Conv1: K = 256,F = 5
S = 1,P = 0
Output:W2 =?, H2 =?
Parameters: ?
227
55
27
11
3 5
3
5
11
27
55
227
96
MaxPooling
96
Convolution
3
Input
49/68
Input: 27 × 27 × 96
Conv1: K = 256,F = 5
S = 1,P = 0
Output:W2 = 23, H2 = 23
Parameters: ?
227
55
27 23
11
3 5
3
5
11
27 23
55
227
96 256
MaxPooling Convolution
96
Convolution
3
Input
49/68
Input: 27 × 27 × 96
Conv1: K = 256,F = 5
S = 1,P = 0
Output:W2 = 23, H2 = 23
Parameters: (5 × 5 × 96) × 256 = 0.6M
227
55
27 23
11
3 5
3
5
11
27 23
55
227
96 256
96
Convolution
3
Input
49/68
F = 3,S = 2
Output:W2 =?, H2 =?
Parameters: ?
227
55
27 23
11
3 5
3
3 3
5
11
27 23
55
227
96 256
96
Convolution
3
Input
49/68
F = 3,S = 2
Output:W2 = 11, H2 = 11
Parameters: ?
227
55
27 23
11
11
3 5
3
3 3
5
11
11
27 23
55
227 256
MaxPooling
96 256
96
Convolution
3
Input
49/68
F = 3,S = 2
Output:W2 = 11, H2 = 11
Parameters: 0
227
55
27 23
11
11
3 5
3
3 3
5
11
11
27 23
55
227 256
MaxPooling
96 256
96
Convolution
3
Input
49/68
Input: 11 × 11 × 256
Conv1: K = 384,F = 3
S = 1,P = 0
Output:W2 =?, H2 =?
Parameters: ?
227
55
27 23
11
11
3 5
3 3
3 3 3
5
11
11
27 23
55
227 256
MaxPooling
96 256
96
Convolution
3
Input
49/68
Input: 11 × 11 × 256
Conv1: K = 384,F = 3
S = 1,P = 0
Output:W2 = 9, H2 = 9
Parameters: ?
227
55
27 23
11
11 9
3 5
3 3
3 3 3
5
11
11 9
27 23
55 384
227 256
Convolution
MaxPooling
96 256
96
Convolution
3
Input
49/68
Input: 11 × 11 × 256
Conv1: K = 384,F = 3
S = 1,P = 0
Output:W2 = 9, H2 = 9
Parameters: (3 × 3 × 256) × 384 = 0.8M
227
55
27 23
11
11 9
3 5
3 3
3 3 3
5
11
11 9
27 23
55 384
227 256
Convolution
MaxPooling
96 256
96
Convolution
3
Input
49/68
Input: 9 × 9 × 384
Conv1: K = 384,F = 3
S = 1,P = 0
Output:W2 =?, H2 =?
Parameters: ?
227
55
27 23
11
11 9
3 5
3 3 3
3 3 3 3
5
11
11 9
27 23
55 384
227 256
Convolution
MaxPooling
96 256
96
Convolution
3
Input
49/68
Input: 9 × 9 × 384
Conv1: K = 384,F = 3
S = 1,P = 0
Output:W2 = 7, H2 = 7
Parameters: ?
227
55
27 23
11
11 9
3 5 7
3 3 3
3 3 3 3
5
11
7
11 9
27 23 384
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
96
Convolution
3
Input
49/68
Input: 9 × 9 × 384
Conv1: K = 384,F = 3
S = 1,P = 0
Output:W2 = 7, H2 = 7
Parameters: (3 × 3 × 384) × 384 = 1.327M
227
55
27 23
11
11 9
3 5 7
3 3 3
3 3 3 3
5
11
7
11 9
27 23 384
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
96
Convolution
3
Input
49/68
Input: 7 × 7 × 384
Conv1: K = 256,F = 3
S = 1,P = 0
Output:W2 =?, H2 =?
Parameters: ?
227
55
27 23
11
11 9
3 5 7
3 3 3 3
3 3 3 3
5 3
11
7
11 9
27 23 384
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
96
Convolution
3
Input
49/68
Input: 7 × 7 × 384
Conv1: K = 256,F = 3
S = 1,P = 0
Output:W2 = 5, H2 = 5
Parameters: ?
227
55
27 23
11
11 9
3 5 7
3 3 3 3 5
3 3 3 3
5 3
11 5
7
11 9 256
27 23 384 Convolution
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
96
Convolution
3
Input
49/68
Input: 7 × 7 × 384
Conv1: K = 256,F = 3
S = 1,P = 0
Output:W2 = 5, H2 = 5
Parameters: (3 × 3 × 384) × 256 = 0.8M
227
55
27 23
11
11 9
3 5 7
3 3 3 3 5
3 3 3 3
5 3
11 5
7
11 9 256
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
96
Convolution
3
Input
49/68
F = 3,S = 2
Output:W2 =?, H2 =?
Parameters: ?
227
55
27 23
11
11 9
3 5 7
3 3 3 3 5
3 3
3 3 3 3 3
5
11 5
7
11 9 256
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
96
Convolution
3
Input
49/68
F = 3,S = 2
Output:W2 = 2, H2 = 2
Parameters: ?
227
55
27 23
11
11 9
3 5 7
3 3 3 3 5
3 3
3 3 3 3 3 2
5
11 5 2
7 256
11 9 256 MaxPooling
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
96
Convolution
3
Input
49/68
F = 3,S = 2
Output:W2 = 2, H2 = 2
Parameters: 0
227
55
27 23
11
11 9
3 5 7
3 3 3 3 5
3 3
3 3 3 3 3 2
5
11 5 2
7 256
11 9 256 MaxPooling
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
96
Convolution
3
Input
49/68
FC1
Parameters: (2 × 2 × 256) × 4096 = 4M
227
55
27 23
11
11 9
3 5 7 dense
3 3 3 3 5
3 3
3 3 3 3 3 2
5
11 5 2
7 256
11 9 256 MaxPooling
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
4096
96
Convolution
3
Input
49/68
FC1
Parameters: 4096 × 4096 = 16M
227
55
27 23
11
11 9
3 5 7 dense
3 3 3 3 5
3 3 dense
3 3 3 3 3 2
5
11 5 2
7 256
11 9 256 MaxPooling
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
4096 4096
96
Convolution
3
Input
49/68
FC1
Parameters: 4096 × 1000 = 4M
227
55
27 23
11
11 9
3 5 7 dense
3 3 3 3 5
3 3 dense dense
3 3 3 3 3 2
5
11 5 2
7 256
11 9 256 MaxPooling
55 384 Convolution
227 256
Convolution
MaxPooling 1000
96 256
4096 4096
96
Convolution
3
Input
49/68
Total Parameters: 27.55M
227
55
27 23
11
11 9
3 5 7 dense
3 3 3 3 5
3 3 dense dense
3 3 3 3 3 2
5
11 5 2
7 256
11 9 256 MaxPooling
55 384 Convolution
227 256
Convolution
MaxPooling 1000
96 256
4096 4096
96
Convolution
3
Input
49/68
Let us look at the
connections in the
fully connected lay-
ers in more detail
2
256
MaxPooling
50/68
Let us look at the
connections in the
ers in more detail make linear
2
We will first stretch 2
out the last conv 256
or maxpool layer to MaxPooling
make it a 1d vector
2 × 2 × 256 = 1024
50/68
Let us look at the
connections in the
2
dense
make it a 1d vector
This 1d vector is 2 × 2 × 256 = 1024
then densely con-
nected to other lay- 4096
ers just as in a regu-

lar feedforward neu-
ral network
50/68
Let us look at the
connections in the
2
dense dense
make it a 1d vector
This 1d vector is 2 × 2 × 256 = 1024
then densely con-
nected to other lay- 4096 4096

ral network
50/68
Let us look at the
connections in the
2
dense dense dense
make it a 1d vector
This 1d vector is 2 × 2 × 256 = 1024 1000
then densely con-
nected to other lay- 4096 4096

ral network
50/68
AlexNet
ZFNet
VGGNet
51/68
227
227
3
Input
227
227
3
Input 52/68
227
11
11
227
Layer1: F = 11 → 7
3
Input Difference in Parameters
((11 − 72 ) × 3) × 96 = 20.7K
2
227
227
3
Input 52/68
227
55
11
11
55
227
96
Convolution Layer1: F = 11 → 7
3
((11 − 72 ) × 3) × 96 = 20.7K
2
227
55
55
227
96
Convolution
3
Input 52/68
227
55
11
3
3
11
55
227
96
Convolution
3
Input
Layer2: No difference
227
55
7
3
3
7
55
227
96
Convolution
3
Input 52/68
227
55
27
11
3
3
11
27
55
227
96
MaxPooling
96
Convolution
3
Input
227
55
27
7
3
3
7
27
55
227
96
MaxPooling
96
Convolution
3
Input 52/68
227
55
27
11
3 5
3
11 5
27
55
227
96
MaxPooling
96
Convolution
3
Input
227
55
27
7
3 5
3
7 5
27
55
227
96
MaxPooling
96
Convolution
3
Input 52/68
227
55
27 23
11
3 5
3
11 5
27 23
55
227
96 256
96
Convolution
3
Input
227
55
27 23
7
3 5
3
7 5
27 23
55
227
96 256
96
Convolution
3
Input 52/68
227
55
27 23
11
3 5
3
3 3
11 5
27 23
55
227
96 256
96
Convolution
3
Input
227
55
27 23
7
3 5
3
3 3
7 5
27 23
55
227
96 256
96
Convolution
3
Input 52/68
227
55
27 23
11 11
3 5
3
3 3
11 5
11
27 23
55
227 256
256 MaxPooling
96
96
Convolution
3
Input
227
55
27 23
11
7
3 5
3
3 3
7 5
11
27 23
55
227 256
256 MaxPooling
96
96
Convolution
3
Input 52/68
227
55
27 23
11 11
3 5
3 3
3 3 3
11 5
11
27 23
55
227 256
256 MaxPooling
96
96
Convolution Layer5: K = 384 → 512
3
(3 × 3 × 256) × (512 − 384) = 0.29M
227
55
27 23
11
7
3 5
3 3
3 3 3
7 5
11
27 23
55
227 256
256 MaxPooling
96
96
Convolution
3
Input 52/68
227
55
27 23
11 11 9
3 5
3 3
3 3 3
11 5
11 9
27 23
55
227 384
256
96 256
96
3
(3 × 3 × 256) × (512 − 384) = 0.29M
227
55
27 23
11 9
7
3 5
3 3
3 3 3
7 5
11 9
27 23
55
227 512
256
96 256
96
Convolution
3
Input 52/68
227
55
27 23
11 11 9
3 5
3 3 3
3 3 3 3
11 5
11 9
27 23
55
227 384
256
96 256
96
3
(3 × 3 × ((384 × 384) − (512 × 1024)) = 0.8M
227
55
27 23
11 9
7
3 5
3 3 3
3 3 3 3
7 5
11 9
27 23
55
227 512
256
96 256
96
Convolution
3
Input 52/68
227
55
27 23
11 11 9
3 5 7
3 3 3
3 3 3 3
11 5
7
11 9
27 23
384
55
227 384 Convolution
256
96 256
96
3
(3 × 3 × ((384 × 384) − (512 × 1024)) = 0.8M
227
55
27 23
11 9
7
3 5 7
3 3 3
3 3 3 3
7 5
7
11 9
27 23
1024
55
227 512 Convolution
256
96 256
96
Convolution
3
Input 52/68
227
55
27 23
11 11 9
3 5 7
3 3 3 3
3 3 3 3
11 5 3
7
11 9
27 23
384
55
227 384 Convolution
256
96 256
96
3
(3 × 3 × ((384 × 256) − (1024 × 512)) = 0.36M
227
55
27 23
11 9
7
3 5 7
3 3 3 3
3 3 3 3
7 5 3
7
11 9
27 23
1024
55
227 512 Convolution
256
96 256
96
Convolution
3
Input 52/68
227
55
27 23
11 11 9
3 5 7
3 3 3 3 5
3 3 3 3
11 5 3
5
7
11 9
23 256
27 384
55 Convolution
227 384 Convolution
256
96 256
96
3
(3 × 3 × ((384 × 256) − (1024 × 512)) = 0.36M
227
55
27 23
11 9
7
3 5 7
3 3 3 3 5
3 3 3 3
7 5 3
5
7
11 9
23 512
27 1024
55 Convolution
227 512 Convolution
256
96 256
96
Convolution
3
Input 52/68
227
55
27 23
11 11 9
3 5 7
3 3 3 3 3 5
3 3 3 3
11 5 3 3
5
7
11 9
23 256
27 384
55 Convolution
227 384 Convolution
256
96 256
96
Convolution
3
Input
227
55
27 23
11 9
7
3 5 7
3 3 3 3 3 5
3 3 3 3
7 5 3 3
5
7
11 9
23 512
27 1024
55 Convolution
227 512 Convolution
256
96 256
96
Convolution
3
Input 52/68
227
55
27 23
11 11 9
3 5 7
3 3 3 3 3 5
3 3 3 2
11 5 3 3 3
5 2
7
11 9 256
23 256
27 384 MaxPooling
55 Convolution
227 384 Convolution
256
96 256
96
Convolution
3
Input
227
55
27 23
11 9
7
3 5 7
3 3 3 3 3 5
3 3 3 2
7 5 3 3 3
5 2
7
11 9 256
23 512
27 1024 MaxPooling
55 Convolution
227 512 Convolution
256
96 256
96
Convolution
3
Input 52/68
227
55
27 23
11 11 9
3 5 7
3 3 3 3 5
3 3 3
3 2dense
11 5 3 3 3
5 2
7
11 9 256
23 256
27 384 MaxPooling
55 Convolution
227 384 Convolution
256
96 256
MaxPooling Convolution 4096
96
Convolution
3
Input
227
55
27 23
11 9
7
3 5 7
3 3 3 3 5
3 3 3
3 2dense
7 5 3 3 3
5 2
7
11 9 256
23 512
27 1024 MaxPooling
55 Convolution
227 512 Convolution
256
96 256
MaxPooling Convolution 4096
96
Convolution
3
Input 52/68
227
55
27 23
11 11 9
3 5 7
3 3 3 3 5
3 3 3
3 2dense dense
11 5 3 3 3
5 2
7
11 9 256
23 256
27 384 MaxPooling
55 Convolution
227 384 Convolution
256
96 256
MaxPooling Convolution 4096 4096
96
Convolution
3
Input
227
55
27 23
11 9
7
3 5 7
3 3 3 3 5
3 3 3
3 2dense dense
7 5 3 3 3
5 2
7
11 9 256
23 512
27 1024 MaxPooling
55 Convolution
227 512 Convolution
256
96 256
96
Convolution
3
Input 52/68
227
55
27 23
11 11 9
3 5 7
3 3 3 3 5
3 3 3
3 2dense dense dense
11 5 3 3 3
5 2
7
11 9 256
23 256
27 384 MaxPooling
55 Convolution
227 384 Convolution
256
Convolution 1000
256 MaxPooling
96
96
Convolution
3
Input
227
55
27 23
11 9
7
3 5 7
3 3 3 3 5
3 3 3
7 5 3 3 3
5 2
7
11 9 256
23 512
27 1024 MaxPooling
55 Convolution
227 512 Convolution
256
Convolution 1000
256 MaxPooling
96
96
Convolution
3
Input 52/68
227
55
27 23
11 11 9
3 5 7
3 3 3 3 5
3 3 3
11 5 3 3 3
5 2
7
11 9 256
23 256
27 384 MaxPooling
55 Convolution
227 384 Convolution
256
Convolution 1000
256 MaxPooling
96
96
3
Convolution Difference in Total No. of Parameters
Input 1.45M
227
55
27 23
11 9
7
3 5 7
3 3 3 3 5
3 3 3
7 5 3 3 3
5 2
7
11 9 256
23 512
27 1024 MaxPooling
55 Convolution
227 512 Convolution
256
Convolution 1000
256 MaxPooling
96
96
Convolution
3
Input 52/68
AlexNet
ZFNet
VGGNet
53/68
4
22
224
Input
54/68
4
22
224
Input Conv
54/68
4
4
22
22
224
224
64
Input Conv
54/68
4
2
22
22
11
112
224
224
64
64 maxpool
Input Conv
54/68
4
2
22
22
11
112
224
224
64
64 maxpool Conv
Input Conv
54/68
4
2
22
22
11
11
112
112
224
224
64 128
64 maxpool Conv
Input Conv
54/68
4
2
22
22
11
11
56
56
112
112
224
224
128
64 128 maxpool
64 maxpool Conv
Input Conv
54/68
4
2
22
22
11
11
56
56
112
112
224
224
128
64 128 maxpool Conv
64 maxpool Conv
Input Conv
54/68
4
2
22
22
11
11
56
56
112
112
224
224
128
64 128 maxpool Conv
64 maxpool Conv
Input Conv
54/68
4
2
22
22
11
11
56
56
56
56
112
112
224
224
128 256
64 128 maxpool Conv
64 maxpool Conv
Input Conv
54/68
4
2
22
22
11
11
56
56
28
28
56
56
112
112
224
224
256
128 256
maxpool
64 128 maxpool Conv
64 maxpool Conv
Input Conv
54/68
4
2
22
22
11
11
56
56
28
28
56
56
112
112
224
224
256
128 256
maxpool Conv
64 128 maxpool Conv
64 maxpool Conv
Input Conv
54/68
4
2
22
22
11
11
56
56
28
28
56
56
112
112
224
224
256
128 256
maxpool Conv
64 128 maxpool Conv
64 maxpool Conv
Input Conv
54/68
4
2
22
22
11
11
56
56
28
28
28
28
56
56
112
112
224
224
256 512
128 256
maxpool Conv
64 128 maxpool Conv
64 maxpool Conv
Input Conv
54/68
4
2
22
22
11
11
56
56
28
28
14
28
14
28
56
56
112
112
224
224
512
256 512
128 256
maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
Input Conv
54/68
4
2
22
22
11
11
56
56
28
28
14
28
14
28
56
56
112
112
224
224
512
256 512
128 256
maxpool Conv maxpool Conv
64 128 maxpool Conv
64 maxpool Conv
Input Conv
54/68
4
2
22
22
11
11
56
56
28
28
14
28
14
28
56
56
112
112
224
224
512
256 512
128 256
64 128 maxpool Conv
64 maxpool Conv
Input Conv
54/68
4
2
22
22
11
11
56
56
28
28
14
14
28
14
14
28
56
56
112
112
224
224
512 512
256 512
128 256
64 128 maxpool Conv
64 maxpool Conv
Input Conv
54/68
4
2
22
22
11
11
56
56
28
28
14
14
7
28
14
14
7
28
56
56
112
112
512
224
224
512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
Input Conv
54/68
4
2
22
22
11
11
56
56
28
28
14
14
7
28
14
14
7
28
56
56
112
112
512
224
224
512 512
256 512
128 256
64 128 maxpool Conv
64 maxpool Conv
Input Conv fc
4096
54/68
4
2
22
22
11
11
56
56
28
28
14
14
7
28
14
14
7
28
56
56
112
112
512
224
224
512 512
256 512
128 256
64 128 maxpool Conv
64 maxpool Conv
Input Conv fc fc
4096 4096
54/68
4
2
22
22
11
11
56
56
28
28
14
14
7
28
14
14
7
28
56
56
112
112
512
224
224
512 512
256 512
128 256
64 128 maxpool Conv
64 maxpool Conv
Input Conv fc fc
4096 4096
54/68
softmax
2
22
22
11
11
56
56
28
28
14
14
7
28
14
14
7
28
56
56
112
112
512
224
224
512 512
256 512
128 256
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096
54/68
softmax
2
22
22
11
11
56
56
28
28
14
14
7
28
14
14
7
28
56
56
112
112
512
224
224
512 512
256 512
128 256
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096
Kernel size is 3 × 3 throughout
54/68
softmax
2
22
22
11
11
56
56
28
28
14
14
7
28
14
14
7
28
56
56
112
112
512
224
224
512 512
256 512
128 256
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

Total parameters in non FC layers = ∼ 16M
54/68
softmax
2
22
22
11
11
56
56
28
28
14
14
7
28
14
14
7
28
56
56
112
112
512
224
224
512 512
256 512
128 256
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

Total Parameters in FC layers =
54/68
softmax
2
22
22
11
11
56
56
28
28
14
14
7
28
14
14
7
28
56
56
112
112
512
224
224
512 512
256 512
128 256
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

Total Parameters in FC layers = (512 × 7 × 7 × 4096)
54/68
softmax
2
22
22
11
11
56
56
28
28
14
14
7
28
14
14
7
28
56
56
112
112
512
224
224
512 512
256 512
128 256
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096)
54/68
softmax
2
22
22
11
11
56
56
28
28
14
14
7
28
14
14
7
28
56
56
112
112
512
224
224
512 512
256 512
128 256
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024)
54/68
softmax
2
22
22
11
11
56
56
28
28
14
14
7
28
14
14
7
28
56
56
112
112
512
224
224
512 512
256 512
128 256
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) =
∼ 122M
54/68
softmax
2
22
22
11
11
56
56
28
28
14
14
7
28
14
14
7
28
56
56
112
112
512
224
224
512 512
256 512
128 256
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) =
∼ 122M
Most parameters are in the first FC layer (∼ 102M)
54/68
Module 11.5 : Image Classification continued
(GoogLeNet and ResNet)
55/68
Consider the output at a certain layer
of a convolutional neural network
56/68
H1
f
Max Pooling
D
f
After this layer we could apply a max-
H
W1
pooling layer
1
56/68
H1
f H
Max Pooling
D
f
H
1
1
W1
convolution pooling layer
D
1
Or a 1 × 1 convolution
W
56/68
H1
f H
Max Pooling
D
f
H
1
1
W1
convolution H2 pooling layer
D
3
1
convolution
D
3
W Or a 3 × 3 convolution
W2
W
56/68
H1
f H
Max Pooling
D
f
H
1
1
W1
D
3
1
convolution
D
3
W2
H3
W
5
convolution
1
5
1
D
W3
D
56/68
H1
f H
Max Pooling
D
f
H
1
1
W1
D
3
1
convolution
D
3
W2
H3
W
5
5
1
convolution Question: Why choose between
D
1
these options (convolution, maxpool-
W3
D
ing, filter sizes)?
1
56/68
H1
f H
Max Pooling
D
f
H
1
1
W1
D
3
1
convolution
D
3
W2
H3
W
5
5
1
convolution Question: Why choose between
D
1
these options (convolution, maxpool-
W3
D
ing, filter sizes)?
1 Idea: Why not apply all of them at
the same time and then concatenate
the feature maps?
56/68
H1
Well this naive idea could result in a
f H
large number of computations
Max Pooling
f
D
H
1 W1
convolution H2
1
1
3
convolution
3
D W
H3
W2
W
5
convolution
1
5
1
D
W3
D
57/68
H1
f H
Max Pooling
D
f
If P = 0 & S = 1 then convolving a
W × H × D input with a F × F × D
H
1 W1
convolution H2
1
1
filter results in a (W − F + 1)(H −
3
3
convolution
F + 1) sized output
D W
H3
W2
W
5
convolution
1
5
1
D
W3
D
57/68
H1
f H
Max Pooling
D
f
H
1 W1
convolution H2
1
1
3
3
convolution
F + 1) sized output
D W
H3
Each element of the output requires
W2
5
W O(F × F × D) computations
convolution
1
5
1
D
W3
D
57/68
H1
f H
Max Pooling
D
f
H
1 W1
convolution H2
1
1
3
3
convolution
F + 1) sized output
D W
H3
Each element of the output requires
W2
5
W O(F × F × D) computations
convolution
1
5
1
Can we reduce the number of compu-
D
D
W3 tations?
57/68
Yes, by using 1 × 1 convolutions
58/68
Huh?? What does a 1×1 convolution
H do ?
58/68
H H do ?
It aggregates along the depth
1
W W
D 1
58/68
H H do ?
1 So convolving a D×W ×H input with
1
D1 1×1 (D1 < D) filters will result in
D
a D1 × W × H output (S = 1, P = 0)
W W
D D1
58/68
H H do ?
1
D
a D1 × W × H output (S = 1, P = 0)
If D1 < D then this effectively re-
W W
duces the dimension of the input and
hence the computations
D D1
58/68
H H do ?
1
D
a D1 × W × H output (S = 1, P = 0)
W W
Specifically instead of O(F × F × D)
D D1 we will need O(F × F × D1 ) compu-
tations
58/68
H H do ?
1
D
a D1 × W × H output (S = 1, P = 0)
W W
Specifically instead of O(F × F × D)
D D1 we will need O(F × F × D1 ) compu-
tations
We could then apply subsequent 3×3,
5 × 5 filter on this reduced output
58/68
But we might want to use different
28
1 × 1 convolutions
dimensionality reductions before the
3 × 3 convolutions
(dimensionality re-
duction)
(on reduced input)
3 × 3 and 5 × 5 filters
5 × 5 convolutions
(on reduced input)
28
256
59/68
28
1 × 1 convolutions
3 × 3 convolutions
(dimensionality re-
duction)
(on reduced input)
1 × 1 convolutions
(dimensionality re-
5 × 5 convolutions
(on reduced input)
So we can use D1 and D2 1 × 1 fil-
duction)
28 ters before the 3 × 3 and 5 × 5 filters
respectively
256
59/68
28
1 × 1 convolutions
3 × 3 convolutions
(dimensionality re-
duction)
(on reduced input)
1 × 1 convolutions
(dimensionality re-
5 × 5 convolutions
(on reduced input)
duction)
28
3 × 3 Maxpooling
ters before the 3 × 3 and 5 × 5 filters
(dimensionality re-
duction)
1 × 1 convolutions
respectively
256
We can then add the maxpooling
layer followed by dimensionality re-
duction
59/68
1 × 1 convolutions But we might want to use different
28
1 × 1 convolutions
3 × 3 convolutions
(dimensionality re-
duction)
(on reduced input)
1 × 1 convolutions
(dimensionality re-
5 × 5 convolutions
(on reduced input)
duction)
28
3 × 3 Maxpooling
(dimensionality re-
duction)
1 × 1 convolutions
respectively
256
duction
And a new set of 1 × 1 convolutions
59/68
28
1 × 1 convolutions
3 × 3 convolutions
(dimensionality re-
duction)
(on reduced input)
Filter
concatenation
1 × 1 convolutions
(dimensionality re-
5 × 5 convolutions
(on reduced input)
duction)
28
3 × 3 Maxpooling
(dimensionality re-
duction)
1 × 1 convolutions
respectively
256
duction
And finally we concatenate all these
layers
59/68
28
1 × 1 convolutions
3 × 3 convolutions
(dimensionality re-
duction)
(on reduced input)
Filter
concatenation
1 × 1 convolutions
(dimensionality re-
5 × 5 convolutions
(on reduced input)
duction)
28
3 × 3 Maxpooling
(dimensionality re-
duction)
1 × 1 convolutions
respectively
256
duction
layers
This is called the Inception module
59/68
28
1 × 1 convolutions
3 × 3 convolutions
(dimensionality re-
duction)
(on reduced input)
Filter
concatenation
1 × 1 convolutions
(dimensionality re-
5 × 5 convolutions
(on reduced input)
duction)
28
3 × 3 Maxpooling
(dimensionality re-
duction)
1 × 1 convolutions
respectively
256
duction
layers
This is called the Inception module
We will now see GoogLeNet which
contains many such inception mod-
ules 59/68
9
22
229
Input
60/68
9
2
22
11
229
112
64
Conv
Input
60/68
9
56
22
11
56
229
112
64
64 maxpool
Conv
Input
60/68
9
56
56
22
11
56
56
229
112
64 192
64 maxpool Conv
Conv
Input
60/68
9
56
56
22
11
28
28
56
56
229
112
192
64 192 maxpool
64 maxpool Conv
Conv
Input
60/68
9
56
56
22
11
28
28
28
28
3a
56
56
229
112
192 256
64 192 maxpoolInception
64 maxpool Conv
Conv
Input
64 1 × 1 convolutions
28
96 1 × 1 convolu- 128 3 × 3 convolu-
tions (dimensionality tions (on reduced
reduction) input)
Filter
concatenation
16 1 × 1 convolu-
tions (dimensionality
(on reduced input)
reduction)
28
3 × 3 Maxpooling
(dimensionality re- 32 1 × 1 convolutions
duction)
192
60/68
9
56
56
22
11
28
28
28
28
28
28
3a
56
56
3b
229
112
192 256 480

64 192 maxpoolInception Inception
64 maxpool Conv
Conv
Input
128 1 × 1
convolutions
28
128 1 × 1 convolu- 192 3 × 3 convolu-
reduction) input)
Filter
concatenation
32 1 × 1 convolu-
(on reduced input)
reduction)
28
3 × 3 Maxpooling
duction)
256
60/68
9
56
56
22
11
28
28
28
14
28
28
28
3a
56
56
14
3b
229
112
480
192 256 480
64 192 maxpoolInception Inception maxpool
64 maxpool Conv
Conv
Input
60/68
9
56
56
22
11
28
28
28
14
28
28
28
3a 4a
56
56
14
3b
229
112
480
192 256 480
64 192 maxpoolInception Inception maxpool Inception
64 maxpool Conv
Conv
Input
192 1 × 1
convolutions
14
96 1 × 1 convolu- 208 3 × 3 convolu-
reduction) input)
Filter
concatenation
16 1 × 1 convolu-
(on reduced input)
reduction)
14
3 × 3 Maxpooling
duction)
480
60/68
9
56
56
22
11
28
28
28
14
28
28
28
3a 4a
56
56
14
3b 4b
229
112
480
192 256 480
64 maxpool Conv
Conv
Input
160 1 × 1
convolutions
14
112 1 × 1 convolu- 224 3 × 3 convolu-
reduction) input)
Filter
concatenation
24 1 × 1 convolu-
(on reduced input)
reduction)
14
3 × 3 Maxpooling
duction)
512
60/68
9
56
56
22
11
28
28
28
14
14
28
28
28
3a 4a 4c
56
56
14
14
3b 4b
229
112
480 512
192 256 480
64 maxpool Conv
Conv
Input
128 1 × 1
convolutions
14
128 1 × 1 convolu- 256 3 × 3 convolu-
reduction) input)
Filter
concatenation
24 1 × 1 convolu-
(on reduced input)
reduction)
14
3 × 3 Maxpooling
duction)
512
60/68
9
56
56
22
11
28
28
28
14
14
14
28
28
28
3a 4a 4c
56
56
14
14
14
3b 4b 4d
229
112
480 512 528

192 256 480
64 192 maxpoolInception Inception maxpool Inception Inception
64 maxpool Conv
Conv
Input
112 1 × 1
convolutions
14
144 1 × 1 convolu- 288 3 × 3 convolu-
reduction) input)
Filter
concatenation
32 1 × 1 convolu-
(on reduced input)
reduction)
14
3 × 3 Maxpooling
duction)
512
60/68
9
56
56
22
11
28
28
28
14
14
14
14
28
28
28
3a 4a 4c 4e
56
56
14
14
14
14
3b 4b 4d
229
112
480 512 528 832

192 256 480
64 192 maxpoolInception Inception maxpool Inception Inception Inception
64 maxpool Conv
Conv
Input
256 1 × 1
convolutions
14
160 1 × 1 convolu- 320 3 × 3 convolu-
reduction) input)
Filter
concatenation
32 1 × 1 convolu- 128 5 × 5 convolu-
reduction) input)
14
3 × 3 Maxpooling
128 1 × 1
(dimensionality re-
convolutions
duction)
528
60/68
9
56
56
22
11
28
28
28
14
14
14
14
7
28
28
28
3a 4a 4c 4e
56
56
14
14
14
14
3b 4b 4d
7
229
112
832
480 512 528 832
192 256 480 maxpool
64 maxpool Conv
Conv
Input
60/68
9
56
56
22
11
28
28
28
14
14
14
14
7
28
28
28
3a 4a 4c 4e 5a
56
56
14
14
14
14
3b 4b 4d
7
229
112
832 832
480 512 528 832
192 256 480 maxpool Inception
64 maxpool Conv
Conv
Input
256 1 × 1
convolutions
7
160 1 × 1 convolu- 320 3 × 3 convolu-
reduction) input)
Filter
concatenation
32 1 × 1 convolu- 128 5 × 5 convolu-
reduction) input)
7
3 × 3 Maxpooling
128 1 × 1
(dimensionality re-
convolutions
duction)
832
60/68
9
56
56
22
11
28
28
28
14
14
14
14
7
7
28
28
28
3a 4a 4c 4e 5a 5b
56
56
14
14
14
14
3b 4b 4d
7
229
112
832 832 1024

480 512 528 832
192 256 480 maxpool Inception Inception
64 maxpool Conv
Conv
Input
384 1 × 1
convolutions
7
192 1 × 1 convolu- 384 3 × 3 convolu-
reduction) input)
Filter
concatenation
48 1 × 1 convolu- 128 5 × 5 convolu-
reduction) input)
7
3 × 3 Maxpooling
128 1 × 1
(dimensionality re-
convolutions
duction)
832
60/68
9
56
56
22
11
28
28
28
14
14
14
14
7
7
1
1
28
28
28
3a 4a 4c 4e 5a 5b
56
56
14
14
14
14
3b 4b 4d 1024
7
229
112
832 832 1024 avgpool

480 512 528 832
64 maxpool Conv
Conv
Input
60/68
9
56
56
22
11
28
28
28
14
14
14
14
7
7
1 1
1 1
28
28
28
3a 4a 4c 4e 5a 5b
56
56
14
14
14
14
3b 4b 4d 1024 1024
7
229
112
832 832 1024 avgpool dropout(40%)

480 512 528 832
64 maxpool Conv
Conv
Input
60/68
9
56
56
22
11
28
28
28
14
14
14
14
7
7
1 1
1 1
28
28
28
3a 4a 4c 4e 5a 5b
56
56
14
14
14
14
3b 4b 4d 1024 1024
7
229
112

480 512 528 832
192 256 480 maxpool Inception Inception 1000
64 192 maxpoolInception Inception maxpool Inception Inception Inception fc
64 maxpool Conv
Conv
Input
60/68
9
56
56
22
11
28
28
28
14
14
14
14
7
7
1 1
1 1
28
28
28
3a 4a 4c 4e 5a 5b
56
56
14
14
14
14
3b 4b 4d 1024 1024
7
229
112

480 512 528 832
192 256 480 maxpool Inception Inception 1000 1000
64 192 maxpoolInception Inception maxpool Inception Inception Inception fc softmax
64 maxpool Conv
Conv
Input
60/68
Important Trick: Got rid of the
fully connected layer
61/68
Important Trick: Got rid of the
Notice that output of the last layer is
7 × 7 × 1024
7 × 7 × 1024 dimensional
flatten
1024
61/68
1000 Important Trick: Got rid of the
7 × 7 × 1024
flatten What if we were to add a fully con-
nected layer with 1000 nodes (for
7 1000 classes) on top of this
1024
61/68
W ∈ R50176×1000
7 × 7 × 1024
We would have 7 × 7 × 1024 × 1000 =
1024
49M parameters
61/68
W ∈ R50176×1000
7 × 7 × 1024
pick average We would have 7 × 7 × 1024 × 1000 =
1024
49M parameters
Instead they use an average pooling of
7 size 7 × 7 on each of the 1024 feature
maps
61/68
W ∈ R50176×1000
7 × 7 × 1024
1024
1024
49M parameters
maps
This results in a 1024 dimensional
output
61/68
W ∈ R1024×1000
1024
1024
1024
49M parameters
maps
This results in a 1024 dimensional
output
Significantly reduces the number of
parameters 61/68
1000 12× less parameters than AlexNet
W ∈ R1024×1000
1024
flatten
7
pick average
1024
1024
61/68
1000 12× less parameters than AlexNet
W ∈ R1024×1000 2× more computations
1024
flatten
7
pick average
1024
1024
61/68
GoogLeNet
ResNet
62/68
Suppose we have been able to train a
shallow neural network well
63/68
Now suppose we construct a deeper
network which has few more layers (in
orange)
63/68
orange)
Intuitively, if the shallow network
works well then the deep network
should also work well by simply learn-
ing to compute identity functions in
the new layers
63/68
orange)
Intuitively, if the shallow network
works well then the deep network
should also work well by simply learn-
ing to compute identity functions in
the new layers
Essentially, the solution space of a
shallow neural network is a subset of
the solution space of a deep neural
network
63/68
But in practice it is observed that this
doesn’t happen
64/68
But in practice it is observed that this
doesn’t happen
Notice that the deep layers have a
higher error rate on the test set
64/68
x
Consider any two stacked layers in a
CNN
relu
relu
H(x)
65/68
x
CNN
relu
The two layers are essentially
learning some function of the input
relu
H(x)
65/68
x
CNN
relu
The two layers are essentially
learning some function of the input
relu What if we enable it to learn only a
H(x) residual function of the input?
relu Identity
F (x) relu
H(x) = F (x) + x
65/68
x
Why would this help?
relu
relu
H(x)
relu Identity
F (x) relu
H(x) = F (x) + x
66/68
x
Remember our argument that a
relu
deeper version of a shallow network
would do just fine by learning identity
relu
transformations in the new layers
H(x)
relu Identity
F (x) relu
H(x) = F (x) + x
66/68
x
relu
relu
H(x) This identity connection from the in-
put allows a ResNet to retain a copy
x of the input
relu Identity
F (x) relu
H(x) = F (x) + x
66/68
x
relu
relu
H(x) This identity connection from the in-
put allows a ResNet to retain a copy
x of the input
Using this idea they were able to train
really deep networks
relu Identity
F (x) relu
H(x) = F (x) + x
66/68
1st place in all five main tracks
ImageNet Classification: “Ultra-
deep” 152-layer nets
ResNet, 152 layers
67/68
ImageNet Detection: 16% better
than the 2nd best system
ResNet, 152 layers
67/68
ImageNet Localization: 27% bet-
ter than the 2nd best system
ResNet, 152 layers
67/68
COCO Detection: 11% better than
the 2nd best system
ResNet, 152 layers
67/68
COCO Detection: 11% better than
the 2nd best system
COCO Segmentation: 12% better
ResNet, 152 layers
67/68
Bag of tricks
Batch Normalizaton after every
CONV layer
ResNet, 152 layers
68/68
Bag of tricks
CONV layer
Xavier/2 initialization from [He et al]
ResNet, 152 layers
68/68
Bag of tricks
CONV layer
SGD + Momentum(0.9)
ResNet, 152 layers
68/68
Bag of tricks
CONV layer
SGD + Momentum(0.9)
Learning rate:0.1, divided by 10 when
validation error plateaus
ResNet, 152 layers
68/68
Bag of tricks
CONV layer
SGD + Momentum(0.9)
Mini-batch size 256
ResNet, 152 layers
68/68
Bag of tricks
CONV layer
SGD + Momentum(0.9)
Mini-batch size 256
Weight decay of 1e-5
ResNet, 152 layers
68/68
Bag of tricks
CONV layer
SGD + Momentum(0.9)
Mini-batch size 256
Weight decay of 1e-5
No dropout used
ResNet, 152 layers
68/68

Cs7015 (Deep Learning) : Lecture 11: Convolutional Neural Networks, Lenet, Alexnet, Zf-Net, Vggnet, Googlenet and Resnet

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Cs7015 (Deep Learning) : Lecture 11: Convolutional Neural Networks, Lenet, Alexnet, Zf-Net, Vggnet, Googlenet and Resnet

Загружено:

Авторское право:

Доступные форматы

CS7015 (Deep Learning) : Lecture 11

Convolutional Neural Networks, LeNet, AlexNet, ZF-Net, VGGNet,

Department of Computer Science and Engineering

S 0.00 1.80 0.00 0.00 0.00 0.00 0.00

s6 = x6 w0 + x5 w−1 + x4 w−2 + x3 w−3 + x4 w−4 + x5 w−5 + x6 w−6

S 0.00 1.80 1.96 0.00 0.00 0.00 0.00

s6 = x6 w0 + x5 w−1 + x4 w−2 + x3 w−3 + x4 w−4 + x5 w−5 + x6 w−6

S 0.00 1.80 1.96 2.11 0.00 0.00 0.00

s6 = x6 w0 + x5 w−1 + x4 w−2 + x3 w−3 + x4 w−4 + x5 w−5 + x6 w−6

S 0.00 1.80 1.96 2.11 2.16 0.00 0.00

s6 = x6 w0 + x5 w−1 + x4 w−2 + x3 w−3 + x4 w−4 + x5 w−5 + x6 w−6

S 0.00 1.80 1.96 2.11 2.16 2.28 0.00

s6 = x6 w0 + x5 w−1 + x4 w−2 + x3 w−3 + x4 w−4 + x5 w−5 + x6 w−6

S 0.00 1.80 1.96 2.11 2.16 2.28 2.42

s6 = x6 w0 + x5 w−1 + x4 w−2 + x3 w−3 + x4 w−4 + x5 w−5 + x6 w−6

S 0.00 1.80 1.96 2.11 2.16 2.28 2.42

s6 = x6 w0 + x5 w−1 + x4 w−2 + x3 w−3 + x4 w−4 + x5 w−5 + x6 w−6

s6 = x6 w0 + x5 w−1 + x4 w−2 + x3 w−3 + x4 w−4 + x5 w−5 + x6 w−6

aw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz

aw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz

aw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz

aw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz

ew+fx+iy+jz fw+gx+jy+kz gw+hx+ky+`z

blurs the image

sharpens the image

detects the edges

In the 1D case, we slide a one dimensional

In the 1D case, we slide a one dimensional

In the 1D case, we slide a one dimensional

In the 1D case, we slide a one dimensional

In the 1D case, we slide a one dimensional

What would a 3D filter look like?

What would a 3D filter look like?

What would a 3D filter look like?

What would a 3D filter look like?

What would a 3D filter look like?

What would a 3D filter look like?

What would a 3D filter look like?

What would a 3D filter look like?

What would a 3D filter look like?

What would a 3D filter look like?

What would a 3D filter look like?

What would a 3D filter look like?

What would a 3D filter look like?

What would a 3D filter look like?

What would a 3D filter look like?

What would a 3D filter look like?

What would a 3D filter look like?

What would a 3D filter look like?

So what should our final formula look like,

So what should our final formula look like,

static feature extraction (no learning) learning weights of classifier

car, bus, monument, flower

car, bus, monument, flower

car, bus, monument, flower

-1.21358689e-03 3.23652686e-03 ··· ··· -2.06615720e-02

car, bus, monument, flower

car, bus, monument, flower

-1.21358689e-03 3.23652686e-03 ··· ··· -2.06615720e-02

car, bus, monument, flower

car, bus, monument, flower

-1.21358689e-03 3.23652686e-03 ··· ··· -2.06615720e-02

car, bus, monument, flower