Вы находитесь на странице: 1из 477

CS7015 (Deep Learning) : Lecture 11

Convolutional Neural Networks, LeNet, AlexNet, ZF-Net, VGGNet,


GoogLeNet and ResNet

Mitesh M. Khapra

Department of Computer Science and Engineering


Indian Institute of Technology Madras

1/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Module 11.1 : The convolution operation

2/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Suppose we are tracking the position
of an aeroplane using a laser sensor at
discrete time intervals

x0

3/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Suppose we are tracking the position
of an aeroplane using a laser sensor at
discrete time intervals

x0 x1

3/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Suppose we are tracking the position
of an aeroplane using a laser sensor at
discrete time intervals

x0 x1 x2

3/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Suppose we are tracking the position
of an aeroplane using a laser sensor at
discrete time intervals
Now suppose our sensor is noisy
x0 x1 x2

3/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Suppose we are tracking the position
of an aeroplane using a laser sensor at
discrete time intervals
Now suppose our sensor is noisy
x0 x1 x2
To obtain a less noisy estimate we
would like to average several measure-
ments

3/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Suppose we are tracking the position
of an aeroplane using a laser sensor at
discrete time intervals
Now suppose our sensor is noisy
x0 x1 x2
To obtain a less noisy estimate we
would like to average several measure-
ments
More recent measurements are more
important so we would like to take a
weighted average

3/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Suppose we are tracking the position
of an aeroplane using a laser sensor at
discrete time intervals
Now suppose our sensor is noisy
x0 x1 x2
To obtain a less noisy estimate we
would like to average several measure-

X ments
st = xt−a w−a =
a=0 More recent measurements are more
important so we would like to take a
weighted average

3/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Suppose we are tracking the position
of an aeroplane using a laser sensor at
discrete time intervals
Now suppose our sensor is noisy
x0 x1 x2
To obtain a less noisy estimate we
would like to average several measure-

X ments
st = xt−a w−a = (x∗w)t
a=0 More recent measurements are more
important so we would like to take a
weighted average

3/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Suppose we are tracking the position
of an aeroplane using a laser sensor at
discrete time intervals
Now suppose our sensor is noisy
x0 x1 x2
To obtain a less noisy estimate we
would like to average several measure-

X ments
st = xt−a w−a = (x∗w)t
a=0 More recent measurements are more
important so we would like to take a
input filter weighted average
convolution

3/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In practice, we would only sum over a
6
X small window
st = xt−a w−a
a=0

4/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In practice, we would only sum over a
6
X small window
st = xt−a w−a
a=0
The weight array (w) is known as the
filter

4/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In practice, we would only sum over a
6
X small window
st = xt−a w−a
a=0
The weight array (w) is known as the
filter
We just slide the filter over the input and
compute the value of st based on a win-
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5

X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70

S 0.00 1.80 0.00 0.00 0.00 0.00 0.00

s6 = x6 w0 + x5 w−1 + x4 w−2 + x3 w−3 + x4 w−4 + x5 w−5 + x6 w−6

4/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In practice, we would only sum over a
6
X small window
st = xt−a w−a
a=0
The weight array (w) is known as the
filter
We just slide the filter over the input and
compute the value of st based on a win-
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5

X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70

S 0.00 1.80 1.96 0.00 0.00 0.00 0.00

s6 = x6 w0 + x5 w−1 + x4 w−2 + x3 w−3 + x4 w−4 + x5 w−5 + x6 w−6

4/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In practice, we would only sum over a
6
X small window
st = xt−a w−a
a=0
The weight array (w) is known as the
filter
We just slide the filter over the input and
compute the value of st based on a win-
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5

X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70

S 0.00 1.80 1.96 2.11 0.00 0.00 0.00

s6 = x6 w0 + x5 w−1 + x4 w−2 + x3 w−3 + x4 w−4 + x5 w−5 + x6 w−6

4/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In practice, we would only sum over a
6
X small window
st = xt−a w−a
a=0
The weight array (w) is known as the
filter
We just slide the filter over the input and
compute the value of st based on a win-
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5

X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70

S 0.00 1.80 1.96 2.11 2.16 0.00 0.00

s6 = x6 w0 + x5 w−1 + x4 w−2 + x3 w−3 + x4 w−4 + x5 w−5 + x6 w−6

4/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In practice, we would only sum over a
6
X small window
st = xt−a w−a
a=0
The weight array (w) is known as the
filter
We just slide the filter over the input and
compute the value of st based on a win-
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5

X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70

S 0.00 1.80 1.96 2.11 2.16 2.28 0.00

s6 = x6 w0 + x5 w−1 + x4 w−2 + x3 w−3 + x4 w−4 + x5 w−5 + x6 w−6

4/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In practice, we would only sum over a
6
X small window
st = xt−a w−a
a=0
The weight array (w) is known as the
filter
We just slide the filter over the input and
compute the value of st based on a win-
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5

X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70

S 0.00 1.80 1.96 2.11 2.16 2.28 2.42

s6 = x6 w0 + x5 w−1 + x4 w−2 + x3 w−3 + x4 w−4 + x5 w−5 + x6 w−6

4/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In practice, we would only sum over a
6
X small window
st = xt−a w−a
a=0
The weight array (w) is known as the
filter
We just slide the filter over the input and
compute the value of st based on a win-
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5 Here the input (and the kernel) is one
dimensional
X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70

S 0.00 1.80 1.96 2.11 2.16 2.28 2.42

s6 = x6 w0 + x5 w−1 + x4 w−2 + x3 w−3 + x4 w−4 + x5 w−5 + x6 w−6

4/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In practice, we would only sum over a
6
X small window
st = xt−a w−a
a=0
The weight array (w) is known as the
filter
We just slide the filter over the input and
compute the value of st based on a win-
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5 Here the input (and the kernel) is one
dimensional
X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70 Can we use a convolutional operation on
a 2D input also?
S 0.00 1.80 1.96 2.11 2.16 2.28 2.42

s6 = x6 w0 + x5 w−1 + x4 w−2 + x3 w−3 + x4 w−4 + x5 w−5 + x6 w−6

4/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We can think of images as 2D inputs

5/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We can think of images as 2D inputs
We would now like to use a 2D filter
(m × n)

5/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We can think of images as 2D inputs
We would now like to use a 2D filter
(m × n)
First let us see what the 2D formula
looks like

m−1
X n−1
X
Sij = (I ∗ K)ij = Ii−a,j−b Ka,b
a=0 b=0

5/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We can think of images as 2D inputs
We would now like to use a 2D filter
(m × n)
First let us see what the 2D formula
looks like
This formula looks at all the preced-
ing neighbours (i − a, j − b)

m−1
X n−1
X
Sij = (I ∗ K)ij = Ii−a,j−b Ka,b
a=0 b=0

5/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We can think of images as 2D inputs
We would now like to use a 2D filter
(m × n)
First let us see what the 2D formula
looks like
This formula looks at all the preced-
ing neighbours (i − a, j − b)
In practice, we use the following for-
mula which looks at the succeeding
m−1
X n−1
X neighbours
Sij = (I ∗ K)ij = Ii+a,j+b Ka,b
a=0 b=0

5/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us apply this idea to a toy exam-
ple and see the results

6/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Let us apply this idea to a toy exam-
Kernel ple and see the results
a b c d
w x
e f g h
y z
i j k `

Output

aw+bx+ey+fz

6/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Let us apply this idea to a toy exam-
Kernel ple and see the results
a b c d
w x
e f g h
y z
i j k `

Output

aw+bx+ey+fz bw+cx+fy+gz

6/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Let us apply this idea to a toy exam-
Kernel ple and see the results
a b c d
w x
e f g h
y z
i j k `

Output

aw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz

6/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Let us apply this idea to a toy exam-
Kernel ple and see the results
a b c d
w x
e f g h
y z
i j k `

Output

aw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz

ew+fx+iy+jz

6/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Let us apply this idea to a toy exam-
Kernel ple and see the results
a b c d
w x
e f g h
y z
i j k `

Output

aw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz

ew+fx+iy+jz fw+gx+jy+kz

6/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Let us apply this idea to a toy exam-
Kernel ple and see the results
a b c d
w x
e f g h
y z
i j k `

Output

aw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz

ew+fx+iy+jz fw+gx+jy+kz gw+hx+ky+`z

6/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
For the rest of the discussion we will
use the following formula for convolu-
tion

7/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
For the rest of the discussion we will
2 c
bX
m
2c
bX
n
use the following formula for convolu-
Sij = (I ∗ K)ij = Ii−a,j−b K m +a, n +b
2 2 tion
a=b− m
2 c
b=b− n
2c

7/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
For the rest of the discussion we will
2 c
bX
m
2c
bX
n
use the following formula for convolu-
Sij = (I ∗ K)ij = Ii−a,j−b K m +a, n +b
2 2 tion
a=b− m
2 c
b=b− n
2c
In other words we will assume that
the kernel is centered on the pixel of
pixel of interest interest

7/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
For the rest of the discussion we will
2 c
bX
m
2c
bX
n
use the following formula for convolu-
Sij = (I ∗ K)ij = Ii−a,j−b K m +a, n +b
2 2 tion
a=b− m
2 c
b=b− n
2c
In other words we will assume that
the kernel is centered on the pixel of
pixel of interest interest
So we will be looking at both preceed-
ing and succeeding neighbors

7/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us see some examples of 2D convolutions applied to images

8/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 1 1
∗ 1 1 1 =
1 1 1

9/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 1 1
∗ 1 1 1 =
1 1 1

blurs the image

9/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 -1 0
∗ -1 5 -1 =
0 -1 0

10/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 -1 0
∗ -1 5 -1 =
0 -1 0

sharpens the image

10/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 1 1
∗ 1 -8 1 =
1 1 1

11/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 1 1
∗ 1 -8 1 =
1 1 1

detects the edges

11/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We will now see a working example of 2D convolution.

12/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output
The resulting output is called a fea-
ture map.

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output
The resulting output is called a fea-
ture map.
We can use multiple filters to get mul-
tiple feature maps.

13/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Question
In the 1D case, we slide a one dimensional
filter over a one dimensional input

14/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Question A B C B A B C

In the 1D case, we slide a one dimensional


filter over a one dimensional input

14/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Question A B C B A B C

In the 1D case, we slide a one dimensional


filter over a one dimensional input

14/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Question A B C B A B C

In the 1D case, we slide a one dimensional


filter over a one dimensional input

14/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Question A B C B A B C

In the 1D case, we slide a one dimensional


filter over a one dimensional input

14/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Question A B C B A B C

In the 1D case, we slide a one dimensional


filter over a one dimensional input

14/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Question
In the 1D case, we slide a one dimensional
filter over a one dimensional input
In the 2D case, we slide a two dimen-
stional filter over a two dimensional out-
put

14/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
a b c d
Question
In the 1D case, we slide a one dimensional e f g h
filter over a one dimensional input
i j k l
In the 2D case, we slide a two dimen-
stional filter over a two dimensional out-
put

14/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
a b c d
Question
In the 1D case, we slide a one dimensional e f g h
filter over a one dimensional input
i j k l
In the 2D case, we slide a two dimen-
stional filter over a two dimensional out-
put

14/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
a b c d
Question
In the 1D case, we slide a one dimensional e f g h
filter over a one dimensional input
i j k l
In the 2D case, we slide a two dimen-
stional filter over a two dimensional out-
put

14/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
a b c d
Question
In the 1D case, we slide a one dimensional e f g h
filter over a one dimensional input
i j k l
In the 2D case, we slide a two dimen-
stional filter over a two dimensional out-
put

14/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
a b c d
Question
In the 1D case, we slide a one dimensional e f g h
filter over a one dimensional input
i j k l
In the 2D case, we slide a two dimen-
stional filter over a two dimensional out-
put

14/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
a b c d
Question
In the 1D case, we slide a one dimensional e f g h
filter over a one dimensional input
i j k l
In the 2D case, we slide a two dimen-
stional filter over a two dimensional out-
put

14/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Question
In the 1D case, we slide a one dimensional
filter over a one dimensional input
In the 2D case, we slide a two dimen-
stional filter over a two dimensional out-
put
What would happen in the 3D case?

14/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B

What would a 3D filter look like?

INPUT

15/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B

What would a 3D filter look like?


It will be 3D and we will refer to it as a volume

filter

INPUT

15/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B

What would a 3D filter look like?


It will be 3D and we will refer to it as a volume
Once again we will slide the volume over the
3D input and compute the convolution oper-
ation

OUTPUT

INPUT

15/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B

What would a 3D filter look like?


It will be 3D and we will refer to it as a volume
Once again we will slide the volume over the
3D input and compute the convolution oper-
ation

OUTPUT

INPUT

15/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B

What would a 3D filter look like?


It will be 3D and we will refer to it as a volume
Once again we will slide the volume over the
3D input and compute the convolution oper-
ation

OUTPUT

INPUT

15/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B

What would a 3D filter look like?


It will be 3D and we will refer to it as a volume
Once again we will slide the volume over the
3D input and compute the convolution oper-
ation

OUTPUT

INPUT

15/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B

What would a 3D filter look like?


It will be 3D and we will refer to it as a volume
Once again we will slide the volume over the
3D input and compute the convolution oper-
ation

OUTPUT

INPUT

15/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B

What would a 3D filter look like?


It will be 3D and we will refer to it as a volume
Once again we will slide the volume over the
3D input and compute the convolution oper-
ation

OUTPUT

INPUT

15/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B

What would a 3D filter look like?


It will be 3D and we will refer to it as a volume
Once again we will slide the volume over the
3D input and compute the convolution oper-
ation

OUTPUT

INPUT

15/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B

What would a 3D filter look like?


It will be 3D and we will refer to it as a volume
Once again we will slide the volume over the
3D input and compute the convolution oper-
ation

OUTPUT

INPUT

15/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B

What would a 3D filter look like?


It will be 3D and we will refer to it as a volume
Once again we will slide the volume over the
3D input and compute the convolution oper-
ation

OUTPUT

INPUT

15/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B

What would a 3D filter look like?


It will be 3D and we will refer to it as a volume
Once again we will slide the volume over the
3D input and compute the convolution oper-
ation

OUTPUT

INPUT

15/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B

What would a 3D filter look like?


It will be 3D and we will refer to it as a volume
Once again we will slide the volume over the
3D input and compute the convolution oper-
ation

OUTPUT

INPUT

15/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B

What would a 3D filter look like?


It will be 3D and we will refer to it as a volume
Once again we will slide the volume over the
3D input and compute the convolution oper-
ation

OUTPUT

INPUT

15/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B

What would a 3D filter look like?


It will be 3D and we will refer to it as a volume
Once again we will slide the volume over the
3D input and compute the convolution oper-
ation
Note that in this lecture we will assume that
the filter always extends to the depth of the
image

OUTPUT

INPUT

15/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B

What would a 3D filter look like?


It will be 3D and we will refer to it as a volume
Once again we will slide the volume over the
3D input and compute the convolution oper-
ation
Note that in this lecture we will assume that
the filter always extends to the depth of the
image
In effect, we are doing a 2D convolution oper-
ation on a 3D input (because the filter moves
along the height and the width but not along
the depth)

OUTPUT

INPUT

15/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B

What would a 3D filter look like?


It will be 3D and we will refer to it as a volume
Once again we will slide the volume over the
3D input and compute the convolution oper-
ation
Note that in this lecture we will assume that
the filter always extends to the depth of the
image
In effect, we are doing a 2D convolution oper-
ation on a 3D input (because the filter moves
along the height and the width but not along
the depth)
As a result the output will be 2D (only width
OUTPUT
and height, no depth)
INPUT

15/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B

What would a 3D filter look like?


It will be 3D and we will refer to it as a volume
Once again we will slide the volume over the
3D input and compute the convolution oper-
ation
Note that in this lecture we will assume that
the filter always extends to the depth of the
image
In effect, we are doing a 2D convolution oper-
ation on a 3D input (because the filter moves
along the height and the width but not along
the depth)
As a result the output will be 2D (only width
OUTPUT
and height, no depth)
INPUT
Once again we can apply multiple filters to get
multiple feature maps 15/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Module 11.2 : Relation between input size, output size
and filter size

16/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
So far we have not said anything explicit about the dimensions of the

17/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
So far we have not said anything explicit about the dimensions of the
1 inputs

17/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
So far we have not said anything explicit about the dimensions of the
1 inputs
2 filters

17/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
So far we have not said anything explicit about the dimensions of the
1 inputs
2 filters
3 outputs

17/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
So far we have not said anything explicit about the dimensions of the
1 inputs
2 filters
3 outputs
and the relations between them

17/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
So far we have not said anything explicit about the dimensions of the
1 inputs
2 filters
3 outputs
and the relations between them
We will see how they are related but before that we will define a few quantities

17/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We first define the following quanti-
ties

18/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We first define the following quanti-
ties
Width (W1 ),

W1

18/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We first define the following quanti-
ties
Width (W1 ), Height (H1 )

H1

W1

18/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We first define the following quanti-
ties
Width (W1 ), Height (H1 ) and Depth
(D1 ) of the original input

H1

W1

D1

18/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We first define the following quanti-
ties
Width (W1 ), Height (H1 ) and Depth
(D1 ) of the original input
The Stride S (We will come back to
this later)

H1

W1

D1

18/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We first define the following quanti-
ties
Width (W1 ), Height (H1 ) and Depth
(D1 ) of the original input
The Stride S (We will come back to
this later)

H1

W1

D1

18/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We first define the following quanti-
ties
Width (W1 ), Height (H1 ) and Depth
(D1 ) of the original input
The Stride S (We will come back to
this later)
The number of filters K
H1

W1

D1

18/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We first define the following quanti-
ties
F Width (W1 ), Height (H1 ) and Depth
(D1 ) of the original input
F The Stride S (We will come back to
D1
this later)
The number of filters K
H1
The spatial extent (F ) of each filter
(the depth of each filter is same as
the depth of each input)

W1

D1

18/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We first define the following quanti-
ties
F Width (W1 ), Height (H1 ) and Depth
(D1 ) of the original input
F The Stride S (We will come back to
H2
D1
this later)
The number of filters K
H1
The spatial extent (F ) of each filter
(the depth of each filter is same as
the depth of each input)
The output is W2 × H2 × D2 (we will
W2
soon see a formula for computing W2 ,
W1
D2 H2 and D2 )
D1

18/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output

19/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output

19/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output

19/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output

19/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output

19/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output

19/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output

19/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output
Notice that we can’t place the kernel at the
= corners as it will cross the input boundary

pixel of interest

19/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output
Notice that we can’t place the kernel at the
= corners as it will cross the input boundary

19/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output
Notice that we can’t place the kernel at the
= corners as it will cross the input boundary
This is true for all the shaded points (the
kernel crosses the input boundary)

19/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output
Notice that we can’t place the kernel at the
= corners as it will cross the input boundary
This is true for all the shaded points (the
kernel crosses the input boundary)
This results in an output which is of smaller
dimensions than the input

19/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output
Notice that we can’t place the kernel at the
= corners as it will cross the input boundary
This is true for all the shaded points (the
kernel crosses the input boundary)
This results in an output which is of smaller
dimensions than the input
As the size of the kernel increases, this be-
comes true for even more pixels

20/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output
Notice that we can’t place the kernel at the
= corners as it will cross the input boundary
This is true for all the shaded points (the
kernel crosses the input boundary)
This results in an output which is of smaller
dimensions than the input
As the size of the kernel increases, this be-
comes true for even more pixels
For example, let’s consider a 5 × 5 kernel

20/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output
Notice that we can’t place the kernel at the
= corners as it will cross the input boundary
This is true for all the shaded points (the
kernel crosses the input boundary)
This results in an output which is of smaller
dimensions than the input
As the size of the kernel increases, this be-
comes true for even more pixels
For example, let’s consider a 5 × 5 kernel
We have an even smaller output now

20/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output
Notice that we can’t place the kernel at the
= corners as it will cross the input boundary
This is true for all the shaded points (the
kernel crosses the input boundary)
This results in an output which is of smaller
dimensions than the input
As the size of the kernel increases, this be-
comes true for even more pixels
For example, let’s consider a 5 × 5 kernel
We have an even smaller output now

20/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output
Notice that we can’t place the kernel at the
= corners as it will cross the input boundary
This is true for all the shaded points (the
kernel crosses the input boundary)
This results in an output which is of smaller
dimensions than the input
As the size of the kernel increases, this be-
comes true for even more pixels
For example, let’s consider a 5 × 5 kernel
We have an even smaller output now

20/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output
Notice that we can’t place the kernel at the
= corners as it will cross the input boundary
This is true for all the shaded points (the
kernel crosses the input boundary)
This results in an output which is of smaller
pixel of interest dimensions than the input
As the size of the kernel increases, this be-
comes true for even more pixels
For example, let’s consider a 5 × 5 kernel
We have an even smaller output now

20/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output
Notice that we can’t place the kernel at the
= corners as it will cross the input boundary
This is true for all the shaded points (the
kernel crosses the input boundary)
This results in an output which is of smaller
pixel of interest dimensions than the input
As the size of the kernel increases, this be-
comes true for even more pixels
For example, let’s consider a 5 × 5 kernel
We have an even smaller output now

20/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output
Notice that we can’t place the kernel at the
= corners as it will cross the input boundary
This is true for all the shaded points (the
kernel crosses the input boundary)
This results in an output which is of smaller
dimensions than the input
As the size of the kernel increases, this be-
In general, W2 = W1 − F + 1 comes true for even more pixels
H2 = H1 − F + 1 For example, let’s consider a 5 × 5 kernel

We will refine this formula further We have an even smaller output now

20/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
What if we want the output to be of same
size as the input?

21/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
What if we want the output to be of same
size as the input?
We can use something known as padding

21/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
What if we want the output to be of same
size as the input?
We can use something known as padding
Pad the inputs with appropriate number of 0
inputs so that you can now apply the kernel
at the corners

21/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
What if we want the output to be of same
0 0 0 0 0 0 0 0 0 size as the input?
0 0
We can use something known as padding
0 0
0 0 Pad the inputs with appropriate number of 0
0 0 = inputs so that you can now apply the kernel
0 0 at the corners
0 0 Let us use pad P = 1 with a 3 × 3 kernel
0 0
0 0 0 0 0 0 0 0 0

21/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
What if we want the output to be of same
0 0 0 0 0 0 0 0 0 size as the input?
0 0
We can use something known as padding
0 0
0 0 Pad the inputs with appropriate number of 0
0 0 = inputs so that you can now apply the kernel
0 0 at the corners
0 0 Let us use pad P = 1 with a 3 × 3 kernel
0 0
This means we will add one row and one
0 0 0 0 0 0 0 0 0
column of 0 inputs at the top, bottom, left
and right

21/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
What if we want the output to be of same
0 0 0 0 0 0 0 0 0 size as the input?
0 0
We can use something known as padding
0 0
0 0 Pad the inputs with appropriate number of 0
0 0 = inputs so that you can now apply the kernel
0 0 at the corners
0 0 Let us use pad P = 1 with a 3 × 3 kernel
0 0
This means we will add one row and one
0 0 0 0 0 0 0 0 0
column of 0 inputs at the top, bottom, left
and right

21/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
What if we want the output to be of same
0 0 0 0 0 0 0 0 0 size as the input?
0 0
We can use something known as padding
0 0
0 0 Pad the inputs with appropriate number of 0
0 0 = inputs so that you can now apply the kernel
0 0 at the corners
0 0 Let us use pad P = 1 with a 3 × 3 kernel
0 0
This means we will add one row and one
0 0 0 0 0 0 0 0 0
column of 0 inputs at the top, bottom, left
and right

21/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
What if we want the output to be of same
0 0 0 0 0 0 0 0 0 size as the input?
0 0
We can use something known as padding
0 0
0 0 Pad the inputs with appropriate number of 0
0 0 = inputs so that you can now apply the kernel
0 0 at the corners
0 0 Let us use pad P = 1 with a 3 × 3 kernel
0 0
This means we will add one row and one
0 0 0 0 0 0 0 0 0
column of 0 inputs at the top, bottom, left
and right

21/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
What if we want the output to be of same
0 0 0 0 0 0 0 0 0 size as the input?
0 0
We can use something known as padding
0 0
0 0 Pad the inputs with appropriate number of 0
0 0 = inputs so that you can now apply the kernel
0 0 at the corners
0 0 Let us use pad P = 1 with a 3 × 3 kernel
0 0
This means we will add one row and one
0 0 0 0 0 0 0 0 0
column of 0 inputs at the top, bottom, left
and right

We now have,
W2 = W1 − F + 2P + 1
H2 = H1 − F + 2P + 1
We will refine this formula further

21/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
What does the stride S do?

22/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
What does the stride S do?
It defines the intervals at which the
filter is applied (here S = 2)

22/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 0 0 0 0 0 0 0 0
What does the stride S do?
0 0
0 0 It defines the intervals at which the
0 0 filter is applied (here S = 2)
0 0 =
0 0
Here, we are essentially skipping
0 0 every 2nd pixel which will again
0 0 result in an output which is of
0 0 0 0 0 0 0 0 0 smaller dimensions

22/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 0 0 0 0 0 0 0 0
What does the stride S do?
0 0
0 0 It defines the intervals at which the
0 0 filter is applied (here S = 2)
0 0 =
0 0
Here, we are essentially skipping
0 0 every 2nd pixel which will again
0 0 result in an output which is of
0 0 0 0 0 0 0 0 0 smaller dimensions

22/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 0 0 0 0 0 0 0 0
What does the stride S do?
0 0
0 0 It defines the intervals at which the
0 0 filter is applied (here S = 2)
0 0 =
0 0
Here, we are essentially skipping
0 0 every 2nd pixel which will again
0 0 result in an output which is of
0 0 0 0 0 0 0 0 0 smaller dimensions

22/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 0 0 0 0 0 0 0 0
What does the stride S do?
0 0
0 0 It defines the intervals at which the
0 0 filter is applied (here S = 2)
0 0 =
0 0
Here, we are essentially skipping
0 0 every 2nd pixel which will again
0 0 result in an output which is of
0 0 0 0 0 0 0 0 0 smaller dimensions

22/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 0 0 0 0 0 0 0 0
What does the stride S do?
0 0
0 0 It defines the intervals at which the
0 0 filter is applied (here S = 2)
0 0 =
0 0
Here, we are essentially skipping
0 0 every 2nd pixel which will again
0 0 result in an output which is of
0 0 0 0 0 0 0 0 0 smaller dimensions

22/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 0 0 0 0 0 0 0 0
What does the stride S do?
0 0
0 0 It defines the intervals at which the
0 0 filter is applied (here S = 2)
0 0 =
0 0
Here, we are essentially skipping
0 0 every 2nd pixel which will again
0 0 result in an output which is of
0 0 0 0 0 0 0 0 0 smaller dimensions

22/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 0 0 0 0 0 0 0 0
What does the stride S do?
0 0
0 0 It defines the intervals at which the
0 0 filter is applied (here S = 2)
0 0 =
0 0
Here, we are essentially skipping
0 0 every 2nd pixel which will again
0 0 result in an output which is of
0 0 0 0 0 0 0 0 0 smaller dimensions

22/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 0 0 0 0 0 0 0 0
What does the stride S do?
0 0
0 0 It defines the intervals at which the
0 0 filter is applied (here S = 2)
0 0 =
0 0
Here, we are essentially skipping
0 0 every 2nd pixel which will again
0 0 result in an output which is of
0 0 0 0 0 0 0 0 0 smaller dimensions

22/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 0 0 0 0 0 0 0 0
What does the stride S do?
0 0
0 0 It defines the intervals at which the
0 0 filter is applied (here S = 2)
0 0 =
0 0
Here, we are essentially skipping
0 0 every 2nd pixel which will again
0 0 result in an output which is of
0 0 0 0 0 0 0 0 0 smaller dimensions

22/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 0 0 0 0 0 0 0 0
What does the stride S do?
0 0
0 0 It defines the intervals at which the
0 0 filter is applied (here S = 2)
0 0 =
0 0
Here, we are essentially skipping
0 0 every 2nd pixel which will again
0 0 result in an output which is of
0 0 0 0 0 0 0 0 0 smaller dimensions

22/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 0 0 0 0 0 0 0 0
What does the stride S do?
0 0
0 0 It defines the intervals at which the
0 0 filter is applied (here S = 2)
0 0 =
0 0
Here, we are essentially skipping
0 0 every 2nd pixel which will again
0 0 result in an output which is of
0 0 0 0 0 0 0 0 0 smaller dimensions

22/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 0 0 0 0 0 0 0 0
What does the stride S do?
0 0
0 0 It defines the intervals at which the
0 0 filter is applied (here S = 2)
0 0 =
0 0
Here, we are essentially skipping
0 0 every 2nd pixel which will again
0 0 result in an output which is of
0 0 0 0 0 0 0 0 0 smaller dimensions

22/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 0 0 0 0 0 0 0 0
What does the stride S do?
0 0
0 0 It defines the intervals at which the
0 0 filter is applied (here S = 2)
0 0 =
0 0
Here, we are essentially skipping
0 0 every 2nd pixel which will again
0 0 result in an output which is of
0 0 0 0 0 0 0 0 0 smaller dimensions

22/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 0 0 0 0 0 0 0 0
What does the stride S do?
0 0
0 0 It defines the intervals at which the
0 0 filter is applied (here S = 2)
0 0 =
0 0
Here, we are essentially skipping
0 0 every 2nd pixel which will again
0 0 result in an output which is of
0 0 0 0 0 0 0 0 0 smaller dimensions

22/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 0 0 0 0 0 0 0 0
What does the stride S do?
0 0
0 0 It defines the intervals at which the
0 0 filter is applied (here S = 2)
0 0 =
0 0
Here, we are essentially skipping
0 0 every 2nd pixel which will again
0 0 result in an output which is of
0 0 0 0 0 0 0 0 0 smaller dimensions

So what should our final formula look like,

22/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 0 0 0 0 0 0 0 0
What does the stride S do?
0 0
0 0 It defines the intervals at which the
0 0 filter is applied (here S = 2)
0 0 =
0 0
Here, we are essentially skipping
0 0 every 2nd pixel which will again
0 0 result in an output which is of
0 0 0 0 0 0 0 0 0 smaller dimensions

So what should our final formula look like,


W1 − F + 2P
W2 = +1
S
H1 − F + 2P
H2 = +1
S

22/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Finally, coming to the depth of the
output.

filter
H1

W1

D1

23/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Finally, coming to the depth of the
output.
Each filter gives us one 2D output.

H1

W1

D1

23/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Finally, coming to the depth of the
output.
Each filter gives us one 2D output.
K filters will give us K such 2D out-
H2 puts

H1

W2

W1
D2 = K

D1
W1 −F +2P
W2 = S +1
H1 −F +2P
H2 = S +1

D2 = K

23/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Finally, coming to the depth of the
output.
Each filter gives us one 2D output.
K filters will give us K such 2D out-
H2 puts
We can think of the resulting output
as K × W2 × H2 volume
H1

W2

W1
D2 = K

D1
W1 −F +2P
W2 = S +1
H1 −F +2P
H2 = S +1

D2 = K

23/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Finally, coming to the depth of the
output.
Each filter gives us one 2D output.
K filters will give us K such 2D out-
H2 puts
We can think of the resulting output
as K × W2 × H2 volume
H1

Thus D2 = K

W2

W1
D2 = K

D1
W1 −F +2P
W2 = S +1
H1 −F +2P
H2 = S +1

D2 = K

23/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us do a few exercises

H2 =?

11
∗ =

11
227
3
W2 =?
96 filters
227 Stride = 4
P adding = 0
W2 = W1 −F +2P
+1 D2 =?
S
3 H1 −F +2P
H2 = S
+1

24/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us do a few exercises

H2 =?

11
∗ =

11
227
3
W2 =?
96 filters
227 Stride = 4
P adding = 0
W2 = W1 −F +2P
+1 96
S
3 H1 −F +2P
H2 = S
+1

24/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us do a few exercises

227−11
55 = 4
+1

11
∗ =

11
227
3
W2 =?
96 filters
227 Stride = 4
P adding = 0
W2 = W1 −F +2P
+1 96
S
3 H1 −F +2P
H2 = S
+1

24/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us do a few exercises

227−11
55 = 4
+1

11
∗ =

11
227
3
227−11
55 = 4
+1
96 filters
227 Stride = 4
P adding = 0
W2 = W1 −F +2P
+1 96
S
3 H1 −F +2P
H2 = S
+1

24/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us do a few exercises

H2 =?

5
∗ =

5
32
1
W2 =?
6 filters
32 Stride = 1
P adding = 0
W2 = W1 −F +2P
+1 D2 =?
S
1 H1 −F +2P
H2 = S
+1

25/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us do a few exercises

H2 =?

5
∗ =

5
32
1
W2 =?
6 filters
32 Stride = 1
P adding = 0
W2 = W1 −F +2P
+1 6
S
1 H1 −F +2P
H2 = S
+1

25/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us do a few exercises

32−5
28 = 1
+1

5
∗ =

5
32
1
W2 =?
6 filters
32 Stride = 1
P adding = 0
W2 = W1 −F +2P
+1 6
S
1 H1 −F +2P
H2 = S
+1

25/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us do a few exercises

32−5
28 = 1
+1

5
∗ =

5
32
1
32−5
28 = 1
+1
6 filters
32 Stride = 1
P adding = 0
W2 = W1 −F +2P
+1 6
S
1 H1 −F +2P
H2 = S
+1

25/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Module 11.3 : Convolutional Neural Networks

26/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Putting things into perspective
What is the connection between this operation (convolution) and neural net-
works?

27/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Putting things into perspective
What is the connection between this operation (convolution) and neural net-
works?
We will try to understand this by considering the task of “image classification”

27/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
28/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Features

Raw pixels

28/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Features

Raw pixels
car, bus, monument, flower

28/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Features

Raw pixels
car, bus, monument, flower

28/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Features

Raw pixels
car, bus, monument, flower

Edge Detector

28/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Features

Raw pixels
car, bus, monument, flower

Edge Detector
car, bus, monument, flower

28/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Features

Raw pixels
car, bus, monument, flower

Edge Detector
car, bus, monument, flower

28/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Features

Raw pixels
car, bus, monument, flower

Edge Detector
car, bus, monument, flower

SIF T /HOG

28/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Features

Raw pixels
car, bus, monument, flower

Edge Detector
car, bus, monument, flower

SIF T /HOG
car, bus, monument, flower

28/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Features

Raw pixels
car, bus, monument, flower

Edge Detector
car, bus, monument, flower

SIF T /HOG
car, bus, monument, flower

static feature extraction (no learning) learning weights of classifier

28/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Features Classifier

car, bus, monument, flower

0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0

Instead of using handcrafted kernels such as edge detectors can we learn meaningful ker-
nels/filters in addition to learning the weights of the classifier?
29/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Features Classifier

car, bus, monument, flower

0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0

car, bus, monument, flower

-1.21358689e-03 3.23652686e-03 ··· ··· -2.06615720e-02


-1.52757822e-03 2.36130832e-03 ··· ··· -1.19824838e-02
.. .. ..
. . .
.. .. ..
. . .
-8.25322699e-04 -5.14897937e-03 ··· ··· -9.90395527e-03

Instead of using handcrafted kernels such as edge detectors can we learn meaningful ker-
nels/filters in addition to learning the weights of the classifier?
29/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Features Classifier

car, bus, monument, flower

0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0

car, bus, monument, flower

-1.21358689e-03 3.23652686e-03 ··· ··· -2.06615720e-02


-1.52757822e-03 2.36130832e-03 ··· ··· -1.19824838e-02
.. .. ..
.
..
.
..
.
..
Learn these weights
. . .
-8.25322699e-04 -5.14897937e-03 ··· ··· -9.90395527e-03

Instead of using handcrafted kernels such as edge detectors can we learn meaningful ker-
nels/filters in addition to learning the weights of the classifier?
29/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Features Classifier

car, bus, monument, flower

0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0

car, bus, monument, flower

-1.21358689e-03 3.23652686e-03 ··· ··· -2.06615720e-02


-1.52757822e-03 2.36130832e-03 ··· ··· -1.19824838e-02
.. .. ..
. . .
.. .. ..
. . .
-8.25322699e-04 -5.14897937e-03 ··· ··· -9.90395527e-03

Even better: Instead of using handcrafted kernels (such as edge detectors)can we learn
multiple meaningful kernels/filters in addition to learning the weights of the clas-
30/68
sifier?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Features Classifier

car, bus, monument, flower

0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0

car, bus, monument, flower

-0.02337041 -0.03243878 ··· ··· -0.04728875


-0.05375158 -0.05350766 ··· ··· -0.04323674
.. .. ..
. . .
.. .. ..
. . .
-0.00792501 -0.00503319 ··· ··· 0.00174674

Even better: Instead of using handcrafted kernels (such as edge detectors)can we learn
multiple meaningful kernels/filters in addition to learning the weights of the clas-
30/68
sifier?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Features Classifier

car, bus, monument, flower

0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0

car, bus, monument, flower

-0.01871333 -0.01075948 ··· ··· 0.04684572


0.00104325 0.01935937 ··· ··· 0.01016542
.. .. ..
. . .
.. .. ..
. . .
0.03008777 0.00335217 ··· ··· -0.02791128

Even better: Instead of using handcrafted kernels (such as edge detectors)can we learn
multiple meaningful kernels/filters in addition to learning the weights of the clas-
30/68
sifier?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Can we learn multiple layers of meaningful kernels/filters in addition to
learning the weights of the classifier?

31/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Classifier

car, bus, monument, flower

Can we learn multiple layers of meaningful kernels/filters in addition to


learning the weights of the classifier?
Yes, we can !

31/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Classifier

car, bus, monument, flower

-1.21358689e-03 3.23652686e-03 ··· ··· -2.06615720e-02 -0.01112582 0.02185669 ··· ··· 0.00015161
-1.52757822e-03 2.36130832e-03 ··· ··· -1.19824838e-02 -0.00687587 0.01229961 ··· ··· 0.00214013
.. .. .. .. .. .. backpropagation
. . . . . .
.. .. .. .. .. ..
. . . . . .
-8.25322699e-04 -5.14897937e-03 ··· ··· -9.90395527e-03 -0.00372989 -0.00886137 ··· ··· -0.01974954

Can we learn multiple layers of meaningful kernels/filters in addition to


learning the weights of the classifier?
Yes, we can !
Simply by treating these kernels as parameters and learning them in addition to the
weights of the classifier (using back propagation)
31/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Classifier

car, bus, monument, flower

-1.21358689e-03 3.23652686e-03 ··· ··· -2.06615720e-02 -0.01112582 0.02185669 ··· ··· 0.00015161
-1.52757822e-03 2.36130832e-03 ··· ··· -1.19824838e-02 -0.00687587 0.01229961 ··· ··· 0.00214013
.. .. .. .. .. .. backpropagation
. . . . . .
.. .. .. .. .. ..
. . . . . .
-8.25322699e-04 -5.14897937e-03 ··· ··· -9.90395527e-03 -0.00372989 -0.00886137 ··· ··· -0.01974954

Can we learn multiple layers of meaningful kernels/filters in addition to


learning the weights of the classifier?
Yes, we can !
Simply by treating these kernels as parameters and learning them in addition to the
weights of the classifier (using back propagation)
Such a network is called a Convolutional Neural Network. 31/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Okay, I get it that the idea is to learn the kernel/filters by just treating them
as parameters of the classification model

32/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Okay, I get it that the idea is to learn the kernel/filters by just treating them
as parameters of the classification model
But how is this different from a regular feedforward neural network

32/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Okay, I get it that the idea is to learn the kernel/filters by just treating them
as parameters of the classification model
But how is this different from a regular feedforward neural network
Let us see

32/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
2
33/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
16

2
33/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
10 classes(digits)

16

2
33/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
10 classes(digits)

16

2
33/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
10 classes(digits)

16

2
33/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
10 classes(digits)

...
16

2
33/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
10 classes(digits)

...
16

2
33/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
10 classes(digits)

...
16

2
33/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
10 classes(digits)

...
16

2
33/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
10 classes(digits)

...
This is what a regular feed-forward
neural network will look like

16

2
33/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
10 classes(digits)

...
This is what a regular feed-forward
neural network will look like

There are many dense connections


here

16

2
33/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
10 classes(digits)

...
This is what a regular feed-forward
neural network will look like

There are many dense connections


here

For example all the 16 input neurons


are contributing to the computation
of h11
16

2
33/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
10 classes(digits)

...
This is what a regular feed-forward
neural network will look like

There are many dense connections


here

For example all the 16 input neurons


are contributing to the computation
of h11
16
Contrast this to what happens in the
case of convolution
2
33/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
...
16

2 * =

34/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Only a few local neurons participate
in the computation of h11

...
16

2 * =

34/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
h11 Only a few local neurons participate
in the computation of h11

For example, only pixels 1, 2, 5, 6


... contribute to h11

16

2
h11
* =

34/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
h11 h12 Only a few local neurons participate
in the computation of h11

For example, only pixels 1, 2, 5, 6


... contribute to h11

16

2
h12
* =

34/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
h11 h12 Only a few local neurons participate
in the computation of h11

For example, only pixels 1, 2, 5, 6


... contribute to h11

16

2
h13
* =

34/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
h11 h12 Only a few local neurons participate
in the computation of h11

For example, only pixels 1, 2, 5, 6


... contribute to h11

16

2
h14
* =

34/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
h11 h12 Only a few local neurons participate
in the computation of h11

For example, only pixels 1, 2, 5, 6


... contribute to h11

16 The connections are much sparser

2
h14
* =

34/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
h11 h12 Only a few local neurons participate
in the computation of h11

For example, only pixels 1, 2, 5, 6


... contribute to h11

16 The connections are much sparser

We are taking advantage of the struc-


ture of the image(interactions be-

2
h14 tween neighboring pixels are more in-
* = teresting)

34/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
h11 h12 Only a few local neurons participate
in the computation of h11

For example, only pixels 1, 2, 5, 6


... contribute to h11

16 The connections are much sparser

We are taking advantage of the struc-


ture of the image(interactions be-

2
h14 tween neighboring pixels are more in-
* = teresting)

This sparse connectivity reduces


the number of parameters in the
model
34/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
But is sparse connectivity really good
thing ?


Goodfellow-et-al-2016
35/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
But is sparse connectivity really good
thing ?

Aren’t we losing information (by los-


ing interactions between some input
pixels)


Goodfellow-et-al-2016
35/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
But is sparse connectivity really good
thing ?

Aren’t we losing information (by los-


ing interactions between some input
pixels)

Well, not really


Goodfellow-et-al-2016
35/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
But is sparse connectivity really good
thing ?

Aren’t we losing information (by los-


ing interactions between some input
pixels)

Well, not really

The two highlighted neurons (x1 &


x5 )∗ do not interact in layer 1


Goodfellow-et-al-2016
35/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
But is sparse connectivity really good
thing ?

Aren’t we losing information (by los-


ing interactions between some input
pixels)

Well, not really

The two highlighted neurons (x1 &


x5 )∗ do not interact in layer 1

But they indirectly contribute to the


computation of g3 and hence interact
indirectly

Goodfellow-et-al-2016
35/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Another characteristic of
CNNs is weight sharing

36/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Another characteristic of
CNNs is weight sharing

Consider the following net-


work

36/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Another characteristic of
CNNs is weight sharing

Consider the following net-


work

16

Kernel 1

Kernel 2
4x4 Image

36/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Another characteristic of
CNNs is weight sharing

Consider the following net-


work

16 Do we want the kernel


weights to be different for dif-
ferent portions of the image?

Kernel 1

Kernel 2
4x4 Image

36/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Another characteristic of
CNNs is weight sharing

Consider the following net-


work

16 Do we want the kernel


weights to be different for dif-
ferent portions of the image?

Kernel 1 Imagine that we are trying


to learn a kernel that detects
edges
Kernel 2
4x4 Image

36/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Another characteristic of
CNNs is weight sharing

Consider the following net-


work

16 Do we want the kernel


weights to be different for dif-
ferent portions of the image?

Kernel 1 Imagine that we are trying


to learn a kernel that detects
edges
Kernel 2
4x4 Image Shouldn’t we be applying the
same kernel at all the por-
tions of the image?
36/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In other words shouldn’t the orange
and pink kernels be the same

16

37/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In other words shouldn’t the orange
and pink kernels be the same

Yes, indeed

16

37/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In other words shouldn’t the orange
and pink kernels be the same

Yes, indeed

16

37/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In other words shouldn’t the orange
and pink kernels be the same

Yes, indeed

This would make the job of learning


easier(instead of trying to learn the
same weights/kernels at different lo-
cations again and again)

16

37/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In other words shouldn’t the orange
and pink kernels be the same

Yes, indeed

This would make the job of learning


easier(instead of trying to learn the
same weights/kernels at different lo-
cations again and again)

But does that mean we can have only


16 one kernel?

37/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In other words shouldn’t the orange
and pink kernels be the same

Yes, indeed

This would make the job of learning


easier(instead of trying to learn the
same weights/kernels at different lo-
cations again and again)

But does that mean we can have only


16 one kernel?

No, we can have many such kernels


but the kernels will be shared by all
locations in the image

37/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In other words shouldn’t the orange
and pink kernels be the same

Yes, indeed

This would make the job of learning


easier(instead of trying to learn the
same weights/kernels at different lo-
cations again and again)

But does that mean we can have only


16 one kernel?

No, we can have many such kernels


but the kernels will be shared by all
locations in the image

37/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In other words shouldn’t the orange
and pink kernels be the same

Yes, indeed

This would make the job of learning


easier(instead of trying to learn the
same weights/kernels at different lo-
cations again and again)

But does that mean we can have only


16 one kernel?

No, we can have many such kernels


but the kernels will be shared by all
locations in the image

37/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In other words shouldn’t the orange
and pink kernels be the same

Yes, indeed

This would make the job of learning


easier(instead of trying to learn the
same weights/kernels at different lo-
cations again and again)

But does that mean we can have only


16 one kernel?

No, we can have many such kernels


but the kernels will be shared by all
locations in the image

This is called “weight sharing” 37/68


Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
So far, we have focused only on the convolution operation

38/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
So far, we have focused only on the convolution operation
Let us see what a full convolutional neural network looks like

38/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Convolution Layer 2
Input Convolution Layer 1 Pooling Layer 2

FC 1(120)FC 2(84)

A
Pooling Layer 1 Output(10)

32

28
14
P aram
32
10 P aram P aram = 850
28 14 5 = 48120 = 10164
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
P aram = 2400 P aram = 0

39/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Convolution Layer 2
Input Convolution Layer 1 Pooling Layer 2

FC 1(120)FC 2(84)

A
Pooling Layer 1 Output(10)

32

28
14
P aram
32
10 P aram P aram = 850
28 14 5 = 48120 = 10164
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
P aram = 2400 P aram = 0

It has alternate convolution and pooling layers

39/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Convolution Layer 2
Input Convolution Layer 1 Pooling Layer 2

FC 1(120)FC 2(84)

A
Pooling Layer 1 Output(10)

32

28
14
P aram
32
10 P aram P aram = 850
28 14 5 = 48120 = 10164
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
P aram = 2400 P aram = 0

It has alternate convolution and pooling layers


What does a pooling layer do?

39/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Convolution Layer 2
Input Convolution Layer 1 Pooling Layer 2

FC 1(120)FC 2(84)

A
Pooling Layer 1 Output(10)

32

28
14
P aram
32
10 P aram P aram = 850
28 14 5 = 48120 = 10164
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
P aram = 2400 P aram = 0

It has alternate convolution and pooling layers


What does a pooling layer do?
Let us see
39/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
*

Input 1 filter

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
* =

Input 1 filter

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4
* =
7 6 4 5

1 3 1 2

Input 1 filter

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool
* =
7 6 4 5 2x2 filters (stride 2)

1 3 1 2

Input 1 filter

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool
* =
7 6 4 5 2x2 filters (stride 2)

1 3 1 2

Input 1 filter

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool 8
* =
7 6 4 5 2x2 filters (stride 2)

1 3 1 2

Input 1 filter

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2)

1 3 1 2

Input 1 filter

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7

1 3 1 2

Input 1 filter

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4

7 6 4 5

1 3 1 2

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool

7 6 4 5 2x2 filters (stride 1)

1 3 1 2

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool

7 6 4 5 2x2 filters (stride 1)

1 3 1 2

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8

7 6 4 5 2x2 filters (stride 1)

1 3 1 2

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8 8

7 6 4 5 2x2 filters (stride 1)

1 3 1 2

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1)

1 3 1 2

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8

1 3 1 2

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8

1 3 1 2

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7 6

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7 6 5

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 4 2 1

5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7 6 5

Instead of max pooling we can also do average pooling

40/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We will now see some case studies where convolution neural networks have been
successful

41/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
LeNet-5 for handwritten character recognition

Input

A 32

32

42/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
LeNet-5 for handwritten character recognition

Input Convolution Layer 1

A 32

28

32
28

S = 1,F = 5,
K = 6,P = 0,
P aram =?

42/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
LeNet-5 for handwritten character recognition

Input Convolution Layer 1

A 32

28

32
28

S = 1,F = 5,
K = 6,P = 0,
P aram = 150

42/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
LeNet-5 for handwritten character recognition

Input Convolution Layer 1

A
Pooling Layer 1

32

28
14
32
28 14

S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
P aram = 150 P aram =?

42/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
LeNet-5 for handwritten character recognition

Input Convolution Layer 1

A
Pooling Layer 1

32

28
14
32
28 14

S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
P aram = 150 P aram = 0

42/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
LeNet-5 for handwritten character recognition
Convolution Layer 2
Input Convolution Layer 1

A
Pooling Layer 1

32

28
14
32
10
28 14
10
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5,
P aram = 150 P aram = 0 K = 16,P = 0,
P aram =?

42/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
LeNet-5 for handwritten character recognition
Convolution Layer 2
Input Convolution Layer 1

A
Pooling Layer 1

32

28
14
32
10
28 14
10
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5,
P aram = 150 P aram = 0 K = 16,P = 0,
P aram = 2400

42/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
LeNet-5 for handwritten character recognition
Convolution Layer 2
Input Convolution Layer 1 Pooling Layer 2

A
Pooling Layer 1

32

28
14
32
10 5
28 14
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
P aram = 2400 P aram =?

42/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
LeNet-5 for handwritten character recognition
Convolution Layer 2
Input Convolution Layer 1 Pooling Layer 2

A
Pooling Layer 1

32

28
14
32
10 5
28 14
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
P aram = 2400 P aram = 0

42/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
LeNet-5 for handwritten character recognition
Convolution Layer 2
Input Convolution Layer 1 Pooling Layer 2

FC 1(120)

A
Pooling Layer 1

32

28
14
32 P aram
10 5
28 14 =?
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
P aram = 2400 P aram = 0

42/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
LeNet-5 for handwritten character recognition
Convolution Layer 2
Input Convolution Layer 1 Pooling Layer 2

FC 1(120)

A
Pooling Layer 1

32

28
14
32 P aram
10 5
28 14 = 48120
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
P aram = 2400 P aram = 0

42/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
LeNet-5 for handwritten character recognition
Convolution Layer 2
Input Convolution Layer 1 Pooling Layer 2

FC 1(120)FC 2(84)

A
Pooling Layer 1

32

28
14
32
10 P aram P aram
28 14 5 = 48120 =?
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
P aram = 2400 P aram = 0

42/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
LeNet-5 for handwritten character recognition
Convolution Layer 2
Input Convolution Layer 1 Pooling Layer 2

FC 1(120)FC 2(84)

A
Pooling Layer 1

32

28
14
32
10 P aram P aram
28 14 5 = 48120 = 10164
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
P aram = 2400 P aram = 0

42/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
LeNet-5 for handwritten character recognition
Convolution Layer 2
Input Convolution Layer 1 Pooling Layer 2

FC 1(120)FC 2(84)

A
Pooling Layer 1 Output(10)

32

28
14
P aram
32
10 P aram P aram =?
28 14 5 = 48120 = 10164
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
P aram = 2400 P aram = 0

42/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
LeNet-5 for handwritten character recognition
Convolution Layer 2
Input Convolution Layer 1 Pooling Layer 2

FC 1(120)FC 2(84)

A
Pooling Layer 1 Output(10)

32

28
14
P aram
32
10 P aram P aram = 850
28 14 5 = 48120 = 10164
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
P aram = 2400 P aram = 0

42/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
How do we train a convolutional neural network ?

43/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Kernel

b c d l m n o

w x
e f g
y z
b c d e f g h i j
h i j

A CNN can be implemented as a


feedforward neural network

44/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Kernel

b c d l m n o

w x
e f g
y z
b c d e f g h i j
h i j

Output A CNN can be implemented as a


feedforward neural network
` m
wherein only a few weights(in color)
are active
n o

44/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Kernel

b c d l m n o

w x
e f g
y z
b c d e f g h i j
h i j

Output A CNN can be implemented as a


feedforward neural network
` m
wherein only a few weights(in color)
are active
n o
the rest of the weights (in gray) are
zero

44/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Kernel

b c d l m n o

w x
e f g
y z
b c d e f g h i j
h i j

Output A CNN can be implemented as a


feedforward neural network
` m
wherein only a few weights(in color)
are active
n o
the rest of the weights (in gray) are
zero

44/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Kernel

b c d l m n o

w x
e f g
y z
b c d e f g h i j
h i j

Output A CNN can be implemented as a


feedforward neural network
` m
wherein only a few weights(in color)
are active
n o
the rest of the weights (in gray) are
zero

44/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Kernel

b c d l m n o

w x
e f g
y z
b c d e f g h i j
h i j

A CNN can be implemented as a


feedforward neural network
wherein only a few weights(in color)
are active
the rest of the weights (in gray) are
zero

44/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Kernel

b c d l m n o

w x
e f g
y z
b c d e f g h i j
h i j

A CNN can be implemented as a


feedforward neural network
wherein only a few weights(in color)
are active
the rest of the weights (in gray) are
We can thus train a convolution zero
neural network using
backpropagation by thinking of it as
a feedforward neural network with
sparse connections 44/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Module 11.4 : CNNs (success stories on ImageNet)

45/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
ImageNet Success Stories(roadmap for rest of the talk)
AlexNet

46/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
ImageNet Success Stories(roadmap for rest of the talk)
AlexNet
ZFNet

46/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
ImageNet Success Stories(roadmap for rest of the talk)
AlexNet
ZFNet
VGGNet

46/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
28.2

ILSVRC’10

47/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
28.2
25.8

ILSVRC’10 ILSVRC’11

47/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
28.2
25.8

16.4

ILSVRC’10 ILSVRC’11 ILSVRC’12


AlexNet

47/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
28.2
25.8

16.4

11.7

ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13


AlexNet ZFNet

47/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
28.2
25.8

16.4

11.7

7.3

ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13 ILSVRC’14


AlexNet ZFNet VGG

47/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
28.2
25.8

16.4

11.7

7.3 6.7

ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13 ILSVRC’14 ILSVRC’14


AlexNet ZFNet VGG GoogleNet

47/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
28.2
25.8

16.4

11.7

7.3 6.7

3.57

ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13 ILSVRC’14 ILSVRC’14 ILSVRC’15


AlexNet ZFNet VGG GoogleNet ResNet

47/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
28.2
25.8
152 layers

16.4

11.7
19 layers 22 layers

7.3 6.7

shallow 8 layers 8 layers 3.57

ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13 ILSVRC’14 ILSVRC’14 ILSVRC’15


AlexNet ZFNet VGG GoogleNet ResNet

47/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
28.2
25.8
152 layers

16.4

11.7
19 layers 22 layers

7.3 6.7

shallow 8 layers 8 layers 3.57

ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13 ILSVRC’14 ILSVRC’14 ILSVRC’15


AlexNet ZFNet VGG GoogleNet ResNet

47/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
ImageNet Success Stories(roadmap for rest of the talk)
AlexNet
ZFNet
VGGNet

48/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227

227

3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input: 227 × 227 × 3
Conv1: K = 96,F = 11
S = 4,P = 0
Output:W2 =?, H2 =?
Parameters: ?

227

11

11

227

3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input: 227 × 227 × 3
Conv1: K = 96,F = 11
S = 4,P = 0
Output:W2 = 55, H2 = 55
Parameters: ?

227
55

11

11

55
227

96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input: 227 × 227 × 3
Conv1: K = 96,F = 11
S = 4,P = 0
Output:W2 = 55, H2 = 55
Parameters: (11 × 11 × 3) × 96 = 34K

227
55

11

11

55
227

96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Max Pool Input: 55 × 55 × 96
F = 3,S = 2
Output:W2 =?, H2 =?
Parameters: ?

227
55

11
3
3
11

55
227

96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Max Pool Input: 55 × 55 × 96
F = 3,S = 2
Output:W2 = 27, H2 = 27
Parameters: ?

227
55
27
11
3
3
11

27
55
227

96
MaxPooling
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Max Pool Input: 55 × 55 × 96
F = 3,S = 2
Output:W2 = 27, H2 = 27
Parameters: 0

227
55
27
11
3
3
11

27
55
227

96
MaxPooling
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input: 27 × 27 × 96
Conv1: K = 256,F = 5
S = 1,P = 0
Output:W2 =?, H2 =?
Parameters: ?

227
55
27
11
3 5
3
5
11

27
55
227

96
MaxPooling
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input: 27 × 27 × 96
Conv1: K = 256,F = 5
S = 1,P = 0
Output:W2 = 23, H2 = 23
Parameters: ?

227
55
27 23
11
3 5
3
5
11

27 23
55
227

96 256
MaxPooling Convolution
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input: 27 × 27 × 96
Conv1: K = 256,F = 5
S = 1,P = 0
Output:W2 = 23, H2 = 23
Parameters: (5 × 5 × 96) × 256 = 0.6M

227
55
27 23
11
3 5
3
5
11

27 23
55
227

96 256
MaxPooling Convolution
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Max Pool Input: 23 × 23 × 256
F = 3,S = 2
Output:W2 =?, H2 =?
Parameters: ?

227
55
27 23
11
3 5
3
3 3
5
11

27 23
55
227

96 256
MaxPooling Convolution
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Max Pool Input: 23 × 23 × 256
F = 3,S = 2
Output:W2 = 11, H2 = 11
Parameters: ?

227
55
27 23
11
11
3 5
3
3 3
5
11

11
27 23
55
227 256
MaxPooling
96 256
MaxPooling Convolution
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Max Pool Input: 23 × 23 × 256
F = 3,S = 2
Output:W2 = 11, H2 = 11
Parameters: 0

227
55
27 23
11
11
3 5
3
3 3
5
11

11
27 23
55
227 256
MaxPooling
96 256
MaxPooling Convolution
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input: 11 × 11 × 256
Conv1: K = 384,F = 3
S = 1,P = 0
Output:W2 =?, H2 =?
Parameters: ?

227
55
27 23
11
11
3 5
3 3
3 3 3
5
11

11
27 23
55
227 256
MaxPooling
96 256
MaxPooling Convolution
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input: 11 × 11 × 256
Conv1: K = 384,F = 3
S = 1,P = 0
Output:W2 = 9, H2 = 9
Parameters: ?

227
55
27 23
11
11 9
3 5
3 3
3 3 3
5
11

11 9
27 23
55 384
227 256
Convolution
MaxPooling
96 256
MaxPooling Convolution
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input: 11 × 11 × 256
Conv1: K = 384,F = 3
S = 1,P = 0
Output:W2 = 9, H2 = 9
Parameters: (3 × 3 × 256) × 384 = 0.8M

227
55
27 23
11
11 9
3 5
3 3
3 3 3
5
11

11 9
27 23
55 384
227 256
Convolution
MaxPooling
96 256
MaxPooling Convolution
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input: 9 × 9 × 384
Conv1: K = 384,F = 3
S = 1,P = 0
Output:W2 =?, H2 =?
Parameters: ?

227
55
27 23
11
11 9
3 5
3 3 3
3 3 3 3
5
11

11 9
27 23
55 384
227 256
Convolution
MaxPooling
96 256
MaxPooling Convolution
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input: 9 × 9 × 384
Conv1: K = 384,F = 3
S = 1,P = 0
Output:W2 = 7, H2 = 7
Parameters: ?

227
55
27 23
11
11 9
3 5 7
3 3 3
3 3 3 3
5
11
7
11 9
27 23 384
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
MaxPooling Convolution
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input: 9 × 9 × 384
Conv1: K = 384,F = 3
S = 1,P = 0
Output:W2 = 7, H2 = 7
Parameters: (3 × 3 × 384) × 384 = 1.327M

227
55
27 23
11
11 9
3 5 7
3 3 3
3 3 3 3
5
11
7
11 9
27 23 384
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
MaxPooling Convolution
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input: 7 × 7 × 384
Conv1: K = 256,F = 3
S = 1,P = 0
Output:W2 =?, H2 =?
Parameters: ?

227
55
27 23
11
11 9
3 5 7
3 3 3 3
3 3 3 3
5 3
11
7
11 9
27 23 384
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
MaxPooling Convolution
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input: 7 × 7 × 384
Conv1: K = 256,F = 3
S = 1,P = 0
Output:W2 = 5, H2 = 5
Parameters: ?

227
55
27 23
11
11 9
3 5 7
3 3 3 3 5
3 3 3 3
5 3
11 5
7
11 9 256
27 23 384 Convolution
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
MaxPooling Convolution
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input: 7 × 7 × 384
Conv1: K = 256,F = 3
S = 1,P = 0
Output:W2 = 5, H2 = 5
Parameters: (3 × 3 × 384) × 256 = 0.8M

227
55
27 23
11
11 9
3 5 7
3 3 3 3 5
3 3 3 3
5 3
11 5
7
11 9 256
27 23 384 Convolution
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
MaxPooling Convolution
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Max Pool Input: 5 × 5 × 256
F = 3,S = 2
Output:W2 =?, H2 =?
Parameters: ?

227
55
27 23
11
11 9
3 5 7
3 3 3 3 5
3 3
3 3 3 3 3
5
11 5
7
11 9 256
27 23 384 Convolution
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
MaxPooling Convolution
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Max Pool Input: 5 × 5 × 256
F = 3,S = 2
Output:W2 = 2, H2 = 2
Parameters: ?

227
55
27 23
11
11 9
3 5 7
3 3 3 3 5
3 3
3 3 3 3 3 2
5
11 5 2
7 256
11 9 256 MaxPooling
27 23 384 Convolution
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
MaxPooling Convolution
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Max Pool Input: 5 × 5 × 256
F = 3,S = 2
Output:W2 = 2, H2 = 2
Parameters: 0

227
55
27 23
11
11 9
3 5 7
3 3 3 3 5
3 3
3 3 3 3 3 2
5
11 5 2
7 256
11 9 256 MaxPooling
27 23 384 Convolution
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
MaxPooling Convolution
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
FC1
Parameters: (2 × 2 × 256) × 4096 = 4M

227
55
27 23
11
11 9
3 5 7 dense
3 3 3 3 5
3 3
3 3 3 3 3 2
5
11 5 2
7 256
11 9 256 MaxPooling
27 23 384 Convolution
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
MaxPooling Convolution
4096
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
FC1
Parameters: 4096 × 4096 = 16M

227
55
27 23
11
11 9
3 5 7 dense
3 3 3 3 5
3 3 dense
3 3 3 3 3 2
5
11 5 2
7 256
11 9 256 MaxPooling
27 23 384 Convolution
55 384 Convolution
227 256
Convolution
MaxPooling
96 256
MaxPooling Convolution
4096 4096
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
FC1
Parameters: 4096 × 1000 = 4M

227
55
27 23
11
11 9
3 5 7 dense
3 3 3 3 5
3 3 dense dense
3 3 3 3 3 2
5
11 5 2
7 256
11 9 256 MaxPooling
27 23 384 Convolution
55 384 Convolution
227 256
Convolution
MaxPooling 1000
96 256
MaxPooling Convolution
4096 4096
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Total Parameters: 27.55M

227
55
27 23
11
11 9
3 5 7 dense
3 3 3 3 5
3 3 dense dense
3 3 3 3 3 2
5
11 5 2
7 256
11 9 256 MaxPooling
27 23 384 Convolution
55 384 Convolution
227 256
Convolution
MaxPooling 1000
96 256
MaxPooling Convolution
4096 4096
96
Convolution
3
Input

49/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us look at the
connections in the
fully connected lay-
ers in more detail
2

256

MaxPooling

50/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us look at the
connections in the
fully connected lay-
ers in more detail make linear
2

We will first stretch 2

out the last conv 256

or maxpool layer to MaxPooling

make it a 1d vector
2 × 2 × 256 = 1024

50/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us look at the
connections in the
fully connected lay-
ers in more detail make linear
2
dense
We will first stretch 2

out the last conv 256

or maxpool layer to MaxPooling

make it a 1d vector
This 1d vector is 2 × 2 × 256 = 1024
then densely con-
nected to other lay- 4096

ers just as in a regu-


lar feedforward neu-
ral network
50/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us look at the
connections in the
fully connected lay-
ers in more detail make linear
2
dense dense
We will first stretch 2

out the last conv 256

or maxpool layer to MaxPooling

make it a 1d vector
This 1d vector is 2 × 2 × 256 = 1024
then densely con-
nected to other lay- 4096 4096

ers just as in a regu-


lar feedforward neu-
ral network
50/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us look at the
connections in the
fully connected lay-
ers in more detail make linear
2
dense dense dense
We will first stretch 2

out the last conv 256

or maxpool layer to MaxPooling

make it a 1d vector
This 1d vector is 2 × 2 × 256 = 1024 1000
then densely con-
nected to other lay- 4096 4096

ers just as in a regu-


lar feedforward neu-
ral network
50/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
ImageNet Success Stories(roadmap for rest of the talk)
AlexNet
ZFNet
VGGNet

51/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227

227

3
Input

227

227

3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227

11

11

227

Layer1: F = 11 → 7
3
Input Difference in Parameters
((11 − 72 ) × 3) × 96 = 20.7K
2

227

227

3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55

11

11

55
227

96
Convolution Layer1: F = 11 → 7
3
Input Difference in Parameters
((11 − 72 ) × 3) × 96 = 20.7K
2

227
55

55
227

96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55

11
3
3
11

55
227

96
Convolution
3
Input
Layer2: No difference

227
55

7
3
3
7

55
227

96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55
27
11
3
3
11

27
55
227

96
MaxPooling
96
Convolution
3
Input
Layer2: No difference

227
55
27
7
3
3
7

27
55
227

96
MaxPooling
96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55
27
11
3 5
3
11 5

27
55
227

96
MaxPooling
96
Convolution
3
Input
Layer3: No difference

227
55
27
7
3 5
3
7 5

27
55
227

96
MaxPooling
96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55
27 23
11
3 5
3
11 5

27 23
55
227

96 256
MaxPooling Convolution
96
Convolution
3
Input
Layer3: No difference

227
55
27 23
7
3 5
3
7 5

27 23
55
227

96 256
MaxPooling Convolution
96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55
27 23
11
3 5
3
3 3
11 5

27 23
55
227

96 256
MaxPooling Convolution
96
Convolution
3
Input
Layer4: No difference

227
55
27 23
7
3 5
3
3 3
7 5

27 23
55
227

96 256
MaxPooling Convolution
96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55
27 23
11 11
3 5
3
3 3
11 5

11
27 23
55
227 256
256 MaxPooling
96
MaxPooling Convolution
96
Convolution
3
Input
Layer4: No difference

227
55
27 23
11
7
3 5
3
3 3
7 5

11
27 23
55
227 256
256 MaxPooling
96
MaxPooling Convolution
96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55
27 23
11 11
3 5
3 3
3 3 3
11 5

11
27 23
55
227 256
256 MaxPooling
96
MaxPooling Convolution
96
Convolution Layer5: K = 384 → 512
3
Input Difference in Parameters
(3 × 3 × 256) × (512 − 384) = 0.29M

227
55
27 23
11
7
3 5
3 3
3 3 3
7 5

11
27 23
55
227 256
256 MaxPooling
96
MaxPooling Convolution
96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55
27 23
11 11 9
3 5
3 3
3 3 3
11 5

11 9
27 23
55
227 384
256
MaxPooling Convolution
96 256
MaxPooling Convolution
96
Convolution Layer5: K = 384 → 512
3
Input Difference in Parameters
(3 × 3 × 256) × (512 − 384) = 0.29M

227
55
27 23
11 9
7
3 5
3 3
3 3 3
7 5

11 9
27 23
55
227 512
256
MaxPooling Convolution
96 256
MaxPooling Convolution
96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55
27 23
11 11 9
3 5
3 3 3
3 3 3 3
11 5

11 9
27 23
55
227 384
256
MaxPooling Convolution
96 256
MaxPooling Convolution
96
Convolution Layer6: K = 384 → 1024
3
Input Difference in Parameters
(3 × 3 × ((384 × 384) − (512 × 1024)) = 0.8M

227
55
27 23
11 9
7
3 5
3 3 3
3 3 3 3
7 5

11 9
27 23
55
227 512
256
MaxPooling Convolution
96 256
MaxPooling Convolution
96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55
27 23
11 11 9
3 5 7
3 3 3
3 3 3 3
11 5
7
11 9
27 23
384
55
227 384 Convolution
256
MaxPooling Convolution
96 256
MaxPooling Convolution
96
Convolution Layer6: K = 384 → 1024
3
Input Difference in Parameters
(3 × 3 × ((384 × 384) − (512 × 1024)) = 0.8M

227
55
27 23
11 9
7
3 5 7
3 3 3
3 3 3 3
7 5
7
11 9
27 23
1024
55
227 512 Convolution
256
MaxPooling Convolution
96 256
MaxPooling Convolution
96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55
27 23
11 11 9
3 5 7
3 3 3 3
3 3 3 3
11 5 3
7
11 9
27 23
384
55
227 384 Convolution
256
MaxPooling Convolution
96 256
MaxPooling Convolution
96
Convolution Layer7: K = 256 → 512
3
Input Difference in Parameters
(3 × 3 × ((384 × 256) − (1024 × 512)) = 0.36M

227
55
27 23
11 9
7
3 5 7
3 3 3 3
3 3 3 3
7 5 3
7
11 9
27 23
1024
55
227 512 Convolution
256
MaxPooling Convolution
96 256
MaxPooling Convolution
96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55
27 23
11 11 9
3 5 7
3 3 3 3 5
3 3 3 3
11 5 3
5
7
11 9
23 256
27 384
55 Convolution
227 384 Convolution
256
MaxPooling Convolution
96 256
MaxPooling Convolution
96
Convolution Layer7: K = 256 → 512
3
Input Difference in Parameters
(3 × 3 × ((384 × 256) − (1024 × 512)) = 0.36M

227
55
27 23
11 9
7
3 5 7
3 3 3 3 5
3 3 3 3
7 5 3
5
7
11 9
23 512
27 1024
55 Convolution
227 512 Convolution
256
MaxPooling Convolution
96 256
MaxPooling Convolution
96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55
27 23
11 11 9
3 5 7
3 3 3 3 3 5
3 3 3 3
11 5 3 3
5
7
11 9
23 256
27 384
55 Convolution
227 384 Convolution
256
MaxPooling Convolution
96 256
MaxPooling Convolution
96
Convolution
3
Input
Layer8: No difference

227
55
27 23
11 9
7
3 5 7
3 3 3 3 3 5
3 3 3 3
7 5 3 3
5
7
11 9
23 512
27 1024
55 Convolution
227 512 Convolution
256
MaxPooling Convolution
96 256
MaxPooling Convolution
96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55
27 23
11 11 9
3 5 7
3 3 3 3 3 5
3 3 3 2
11 5 3 3 3
5 2
7
11 9 256
23 256
27 384 MaxPooling
55 Convolution
227 384 Convolution
256
MaxPooling Convolution
96 256
MaxPooling Convolution
96
Convolution
3
Input
Layer8: No difference

227
55
27 23
11 9
7
3 5 7
3 3 3 3 3 5
3 3 3 2
7 5 3 3 3
5 2
7
11 9 256
23 512
27 1024 MaxPooling
55 Convolution
227 512 Convolution
256
MaxPooling Convolution
96 256
MaxPooling Convolution
96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55
27 23
11 11 9
3 5 7
3 3 3 3 5
3 3 3
3 2dense
11 5 3 3 3
5 2
7
11 9 256
23 256
27 384 MaxPooling
55 Convolution
227 384 Convolution
256
MaxPooling Convolution
96 256
MaxPooling Convolution 4096
96
Convolution
3
Input
Layer9: No difference

227
55
27 23
11 9
7
3 5 7
3 3 3 3 5
3 3 3
3 2dense
7 5 3 3 3
5 2
7
11 9 256
23 512
27 1024 MaxPooling
55 Convolution
227 512 Convolution
256
MaxPooling Convolution
96 256
MaxPooling Convolution 4096
96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55
27 23
11 11 9
3 5 7
3 3 3 3 5
3 3 3
3 2dense dense
11 5 3 3 3
5 2
7
11 9 256
23 256
27 384 MaxPooling
55 Convolution
227 384 Convolution
256
MaxPooling Convolution
96 256
MaxPooling Convolution 4096 4096
96
Convolution
3
Input
Layer10: No difference

227
55
27 23
11 9
7
3 5 7
3 3 3 3 5
3 3 3
3 2dense dense
7 5 3 3 3
5 2
7
11 9 256
23 512
27 1024 MaxPooling
55 Convolution
227 512 Convolution
256
MaxPooling Convolution
96 256
MaxPooling Convolution 4096 4096
96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55
27 23
11 11 9
3 5 7
3 3 3 3 5
3 3 3
3 2dense dense dense
11 5 3 3 3
5 2
7
11 9 256
23 256
27 384 MaxPooling
55 Convolution
227 384 Convolution
256
Convolution 1000
256 MaxPooling
96
MaxPooling Convolution 4096 4096
96
Convolution
3
Input
Layer10: No difference

227
55
27 23
11 9
7
3 5 7
3 3 3 3 5
3 3 3
3 2dense dense dense
7 5 3 3 3
5 2
7
11 9 256
23 512
27 1024 MaxPooling
55 Convolution
227 512 Convolution
256
Convolution 1000
256 MaxPooling
96
MaxPooling Convolution 4096 4096
96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55
27 23
11 11 9
3 5 7
3 3 3 3 5
3 3 3
3 2dense dense dense
11 5 3 3 3
5 2
7
11 9 256
23 256
27 384 MaxPooling
55 Convolution
227 384 Convolution
256
Convolution 1000
256 MaxPooling
96
MaxPooling Convolution 4096 4096
96

3
Convolution Difference in Total No. of Parameters
Input 1.45M

227
55
27 23
11 9
7
3 5 7
3 3 3 3 5
3 3 3
3 2dense dense dense
7 5 3 3 3
5 2
7
11 9 256
23 512
27 1024 MaxPooling
55 Convolution
227 512 Convolution
256
Convolution 1000
256 MaxPooling
96
MaxPooling Convolution 4096 4096
96
Convolution
3
Input 52/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
ImageNet Success Stories(roadmap for rest of the talk)
AlexNet
ZFNet
VGGNet

53/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4
22
224

Input

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4
22
224

Input Conv

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

4
22

22
224

224

64
Input Conv

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11
112
224

224

64
64 maxpool
Input Conv

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11
112
224

224

64
64 maxpool Conv
Input Conv

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11

11
112

112
224

224

64 128

64 maxpool Conv
Input Conv

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11

11

56
56
112

112
224

224

128
64 128 maxpool
64 maxpool Conv
Input Conv

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11

11

56
56
112

112
224

224

128
64 128 maxpool Conv
64 maxpool Conv
Input Conv

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11

11

56
56
112

112
224

224

128
64 128 maxpool Conv
64 maxpool Conv
Input Conv

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11

11

56

56
56

56
112

112
224

224

128 256
64 128 maxpool Conv
64 maxpool Conv
Input Conv

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11

11

56

56

28
28
56

56
112

112
224

224

256
128 256
maxpool
64 128 maxpool Conv
64 maxpool Conv
Input Conv

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11

11

56

56

28
28
56

56
112

112
224

224

256
128 256
maxpool Conv
64 128 maxpool Conv
64 maxpool Conv
Input Conv

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11

11

56

56

28
28
56

56
112

112
224

224

256
128 256
maxpool Conv
64 128 maxpool Conv
64 maxpool Conv
Input Conv

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11

11

56

56

28

28
28
28
56

56
112

112
224

224

256 512
128 256
maxpool Conv
64 128 maxpool Conv
64 maxpool Conv
Input Conv

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11

11

56

56

28

28

14
28

14
28
56

56
112

112
224

224

512
256 512
128 256
maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
Input Conv

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11

11

56

56

28

28

14
28

14
28
56

56
112

112
224

224

512
256 512
128 256
maxpool Conv maxpool Conv
64 128 maxpool Conv
64 maxpool Conv
Input Conv

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11

11

56

56

28

28

14
28

14
28
56

56
112

112
224

224

512
256 512
128 256
maxpool Conv maxpool Conv
64 128 maxpool Conv
64 maxpool Conv
Input Conv

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11

11

56

56

28

28

14
14
28

14

14
28
56

56
112

112
224

224

512 512
256 512
128 256
maxpool Conv maxpool Conv
64 128 maxpool Conv
64 maxpool Conv
Input Conv

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11

11

56

56

28

28

14
14
7

28

14

14

7
28
56

56
112

112
512
224

224

512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
Input Conv

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11

11

56

56

28

28

14
14
7

28

14

14

7
28
56

56
112

112
512
224

224

512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
Input Conv fc
4096

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11

11

56

56

28

28

14
14
7

28

14

14

7
28
56

56
112

112
512
224

224

512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
Input Conv fc fc
4096 4096

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
4

2
22

22

11

11

56

56

28

28

14
14
7

28

14

14

7
28
56

56
112

112
512
224

224

512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
Input Conv fc fc
4096 4096

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
softmax

2
22

22

11

11

56

56

28

28

14
14
7

28

14

14

7
28
56

56
112

112
512
224

224

512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
softmax

2
22

22

11

11

56

56

28

28

14
14
7

28

14

14

7
28
56

56
112

112
512
224

224

512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

Kernel size is 3 × 3 throughout

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
softmax

2
22

22

11

11

56

56

28

28

14
14
7

28

14

14

7
28
56

56
112

112
512
224

224

512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

Kernel size is 3 × 3 throughout


Total parameters in non FC layers = ∼ 16M

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
softmax

2
22

22

11

11

56

56

28

28

14
14
7

28

14

14

7
28
56

56
112

112
512
224

224

512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

Kernel size is 3 × 3 throughout


Total parameters in non FC layers = ∼ 16M
Total Parameters in FC layers =

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
softmax

2
22

22

11

11

56

56

28

28

14
14
7

28

14

14

7
28
56

56
112

112
512
224

224

512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

Kernel size is 3 × 3 throughout


Total parameters in non FC layers = ∼ 16M
Total Parameters in FC layers = (512 × 7 × 7 × 4096)

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
softmax

2
22

22

11

11

56

56

28

28

14
14
7

28

14

14

7
28
56

56
112

112
512
224

224

512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

Kernel size is 3 × 3 throughout


Total parameters in non FC layers = ∼ 16M
Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096)

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
softmax

2
22

22

11

11

56

56

28

28

14
14
7

28

14

14

7
28
56

56
112

112
512
224

224

512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

Kernel size is 3 × 3 throughout


Total parameters in non FC layers = ∼ 16M
Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024)

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
softmax

2
22

22

11

11

56

56

28

28

14
14
7

28

14

14

7
28
56

56
112

112
512
224

224

512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

Kernel size is 3 × 3 throughout


Total parameters in non FC layers = ∼ 16M
Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) =
∼ 122M

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
softmax

2
22

22

11

11

56

56

28

28

14
14
7

28

14

14

7
28
56

56
112

112
512
224

224

512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

Kernel size is 3 × 3 throughout


Total parameters in non FC layers = ∼ 16M
Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) =
∼ 122M
Most parameters are in the first FC layer (∼ 102M)

54/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Module 11.5 : Image Classification continued
(GoogLeNet and ResNet)

55/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Consider the output at a certain layer
of a convolutional neural network

56/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
H1
Consider the output at a certain layer
f
of a convolutional neural network
Max Pooling

D
f
After this layer we could apply a max-
H
W1
pooling layer
1

56/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
H1
Consider the output at a certain layer
f H
of a convolutional neural network
Max Pooling

D
f
After this layer we could apply a max-
H
1

1
W1
convolution pooling layer
D

1
Or a 1 × 1 convolution
W

56/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
H1
Consider the output at a certain layer
f H
of a convolutional neural network
Max Pooling

D
f
After this layer we could apply a max-
H
1

1
W1
convolution H2 pooling layer
D

3
1
Or a 1 × 1 convolution
convolution

D
3
W Or a 3 × 3 convolution
W2
W

56/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
H1
Consider the output at a certain layer
f H
of a convolutional neural network
Max Pooling

D
f
After this layer we could apply a max-
H
1

1
W1
convolution H2 pooling layer
D

3
1
Or a 1 × 1 convolution
convolution

D
3
W Or a 3 × 3 convolution
W2
H3
Or a 5 × 5 convolution
W
5
convolution
1
5

1
D

W3
D

56/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
H1
Consider the output at a certain layer
f H
of a convolutional neural network
Max Pooling

D
f
After this layer we could apply a max-
H
1

1
W1
convolution H2 pooling layer
D

3
1
Or a 1 × 1 convolution
convolution

D
3
W Or a 3 × 3 convolution
W2
H3
Or a 5 × 5 convolution
W
5

5
1
convolution Question: Why choose between
D
1
these options (convolution, maxpool-
W3
D
ing, filter sizes)?
1

56/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
H1
Consider the output at a certain layer
f H
of a convolutional neural network
Max Pooling

D
f
After this layer we could apply a max-
H
1

1
W1
convolution H2 pooling layer
D

3
1
Or a 1 × 1 convolution
convolution

D
3
W Or a 3 × 3 convolution
W2
H3
Or a 5 × 5 convolution
W
5

5
1
convolution Question: Why choose between
D
1
these options (convolution, maxpool-
W3
D
ing, filter sizes)?
1 Idea: Why not apply all of them at
the same time and then concatenate
the feature maps?

56/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
H1
Well this naive idea could result in a
f H
large number of computations
Max Pooling
f

D
H
1 W1
convolution H2
1

1
3
convolution
3

D W

H3
W2
W
5
convolution
1
5

1
D

W3
D

57/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
H1
Well this naive idea could result in a
f H
large number of computations
Max Pooling

D
f
If P = 0 & S = 1 then convolving a
W × H × D input with a F × F × D
H
1 W1
convolution H2
1

1
filter results in a (W − F + 1)(H −
3

3
convolution
F + 1) sized output
D W

H3
W2
W
5
convolution
1
5

1
D

W3
D

57/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
H1
Well this naive idea could result in a
f H
large number of computations
Max Pooling

D
f
If P = 0 & S = 1 then convolving a
W × H × D input with a F × F × D
H
1 W1
convolution H2
1

1
filter results in a (W − F + 1)(H −
3

3
convolution
F + 1) sized output
D W

H3
Each element of the output requires
W2

5
W O(F × F × D) computations
convolution
1
5

1
D

W3
D

57/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
H1
Well this naive idea could result in a
f H
large number of computations
Max Pooling

D
f
If P = 0 & S = 1 then convolving a
W × H × D input with a F × F × D
H
1 W1
convolution H2
1

1
filter results in a (W − F + 1)(H −
3

3
convolution
F + 1) sized output
D W

H3
Each element of the output requires
W2

5
W O(F × F × D) computations
convolution
1
5

1
Can we reduce the number of compu-
D

D
W3 tations?

57/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Yes, by using 1 × 1 convolutions

58/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Yes, by using 1 × 1 convolutions
Huh?? What does a 1×1 convolution
H do ?

58/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Yes, by using 1 × 1 convolutions
Huh?? What does a 1×1 convolution
H H do ?
It aggregates along the depth
1

W W

D 1

58/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Yes, by using 1 × 1 convolutions
Huh?? What does a 1×1 convolution
H H do ?
It aggregates along the depth
1 So convolving a D×W ×H input with
1
D1 1×1 (D1 < D) filters will result in
D
a D1 × W × H output (S = 1, P = 0)

W W

D D1

58/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Yes, by using 1 × 1 convolutions
Huh?? What does a 1×1 convolution
H H do ?
It aggregates along the depth
1 So convolving a D×W ×H input with
1
D1 1×1 (D1 < D) filters will result in
D
a D1 × W × H output (S = 1, P = 0)
If D1 < D then this effectively re-
W W
duces the dimension of the input and
hence the computations

D D1

58/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Yes, by using 1 × 1 convolutions
Huh?? What does a 1×1 convolution
H H do ?
It aggregates along the depth
1 So convolving a D×W ×H input with
1
D1 1×1 (D1 < D) filters will result in
D
a D1 × W × H output (S = 1, P = 0)
If D1 < D then this effectively re-
W W
duces the dimension of the input and
hence the computations
Specifically instead of O(F × F × D)
D D1 we will need O(F × F × D1 ) compu-
tations

58/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Yes, by using 1 × 1 convolutions
Huh?? What does a 1×1 convolution
H H do ?
It aggregates along the depth
1 So convolving a D×W ×H input with
1
D1 1×1 (D1 < D) filters will result in
D
a D1 × W × H output (S = 1, P = 0)
If D1 < D then this effectively re-
W W
duces the dimension of the input and
hence the computations
Specifically instead of O(F × F × D)
D D1 we will need O(F × F × D1 ) compu-
tations
We could then apply subsequent 3×3,
5 × 5 filter on this reduced output
58/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
But we might want to use different
28
1 × 1 convolutions
dimensionality reductions before the
3 × 3 convolutions
(dimensionality re-
duction)
(on reduced input)
3 × 3 and 5 × 5 filters
5 × 5 convolutions
(on reduced input)

28

256

59/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
But we might want to use different
28
1 × 1 convolutions
dimensionality reductions before the
3 × 3 convolutions
(dimensionality re-
duction)
(on reduced input)
3 × 3 and 5 × 5 filters
1 × 1 convolutions
(dimensionality re-
5 × 5 convolutions
(on reduced input)
So we can use D1 and D2 1 × 1 fil-
duction)
28 ters before the 3 × 3 and 5 × 5 filters
respectively
256

59/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
But we might want to use different
28
1 × 1 convolutions
dimensionality reductions before the
3 × 3 convolutions
(dimensionality re-
duction)
(on reduced input)
3 × 3 and 5 × 5 filters
1 × 1 convolutions
(dimensionality re-
5 × 5 convolutions
(on reduced input)
So we can use D1 and D2 1 × 1 fil-
duction)
28
3 × 3 Maxpooling
ters before the 3 × 3 and 5 × 5 filters
(dimensionality re-
duction)
1 × 1 convolutions
respectively
256
We can then add the maxpooling
layer followed by dimensionality re-
duction

59/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 × 1 convolutions But we might want to use different
28
1 × 1 convolutions
dimensionality reductions before the
3 × 3 convolutions
(dimensionality re-
duction)
(on reduced input)
3 × 3 and 5 × 5 filters
1 × 1 convolutions
(dimensionality re-
5 × 5 convolutions
(on reduced input)
So we can use D1 and D2 1 × 1 fil-
duction)
28
3 × 3 Maxpooling
ters before the 3 × 3 and 5 × 5 filters
(dimensionality re-
duction)
1 × 1 convolutions
respectively
256
We can then add the maxpooling
layer followed by dimensionality re-
duction
And a new set of 1 × 1 convolutions

59/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 × 1 convolutions But we might want to use different
28
1 × 1 convolutions
dimensionality reductions before the
3 × 3 convolutions
(dimensionality re-
duction)
(on reduced input)
Filter
3 × 3 and 5 × 5 filters
concatenation
1 × 1 convolutions
(dimensionality re-
5 × 5 convolutions
(on reduced input)
So we can use D1 and D2 1 × 1 fil-
duction)
28
3 × 3 Maxpooling
ters before the 3 × 3 and 5 × 5 filters
(dimensionality re-
duction)
1 × 1 convolutions
respectively
256
We can then add the maxpooling
layer followed by dimensionality re-
duction
And a new set of 1 × 1 convolutions
And finally we concatenate all these
layers

59/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 × 1 convolutions But we might want to use different
28
1 × 1 convolutions
dimensionality reductions before the
3 × 3 convolutions
(dimensionality re-
duction)
(on reduced input)
Filter
3 × 3 and 5 × 5 filters
concatenation
1 × 1 convolutions
(dimensionality re-
5 × 5 convolutions
(on reduced input)
So we can use D1 and D2 1 × 1 fil-
duction)
28
3 × 3 Maxpooling
ters before the 3 × 3 and 5 × 5 filters
(dimensionality re-
duction)
1 × 1 convolutions
respectively
256
We can then add the maxpooling
layer followed by dimensionality re-
duction
And a new set of 1 × 1 convolutions
And finally we concatenate all these
layers
This is called the Inception module

59/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 × 1 convolutions But we might want to use different
28
1 × 1 convolutions
dimensionality reductions before the
3 × 3 convolutions
(dimensionality re-
duction)
(on reduced input)
Filter
3 × 3 and 5 × 5 filters
concatenation
1 × 1 convolutions
(dimensionality re-
5 × 5 convolutions
(on reduced input)
So we can use D1 and D2 1 × 1 fil-
duction)
28
3 × 3 Maxpooling
ters before the 3 × 3 and 5 × 5 filters
(dimensionality re-
duction)
1 × 1 convolutions
respectively
256
We can then add the maxpooling
layer followed by dimensionality re-
duction
And a new set of 1 × 1 convolutions
And finally we concatenate all these
layers
This is called the Inception module
We will now see GoogLeNet which
contains many such inception mod-
ules 59/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9
22
229

Input

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

2
22

11
229

112

64
Conv
Input

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

56
22

11

56
229

112

64
64 maxpool
Conv
Input

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

56

56
22

11

56

56
229

112

64 192
64 maxpool Conv
Conv
Input

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

56

56
22

11

28
28
56

56
229

112

192
64 192 maxpool
64 maxpool Conv
Conv
Input

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

56

56
22

11

28

28
28

28
3a
56

56
229

112

192 256
64 192 maxpoolInception
64 maxpool Conv
Conv
Input

64 1 × 1 convolutions

28
96 1 × 1 convolu- 128 3 × 3 convolu-
tions (dimensionality tions (on reduced
reduction) input)
Filter
concatenation
16 1 × 1 convolu-
32 5 × 5 convolutions
tions (dimensionality
(on reduced input)
reduction)
28
3 × 3 Maxpooling
(dimensionality re- 32 1 × 1 convolutions
duction)

192

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

56

56
22

11

28

28

28
28

28

28
3a
56

56
3b
229

112

192 256 480


64 192 maxpoolInception Inception
64 maxpool Conv
Conv
Input
128 1 × 1
convolutions
28
128 1 × 1 convolu- 192 3 × 3 convolu-
tions (dimensionality tions (on reduced
reduction) input)
Filter
concatenation
32 1 × 1 convolu-
96 5 × 5 convolutions
tions (dimensionality
(on reduced input)
reduction)
28
3 × 3 Maxpooling
(dimensionality re- 64 1 × 1 convolutions
duction)

256

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

56

56
22

11

28

28

28

14
28

28

28
3a
56

56

14
3b
229

112

480
192 256 480
64 192 maxpoolInception Inception maxpool
64 maxpool Conv
Conv
Input

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

56

56
22

11

28

28

28

14
28

28

28
3a 4a
56

56

14
3b
229

112

480
192 256 480
64 192 maxpoolInception Inception maxpool Inception
64 maxpool Conv
Conv
Input
192 1 × 1
convolutions
14
96 1 × 1 convolu- 208 3 × 3 convolu-
tions (dimensionality tions (on reduced
reduction) input)
Filter
concatenation
16 1 × 1 convolu-
48 5 × 5 convolutions
tions (dimensionality
(on reduced input)
reduction)
14
3 × 3 Maxpooling
(dimensionality re- 64 1 × 1 convolutions
duction)

480

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

56

56
22

11

28

28

28

14
28

28

28
3a 4a
56

56

14
3b 4b
229

112

480
192 256 480
64 192 maxpoolInception Inception maxpool Inception
64 maxpool Conv
Conv
Input
160 1 × 1
convolutions
14
112 1 × 1 convolu- 224 3 × 3 convolu-
tions (dimensionality tions (on reduced
reduction) input)
Filter
concatenation
24 1 × 1 convolu-
64 5 × 5 convolutions
tions (dimensionality
(on reduced input)
reduction)
14
3 × 3 Maxpooling
(dimensionality re- 64 1 × 1 convolutions
duction)

512

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

56

56
22

11

28

28

28

14
14
28

28

28
3a 4a 4c
56

56

14

14
3b 4b
229

112

480 512
192 256 480
64 192 maxpoolInception Inception maxpool Inception
64 maxpool Conv
Conv
Input
128 1 × 1
convolutions
14
128 1 × 1 convolu- 256 3 × 3 convolu-
tions (dimensionality tions (on reduced
reduction) input)
Filter
concatenation
24 1 × 1 convolu-
64 5 × 5 convolutions
tions (dimensionality
(on reduced input)
reduction)
14
3 × 3 Maxpooling
(dimensionality re- 64 1 × 1 convolutions
duction)

512

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

56

56
22

11

28

28

28

14

14
14
28

28

28
3a 4a 4c
56

56

14

14

14
3b 4b 4d
229

112

480 512 528


192 256 480
64 192 maxpoolInception Inception maxpool Inception Inception
64 maxpool Conv
Conv
Input
112 1 × 1
convolutions
14
144 1 × 1 convolu- 288 3 × 3 convolu-
tions (dimensionality tions (on reduced
reduction) input)
Filter
concatenation
32 1 × 1 convolu-
64 5 × 5 convolutions
tions (dimensionality
(on reduced input)
reduction)
14
3 × 3 Maxpooling
(dimensionality re- 64 1 × 1 convolutions
duction)

512

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

56

56
22

11

28

28

28

14

14

14
14
28

28

28
3a 4a 4c 4e
56

56

14

14

14

14
3b 4b 4d
229

112

480 512 528 832


192 256 480
64 192 maxpoolInception Inception maxpool Inception Inception Inception
64 maxpool Conv
Conv
Input
256 1 × 1
convolutions
14
160 1 × 1 convolu- 320 3 × 3 convolu-
tions (dimensionality tions (on reduced
reduction) input)
Filter
concatenation
32 1 × 1 convolu- 128 5 × 5 convolu-
tions (dimensionality tions (on reduced
reduction) input)
14
3 × 3 Maxpooling
128 1 × 1
(dimensionality re-
convolutions
duction)

528

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

56

56
22

11

28

28

28

14

14

14
14

7
28

28

28
3a 4a 4c 4e
56

56

14

14

14

14
3b 4b 4d

7
229

112

832
480 512 528 832
192 256 480 maxpool
64 192 maxpoolInception Inception maxpool Inception Inception Inception
64 maxpool Conv
Conv
Input

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

56

56
22

11

28

28

28

14

14

14
14

7
28

28

28
3a 4a 4c 4e 5a
56

56

14

14

14

14
3b 4b 4d

7
229

112

832 832
480 512 528 832
192 256 480 maxpool Inception
64 192 maxpoolInception Inception maxpool Inception Inception Inception
64 maxpool Conv
Conv
Input
256 1 × 1
convolutions
7
160 1 × 1 convolu- 320 3 × 3 convolu-
tions (dimensionality tions (on reduced
reduction) input)
Filter
concatenation
32 1 × 1 convolu- 128 5 × 5 convolu-
tions (dimensionality tions (on reduced
reduction) input)
7
3 × 3 Maxpooling
128 1 × 1
(dimensionality re-
convolutions
duction)

832

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

56

56
22

11

28

28

28

14

14

14
14
7

7
28

28

28
3a 4a 4c 4e 5a 5b
56

56

14

14

14

14
3b 4b 4d

7
229

112

832 832 1024


480 512 528 832
192 256 480 maxpool Inception Inception
64 192 maxpoolInception Inception maxpool Inception Inception Inception
64 maxpool Conv
Conv
Input
384 1 × 1
convolutions
7
192 1 × 1 convolu- 384 3 × 3 convolu-
tions (dimensionality tions (on reduced
reduction) input)
Filter
concatenation
48 1 × 1 convolu- 128 5 × 5 convolu-
tions (dimensionality tions (on reduced
reduction) input)
7
3 × 3 Maxpooling
128 1 × 1
(dimensionality re-
convolutions
duction)

832

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

56

56
22

11

28

28

28

14

14

14
14
7

7
1
1

28

28

28
3a 4a 4c 4e 5a 5b
56

56

14

14

14

14
3b 4b 4d 1024

7
229

112

832 832 1024 avgpool


480 512 528 832
192 256 480 maxpool Inception Inception
64 192 maxpoolInception Inception maxpool Inception Inception Inception
64 maxpool Conv
Conv
Input

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

56

56
22

11

28

28

28

14

14

14
14
7

7
1 1
1 1

28

28

28
3a 4a 4c 4e 5a 5b
56

56

14

14

14

14
3b 4b 4d 1024 1024

7
229

112

832 832 1024 avgpool dropout(40%)


480 512 528 832
192 256 480 maxpool Inception Inception
64 192 maxpoolInception Inception maxpool Inception Inception Inception
64 maxpool Conv
Conv
Input

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

56

56
22

11

28

28

28

14

14

14
14
7

7
1 1
1 1

28

28

28
3a 4a 4c 4e 5a 5b
56

56

14

14

14

14
3b 4b 4d 1024 1024

7
229

112

832 832 1024 avgpool dropout(40%)


480 512 528 832
192 256 480 maxpool Inception Inception 1000
64 192 maxpoolInception Inception maxpool Inception Inception Inception fc
64 maxpool Conv
Conv
Input

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9

56

56
22

11

28

28

28

14

14

14
14
7

7
1 1
1 1

28

28

28
3a 4a 4c 4e 5a 5b
56

56

14

14

14

14
3b 4b 4d 1024 1024

7
229

112

832 832 1024 avgpool dropout(40%)


480 512 528 832
192 256 480 maxpool Inception Inception 1000 1000
64 192 maxpoolInception Inception maxpool Inception Inception Inception fc softmax
64 maxpool Conv
Conv
Input

60/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Important Trick: Got rid of the
fully connected layer

61/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Important Trick: Got rid of the
fully connected layer
Notice that output of the last layer is
7 × 7 × 1024
7 × 7 × 1024 dimensional
flatten

1024

61/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1000 Important Trick: Got rid of the
fully connected layer
Notice that output of the last layer is
7 × 7 × 1024
7 × 7 × 1024 dimensional
flatten What if we were to add a fully con-
nected layer with 1000 nodes (for
7 1000 classes) on top of this

1024

61/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1000 Important Trick: Got rid of the
fully connected layer
W ∈ R50176×1000
Notice that output of the last layer is
7 × 7 × 1024
7 × 7 × 1024 dimensional
flatten What if we were to add a fully con-
nected layer with 1000 nodes (for
7 1000 classes) on top of this
We would have 7 × 7 × 1024 × 1000 =
1024
49M parameters

61/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1000 Important Trick: Got rid of the
fully connected layer
W ∈ R50176×1000
Notice that output of the last layer is
7 × 7 × 1024
7 × 7 × 1024 dimensional
flatten What if we were to add a fully con-
nected layer with 1000 nodes (for
7 1000 classes) on top of this
pick average We would have 7 × 7 × 1024 × 1000 =
1024
49M parameters
Instead they use an average pooling of
7 size 7 × 7 on each of the 1024 feature
maps

61/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1000 Important Trick: Got rid of the
fully connected layer
W ∈ R50176×1000
Notice that output of the last layer is
7 × 7 × 1024
7 × 7 × 1024 dimensional
flatten What if we were to add a fully con-
nected layer with 1000 nodes (for
7 1000 classes) on top of this
pick average We would have 7 × 7 × 1024 × 1000 =
1024
1024
49M parameters
Instead they use an average pooling of
7 size 7 × 7 on each of the 1024 feature
maps
This results in a 1024 dimensional
output

61/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1000 Important Trick: Got rid of the
fully connected layer
W ∈ R1024×1000
Notice that output of the last layer is
1024
7 × 7 × 1024 dimensional
flatten What if we were to add a fully con-
nected layer with 1000 nodes (for
7 1000 classes) on top of this
pick average We would have 7 × 7 × 1024 × 1000 =
1024
1024
49M parameters
Instead they use an average pooling of
7 size 7 × 7 on each of the 1024 feature
maps
This results in a 1024 dimensional
output
Significantly reduces the number of
parameters 61/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1000 12× less parameters than AlexNet
W ∈ R1024×1000

1024

flatten

7
pick average
1024
1024

61/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1000 12× less parameters than AlexNet
W ∈ R1024×1000 2× more computations
1024

flatten

7
pick average
1024
1024

61/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
GoogLeNet
ResNet

62/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Suppose we have been able to train a
shallow neural network well

63/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Suppose we have been able to train a
shallow neural network well
Now suppose we construct a deeper
network which has few more layers (in
orange)

63/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Suppose we have been able to train a
shallow neural network well
Now suppose we construct a deeper
network which has few more layers (in
orange)
Intuitively, if the shallow network
works well then the deep network
should also work well by simply learn-
ing to compute identity functions in
the new layers

63/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Suppose we have been able to train a
shallow neural network well
Now suppose we construct a deeper
network which has few more layers (in
orange)
Intuitively, if the shallow network
works well then the deep network
should also work well by simply learn-
ing to compute identity functions in
the new layers
Essentially, the solution space of a
shallow neural network is a subset of
the solution space of a deep neural
network

63/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
But in practice it is observed that this
doesn’t happen

64/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
But in practice it is observed that this
doesn’t happen
Notice that the deep layers have a
higher error rate on the test set

64/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
x
Consider any two stacked layers in a
CNN
relu

relu

H(x)

65/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
x
Consider any two stacked layers in a
CNN
relu
The two layers are essentially
learning some function of the input
relu

H(x)

65/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
x
Consider any two stacked layers in a
CNN
relu
The two layers are essentially
learning some function of the input
relu What if we enable it to learn only a
H(x) residual function of the input?

relu Identity

F (x) relu

H(x) = F (x) + x

65/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
x
Why would this help?

relu

relu

H(x)

relu Identity

F (x) relu

H(x) = F (x) + x

66/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
x
Why would this help?
Remember our argument that a
relu
deeper version of a shallow network
would do just fine by learning identity
relu
transformations in the new layers
H(x)

relu Identity

F (x) relu

H(x) = F (x) + x

66/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
x
Why would this help?
Remember our argument that a
relu
deeper version of a shallow network
would do just fine by learning identity
relu
transformations in the new layers
H(x) This identity connection from the in-
put allows a ResNet to retain a copy
x of the input

relu Identity

F (x) relu

H(x) = F (x) + x

66/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
x
Why would this help?
Remember our argument that a
relu
deeper version of a shallow network
would do just fine by learning identity
relu
transformations in the new layers
H(x) This identity connection from the in-
put allows a ResNet to retain a copy
x of the input
Using this idea they were able to train
really deep networks
relu Identity

F (x) relu

H(x) = F (x) + x

66/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1st place in all five main tracks
ImageNet Classification: “Ultra-
deep” 152-layer nets

ResNet, 152 layers

67/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1st place in all five main tracks
ImageNet Classification: “Ultra-
deep” 152-layer nets
ImageNet Detection: 16% better
than the 2nd best system

ResNet, 152 layers

67/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1st place in all five main tracks
ImageNet Classification: “Ultra-
deep” 152-layer nets
ImageNet Detection: 16% better
than the 2nd best system
ImageNet Localization: 27% bet-
ter than the 2nd best system

ResNet, 152 layers

67/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1st place in all five main tracks
ImageNet Classification: “Ultra-
deep” 152-layer nets
ImageNet Detection: 16% better
than the 2nd best system
ImageNet Localization: 27% bet-
ter than the 2nd best system
COCO Detection: 11% better than
the 2nd best system

ResNet, 152 layers

67/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1st place in all five main tracks
ImageNet Classification: “Ultra-
deep” 152-layer nets
ImageNet Detection: 16% better
than the 2nd best system
ImageNet Localization: 27% bet-
ter than the 2nd best system
COCO Detection: 11% better than
the 2nd best system
COCO Segmentation: 12% better
than the 2nd best system

ResNet, 152 layers

67/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Bag of tricks
Batch Normalizaton after every
CONV layer

ResNet, 152 layers

68/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Bag of tricks
Batch Normalizaton after every
CONV layer
Xavier/2 initialization from [He et al]

ResNet, 152 layers

68/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Bag of tricks
Batch Normalizaton after every
CONV layer
Xavier/2 initialization from [He et al]
SGD + Momentum(0.9)

ResNet, 152 layers

68/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Bag of tricks
Batch Normalizaton after every
CONV layer
Xavier/2 initialization from [He et al]
SGD + Momentum(0.9)
Learning rate:0.1, divided by 10 when
validation error plateaus

ResNet, 152 layers

68/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Bag of tricks
Batch Normalizaton after every
CONV layer
Xavier/2 initialization from [He et al]
SGD + Momentum(0.9)
Learning rate:0.1, divided by 10 when
validation error plateaus
Mini-batch size 256

ResNet, 152 layers

68/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Bag of tricks
Batch Normalizaton after every
CONV layer
Xavier/2 initialization from [He et al]
SGD + Momentum(0.9)
Learning rate:0.1, divided by 10 when
validation error plateaus
Mini-batch size 256
Weight decay of 1e-5

ResNet, 152 layers

68/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Bag of tricks
Batch Normalizaton after every
CONV layer
Xavier/2 initialization from [He et al]
SGD + Momentum(0.9)
Learning rate:0.1, divided by 10 when
validation error plateaus
Mini-batch size 256
Weight decay of 1e-5
No dropout used

ResNet, 152 layers

68/68
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11

Вам также может понравиться