Академический Документы
Профессиональный Документы
Культура Документы
Lecture 9a
Overview of ways to improve generalization
Geoffrey Hinton
Nitish Srivastava,
Kevin Swersky
Tijmen Tieleman
Abdel-rahman Mohamed
Reminder: Overfitting
The training data contains information about the
regularities in the mapping from input to output. But it
also contains sampling error.
There will be accidental regularities just because of
the particular training cases that were chosen.
When we fit the model, it cannot tell which regularities
are real and which are caused by sampling error.
So it fits both kinds of regularity. If the model is very
flexible it can model the sampling error really well.
Preventing overfitting
Approach 3: Average many different
Approach 1: Get more data!
models.
Almost always the best bet if you
Use models with different forms.
have enough compute power to
train on more data.
Or train the model on different
subsets of the training data (this
Approach 2: Use a model that has
is called bagging).
the right capacity:
enough to fit the true regularities. Approach 4: (Bayesian) Use a
single neural network architecture,
not enough to also fit spurious
but average the predictions made
regularities (if they are weaker).
by many different weight vectors.
outputs
W2
W1
inputs
C = E + 2 wi2
i
C
E
=
+ wi
wi wi
C
when
= 0,
wi
C
w
E
wi =
wi
1
w/2
w/2
y j + N (0, wi2 i2 )
j
wi
i
xi + N (0, i2 )
Gaussian noise
output on
one case
y noisy = wi xi + wii
i
"'
"'
*2 $
*2 $
E "#(y noisy t)2 $% = E -)) y + wii t ,, . = E -)) (y t) + wii ,, .
-(
.
-(
.
+
+
i
i
#
%
#
%
#
& #)
,2 &
= (y t)2 + E %2(y t) wii ( + E %++ wii .. (
(
%$
(' %* i
i
$
'
#
&
2 2
= (y t) + E % wi i (
%$ i
('
2
= (y t)2 + wi2 i2
i
because i is independent of j
and i is independent of (y t)
2
So i is equivalent to an L2 penalty
p(s = 1) =
1+ e
1
p 0.5
0
53
P(D) = p (1 p)
47
dP(D)
= 53p 52 (1 p)47 47p 53 (1 p)46
dp
" 53 47 %( 53
47 *
=$
p
(1
p)
')
+
# p 1 p &
= 0 if p = .53
probability
density
0
area=1
p
probability
density
probability
density
1
1
1
2
area=1
2
probability
density
0
area=1
p
1
area=1
area=1
2
probability
density
1
0
Bayes Theorem
joint probability
conditional
probability
p(W | D) =
posterior probability of
weight vector W given
training data D
p(W )
probability of observed
data given W
p(D | W )
p(D)
p(W )p(D | W )
Geoffrey Hinton
Nitish Srivastava,
Kevin Swersky
Tijmen Tieleman
Abdel-rahman Mohamed
t
correct
answer
y
models
estimate of
most probable
value
yc = f (inputc , W )
p(tc | yc ) =
1
2
(tc yc )2
2 2
(tc yc )
log p(tc | yc ) = k +
2 2
Gaussian
distribution
centered at the
nets output
Minimizing squared
error is the same as
maximizing log prob
under a Gaussian.
Because the log function is monotonic, it does not change where the
maxima are. So we can maximize sums of log probabilities
p(W )
p(D | W )
p(D)
log prob
of target
values
given W
p(w)
w
p(w) =
0
w2
2 W2
w2
log p(w) =
+k
2
2 W
(yc tc )
2 D2 c
D2
2
C = E +
w
2 i
W i
2 W2
wi2
constant