Вы находитесь на странице: 1из 55

Bayesian Deep Learning

Yarin Gal
Research Fellow, University of Cambridge
Research Fellow, The Alan Turing Institute
yg279@cam.ac.uk

Unless specified otherwise, photos are either original work or taken from Wikimedia, under Creative Commons license
Pillar I: Deep learning
Conceptually simple models
Data: X = {x1 , x2 , ..., xN }, Y = {y1 , y2 , ..., yN }
Model: given matrices W and non-linear func. σ(·), define “network”

ỹi (xi ) = W2 · σ W1 xi

Objective: find W for which ỹi (xi ) is close to yi for all i ≤ N.

2 of 18
Pillar I: Deep learning
Conceptually simple models
Data: X = {x1 , x2 , ..., xN }, Y = {y1 , y2 , ..., yN }
Model: given matrices W and non-linear func. σ(·), define “network”

ỹi (xi ) = W2 · σ W1 xi

Objective: find W for which ỹi (xi ) is close to yi for all i ≤ N.

Deep learning is awesome ... but has many issues


I Simple and modular I What does a model not know?
I Huge attention from I Uninterpretable black-boxes
practitioners and engineers I Easily fooled (AI safety)
I Great software tools I Lacks solid mathematical
I Scales with data and foundations (mostly ad hoc)
compute I Crucially relies on big data
I Real-world impact
2 of 18
Pillar I: Deep learning
Conceptually simple models
Data: X = {x1 , x2 , ..., xN }, Y = {y1 , y2 , ..., yN }
Model: given matrices W and non-linear func. σ(·), define “network”

ỹi (xi ) = W2 · σ W1 xi

Objective: find W for which ỹi (xi ) is close to yi for all i ≤ N.

Deep learning is awesome ... but has many issues






I Simple and modular I What does a model not know? 

No uncertainty!




I Huge attention from I Uninterpretable black-boxes




practitioners and engineers

I Easily fooled (AI safety)
I Great software tools


I Lacks solid mathematical




I Scales with data and foundations (mostly ad hoc)




compute


I Crucially relies on big data


I Real-world impact
2 of 18
Why should I care about uncertainty?
I We need a way to tell what our model knows and what not.
I We train a model to recognise dog breeds

3 of 18
Why should I care about uncertainty?
I We need a way to tell what our model knows and what not.
I We train a model to recognise dog breeds

I And are given a cat to classify

3 of 18
Why should I care about uncertainty?
I We need a way to tell what our model knows and what not.
I We train a model to recognise dog breeds

I And are given a cat to classify

I What would you want your model to do?

3 of 18
Why should I care about uncertainty?
I We need a way to tell what our model knows and what not.
I We train a model to recognise dog breeds

I And are given a cat to classify

I What would you want your model to do?

I Similar problems in decision making, physics, life science, etc.

3 of 18
Why should I care about uncertainty?
I We need a way to tell what our model knows and what not.

I Uncertainty gives insights into the black-box when it fails


—where am I not certain?

3 of 18
Why should I care about uncertainty?
I We need a way to tell what our model knows and what not.

I Uncertainty gives insights into the black-box when it fails


—where am I not certain?

I Uncertainty might even be useful to identify when attacked with


adversarial examples!

I Lastly, need less data if label only where model is uncertain:


wear-and-tear in robotics, expert time in medical analysis
3 of 18
Why should I care about uncertainty?
I We need a way to tell what our model knows and what not.

I Uncertainty gives insights into the black-box when it fails


—where am I not certain?

I Uncertainty might even be useful to identify when attacked with


adversarial examples!

I Lastly, need less data if label only where model is uncertain:


wear-and-tear in robotics, expert time in medical analysis

3 of 18
Pillar II: Bayes
The language of uncertainty
I Probability theory
I Specifically Bayesian probability theory (1750!)
When applied to Information Engineering...
I Bayesian modelling

I Built on solid mathematical foundations


I Orthogonal to deep learning...
4 of 18
A simple way to tie the two pillars together
I “Dropout”: a popular method in deep learning, cited hundreds
and hundreds of times

I Works by randomly setting network units to zero

I This somehow improves performance and reduces over-fitting

I Used in almost all modern deep learning models

5 of 18
A simple way to tie the two pillars together
I “Dropout”: a popular method in deep learning, cited hundreds
and hundreds of times

I Works by randomly setting network units to zero

I This somehow improves performance and reduces over-fitting

I Used in almost all modern deep learning models

5 of 18
A simple way to tie the two pillars together
I “Dropout”: a popular method in deep learning, cited hundreds
and hundreds of times

I Works by randomly setting network units to zero

I This somehow improves performance and reduces over-fitting

I Used in almost all modern deep learning models

5 of 18
A simple way to tie the two pillars together
I “Dropout”: a popular method in deep learning, cited hundreds
and hundreds of times

I Works by randomly setting network units to zero

I This somehow improves performance and reduces over-fitting

I Used in almost all modern deep learning models

5 of 18
A simple way to tie the two pillars together
I Can be shown that dropout training is identical to approximate
inference in Bayesian modelling [Gal, 2016],

I Connecting Deep Learning to Bayesian probability theory.


I The mathematically grounded connection gives a treasure trove
of new research opportunities:
I uncertainty in deep learning, e.g. interpretability and AI safety
I principled extensions to deep learning
I enable deep learning in small data domains

6 of 18
A simple way to tie the two pillars together
I Can be shown that dropout training is identical to approximate
inference in Bayesian modelling [Gal, 2016],

I Connecting Deep Learning to Bayesian probability theory.


I The mathematically grounded connection gives a treasure trove
of new research opportunities:
I uncertainty in deep learning, e.g. interpretability and AI safety
I principled extensions to deep learning
I enable deep learning in small data domains

6 of 18
A simple way to tie the two pillars together
I Can be shown that dropout training is identical to approximate
inference in Bayesian modelling [Gal, 2016],

I Connecting Deep Learning to Bayesian probability theory.


I The mathematically grounded connection gives a treasure trove
of new research opportunities:
I uncertainty in deep learning, e.g. interpretability and AI safety
I principled extensions to deep learning
I enable deep learning in small data domains

6 of 18
A simple way to tie the two pillars together
I Can be shown that dropout training is identical to approximate
inference in Bayesian modelling [Gal, 2016],

I Connecting Deep Learning to Bayesian probability theory.


I The mathematically grounded connection gives a treasure trove
of new research opportunities:
I uncertainty in deep learning, e.g. interpretability and AI safety
I principled extensions to deep learning
I enable deep learning in small data domains

6 of 18
A simple way to tie the two pillars together
I Can be shown that dropout training is identical to approximate
inference in Bayesian modelling [Gal, 2016],

I Connecting Deep Learning to Bayesian probability theory.


I The mathematically grounded connection gives a treasure trove
of new research opportunities:
I uncertainty in deep learning, e.g. interpretability and AI safety
I principled extensions to deep learning
I enable deep learning in small data domains

6 of 18
A simple way to tie the two pillars together
I Can be shown that dropout training is identical to approximate
inference in Bayesian modelling [Gal, 2016],

I Connecting Deep Learning to Bayesian probability theory.


I The mathematically grounded connection gives a treasure trove
of new research opportunities:
I uncertainty in deep learning, e.g. interpretability and AI safety
I principled extensions to deep learning
I enable deep learning in small data domains

6 of 18
A simple way to tie the two pillars together
I Can be shown that dropout training is identical to approximate
inference in Bayesian modelling [Gal, 2016],

I Connecting Deep Learning to Bayesian probability theory.


I The mathematically grounded connection gives a treasure trove
of new research opportunities:
I uncertainty in deep learning, e.g. interpretability and AI safety
I principled extensions to deep learning
I enable deep learning in small data domains

6 of 18
A simple way to tie the two pillars together
I Can be shown that dropout training is identical to approximate
inference in Bayesian modelling [Gal, 2016],

I Connecting Deep Learning to Bayesian probability theory.


I The mathematically grounded connection gives a treasure trove
of new research opportunities:
I uncertainty in deep learning, e.g. interpretability and AI safety
I principled extensions to deep learning
I enable deep learning in small data domains
More in a second. First, some theory.

6 of 18
Some theory

From Bayesian neural networks to Dropout

I Place prior p(W) dist. on weights, making these r.v.s


I Given dataset X, Y, the r.v. W has a posterior: p W|X, Y
7 of 18
Some theory

From Bayesian neural networks to Dropout

I Place prior p(W) dist. on weights, making these r.v.s



I Given dataset X, Y, the r.v. W has a posterior: p W|X, Y

I Which is difficult to evaluate—many great researchers tried

I Can define simple distribution qM (·) and approximate



qM (W) ≈ p W|X, Y

I This is called approximate variational inference.

7 of 18
Some theory

From Bayesian neural networks to Dropout

I Place prior p(W) dist. on weights, making these r.v.s



I Given dataset X, Y, the r.v. W has a posterior: p W|X, Y

I Which is difficult to evaluate—many great researchers tried

I Can define simple distribution qM (·) and approximate



qM (W) ≈ p W|X, Y

I This is called approximate variational inference.

7 of 18
Some theory

From Bayesian neural networks to Dropout

I Place prior p(W) dist. on weights, making these r.v.s



I Given dataset X, Y, the r.v. W has a posterior: p W|X, Y

I Which is difficult to evaluate—many great researchers tried

I Can define simple distribution qM (·) and approximate



qM (W) ≈ p W|X, Y

p(W|X, Y)

qθ1 (W)

7 of 18
Some theory

From Bayesian neural networks to Dropout

I Place prior p(W) dist. on weights, making these r.v.s



I Given dataset X, Y, the r.v. W has a posterior: p W|X, Y

I Which is difficult to evaluate—many great researchers tried

I Can define simple distribution qM (·) and approximate



qM (W) ≈ p W|X, Y

p(W|X, Y)
qθ2 (W)

7 of 18
Some theory

From Bayesian neural networks to Dropout

I Place prior p(W) dist. on weights, making these r.v.s



I Given dataset X, Y, the r.v. W has a posterior: p W|X, Y

I Which is difficult to evaluate—many great researchers tried

I Can define simple distribution qM (·) and approximate



qM (W) ≈ p W|X, Y

p(W|X, Y)
qθ3 (W)

7 of 18
Some theory

From Bayesian neural networks to Dropout

I Place prior p(W) dist. on weights, making these r.v.s



I Given dataset X, Y, the r.v. W has a posterior: p W|X, Y

I Which is difficult to evaluate—many great researchers tried

I Can define simple distribution qM (·) and approximate



qM (W) ≈ p W|X, Y

p(W|X, Y)
qθ4 (W)

7 of 18
Some theory

From Bayesian neural networks to Dropout

I Place prior p(W) dist. on weights, making these r.v.s



I Given dataset X, Y, the r.v. W has a posterior: p W|X, Y

I Which is difficult to evaluate—many great researchers tried

I Can define simple distribution qM (·) and approximate



qM (W) ≈ p W|X, Y

qθ5 (W) p(W|X, Y)

7 of 18
Some theory

From Bayesian neural networks to Dropout

I Place prior p(W) dist. on weights, making these r.v.s



I Given dataset X, Y, the r.v. W has a posterior: p W|X, Y

I Which is difficult to evaluate—many great researchers tried

I Can define simple distribution qM (·) and approximate



qM (W) ≈ p W|X, Y

I This is called approximate variational inference.

7 of 18
Some theory

Theorem (Dropout as approximate variational inference)


Define qM (W) := M · diag(Bernoulli)
with variational parameter M.
The optimisation objective of (stochastic) variational inference with
qM (W) is identical to the objective of a dropout neural network.

Proof.
See Gal [2016].

8 of 18
Some theory

Theorem (Dropout as approximate variational inference)


Define qM (W) := M · diag(Bernoulli)
with variational parameter M.
The optimisation objective of (stochastic) variational inference with
qM (W) is identical to the objective of a dropout neural network.

Proof.
See Gal [2016].

Implementing inference with qM (W)


=
Implementing dropout training.
Line to line.

8 of 18
Some theory

Theorem (Dropout as approximate variational inference)


Define qM (W) := M · diag(Bernoulli)
with variational parameter M.
The optimisation objective of (stochastic) variational inference with
qM (W) is identical to the objective of a dropout neural network.

Corollary (Model uncertainty with dropout)


Given p(y∗ |fW (x∗ )) = N (y∗ ; fW (x∗ ), τ −1 I) for some τ > 0, the model’s
predictive variance can be estimated with the unbiased estimator:
T
f ∗ ] := τ −1 I + 1
X
Var[y e ∗ ]T E[y
fWt (x∗ )T fWt (x∗ ) − E[y e ∗]
b b
T
t=1

b t ∼ q ∗ (W).
with W M

8 of 18
Some code, just for fun

In practical terms1 , given point x:


I drop units at test time

I repeat 10 times

I and look at mean and sample variance.

I Or in Python:

1 y = []
2 for _ in xrange(10):
3 y.append(model.output(x, dropout=True))
4 y_mean = numpy.mean(y)
5 y_var = numpy.var(y)

1
Friendly introduction given in yarin.co/blog
9 of 18
Example uncertainty in deep learning
What would be the CO2 concentration level in Mauna Loa,
Hawaii, in 20 years’ time?
I Normal dropout:

I Same network, Bayesian perspective:

10 of 18
Example uncertainty in deep learning
What would be the CO2 concentration level in Mauna Loa,
Hawaii, in 20 years’ time?
I Normal dropout:

I Same network, Bayesian perspective:

10 of 18
Example uncertainty in deep learning
What would be the CO2 concentration level in Mauna Loa,
Hawaii, in 20 years’ time?

Normal dropout: Bayesian perspective:

What can we do with this?


I Interpretability & AI safety
I Principled deep learning extensions
I Deep learning in small data domains

10 of 18
Example uncertainty in deep learning
What would be the CO2 concentration level in Mauna Loa,
Hawaii, in 20 years’ time?

Normal dropout: Bayesian perspective:

What can we do with this?


I Interpretability & AI safety
I Principled deep learning extensions
I Deep learning in small data domains
I Cancer diagnosis

10 of 18
Active Learning with image data

Active learning of images [Gal, Islam & Ghahramani, 2017]


E.g. diagnose melanoma with a handful of images.

11 of 18
Active Learning acquisition functions
Choose x ∗ that maximises acquisition functions a(x):

x∗ = argmaxx∈Dpool a(x)

E.g. points that maximise uncertainty. But, which uncertainty?


I Aleatoric uncertainty captures noise inherent in the data
I Epistemic uncertainty captures model’s lack of knowledge
I Predictive uncertainty captures the sum of the two

Figures adapted from Hanna M. Wallach (Cambridge, UMassAmherst)


12 of 18
Acquisition functions for classification
Choose x ∗ that maximises acquisition functions a(x):
x∗ = argmaxx∈Dpool a(x)
Possible measures of uncertainty in classification:
I Predictive entropy (H[y|x, Dtrain ])
X
aPE (x) = − p(y = c|x, Dtrain ) log p(y = c|x, Dtrain )
c

I Information gained about the model parameters (I[y , W|x, Dtrain ])


 
aMI (x) = H[y|x, Dtrain ] − Ep(W|Dtrain ) H[y |x, W]
I Variation ratios
aVR (x) = 1 − max p(y|x, Dtrain )
y

I Random acquisition (baseline): aU (x) = unif()


13 of 18
Acquisition functions intuition
Want to classify dogs vs. cats given image x with models M1 , M2 , M3
• Stochastic forward passes give probability vectors for each model:
1. (1, 0), ..., (1, 0)
2. (0.5, 0.5), ..., (0.5, 0.5), and
3. (1, 0), (0, 1), (1, 0), ..., (0, 1)

14 of 18
Acquisition functions intuition
Want to classify dogs vs. cats given image x with models M1 , M2 , M3
• Stochastic forward passes give probability vectors for each model:
1. (1, 0), ..., (1, 0)
2. (0.5, 0.5), ..., (0.5, 0.5), and
3. (1, 0), (0, 1), (1, 0), ..., (0, 1)

What’s the epistemic uncertainty for each model?


What’s the predictive uncertainty for each model?

14 of 18
Acquisition functions intuition
Want to classify dogs vs. cats given image x with models M1 , M2 , M3
• Stochastic forward passes give probability vectors for each model:
1. (1, 0), ..., (1, 0)
2. (0.5, 0.5), ..., (0.5, 0.5), and
3. (1, 0), (0, 1), (1, 0), ..., (0, 1)

What’s the epistemic uncertainty? models M1 and M2 are


confident about the output. Model M3 is uncertain.
What’s the predictive uncertainty? M1 has low uncertainty, M2
and M3 have high uncertainty.

14 of 18
Acquisition functions intuition
Want to classify dogs vs. cats given image x with models M1 , M2 , M3
• Stochastic forward passes give probability vectors for each model:
1. (1, 0), ..., (1, 0)
2. (0.5, 0.5), ..., (0.5, 0.5), and
3. (1, 0), (0, 1), (1, 0), ..., (0, 1)

What’s the epistemic uncertainty? models M1 and M2 are


confident about the output. Model M3 is uncertain.
What’s the predictive uncertainty? M1 has low uncertainty, M2
and M3 have high uncertainty.
Acquisition functions intuition:
I M1 : all acquisition functions give low uncertainty
I M2 : variation ratios and predictive entropy give high uncertainty;
mutual information gives low uncertainty.
I M3 : all acquisition functions give high uncertainty
14 of 18
MNIST experiments
Test accuracy as a function of number of acquired images (up to 1K):
100 100 100

98 98 98

96 96 96

94 94 94

92 92 92

90 90 90

88 88 88

86 86 86

84 84 84

82 BALD 82 Var Ratios 82 Max Entropy


Deterministic BALD Deterministic Var Ratios Deterministic Max Entropy
80 80 80
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000

BALD Var Ratios Max Entropy


using both a Bayesian CNN (red) and a deterministic CNN (blue)

Number of acquired images to get to model error of %:

% error BALD Var Ratios Max Ent Random


10% 145 120 165 255
5% 335 295 355 835
15 of 18
Active learning vs. semi-supervised learning
Test error on MNIST with 1000 labelled training samples, for active
learning (using simple LeNet) vs. semi-supervised techniques:

Technique Test error


Semi-supervised:
SS Embedding (Weston et al., 2012) 5.73%
DGN (Kingma et al., 2014) 2.40%
Γ Ladder Network (Rasmus et al., 2015) 1.53%
Virtual Adversarial (Miyato et al., 2015) 1.32%
Active learning with various acquisitions:
Random 4.66%
BALD 1.80%
Max Entropy 1.74%
Var Ratios 1.64%
16 of 18
Medical analysis with Active Learning

Active learning of images [Gal, Islam & Ghahramani, 2017]


E.g. diagnose melanoma with a handful of images:

0.74
0.72
0.70
0.68
AUC

0.66
0.64
0.62 BALD
uniform
0.60
0 1 2 3 4
Acquisition steps

Performance vs. acquisition

17 of 18
Medical analysis with Active Learning

Active learning of images [Gal, Islam & Ghahramani, 2017]


E.g. diagnose melanoma with a handful of images:
70

60
# positive examples acquired

50

40

30
BALD
uniform
20
0 1 2 3 4
Acquisition steps

# acquired positive examples vs. acquisition

17 of 18
New horizons

Most exciting is work to come:


I What is interesting data to label? (when model is uncertain)

I Active learning in real-world medical applications


and much, much, more.

18 of 18
New horizons

Most exciting is work to come:


I What is interesting data to label? (when model is uncertain)

I Active learning in real-world medical applications


and much, much, more.
Thank you for listening.

18 of 18
References
I Y Gal, R Turner, “Improving the Gaussian process sparse spectrum
approximation by representing uncertainty in frequency inputs”, ICML (2015)
I Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing
Model Uncertainty in Deep Learning”, ICML (2016)
I Y Gal, Z Ghahramani, “A Theoretically Grounded Application of Dropout in
Recurrent Neural Networks”, NIPS (2016)
I Y Gal, R McAllister, C Rasmussen, “Improving PILCO with Bayesian Neural
Network Dynamics Models”, DEML workshop, ICML (2016)
I Y Gal, R Islam, Z Ghahramani, “Deep Bayesian Active Learning with Image
Data”, ICML (2017)
I Y Li, Y Gal, “Dropout Inference in Bayesian Neural Networks with
Alpha-divergences”, ICML (2017)
I A Kendall, Y Gal, “What Uncertainties Do We Need in Bayesian Deep
Learning for Computer Vision?”, arXiv preprint, arXiv:1703.04977 (2017)
I A Shah, Y Gal, “Invertible Transformations for Bayesian Neural Network
Inference” (2017)
I and more...

18 of 18

Вам также может понравиться