Bayesed Deep Learning

Bayesian Deep Learning
Yarin Gal
Research Fellow, University of Cambridge
Research Fellow, The Alan Turing Institute
yg279@cam.ac.uk
Unless specified otherwise, photos are either original work or taken from Wikimedia, under Creative Commons license
Pillar I: Deep learning
Conceptually simple models
Data: X = {x1 , x2 , ..., xN }, Y = {y1 , y2 , ..., yN }
Model: given matrices W and non-linear func. σ(·), define “network”

ỹi (xi ) = W2 · σ W1 xi
Objective: find W for which ỹi (xi ) is close to yi for all i ≤ N.
2 of 18
Data: X = {x1 , x2 , ..., xN }, Y = {y1 , y2 , ..., yN }

ỹi (xi ) = W2 · σ W1 xi
Deep learning is awesome ... but has many issues

I Simple and modular I What does a model not know?
I Huge attention from I Uninterpretable black-boxes
practitioners and engineers I Easily fooled (AI safety)
I Great software tools I Lacks solid mathematical
I Scales with data and foundations (mostly ad hoc)
compute I Crucially relies on big data
I Real-world impact
2 of 18
Data: X = {x1 , x2 , ..., xN }, Y = {y1 , y2 , ..., yN }

ỹi (xi ) = W2 · σ W1 xi
Deep learning is awesome ... but has many issues





I Simple and modular I What does a model not know? 

No uncertainty!




I Huge attention from I Uninterpretable black-boxes




practitioners and engineers

I Easily fooled (AI safety)
I Great software tools


I Lacks solid mathematical




I Scales with data and foundations (mostly ad hoc)




compute


I Crucially relies on big data


I Real-world impact
2 of 18
Why should I care about uncertainty?
I We need a way to tell what our model knows and what not.
I We train a model to recognise dog breeds
3 of 18
I And are given a cat to classify
3 of 18
I What would you want your model to do?
3 of 18
I What would you want your model to do?
I Similar problems in decision making, physics, life science, etc.
3 of 18
I Uncertainty gives insights into the black-box when it fails

—where am I not certain?
3 of 18

I Uncertainty might even be useful to identify when attacked with

adversarial examples!
I Lastly, need less data if label only where model is uncertain:

wear-and-tear in robotics, expert time in medical analysis
3 of 18

I Uncertainty might even be useful to identify when attacked with

adversarial examples!
I Lastly, need less data if label only where model is uncertain:

wear-and-tear in robotics, expert time in medical analysis
3 of 18
Pillar II: Bayes
The language of uncertainty
I Probability theory
I Specifically Bayesian probability theory (1750!)
When applied to Information Engineering...
I Bayesian modelling
I Built on solid mathematical foundations

I Orthogonal to deep learning...
4 of 18
A simple way to tie the two pillars together
I “Dropout”: a popular method in deep learning, cited hundreds
and hundreds of times
I Works by randomly setting network units to zero
I This somehow improves performance and reduces over-fitting
I Used in almost all modern deep learning models
5 of 18
5 of 18
5 of 18
5 of 18
I Can be shown that dropout training is identical to approximate
inference in Bayesian modelling [Gal, 2016],
I Connecting Deep Learning to Bayesian probability theory.

I The mathematically grounded connection gives a treasure trove
of new research opportunities:
I uncertainty in deep learning, e.g. interpretability and AI safety
I principled extensions to deep learning
I enable deep learning in small data domains
6 of 18

6 of 18

6 of 18

6 of 18

6 of 18

6 of 18

6 of 18

More in a second. First, some theory.
6 of 18
Some theory
From Bayesian neural networks to Dropout
I Place prior p(W) dist. on weights, making these r.v.s

I Given dataset X, Y, the r.v. W has a posterior: p W|X, Y
7 of 18
Some theory

I Which is difficult to evaluate—many great researchers tried
I Can define simple distribution qM (·) and approximate

qM (W) ≈ p W|X, Y
I This is called approximate variational inference.
7 of 18
Some theory


qM (W) ≈ p W|X, Y
7 of 18
Some theory


qM (W) ≈ p W|X, Y
p(W|X, Y)
qθ1 (W)
7 of 18
Some theory


qM (W) ≈ p W|X, Y
p(W|X, Y)
qθ2 (W)
7 of 18
Some theory


qM (W) ≈ p W|X, Y
p(W|X, Y)
qθ3 (W)
7 of 18
Some theory


qM (W) ≈ p W|X, Y
p(W|X, Y)
qθ4 (W)
7 of 18
Some theory


qM (W) ≈ p W|X, Y
qθ5 (W) p(W|X, Y)
7 of 18
Some theory


qM (W) ≈ p W|X, Y
7 of 18
Some theory
Theorem (Dropout as approximate variational inference)

Define qM (W) := M · diag(Bernoulli)
with variational parameter M.
The optimisation objective of (stochastic) variational inference with
qM (W) is identical to the objective of a dropout neural network.
Proof.
See Gal [2016].
8 of 18
Some theory

Proof.
See Gal [2016].
Implementing inference with qM (W)

=
Implementing dropout training.
Line to line.
8 of 18
Some theory

Corollary (Model uncertainty with dropout)

Given p(y∗ |fW (x∗ )) = N (y∗ ; fW (x∗ ), τ −1 I) for some τ > 0, the model’s
predictive variance can be estimated with the unbiased estimator:
T
f ∗ ] := τ −1 I + 1
X
Var[y e ∗ ]T E[y
fWt (x∗ )T fWt (x∗ ) − E[y e ∗]
b b
T
t=1
b t ∼ q ∗ (W).
with W M
8 of 18
Some code, just for fun
In practical terms1 , given point x:

I drop units at test time
I repeat 10 times
I and look at mean and sample variance.
I Or in Python:
1 y = []
2 for _ in xrange(10):
3 y.append(model.output(x, dropout=True))
4 y_mean = numpy.mean(y)
5 y_var = numpy.var(y)
1
Friendly introduction given in yarin.co/blog
9 of 18
Example uncertainty in deep learning
What would be the CO2 concentration level in Mauna Loa,
Hawaii, in 20 years’ time?
I Normal dropout:
I Same network, Bayesian perspective:
10 of 18
I Normal dropout:
I Same network, Bayesian perspective:
10 of 18
Normal dropout: Bayesian perspective:
What can we do with this?

I Interpretability & AI safety
I Principled deep learning extensions
I Deep learning in small data domains
10 of 18
Normal dropout: Bayesian perspective:
What can we do with this?

I Interpretability & AI safety
I Principled deep learning extensions
I Deep learning in small data domains
I Cancer diagnosis
10 of 18
Active Learning with image data
Active learning of images [Gal, Islam & Ghahramani, 2017]

E.g. diagnose melanoma with a handful of images.
11 of 18
Active Learning acquisition functions
Choose x ∗ that maximises acquisition functions a(x):
x∗ = argmaxx∈Dpool a(x)
E.g. points that maximise uncertainty. But, which uncertainty?

I Aleatoric uncertainty captures noise inherent in the data
I Epistemic uncertainty captures model’s lack of knowledge
I Predictive uncertainty captures the sum of the two
Figures adapted from Hanna M. Wallach (Cambridge, UMassAmherst)

12 of 18
Acquisition functions for classification
Choose x ∗ that maximises acquisition functions a(x):
x∗ = argmaxx∈Dpool a(x)
Possible measures of uncertainty in classification:
I Predictive entropy (H[y|x, Dtrain ])
X
aPE (x) = − p(y = c|x, Dtrain ) log p(y = c|x, Dtrain )
c
I Information gained about the model parameters (I[y , W|x, Dtrain ])

aMI (x) = H[y|x, Dtrain ] − Ep(W|Dtrain ) H[y |x, W]
I Variation ratios
aVR (x) = 1 − max p(y|x, Dtrain )
y
I Random acquisition (baseline): aU (x) = unif()

13 of 18
Acquisition functions intuition
Want to classify dogs vs. cats given image x with models M1 , M2 , M3
• Stochastic forward passes give probability vectors for each model:
1. (1, 0), ..., (1, 0)
2. (0.5, 0.5), ..., (0.5, 0.5), and
3. (1, 0), (0, 1), (1, 0), ..., (0, 1)
14 of 18
1. (1, 0), ..., (1, 0)
2. (0.5, 0.5), ..., (0.5, 0.5), and
3. (1, 0), (0, 1), (1, 0), ..., (0, 1)
What’s the epistemic uncertainty for each model?

What’s the predictive uncertainty for each model?
14 of 18
1. (1, 0), ..., (1, 0)
2. (0.5, 0.5), ..., (0.5, 0.5), and
3. (1, 0), (0, 1), (1, 0), ..., (0, 1)
What’s the epistemic uncertainty? models M1 and M2 are

confident about the output. Model M3 is uncertain.
What’s the predictive uncertainty? M1 has low uncertainty, M2
and M3 have high uncertainty.
14 of 18
1. (1, 0), ..., (1, 0)
2. (0.5, 0.5), ..., (0.5, 0.5), and
3. (1, 0), (0, 1), (1, 0), ..., (0, 1)
What’s the epistemic uncertainty? models M1 and M2 are

confident about the output. Model M3 is uncertain.
What’s the predictive uncertainty? M1 has low uncertainty, M2
and M3 have high uncertainty.
Acquisition functions intuition:
I M1 : all acquisition functions give low uncertainty
I M2 : variation ratios and predictive entropy give high uncertainty;
mutual information gives low uncertainty.
I M3 : all acquisition functions give high uncertainty
14 of 18
MNIST experiments
Test accuracy as a function of number of acquired images (up to 1K):
100 100 100
98 98 98
96 96 96
94 94 94
92 92 92
90 90 90
88 88 88
86 86 86
84 84 84
82 BALD 82 Var Ratios 82 Max Entropy

Deterministic BALD Deterministic Var Ratios Deterministic Max Entropy
80 80 80
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
BALD Var Ratios Max Entropy

using both a Bayesian CNN (red) and a deterministic CNN (blue)
Number of acquired images to get to model error of %:
% error BALD Var Ratios Max Ent Random

10% 145 120 165 255
5% 335 295 355 835
15 of 18
Active learning vs. semi-supervised learning
Test error on MNIST with 1000 labelled training samples, for active
learning (using simple LeNet) vs. semi-supervised techniques:
Technique Test error

Semi-supervised:
SS Embedding (Weston et al., 2012) 5.73%
DGN (Kingma et al., 2014) 2.40%
Γ Ladder Network (Rasmus et al., 2015) 1.53%
Virtual Adversarial (Miyato et al., 2015) 1.32%
Active learning with various acquisitions:
Random 4.66%
BALD 1.80%
Max Entropy 1.74%
Var Ratios 1.64%
16 of 18
Medical analysis with Active Learning

E.g. diagnose melanoma with a handful of images:
0.74
0.72
0.70
0.68
AUC
0.66
0.64
0.62 BALD
uniform
0.60
0 1 2 3 4
Acquisition steps
Performance vs. acquisition
17 of 18
Medical analysis with Active Learning

E.g. diagnose melanoma with a handful of images:
70
60
# positive examples acquired
50
40
30
BALD
uniform
20
0 1 2 3 4
Acquisition steps
# acquired positive examples vs. acquisition
17 of 18
New horizons
Most exciting is work to come:

I What is interesting data to label? (when model is uncertain)
I Active learning in real-world medical applications

and much, much, more.
18 of 18
New horizons
Most exciting is work to come:

I What is interesting data to label? (when model is uncertain)
I Active learning in real-world medical applications

and much, much, more.
Thank you for listening.
18 of 18
References
I Y Gal, R Turner, “Improving the Gaussian process sparse spectrum
approximation by representing uncertainty in frequency inputs”, ICML (2015)
I Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing
Model Uncertainty in Deep Learning”, ICML (2016)
I Y Gal, Z Ghahramani, “A Theoretically Grounded Application of Dropout in
Recurrent Neural Networks”, NIPS (2016)
I Y Gal, R McAllister, C Rasmussen, “Improving PILCO with Bayesian Neural
Network Dynamics Models”, DEML workshop, ICML (2016)
I Y Gal, R Islam, Z Ghahramani, “Deep Bayesian Active Learning with Image
Data”, ICML (2017)
I Y Li, Y Gal, “Dropout Inference in Bayesian Neural Networks with
Alpha-divergences”, ICML (2017)
I A Kendall, Y Gal, “What Uncertainties Do We Need in Bayesian Deep
Learning for Computer Vision?”, arXiv preprint, arXiv:1703.04977 (2017)
I A Shah, Y Gal, “Invertible Transformations for Bayesian Neural Network
Inference” (2017)
I and more...
18 of 18

Bayesed Deep Learning

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Bayesed Deep Learning

Загружено:

Авторское право:

Доступные форматы

Bayesian Deep Learning

Objective: find W for which ỹi (xi ) is close to yi for all i ≤ N.

Objective: find W for which ỹi (xi ) is close to yi for all i ≤ N.

Deep learning is awesome ... but has many issues

Objective: find W for which ỹi (xi ) is close to yi for all i ≤ N.

Deep learning is awesome ... but has many issues

I And are given a cat to classify

I And are given a cat to classify

I What would you want your model to do?

I And are given a cat to classify

I What would you want your model to do?

I Similar problems in decision making, physics, life science, etc.

I Uncertainty gives insights into the black-box when it fails

I Uncertainty gives insights into the black-box when it fails

I Uncertainty might even be useful to identify when attacked with

I Lastly, need less data if label only where model is uncertain:

I Uncertainty gives insights into the black-box when it fails

I Uncertainty might even be useful to identify when attacked with

I Lastly, need less data if label only where model is uncertain:

I Built on solid mathematical foundations

I Works by randomly setting network units to zero

I This somehow improves performance and reduces over-fitting

I Used in almost all modern deep learning models

I Works by randomly setting network units to zero

I This somehow improves performance and reduces over-fitting

I Used in almost all modern deep learning models

I Works by randomly setting network units to zero

I This somehow improves performance and reduces over-fitting

I Used in almost all modern deep learning models

I Works by randomly setting network units to zero

I This somehow improves performance and reduces over-fitting

I Used in almost all modern deep learning models

I Connecting Deep Learning to Bayesian probability theory.

I Connecting Deep Learning to Bayesian probability theory.

I Connecting Deep Learning to Bayesian probability theory.

I Connecting Deep Learning to Bayesian probability theory.

I Connecting Deep Learning to Bayesian probability theory.

I Connecting Deep Learning to Bayesian probability theory.

I Connecting Deep Learning to Bayesian probability theory.

I Connecting Deep Learning to Bayesian probability theory.

From Bayesian neural networks to Dropout

I Place prior p(W) dist. on weights, making these r.v.s

From Bayesian neural networks to Dropout

I Place prior p(W) dist. on weights, making these r.v.s

I Which is difficult to evaluate—many great researchers tried

I Can define simple distribution qM (·) and approximate

I This is called approximate variational inference.

From Bayesian neural networks to Dropout

I Place prior p(W) dist. on weights, making these r.v.s

I Which is difficult to evaluate—many great researchers tried

I Can define simple distribution qM (·) and approximate

I This is called approximate variational inference.

From Bayesian neural networks to Dropout

I Place prior p(W) dist. on weights, making these r.v.s

I Which is difficult to evaluate—many great researchers tried

I Can define simple distribution qM (·) and approximate

From Bayesian neural networks to Dropout

I Place prior p(W) dist. on weights, making these r.v.s

I Which is difficult to evaluate—many great researchers tried

I Can define simple distribution qM (·) and approximate

From Bayesian neural networks to Dropout

I Place prior p(W) dist. on weights, making these r.v.s