Академический Документы
Профессиональный Документы
Культура Документы
Yarin Gal
Research Fellow, University of Cambridge
Research Fellow, The Alan Turing Institute
yg279@cam.ac.uk
Unless specified otherwise, photos are either original work or taken from Wikimedia, under Creative Commons license
Pillar I: Deep learning
Conceptually simple models
Data: X = {x1 , x2 , ..., xN }, Y = {y1 , y2 , ..., yN }
Model: given matrices W and non-linear func. σ(·), define “network”
ỹi (xi ) = W2 · σ W1 xi
2 of 18
Pillar I: Deep learning
Conceptually simple models
Data: X = {x1 , x2 , ..., xN }, Y = {y1 , y2 , ..., yN }
Model: given matrices W and non-linear func. σ(·), define “network”
ỹi (xi ) = W2 · σ W1 xi
No uncertainty!
I Huge attention from I Uninterpretable black-boxes
practitioners and engineers
I Easily fooled (AI safety)
I Great software tools
I Lacks solid mathematical
I Scales with data and foundations (mostly ad hoc)
compute
I Crucially relies on big data
I Real-world impact
2 of 18
Why should I care about uncertainty?
I We need a way to tell what our model knows and what not.
I We train a model to recognise dog breeds
3 of 18
Why should I care about uncertainty?
I We need a way to tell what our model knows and what not.
I We train a model to recognise dog breeds
3 of 18
Why should I care about uncertainty?
I We need a way to tell what our model knows and what not.
I We train a model to recognise dog breeds
3 of 18
Why should I care about uncertainty?
I We need a way to tell what our model knows and what not.
I We train a model to recognise dog breeds
3 of 18
Why should I care about uncertainty?
I We need a way to tell what our model knows and what not.
3 of 18
Why should I care about uncertainty?
I We need a way to tell what our model knows and what not.
3 of 18
Pillar II: Bayes
The language of uncertainty
I Probability theory
I Specifically Bayesian probability theory (1750!)
When applied to Information Engineering...
I Bayesian modelling
5 of 18
A simple way to tie the two pillars together
I “Dropout”: a popular method in deep learning, cited hundreds
and hundreds of times
5 of 18
A simple way to tie the two pillars together
I “Dropout”: a popular method in deep learning, cited hundreds
and hundreds of times
5 of 18
A simple way to tie the two pillars together
I “Dropout”: a popular method in deep learning, cited hundreds
and hundreds of times
5 of 18
A simple way to tie the two pillars together
I Can be shown that dropout training is identical to approximate
inference in Bayesian modelling [Gal, 2016],
6 of 18
A simple way to tie the two pillars together
I Can be shown that dropout training is identical to approximate
inference in Bayesian modelling [Gal, 2016],
6 of 18
A simple way to tie the two pillars together
I Can be shown that dropout training is identical to approximate
inference in Bayesian modelling [Gal, 2016],
6 of 18
A simple way to tie the two pillars together
I Can be shown that dropout training is identical to approximate
inference in Bayesian modelling [Gal, 2016],
6 of 18
A simple way to tie the two pillars together
I Can be shown that dropout training is identical to approximate
inference in Bayesian modelling [Gal, 2016],
6 of 18
A simple way to tie the two pillars together
I Can be shown that dropout training is identical to approximate
inference in Bayesian modelling [Gal, 2016],
6 of 18
A simple way to tie the two pillars together
I Can be shown that dropout training is identical to approximate
inference in Bayesian modelling [Gal, 2016],
6 of 18
A simple way to tie the two pillars together
I Can be shown that dropout training is identical to approximate
inference in Bayesian modelling [Gal, 2016],
6 of 18
Some theory
I Given dataset X, Y, the r.v. W has a posterior: p W|X, Y
7 of 18
Some theory
7 of 18
Some theory
7 of 18
Some theory
p(W|X, Y)
qθ1 (W)
7 of 18
Some theory
p(W|X, Y)
qθ2 (W)
7 of 18
Some theory
p(W|X, Y)
qθ3 (W)
7 of 18
Some theory
p(W|X, Y)
qθ4 (W)
7 of 18
Some theory
7 of 18
Some theory
7 of 18
Some theory
Proof.
See Gal [2016].
8 of 18
Some theory
Proof.
See Gal [2016].
8 of 18
Some theory
b t ∼ q ∗ (W).
with W M
8 of 18
Some code, just for fun
I repeat 10 times
I Or in Python:
1 y = []
2 for _ in xrange(10):
3 y.append(model.output(x, dropout=True))
4 y_mean = numpy.mean(y)
5 y_var = numpy.var(y)
1
Friendly introduction given in yarin.co/blog
9 of 18
Example uncertainty in deep learning
What would be the CO2 concentration level in Mauna Loa,
Hawaii, in 20 years’ time?
I Normal dropout:
10 of 18
Example uncertainty in deep learning
What would be the CO2 concentration level in Mauna Loa,
Hawaii, in 20 years’ time?
I Normal dropout:
10 of 18
Example uncertainty in deep learning
What would be the CO2 concentration level in Mauna Loa,
Hawaii, in 20 years’ time?
10 of 18
Example uncertainty in deep learning
What would be the CO2 concentration level in Mauna Loa,
Hawaii, in 20 years’ time?
10 of 18
Active Learning with image data
11 of 18
Active Learning acquisition functions
Choose x ∗ that maximises acquisition functions a(x):
x∗ = argmaxx∈Dpool a(x)
14 of 18
Acquisition functions intuition
Want to classify dogs vs. cats given image x with models M1 , M2 , M3
• Stochastic forward passes give probability vectors for each model:
1. (1, 0), ..., (1, 0)
2. (0.5, 0.5), ..., (0.5, 0.5), and
3. (1, 0), (0, 1), (1, 0), ..., (0, 1)
14 of 18
Acquisition functions intuition
Want to classify dogs vs. cats given image x with models M1 , M2 , M3
• Stochastic forward passes give probability vectors for each model:
1. (1, 0), ..., (1, 0)
2. (0.5, 0.5), ..., (0.5, 0.5), and
3. (1, 0), (0, 1), (1, 0), ..., (0, 1)
14 of 18
Acquisition functions intuition
Want to classify dogs vs. cats given image x with models M1 , M2 , M3
• Stochastic forward passes give probability vectors for each model:
1. (1, 0), ..., (1, 0)
2. (0.5, 0.5), ..., (0.5, 0.5), and
3. (1, 0), (0, 1), (1, 0), ..., (0, 1)
98 98 98
96 96 96
94 94 94
92 92 92
90 90 90
88 88 88
86 86 86
84 84 84
0.74
0.72
0.70
0.68
AUC
0.66
0.64
0.62 BALD
uniform
0.60
0 1 2 3 4
Acquisition steps
17 of 18
Medical analysis with Active Learning
60
# positive examples acquired
50
40
30
BALD
uniform
20
0 1 2 3 4
Acquisition steps
17 of 18
New horizons
18 of 18
New horizons
18 of 18
References
I Y Gal, R Turner, “Improving the Gaussian process sparse spectrum
approximation by representing uncertainty in frequency inputs”, ICML (2015)
I Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing
Model Uncertainty in Deep Learning”, ICML (2016)
I Y Gal, Z Ghahramani, “A Theoretically Grounded Application of Dropout in
Recurrent Neural Networks”, NIPS (2016)
I Y Gal, R McAllister, C Rasmussen, “Improving PILCO with Bayesian Neural
Network Dynamics Models”, DEML workshop, ICML (2016)
I Y Gal, R Islam, Z Ghahramani, “Deep Bayesian Active Learning with Image
Data”, ICML (2017)
I Y Li, Y Gal, “Dropout Inference in Bayesian Neural Networks with
Alpha-divergences”, ICML (2017)
I A Kendall, Y Gal, “What Uncertainties Do We Need in Bayesian Deep
Learning for Computer Vision?”, arXiv preprint, arXiv:1703.04977 (2017)
I A Shah, Y Gal, “Invertible Transformations for Bayesian Neural Network
Inference” (2017)
I and more...
18 of 18