5 views

Uploaded by sykim657

© All Rights Reserved

- LogisticRegression-week1
- 20185644444444444
- Book-Machine Learning - The Complete Guide
- Closed-Form Algorithm for the Least-squares Trilateration Problem
- Curve Fitting
- Num Analysis Collins
- HOW TO USE FST
- Applied Statistics and the SAS Programming 5th edition
- Analysis of Epidemiological Data Using R and Epicalc
- Design of Solar Drying Technology
- Ch.2 - STATA Code for Website
- Coordinate Swapping in Standard Addition Graphs for Analytical Chemistry: A Simplified Path for Uncertainty Calculation in Linear and Nonlinear Plots
- Prior vs Likelihood vs Posterior Distribution
- Notes on Probabilistic Latent Semantic Analysis.pdf
- 3.Shumway
- risks-04-00004
- Longitudinal Data Analysis - Note Set 13
- Basian-Markovian Principle in Fitting of Linear Curve
- RMppt
- DATA MINING Chapter 1 and 2 Lect Slide

You are on page 1of 176

FALL 2017

C OURSE A DMIN

T ERM T IMELINE

Sep 5/6 Midterm Dec 7/11

tutorial tutorial due

Dates

Python tutorial 12/13 September

TensorFlow tutorial 24/25 October

Midterm exam 19/23 October

Final project due 11 December

A SSISTANTS AND G RADING

Teaching Assistants

Ian Kinsella and Wenda Zhou

Office Hours Mon/Tue 5:30-7:30pm, Room 1025, Dept of Statistics, 10th floor SSW

Class Homepage

https://wendazhou.com/teaching/AdvancedMLFall17/

Homework

Some homework problems and final project require coding

Coding: Python

Homework due: Tue/Wed at 4pm no late submissions

You can drop two homeworks from your final score

Grade

Homework + Midterm Exam + Final Project

20% 40% 40%

H OUSE RULES

Email

All email to the TAs, please.

The instructors will not read your email unless it is forwarded by a TA.

If you cannot take the exam or finish the project: You must let us know

at least one week before

the midterm exam/project due date.

R EADING

Books (optional)

O UTLINE ( VERY TENTATIVE )

Neural networks (basic definitions and NN software

training) Convolutional NNs and computer vision

Graphical models (ditto) Recurrent NNs

Sampling algorithms Reinforcement learning

Variational inference Dimension reduction and autoencoders

Optimization for GMs and NNs

I NTRODUCTION

AGAIN : M ACHINE L EARNING

Machines need to...

recognize patterns (e.g. vision, language)

predict

cope with uncertainty

medical diagnosis recommender systems

face detection/recognition bioinformatics

speech and handwriting recognition natural language processing

web search computer vision

Today

Machine learning and statistics have become hard to tell apart.

L EARNING AND S TATISTICS

Task

Balance the pendulumn upright by moving the sled left and right.

The computer can control only the motion of the sled.

L EARNING AND S TATISTICS

Formalization

State = 4 variables (sled location, sled velocity, angle, angular velocity)

Actions = sled movements

f : S A S

(state, action) 7 state

Advanced Machine Learning 10 / 173

L EARNING AND S TATISTICS

L EARNING AND S TATISTICS

Fit a function

f : S A S

(state, action) 7 state

to the data obtained in previous runs.

1. The function f , which tells the system how the world works.

2. An optimization method that uses f to determine how to move towards the optimal state.

Note well

Learning how the world works is a regression problem.

O UR MAIN TOPICS

Define functions Define distributions

Represented by directed graph Represented by directed graph

Incoming edges: Function arguments Incoming edges: Conditions

Outgoing edges: Function values Outgoing edges: Draws from distribution

Learning: Differentiation/optimization Learning: Estimation/inference

R ECALL P REVIOUS T ERM

sgn(hvH , xi c) > 0

sgn(hvH , xi c) < 0

Classification Clustering (mixture models)

Problems Regression HMMs

Dimension reduction (PCA)

Solutions Functions Distributions

Neural networks Graphical models

x1 x2 x3

v1 v2 v3

y2 = (vt x)

...

...

.. .. ..

. . .

y1 y2 y3

...

...

NN S AND G RAPHICAL M ODELS

Neural networks

Representation of function using a graph

Layers: x1 v13 x2 v31 x3

v21 v23

g v33

x f v12 v22 v32

v11

Symbolizes: f (g(x))

Graphical models

Representation of a distribution using a graph

Layers:

X Y Z

Grouping dependent variables into

layers is a good thing.

H ISTORICAL PERSPECTIVE : M C C ULLOCH -P ITTS NEURON

MODEL (1943)

x1 x2 x3

McCulloch-Pitts model

Collect the input signals x1 , x2 , x3 into a vector x = (x1 , x2 , x3 ) R3

Choose fixed vector v R3 and constant c R.

Compute:

y = I{hv, xi > c} for some c R .

R EADING THE DIAGRAM

x1 x2 x3

v1 v2 v3

1

I{ > 0}

f (x) = sgn(hv, xi c)

L INEAR C LASSIFICATION

hx,vi

kvk

f (x) = sgn(hv, xi c)

R EMARKS

x1 x2 x3

v1 v2 v3

y = I{vt x > c}

It does not specify the training method.

To train the classifier, we need a cost function and an optimization method.

T RAINING

Idea: Choose 0-1 loss as simplest loss for classification.

Minimize empirical

16 risk (on training data) under CHAPTER

this loss. 5. LINEAR DISCRIMINANT

J(a) Jp(a)

3 10

2

1 5

0

y3

y2

-2 y2

y1 y1

0

-2 solution -2 solut

0 region 2 a1 0 regio

a2 2 a2 2

4

4

Jq(a) Jr(a)

Advanced Machine Learning Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001 22 / 173

T HE P ERCEPTRON C RITERION

Approximate by piece-wise linear function perceptron cost function

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

Jp(a)

10

y3 y3

y2

-2 y2 -2

y1

0 0

solution -2 solution

region 2 a1 0 region 2 a1

2 a2 2

4 4

4 4

Jr(a)

5

Advanced Machine Learning Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001 23 / 173

P ERCEPTRON

Train McCulloch-Pitts model (that is: estimate (c, v)) by applying gradient descent to the

function

n

X c 1

Cp (c, v) := I{sgn(hv, xi i c) 6= yi } , ,

v xi

i=1

called the Perceptron cost function.

O UTLINE ( VERY TENTATIVE )

Neural networks (basic definitions and NN software

training) Convolutional NNs and computer vision

Graphical models (ditto) Recurrent NNs

Sampling algorithms Reinforcement learning

Variational inference Dimension reduction and autoencoders

Optimization for GMs and NNs

T OOLS : L OGISTIC R EGRESSION

S IGMOIDS

Sigmoid function

1.0

0.8

1 0.6

(x) =

1 + ex 0.4

0.2

-10 -5 5 10

Note 1.0

0.8

1 + ex 1 1

1(x) = = x = (x) 0.6

1 + ex e +1

0.4

Derivative 0.2

d ex -10 -5 5 10

(x) = = (x) 1 (x)

dx (1 + ex )2

Sigmoid (blue) and its derivative (red)

A PPROXIMATING DECISION BOUNDARIES

1.0

In linear classification: Decision

0.8

boundary is a discontinuity

Boundary is represented either by 0.6

0.4

function sign( c)

These representations are equivalent: 0.2

-5 0 5 10

The most important use of the sigmoid function in machine learning is as a smooth

approximation to the indicator function.

Given a sigmoid and a data point x, we decide which side of the approximated boundary we

are own by thresholding

1

(x)

2

S CALING

We can add a scale parameter by definining

1

(x) := (x) = for R

1 ex

1.0

0.8

0.6

0.4

0.2

-5 0 5 10

Influence of

As increases, approximates I more closely.

For , the sigmoid converges to I pointwise, that is: For every x 6= 0, we have

(x) I{x > 0} as + .

1

Note (0) = 2

always, regardless of .

A PPROXIMATING A L INEAR C LASSIFIER

The decision boundary of a linear classifier in We can stretch into a ridge function on R2 :

R2 is a discontinuous ridge:

x = (x1 , x2 ) 7 (x1 ).

I{hv, xi c}.

The ridge runs parallel to the x2 -axes.

Here: v = (1, 1) and c = 0.

If we use (x2 ) instead, we rotate by 90

degrees (still axis-parallel).

S TEERING A S IGMOID

The function (hv, xi c) is a sigmoid ridge, where the ridge is orthogonal to the normal

vector v, and c is an offset that shifts the ridge out of the origin.

The plot on the right shows the normal vector (here: v = (1, 1)) in black.

The parameters v and c have the same meaning for I and , that is, (hv, xi c)

approximates I{hv, xi c}.

L OGISTIC R EGRESSION

sigmoids.

Setup

Two-class classification problem

Observations x1 , . . . , xn Rd , class labels yi {0, 1}.

We model the conditional distribution of the class label given the data as

P(y|x) := Bernoulli (hv, xi c) .

1

Recall (hv, xi c) takes values in [0, 1] for all , and value 2

on the class boundary.

The logistic regression model interprets this value as the probability of being in class y.

L EARNING L OGISTIC R EGRESSION

Since the model is defined by a parametric distribution, we can apply maximum likelihood.

Notation

Recall from Statistical Machine Learning: We collect the parameters in a vector w by writing

v x

w := and x := so that hw, xi = hv, xi c .

c 1

n

Y 1yi

(hw, xi i)yi 1 (hw, xi i)

i=1

Negative log-likelihood

n

X

L(w) := yi log (hw, xi i) + (1 yi ) log 1 (hw, xi i)

i=1

M AXIMUM L IKELIHOOD

n

X

(wt xi ) yi xi

L(w) =

i=1

Note

Each training data point xi contributes to the sum proportionally to the approximation

error (wt xi ) yi incurred at xi by approximating the linear classifier by a sigmoid.

Maximum likelihood

The ML estimator w for w is the solution of

L(w) = 0 .

For logistic regression, this equation has no solution in closed form.

To find w, we use numerical optimization.

The function L is convex (= -shaped).

R ECALL FROM S TATISTICAL M ACHINE L EARNING

x(k+1) := x(k) f (x(k) )

where x(k) is the candidate solution in step k of the algorithm.

If the Hessian matrix Hf of partial second derivatives exists and is invertible, we can apply

Newtons method, which converges faster:

x(k+1) := x(k) Hf1 (x(k) ) f (x(k) )

f : Rd R is

2f

Hf (x) :=

xi xj i,jn

Since f is twice differentiable, each 2 f /xi xj exists; since it is twice continuously

differentiable, 2 f /xi xj = 2 f /xj xi , so Hf is symmetric.

The inverse of Hf (x) exists if and only if the matrix is positive definite (semidefinite does

not suffice), which in turn is true if and only if f is strictly convex.

N EWTON S M ETHOD FOR L OGISTIC R EGRESSION

Applying Newton

w(k+1) := w(k) HL1 (w(k) ) L(w(k) )

Matrix notation

1 (x1 )1 . . . (x1 )j . . . (x1 )d

.. .. ..

(wt x1 )(1 (wt x1 )) ... 0

. . .

1 (xi )1 . . . (xi )j . . . (xi )d D = .. .. ..

xi

X :=

. . .

(wt xn )(1 (wt xn ))

.. .. .. 0 ...

. . .

1 (xn )1 . . . (xn )j . . . (xn )d

X is the data matrix (or design matrix) you know from linear regression. X has size n (d + 1)

and D is n n.

Newton step

( w(k) , x1 )

y1

w(k+1) = Xt D X

1 t (k)

X D Xw D .. .

.

. .

( w(k) , xn )

yn

N EWTON S M ETHOD FOR L OGISTIC R EGRESSION

( w(k) , x1 ) y1

.. .

1 t 1 t

w(k+1) = Xt D X X D Xw(k) D

.

.

. = Xt D X X D u(k)

(k)

( w , xn ) y n

=: u(k)

1 t

= (Xt X)1 Xt y

w(k+1) = Xt D X X D u(k)

Differences:

The vector y of regression responses is substituted by the vector u(k) above.

Note that matrices of product form Xt X are positive semidefinite; since D is diagonal

with non-negative entries, so is Xt D X.

At each step, the algorithm solves a least-squares problem reweighted by the matrix D .

Since this happens at each step of an iterative algorithm, Newtons method applied to the

logistic regression log-likelihood is also known as Iteratively Reweighted Least Squares.

OTHER O PTIMIZATION M ETHODS

Newton: Cost

The size of the Hessian is (d + 1) (d + 1).

In high-dimensional problems, inverting HL can become problematic.

Other methods

Maximum likelihood only requires that we minimize the negative log-likelihood; we can choose

any numerical method, not just Newton. Alternatives include:

Pseudo-Newton methods (only invert HL once, for w(1) , but do not guarantee quadratic

convergence).

Gradient methods.

Approximate gradient methods, like stochastic gradient.

OVERFITTING

Recall from Statistical Machine Learning

x

H If we increase the length of v without

v changing its direction, the sign of hv, xi

does not change, but the value changes.

That means: If v is the normal vector of a

classifier, and we scale v by some > 0,

the decision boundary does not move, but

hv, xi = hv, xi.

longer v

1.0

0.8

0.6

0.4

0.2

more similar to I{hv, xi > 0}. -5 0 5 10

longer v

E FFECT ON ML FOR LOGISTIC REGRESSION

1.0

(wt xi ) yi

0.8

0.6

0.4

0.2

-5 0 5 10

xi

Recall each training data point xi contributes an error term (wt xi ) yi to the

log-likelihood.

By increasing the lenghts of w, we can make (wt xi ) yi arbitrarily small without

moving the decision boundary.

OVERFITTING

Once the decision boundary is correctly located between the two classes, the maximization

algorithm can increase the log-likelihood arbitrarily by increasing the length of w.

That does not move the decision boundary, but he logistic function looks more and more

like the indicator I.

That may fit the training data more tightly, but can lead to bad generalization (e.g. for

similar reasons as for the perceptron, where the decision boundary may end up very close

to a training data point).

That is a form of overfitting.

If the data is not separable, sufficiently many points on the wrong side of the decision

boundary prevent overfitting (since making w larger increases error contributions of these

points).

For large data sets, overfitting can still occur if the fraction of such points is small.

Solutions

Overfitting can be addressed by including an additive penalty of the form L(w) + kwk.

L OGISTIC R EGRESSION FOR M ULTIPLE C LASSES

The mulitnomial distribution

P of N draws from K categories with parameter vector

(1 , . . . , K ) (where kK k = 1) has probabililty mass function

K

N! Y m

P(m1 , . . . , mK |1 , . . . , K ) = k where mk = # draws in category k

m1 ! mK ! k=1 k

Logistic regression

Recall two-class logistic regression is defined by P(Y|x) = Bernoulli((wt x)).

Idea: To generalize logistic regression to K classes, choose a separate weight vector wk

for each class k, and define P(Y|x) by

Multinomial (wt1 x), . . . , (wtK x)

(wt1 x)

where (wt1 x) = P t .

k (wk x)

L OGISTIC R EGRESSION FOR M ULTIPLE C LASSES

Logistic regression for K classes

The label y now takes values in {1, . . . , K}.

K

Y

P(y|x) = (wtk x)I{y=k}

k=1

X

L(w1 , . . . , wK ) = I{y = k} log (wtk xi )

in, kK

Recall that 1 (x) = (x).

That means

Bernoulli (hv, xi c) Multinomial (wt x), ((1)wt x)

regression with K = 2 provided we choose w2 = w1 .

G RAPHICAL M ODELS

G RAPHICAL M ODELS

A graphical model represents the dependence structure within a set of random variables

as a graph.

Overview

Roughly speaking:

Each random variable is represented by vertex.

For example:

X Y

We have to be careful: The above does not imply that X and Y are independent. We have

to make more precise what depends on means.

We will use the notation:

L(X) = distribution of the random variable X

L(X|Y) = conditional distribution of X given Y

(L means law.)

Reason

If X is discrete, L(X) is usually given by a mass function P(x).

If it is continuous, L(X) is usually given by a density p(x).

With the notation above, we do not have to distinguish between discrete and continuous

variables.

D EPENDENCE AND I NDEPENDENCE

Dependence between random variables X1 , . . . , Xn is a property of their

joint distribution L(X1 , . . . , Xn ).

Recall

Two random variables are stochastically independent, or independent for short, if their joint

distribution factorizes:

L(X, Y) = L(X)L(Y)

For densities/mass functions:

P(x, y) = P(x)P(y) or p(x, y) = p(x)p(y)

Dependent means not independent.

Intuitively

X and Y are dependent if knowing the outcome of X provides any information about the

outcome of Y.

More precisely:

If someone draws (X, Y) simultanuously, and only discloses X = x to you, does that

Once X is given, the distribution of Y is the conditional L(Y|X = x).

If that is still L(Y), as before X was drawn, the two are independent. If

L(Y|X = x) 6= L(Y), they are dependent.

C ONDITIONAL INDEPENDENCE

Definition

Given random variables X, Y, Z, we say that X is conditionally independent of Y given Z if

L(X, Y|Z = z) = L(X|Z = z)L(Y|Z = z) .

That is equivalent to

L(X|Y = y, Z = z) = L(X|Z = z) .

Notation

X

Z Y

Intuitively

X and Y are dependent given Z = z if, although Z is known, knowing the outcome of X provides

additional information about the outcome of Y.

G RAPHICAL M ODEL N OTATION

The joint probability of random variables X1 , . . . , Xn can always be factorized as

L(X1 , . . . , Xn ) = L(Xn |X1 , . . . , Xn1 )L(Xn1 |X1 , . . . , Xn2 ) L(X1 ) .

Note that we can re-arrange the variables in any order.

If there are conditional independencies, we can remove some variables from the conditionals:

L(X1 , . . . , Xn ) = L(Xn |Xn )L(Xn1 |Xn1 ) L(X1 ) ,

where Xi is the subset of X1 , . . . , Xn on which Xi depends.

Definition

Let X1 , . . . , Xn be random variables. A (directed) graphical model represents a factorization

of joint distribution L(X1 , . . . , Xn ) as follows:

Factorize L(X1 , . . . , Xn ).

For each variable Xi , add and edge from each variable Xj Xi to Xi .

That is: An edge Xj Xi is added if L(X1 , . . . , Xn ) contains the factor L(Xi |Xj ).

G RAPHICAL M ODEL N OTATION

Lack of uniqueness

The factorization is usually not unique, since e.g.

L(X, Y) = L(X|Y)L(Y) = L(Y|X)L(X) .

That means the direction of edges is not generally determined.

Remark

If we use a graphical model to define a model or visualize a model, we decide on the

direction of the edges.

Estimating the direction of edges from data is a very difficult (and very important)

problem. This is one of the main subjects of a research field called causal inference or

causality.

A simple example

X Y

X

Z Y

... Layer 1

All variables in the (k + 1)st layer are

... Layer 2 conditionally independent given the

variables in the kth layer.

W ORDS OF C AUTION I

X Y

X

Z Y

Important

X and Y are not independent, independence holds only conditionally on Z.

In other words: If we do not observe Z, X and Y are dependent, and we have to change the

graph:

X Y or X Y

W ORDS OF C AUTION II

X Y

Example

Suppose we start with two indepedent normal variables X and Y.

Z = X + Y.

If we know Z, and someone reveals the value of Y to us, we know everything about X.

M ACHINE L EARNING E XAMPLES I

Z1 Z2 Zn1 Zn

X1 X2 Xn1 Xn

M ACHINE L EARNING E XAMPLES II

A graphical model in which each node is a binary random variable and each conditional

probability is a logicist regression model is called a sigmoid belief network.

Terminology: Belief network or Bayes net are alternative names for graphical models.

graphical model that looks like this.

...

Two tasks for deep belief nets are:

Sampling: Draw X1:d from L(X1:d |1:d ).

. . .

independent given previous layer.

Inference: Estimate L(1:d |X1:d = x1:d )

when data x1:d is observed. ...

...

following one makes variables dependent. Data

X1 X2 Xd

(More details later.)

M ARKOV R ANDOM F IELDS

U NDIRECTED G RAPHICAL M ODEL

each edge in the graph is either absent, or present in both directions.

X Y X Y X Y

Z Z Z

directed undirected

Markov random fields are special cases of (directed) graphical models, but have distinct

properties. We treat them separately.

We will consider the undirected case first.

OVERVIEW

.. .. ..

. . .

wi+1,j+1

wi1,i

... i1 i i+1 ...

.. .. ..

. . .

A random variable i is associated with each vertex. Two random variables interact if they are

neighbors in the graph.

N EIGHBORHOOD G RAPH

vertex set

set of edge weights

N = (VN , WN )

The edge weights are scalars wij R. Since the graph is undirected, the weights are

symmetric (wij = wji ).

An edge weight wij = 0 means "no edge between vi and vj ".

Neighborhoods

The set of all neighbors of vj in the graph, vi

(i) := { j | wij 6= 0}

is called the neighborhood of vj .

purple = (i)

M ARKOV R ANDOM F IELDS

We say that the joint distribution P of (1 , . . . , n ) satisfies the Markov property with

respect to N if

L(i |j , j 6= i) = L(i |j , j (i)) .

The set {j , j (i)} of random variables indexed by neighbors of vi is called the Markov

blanket of i .

In words

The Markov property says that each i is conditionally independent of the remaining variables

given its Markov blanket.

Definition

A distribution L(1 , . . . , n ) which satisfies the i

Markov property for a given graph N is called a Markov

random field.

Markov blanket of i

E NERGY F UNCTIONS

A (strictly positive) density p(x) can always be written in the form

1

p(x) = exp(H(x)) where H : X R+

Z

and Z is a normalization constant.

The function H is called an energy function, or cost function, or a potential.

MRF energy

In particular, we can write a MRF density for RVs 1:n as

1

p(1 , . . . , n ) = exp(H(1 , . . . , n ))

Z

C LIQUES

Graphical models factorize over the graph. How does that work for MRFs?

5 1 2 6

The cliques in this graph are: i) The triangles (1, 2, 3), (1, 3, 4).

ii) Each pair of vertices connected by an edge (e.g. (2, 6)).

T HE H AMMERSLEY-C LIFFORD T HEOREM

Theorem

Let N be a neighborhood graph with vertex set VN . Suppose the random variables

{i , i VN } take values in T , and their joint distribution has probability mass function P, so

there is an energy function H such that

eH(1 ,...,n )

P(1 , . . . , n ) = P P H(1 ,...,n )

.

in i T e

X

H(1 , 2 , . . .) = HC (i , i C) ,

CC

where C is the set of cliques in N , and each HC is a non-negative function with |C| arguments.

Hence,

Y eHc (i ,iC)

P(1 , . . . , n ) = P P Hc (i ,iC)

CC iC i T e

U SE OF MRF S

In general

Modeling systems of dependent RVs is one of the hardest problems in probability.

MRFs model dependence, but break it down to a limited number of interactions to make

the model tractable.

.. .. ..

. . .

are 2-dimensional grids. ... ...

i1 i i+1

model spatial interactions between RVs.

... k1 k k+1 ...

Hammersley-Clifford for grids: The only cliques

are the edges!

.. .. ..

. . .

T HE P OTTS M ODEL

Definition

Suppose N = (VN , WN ) a neighborhood graph with n vertices and > 0 a constant. Then

1 X

p(1:n ) := exp wij I{i = j }

Z(, WN ) i,j

Potts model.

Interpretation

If wij > 0: The overall probability increases if i = j .

If wij < 0: The overall probability decreases if i = j .

If wij = 0: No interaction between i and j .

Positive weights encourage smoothness.

E XAMPLE

.. .. ..

. . .

Ising model

The simplest choice is wij = 1 if (i, j) is an edge. ... j1 j j+1 ...

1 X

p(1:n ) = exp I{i = j } ... i1 i i+1 ...

Z()

(i,j) is an edge

... k1 k k+1 ...

.. .. ..

. . .

Example

Samples from an Ising model on a 56 56 grid graph.

Increasing

MRF S AS S MOOTHNESS P RIORS

We consider a spatial problem with observations Xi . Each i is a location on a grid.

Spatial model

Suppose we model each Xi by a distribution L(X|i ), i.e. each location i has its own parameter

variable i . This model is Bayesian (the parameter is a random variable). We use an MRF as

prior distribution.

Xj Xj+1

observed

Xi Xi+1

j j+1

p( . |i )

unobserved

i i+1

Spatial smoothing

We can define the joint distribution (1 , . . . , n ) as a MRF on the grid graph.

For positive weights, the MRF will encourage the model to explain neighbors Xi and Xj by

the same parameter value. Spatial smoothing.

ent values of = 50 and = 200, each

rameter values, ranging over several the data

ordersin of

Fig. 4 does not provide

magnitude. the MRI sufficient

image evidence

(Fig. for smoothin

8).ofWith

smoothing. The number clusters is con

E XAMPLE : S EGMENTATION OF N OISY I MAGES

Averages are taken over ten randomly a particular number

initialized of classes,

experi-

is observed. We thus concludethe

and no

timated

by increasing

that, maybe

stabilization

number of clusters

and

not

datastabilizing

effect

stabilizes

activating

surprisingly,

in Fig. 4 does

at N

the smoot

The effectnotof provide suffi

smoothing

the reliability of MDP and MDP/MRFanounced

particular model selection re-

for number

large valuesof classes, and no

of , resulting

sults depends on how well theisparametric

observed. We clustering model that, may

thus conclude

used with the DP is able to separate the input

the reliability of features

MDP andinto MDP/MRF m

different classes. The effect ofsults

the base

dependsmeasure

on how scatter,

wellde-the parametr

Mixture model fied here by the parameter , is demonstrated

used

number of clusters selected isdifferent

with the DP

plotted over

in isFig.

atThe

9. to

able

twoeffect

The

differ-

separate the

classes. of the base m

A BMM can be used for image segmentation. ent values of = 50 and = 200, each with and without

fied here by the parameter , is demonst

smoothing. The number of clusters number is consistently decreased

of clusters selected is plotted ov

The BMM prior on the component parameters is by a natural and activatingent

increasing thevalues

smoothingof constraint.

= 50 and = 200, each

conjugate prior q(). Fig. 6 A SAR image with a high noise levelThe and stabilizing

ambiguous segments

(upper left). Solutions without (upper right) and with smoothing

effect of smoothing

smoothing.isThe particularly

number ofpro- clusters is con

nounced for large values of , byresulting

increasing in a and

largeactivating

number the smoot

In the spatial setting, we index the parameter of each Xi Input image.

The stabilizing effect of smoothing

separately as i . For K mixture components, 1:n contains nounced for large values of , resulting

only K different values.

The joint BMM prior on 1:n is

n

Y

Fig. 6 A SAR image with a high

qBMM (1:nand

noise level

)= q(i ) .

ambiguous segments Fig. 8 MR frontal view image of a monkeys

Fig.right)

(upper left). Solutions without (upper 7 Segmentation

and withi=1 results for = 10, at different levels of smooth-

smoothing (upper left), unsmoothed MDP segmentation (u

ing: Unconstrained (left), standard smoothing ( = 1, middle) and MDP segmentation (lower left), original image

Segmentation w/o smoothing.

Smoothing term strong

Fig. 6 smoothing ( =with

A SAR image 5, right)

a high noise level and ambiguous segments

(upper left). Solutions without (upper right) and with smoothing

boundaries (smoothed result, lower right)

qBMM () with an MRF prior

1 Average number of

Image Fig. 4 Image Fig. 8

1clusters (withstandard

X

MDP Smoothed MDP

qMRF (1:n ) = deviations),

expchosen by the I{i = j }

algorithm on two images for

Z()

different values ofwthe6=0 1e-10 7.7 1.1 4.8 1.4 6.3 0.2

ij

hyperparameter. When

1e-8

Fig. 8 MR frontal12.9 0.8

view image of a monkeys 6.2 0.4 Original image 6.5 0.3

head.

smoothing is activated ( = 5,

This encourages spatial smoothnes of the segmentation.

Fig. 7 Segmentation results for = 10,column),

right at different

thelevels

numberof of

smooth- 1e-6

(upper left), unsmoothed MDP segmentation8.0

14.8 1.7 (upper

0.0right), smoothed 8.6 0.9

ing: Unconstrained (left), standardclusters

smoothing

tends ( = 1,

to be stable and 1e-4

middle)

more MDP segmentation20.6 (lower left), original image

1.2 overlaid

0.7MRF

9.6frontal with segment12.5 0.3

strong smoothing ( = 5, right) with respect to a changing

Segmentation

boundaries (smoothed result, lowerFig. 8 MR

right)

with view smoothing.

image of a monkeys

Fig. 7 Segmentation results for = 10,1e-2at different levels33.2 4.6

of smooth- 0.4

11.8unsmoothed

(upper left), 22.4 1.8 (u

MDP segmentation

ing: Unconstrained (left), standard smoothing ( = 1, middle) and MDP segmentation (lower left), original image

strong smoothing ( = 5, right) boundaries (smoothed result, lower right)

Table 1 Average number of

Advanced Machine Learning Image Fig. 4 Image Fig. 8 68 / 173

S AMPLING AND I NFERENCE

Problem 1: Sampling

Generate samples from the joint distribution of (1 , . . . , n ).

Problem 2: Inference

If the MRF is used as a prior, we have to compute or approximate the posterior distribution.

Solution

MRF distributions on grids are not analytically tractable. The only known exception is the

Ising model in 1 dimension.

Both sampling and inference are based on Markov chain sampling algorithms.

S AMPLING A LGORITHMS

S AMPLING A LGORITHMS

In general

A sampling algorithm is an algorithm that outputs samples X1 , X2 , . . . from a given

distribution P or density p.

Sampling algorithms can for example be used to approximate expectations:

n

1X

Ep [ f (X)] f (Xi )

n i=1

Suppose we work with a Bayesian model whose posterior Qn := L(|X1:n ) cannot be

computed analytically.

We will see that it can still be possible to sample from Qn .

Doing so, we obtain samples 1 , 2 , . . . distributed according to Qn .

This reduces posterior estimation to a density estimation problem

(i.e. estimate Qn from 1 , 2 , . . .).

P REDICTIVE D ISTRIBUTIONS

Posterior expectations

If we are only interested in some statistic of the posterior of the form EQn [ f ()] (e.g. the

posterior mean), we can again approximate by

m

1 X

EQn [ f ()] f (i ) .

m i=1

Example: Predictive distribution

The posterior predictive distribution is our best guess of what the next data point xn+1 looks

like, given the posterior under previous observations. In terms of densities:

Z

p(xn+1 |x1:n ) := p(xn+1 |)Qn (d|X1:n = x1:n ) .

T

This is one of the key quantities of interest in Bayesian statistics.

The predictive is a posterior expectation, and can be approximated as a sample average:

m

1 X

p(xn+1 |x1:n ) = EQn [ p(xn+1 |)] p(xn+1 |i )

m i=1

BASIC S AMPLING : A REA U NDER C URVE

Say we are interested in a probability density p on the interval [a, b].

p(y)

Yi

x

a Xi b

Key observation

Suppose we can define a uniform distribution UA on the blue area A under the curve. If we

sample

(X1 , Y1 ), (X2 , Y2 ), . . . iid UA

and discard the vertical coordinates Yi , the Xi are distributed according to p,

X1 , X2 , . . . iid p .

arbritrarily shaped one.

R EJECTION S AMPLING ON THE I NTERVAL

We can enclose p in box, and sample uniformly from the box B.

p(x)

c

x

a b

Xi Uniform[a, b] and Yi Uniform[0, c] .

If (Xi , Yi ) A, keep the sample.

That is: If Yi p(Xi ).

Otherwise: Discard it ("reject" it).

Result: The remaining (non-rejected) samples are uniformly distributed on A.

S CALING

This strategy still works if we scale the vertically by some constant k > 0.

kc

B B

x x

a b a b

We simply draw Yi Uniform[0, kc] instead of Yi Uniform[0, c].

Consequence

For sampling, it is sufficient if p is known only up to normalization

(only the shape of p is known).

D ISTRIBUTIONS K NOWN UP TO S CALING

Sampling methods usually assume that we can evaluate the target distribution p up to a constant.

That is:

1

p(x) = p(x) ,

Z

and we can compute p(x) for any given x, but we do not know Z.

We have to pause for a moment and convince ourselves that there are useful examples where

this assumption holds.

For an arbitrary posterior computed with Bayes theorem, we could write

Qn n

p(xi |)q()

Z Y

(|x1:n ) = i=1 with Z = p(xi |)q()d .

Z T i=1

Provided that we can compute the numerator, we can sample without computing the

normalization integral Z.

D ISTRIBUTIONS K NOWN UP TO S CALING

Recall that the posterior of the BMM is (up to normalization):

n X

Y K K

Y

qn (c1:K , 1:K |x1:n ) ck p(xi |k ) q(k |, y) qDirichlet (c1:K )

i=1 k=1 k=1

We already know that we can discard the normalization constant, but can we evaluate the

non-normalized posterior qn ?

The problem with computing qn (as a function of unknowns) is that the term

Qn PK

n

i=1 k=1 . . . blows up into K individual terms.

PK

k=1 ck p(xi |k ) collapses to a single

If we evaluate qn for specific values of c, x and ,

So: Computing qn as a formula in terms of unknowns is difficult; evaluating it for specific

values of the arguments is easy.

D ISTRIBUTIONS K NOWN UP TO S CALING

In a MRF, the normalization function is the real problem.

1 X

p(1:n ) = exp I{i = j }

Z()

(i,j) is an edge

X X

Z() = exp I{i = j }

1:n {0,1}n (i,j) is an edge

and hence a sum over 2n terms. The general Potts model is even more difficult.

X

p(1:n ) = exp I{i = j }

(i,j) is an edge

R EJECTION S AMPLING ON Rd

If we are not on the interval, sampling uniformly from an enclosing box is not possible (since

there is no uniform distribution on all of R or Rd ).

Instead of a box, we use another distribution r to enclose p:

p(x)

Sample Xi r.

R EJECTION S AMPLING ON Rd

p(x)

Scale p such that p(x) < r(x) everywhere.

Sampling: For i = 1, 2, . . . ,:

1. Sample Xi r.

2. Sample Yi |Xi Uniform[0, r(Xi )].

3. If Yi < p(Xi ), keep Xi .

4. Else, discard Xi and start again at (1).

The surviving samples X1 , X2 , . . . are distributed according to p.

FACTORIZATION P ERSPECTIVE

Factorization

We factorize the target distribution or density p as

distribution from which we

know how to sample

p(x) = r(x) A(x)

probability function we can evaluate

once a specific value of x is given

X = X0 Z

where X0 r and Z|X 0 0

Bernoulli(A(X ))

I NDEPENDENCE

If we draw proposal samples Xi i.i.d. from r, the resulting sequence of accepted samples

produced by rejection sampling is again i.i.d. with distribution p. Hence:

Rejection samplers produce i.i.d. sequences of samples.

Important consequence

If samples X1 , X2 , . . . are drawn by a rejection sampler, the sample average

m

1 X

f (Xi )

m i=1

E FFICIENCY

|A|

The fraction of accepted samples is the ratio |B|

of the areas under the curves p and r.

p(x)

samples.

A N IMPORTANT BIT OF IMPRECISE INTUITION

Example figures for sampling methods tend to look like this. A high-dimensional distribution of correlated RVs will look

rather more like this.

Intractable posterior distributions arise when there are several interacting random

In one-dimensional problems (1 RV), we can usually compute the posterior analytically.

Independent multi-dimensional distributions factorize and reduce to one-dimensional case.

W HY IS NOT EVERY SAMPLER A REJECTION SAMPLER ?

We can easily end up in situations where we accept only one in 106 (or 1010 , or 1020 ,. . . )

proposal samples. Especially in higher dimensions, we have to expect this to be not the

exception but the rule.

I MPORTANCE S AMPLING

The rejection problem can be fixed easily if we are only interested in approximating an

expectation Ep [ f (X)].

Suppose p is the target density and q a proposal density. An expectation under p can be

rewritten as

Z Z

p(x) f (X)p(X)

Ep [ f (X)] = f (x)p(x)dx = f (x) q(x)dx = Eq

q(x) q(X)

Importance sampling

We can sample X1 , X2 , . . . from q and approximate Ep [ f (X)] as

m

1 X p(Xi )

Ep [ f (X)] f (Xi )

m i=1 q(Xi )

p(Xi )

This method is called importance sampling. The coefficients q(Xi )

are called importance

weights.

I MPORTANCE S AMPLING

General case: We can only evaluate p

In the general case,

1 1

p= p and q= q ,

Zp Zq

Zp

and Zp (and possibly Zq ) are unknown. We can write Zq

as

R q(x)

p(x) q(x) dx

R Z

Zp p(x)dx q(x) p(X)

= = = p(x) dx = Eq

Zq Zq Zq Zq q(x) q(X)

Zp

The fraction Zq

can be approximated using samples x1:m from q:

m

Zp p(X) 1 X p(Xi )

= Eq

Zq q(X) m i=1 q(Xi )

Approximating Ep [ f (X)]

m m m f (X ) i p(X )

1 X p(Xi ) 1 X Zq p(Xi ) X i q(X )

i

Ep [ f (X)] f (Xi ) = f (Xi ) = Pm p(Xj )

m i=1 q(Xi ) m i=1 Zp q(Xi ) i=1 j=1 q(Xj )

I MPORTANCE S AMPLING IN G ENERAL

Conditions

Given are a target distribution p and a proposal distribution q.

1 1

p= Zp

p and q = Zq

q.

We can evaluate p and q, and we can sample q.

The objective is to compute Ep [ f (X)] for a given function f .

Algorithm

1. Sample X1 , . . . , Xm from q.

2. Approximate Ep [ f (X)] as

Pm p(X )

i

i=1 f (Xi ) q(Xi )

Ep [ f (X)] Pm p(Xj )

j=1 q(Xj )

M ARKOV C HAIN M ONTE C ARLO

M OTIVATION

region of interest

Once we have drawn a sample in the narrow region of interest, we would like to continue

drawing samples within the same region. That is only possible if each sample depends on the

location of the previous sample.

Proposals in rejection sampling are i.i.d. Hence, once we have found the region where p

concentrates, we forget about it for the next sample.

MCMC: I DEA

A sufficiently nice Markov chain (MC) has an invariant distribution Pinv .

Once the MC has converged to Pinv , each sample Xi from the chain has marginal

distribution Pinv .

We want to sample from a distribution with density p. Suppose we can define a MC with

invariant distribution Pinv p. If we sample X1 , X2 , . . . from the chain, then once it has

converged, we obtain samples

Xi p .

This sampling technique is called Markov chain Monte Carlo (MCMC).

Note: For a Markov chain, Xi+1 can depend on Xi , so at least in principle, it is possible for an

MCMC sampler to "remember" the previous step and remain in a high-probability location.

C ONTINUOUS M ARKOV C HAIN

The Markov chains we discussed so far had a finite state space X. For MCMC, state space now

has to be the domain of p, so we often need to work with continuous state spaces.

A continuous Markov chain is defined by an initial distribution Pinit and conditional probability

t(y|x), the transition probability or transition kernel.

In the discrete case, t(y = i|x = j) is the entry pij of the transition matrix p.

xi

We can define a very simple Markov chain by sampling

Xi+1 |Xi = xi g( . |xi , 2 )

where g(x|, 2 ) is a spherical Gaussian with fixed variance. In

other words, the transition distribution is

A Gaussian (gray contours) is placed

t(xi+1 |xi ) := g(xi+1 |xi , 2 ) . around the current point xi to sample

Xi+1 .

I NVARIANT D ISTRIBUTION

The invariant distribution Pinv is a distribution on the finite state space X of the MC

(i.e. a vector of length |X|).

"Invariant" means that, if Xi is distributed according to Pinv , and we execute a step

Xi+1 t( . |xi ) of the chain, then Xi+1 again has distribution Pinv .

In terms of the transition matrix p:

p Pinv = Pinv

Continuous case

X is now uncountable (e.g. X = Rd ).

The transition matrix p is substituted by the conditional probability t.

A distribution Pinv with density pinv is invariant if

Z

t(y|x)pinv (x)dx = pinv (y)

X

P

This is simply the continuous analogue of the equation i pij (Pinv )i = (Pinv )j .

M ARKOV C HAIN S AMPLING

We run the Markov chain n for steps. We "forget" the order and regard the If p (red contours) is both the

Each step moves from the current locations x1:n as a random set of invariant and initial distribution, each

location xi to a new xi+1 . points. Xi is distributed as Xi p.

1. We have to construct a MC with invariant distribution p.

2. We cannot actually start sampling with X1 p; if we knew how to sample from p, all of

this would be pointless.

3. Each point Xi is marginally distributed as Xi p, but the points are not i.i.d.

C ONSTRUCTING THE M ARKOV C HAIN

Given is a continuous target distribution with density p.

1. We start by defining a conditional probability q(y|x) on X.

q has nothing to do with p. We could e.g. choose q(y|x) = g(y|x, 2 ), as in the previous example.

n q(x |x )p(x ) o

i i+1 i+1

A(xn+1 |xn ) := min 1,

q(xi+1 |xi )p(xi ) total probability that

a proposal is sampled

The normalization of p cancels in the quotient, so knowing p is again enough.

and then rejected

3. We define the transition probability of the chain as Z

t(xi+1 |xi ) := q(xi+1 |xi )A(xi+1 |xi )+xi (xi+1 )c(xi ) where c(xi ) := q(y|xi )(1A(y|xi ))dy

At each step i + 1, generate a proposal X q( . |xi ) and Ui Uniform[0, 1].

If Ui A(x |xi ), accept proposal: Set xi+1 := x .

P ROBLEM 1: I NITIAL DISTRIBUTION

Suppose we sample X1 Pinit and Xi+1 t( . |xi ). This defines a distribution Pi of Xi , which

can change from step to step. If the MC is nice (recall: recurrent and aperiodic), then

Pi Pinv for i.

Note: Making precise what aperiodic means in a continuous state space is a bit more technical than in the finite case, but the

theorem still holds. We will not worry about the details here.

Implication

If we can show that Pinv p, we do not have to know how to sample from p.

Instead, we can start with any Pinit , and will get arbitrarily close to p for sufficiently large i.

B URN -I N AND M IXING T IME

The number m of steps required until Pm Pinv p is called the mixing time of the Markov

chain. (In probability theory, there is a range of definitions for what exactly Pm Pinv means.)

In MC samplers, the first m samples are also called the burn-in phase. The first m samples of

each run of the sampler are discarded:

X1 , . . . , Xm1 , Xm , Xm+1 , . . .

Burn-in; Samples from

discard. (approximately) p;

keep.

Convergence diagnostics

In practice, we do not know how large j is. There are a number of methods for assessing whether

the sampler has mixed. Such heuristics are often referred to as convergence diagnostics.

P ROBLEM 2: S EQUENTIAL D EPENDENCE

Autocorrelation Plots

Even after burn-in, the samples from a MC are not i.i.d.

function.

Estimate empirically how many steps L are needed for xi and xi+L to be approximately

> autocorr.plot(mh.draws)

independent. The number L is called the lag.

After burn-in, keep only every Lth sample; discard samples in between.

1.0

1.0

Estimating the lag

0.5

0.5

The most commen method uses the autocorrelation function:

i , xi+L )

Autocorrelation

Autocorrelation

E[xi i ] E[xj j ]

0.0

0.0

Auto(xi , xj ) :=

Auto(x

i j

0.5

0.5

We compute Auto(xi , xi+L ) empirically from the sample for

different values of L, and find the smallest L for which the

autocorrelation is close to zero.

1.0

1.0

0 5 15 25 0 5 15 25

Lag L

Lag

C ONVERGENCE D IAGNOSTICS

There are about half a dozen popular convergence crieteria; the one below is an example.

Gelman-Rubin criterion

Start several chains at random. For each chain k, sample Xik

has a marginal distribution Pki .

The distributions of Pki will differ between chains in early

stages.

Once the chains have converged, all Pi = Pinv are identical.

Criterion: Use a hypothesis test to compare Pki for different k

(e.g. compare P2i against null hypothesis P1i ). Once the test

does not reject anymore, assume that the chains are past

burn-in.

Reference: A. Gelman and D. B. Rubin: "Inference from Iterative Simulation Using Multiple Sequences", Statistical Science, Vol. 7 (1992) 457-511.

S TOCHASTIC H ILL -C LIMBING

n q(x |x )p(x ) o

i i+1 i+1

A(xn+1 |xn ) = min 1, .

q(xi+1 |xi )p(xi )

Hence, we certainly accept if the second term is larger than 1, i.e. if

q(xi |xi+1 )p(xi+1 ) > q(xi+1 |xi )p(xi ) .

That means:

We always accept the proposal value xi+1 if it increases the probability under p.

If it decreases the probability, we still accept with a probability which depends on the

difference to the current probability.

Hill-climbing interpretation

The MH sampler somewhat resembles a gradient ascent algorithm on p, which tends to

move in the direction of increasing probability p.

However:

The actual steps are chosen at random.

The sampler can move "downhill" with a certain probability.

When it reaches a local maximum, it does not get stuck there.

S ELECTING A P ROPOSAL D ISTRIBUTION

Everyones favorite example: Two Gaussians

Will overstep p; many rejections.

Var[q] too small:

Many steps needed to achieve good

coverage of domain.

If p is unimodal and can be roughly

approximated by a Gaussian, Var[q] should be

red = target distribution p chosen as smallest covariance component of p.

gray = proposal distribution q

More generally

For complicated posteriors (recall: small regions of concentration, large low-probability regions

in between) choosing q is much more difficult. To choose q with good performance, we already

need to know something about the posterior.

There are many strategies, e.g. mixture proposals (with one component for large steps and one

for small steps).

S UMMARY: MH S AMPLER

The MH kernel is one generic way to construct such a chain from p and a proposal

distribution q.

Formally, q does not depend on p (but arbitrary choice of q usually means bad

performance).

We have to discard an initial number m of samples as burn-in to obtain samples

(approximately) distributed according to p.

After burn-in, we keep only every Lth sample (where L = lag) to make sure the xi are

(approximately) independent.

Burn-in; Samples correlated Samples correlated

discard. with Xj ; discard. with Xj+L ; discard.

T HE G IBBS SAMPLER

G IBBS S AMPLING

By far the most widely used MCMC algorithm is the Gibbs sampler.

Full conditionals

Suppose L(X) is a distribution on RD , so X = (X1 , . . . , XD ). The conditional probability of the

entry Xd given all other entries,

L(Xd |X1 , . . . , Xd1 , Xd+1 , . . . , XD )

is called the full conditional distribution of Xd .

On RD , that means we are interested in a density

p(xd |x1 , . . . , xd1 , xd+1 , . . . , xD )

Gibbs sampling

The Gibbs sampler is the special case of the Metropolis-Hastings algorithm defined by

propsoal distribution for Xd = full conditional of Xd .

Gibbs sampling is only applicable if we can compute the full conditionals for each

dimension d.

If so, it provides us with a generic way to derive a proposal distribution.

T HE G IBBS S AMPLER

Proposal distribution

Suppose p is a distribution on RD , so each sample is of the form Xi = (Xi,1 , . . . , Xi,D ). We

generate a proposal Xi+1 coordinate-by-coordinate as follows:

Xi+1,1 p( . |xi,2 , . . . , xi,D )

..

.

Xi+1,d p( . |xi+1,1 , . . . , xi+1,d1 , xi,d+1 , . . . , xi,D )

.

..

Xi+1,D p( . |xi+1,1 , . . . , xi+1,D1 )

Note: Each new Xi+1,d is immediately used in the update of the next dimension d + 1.

No rejections

It is straightforward to show that the Metropolis-Hastings acceptance probability for each

xi+1,d+1 is 1, so proposals in Gibbs sampling are always accepted.

E XAMPLE : MRF

up

Full conditionals

In a grid with 4-neighborhoods for instance, the Markov

property implies that left d right

p(d |1 , . . . , d1 , d+1 , . . . , D ) = p(d |left , right , up , down )

down

Recall that, for sampling, knowing only p (unnormalized) is sufficient:

p(d |1 , . . . , d1 , d+1 , . . . , D ) =

exp (I{d = left } + I{d = right } + I{d = up } + I{d = down }

E XAMPLE : MRF

Each step of the sampler generates a sample

i = (i,1 , . . . , i,D ) ,

where D is the number of vertices in the grid.

Gibbs sampler

Each step of the Gibbs sampler generates n updates according to

i+1,d p( . |i+1,1 , . . . , i+1,d1 , i,d+1 , . . . , i,D )

exp (I{i+1,d = left } + I{i+1,d = right } + I{i+1,d = up } + I{i+1,d = down })

B URN -I N M ATTERS

!"#$%&'&!"#$%(&)*

)-+&5)-+&

+&%"$1"40

This example is due to Erik Sudderth (UC Irvine).

:&;&<&40$0

MRFs as "segmentation" priors =3004&>30

-..&/0"1$023%4&

).(...&/0"1$023%4&

MRFs where introduced as tools for image smoothing and segmentation by D. and S.

Geman in 1984.

They sampled from a Potts model with a Gibbs sampler, discarding 200 iterations as

burn-in.

Such a sample (after 200 steps) is shown above, for a Potts model in which each variable

can take one out of 5 possible values.

These patterns led computer vision researchers to conclude that MRFs are "natural" priors

for image segmentation, since samples from the MRF resemble a segmented image.

E XAMPLE : B URN -I N M ATTERS )-+&5)-+&6127&

)-+&5)-+&6127&

+&%"$1"40&%"268931&"76"4&

+&%"$1"40&%"268931&"76"4&

:&;&<&40$0"4&

E. Sudderth ran a Gibbs sampler on the same model and sampled after 200 iterations:&;&<&40$0"4&

(as the Geman brothers did),

=3004&>30"%02$?4@&

and again after 10000 iterations: =3004&>30"%02$?4@&

-..&/0"1$023%4&

-..&/0"1$023%4&

200 iterations

).(...&/0"1$023%4&

).(...&/0"1$023%4&

10000 iterations

Chain 1 Chain 5

The "segmentation" patterns are not sampled from the MRF distribution p Pinv , but

rather from P200 6= Pinv .

The patterns occur not because MRFs are "natural" priors for segmentations, but because

the samplers Markov chain has not mixed.

MRFs are smoothness priors, not segmentation priors.

VARIATIONAL I NFERENCE

VARIATIONAL I NFERENCE : I DEA

Problem

We have to solve an inference problem where the correct solution is an intractable distribution

with density p (e.g. a complicated posterior in a Bayesian inference problem).

Variational approach

Approximate p as

q := arg min (q, p )

qQ

where Q is a class of simple distributions and is a cost function (small means good fit).

That turns the inference problem into a constrained optimization problem

min (q, p )

s.t. q Q

(or discrepancy) to a class of tractable distributions.

BACKGROUND : VARIATIONAL M ETHODS

Formulate your problem such that the solution x Rd is the minimum of some function f , and

solve

x := arg min f (x)

xRd

possibly under constraints.

Examples: Support vector machines, linear regression, logistic regression, . . .

q := arg min (q, p )

We have to optimize over a space of functions. Such spaces are in general

infinite-dimensional.

Often: Q is a parametric model, with parameter space T Rd

reduces to optimization over Rd .

However: Optimization over infinite-dimensional spaces is in principle possible.

O PTIMIZATION OF FUNCTIONALS

A function : F R (a function whose arguments are functions) is called a functional.

Examples: (1) The integral of a function. (2) The differential entropy of a density.

The differential of f : Rd R at point x is

f (x + ) f (x) f (x + x) f (x)

f (x) = lim if d = 1 or f (x) = lim in general.

&0 kxk&0 kxk

The d-dimensional case works by reducing to the 1-dimensional case using a norm.

Derivatives of functionals

If F is a function space and k k a norm on F , we can apply the same idea to : F R:

( f + f ) ( f )

( f ) := lim

k f k&0 k f k

(f ) is called the Frchet derivative of at f .

f is a minimum of a Frchet-differentiable functional only if (f ) = 0.

O PTIMIZATION OF F UNCTIONALS

Optimization

We can in principle find a minimum of by gradient descent: Add increment functions fk in

the direction of (fk ) to the current solution candidate fk .

The maximum entropy problem is often cited as an example.

Horseshoes

We have to represent the infinite-dimensional quantities fk and fk in some way.

Many interesting functionals are not Frchet-differentiable as functionals on F . They

only become differentiable when constrained to a much smaller subspace.

One solution is variational calculus, an analytic technique that addresses both problems. (We

will not need the details.)

The maximum entropy principle chooses a distribution within some set P of candidates

by selecting the one with the largest entropy.

That is: It solves the optimization problem

max H(p)

s.t. p P

For example, if P are all those distributions under which some given statistic S takes a

given expected value, we obtain exponential family distributions with sufficient statistic S.

O PTIMIZATION OF F UNCTIONALS

Maximum entropy as functional optimization

The entropy H assigns a scalar to a distribution functional!

Problem: The entropy as a functional e.g. on all distributions on R is concave, but it is not

differentiable; it is not even continuous.

The solution for exponential families can be determined using variational calculus.

We will be interested in problems of the form

min (q)

q

s.t. q Q

where Q is a parametric family.

That means each element of Q is of the form q( |), for T Rd .

The problem then reduces back to optimization in Rd :

min (q( |))

s.t. T

We can apply gradient descent, Newton, etc.

K ULLBACK -L EIBLER D IVERGENCE

Recall

The information in observing X = x under a probability mass function P is

1

JP (x) := log = log P(x) .

P(x)

Its expectation H(P) := EP [JP (X)] is the entropy of P.

The Kullback-Leibler divergence of P and Q is

X P(x)

DKL (PkQ) := EP [JQ (X)] H(P) = P(x) log

x

Q(x)

If p and q are probability densities, then

Z Z

p(x)

H(p) := p(x) log p(x)dx and DKL (pkq) := p(x) log dx

q(x)

are the differential entropy of p and the Kullback-Leibler divergence of p and q.

Be careful

The differential entropy does not behave like the entropy (e.g. it can be negative).

The KL divergence for densities has properties analogous to the mass function case.

VARIATIONAL I NFERENCE WITH KL

Recall VI optimization problem

q := arg min (q, p )

We have to choose a cost function.

The term variational inference in machine learning typically implies is a KL divergence,

q := arg min DKL (q, p )

Recall that DKL is not symmetric, so

DKL (q, p ) 6= DKL (p , q)

Which order should we use?

Recall DKL (pkq) is an expectation with respect to p.

DKL (p kq) emphasizes regions where the true model p has high probability. That is

what we should use if possible.

We use VI because p is intractable, so we can usually not compute expectations under it.

We use the expectation DKL (q, p ) under the approximating simpler model instead.

We have to understand the implications of this choice.

E XAMPLE

What VI would do if possible What VI does

1 1

z2 z2

0.5 0.5

0 0

0 0.5 z1 1 0 0.5 z1 1

DKL (p kq) =(b)

DKL (greenkred) DKL (qkp ) =(a)

DKL (redkgreen)

E XAMPLE

VI FOR P OSTERIOR D ISTRIBUTIONS

p(x|z)p(z)

If the posterior density is p (z) = p(z|x) = p(x)

, then

h q(Z) i

DKL (q( )|p( |x)) = E log

p(Z|x)

= E[log q(Z)] E[log p(Z|x)]

= E[log q(Z)] E[log p(Z, x)] + log p(x)

It depends only on x, so it is an additive constant w.r.t. the optimization problem.

Dropping it from the objective function does not change the location of the minimum.

F(q) := E[log q(Z)] E[log p(Z, x)]

VI FOR P OSTERIOR D ISTRIBUTIONS

Summary: VI approximation

min F(q)

where F(q) = E[log q(Z)] E[log p(Z, x)]

s.t. qQ

Terminology

The function F is called a free energy in statistical physics.

Since there are different forms of free energies, various authors attach different adjectives

(variational free energy, Helmholtz free energy, etc).

Parts of the machine learning literature have renamed F, by maximizing the objective

function F and calling it an evidence lower bound, since

F(q) + DKL (q k p( |x)) = log p(x) hence eF(q) p(x) ,

and p(x) is the evidence in the Bayes equation.

M EAN F IELD A PPROXIMATION

Definition

A variational approximation of a probability distribution p on a d-dimensional space

q := arg min DKL (q, p )

q(z) = q1 (z1 ) . . . qd (zd ) .

In previous example

1

z2

0.5

0

0 0.5 spherical)

Mean field (Gaussian z1 1 Not a mean field

(a)

Advanced Machine Learning 122 / 173

E XAMPLE : M EAN F IELD FOR THE P OTTS M ODEL

Model

We consider a MRF distribution for X1 , . . . , Xn with values in {1, +1}, given by

n n

1 X X

P(X1 , . . . , Xn ) = exp(H(X1 , . . . , Xn )) where H = wij Xi Xj hi Xi

Z() i,j=1 i=1

external field

Physicists call this a Potts model with an external magnetic field.

Variational approximation

We choose Q as the family

n

(

1+m

nY o

2

X=1

Q := Qmi mi [1, 1] where Qm (X) :=

1m

2

X = 1

i=1

1+m

Each factor is a Bernoulli 2

, except that the range is {1, 1} not {0, 1}.

E XAMPLE : M EAN F IELD FOR THE P OTTS M ODEL

Optimization Problem

n

Y

min DKL Qmi
P

i=1

s.t. mi [1, 1] for i = 1, . . . , n

The mean field approximation is given by the parameter values mi satisfying the equations

n

X

mi = tanh wij mj + hi .

j=1

That is: For given values of wij and hi , we have to solve for the values mi .

E XAMPLE : M EAN F IELD FOR THE P OTTS M ODEL

We have approximated the MRF

n

Y n

X

P(X1 , . . . , Xn ) by Bernoulli(mi ) satisfying mi = tanh wij mj +hi

i=1 j=1

Interpretation

In the MRF P, the random variables Xi interact.

There is no interaction in the approximation.

Instead, the effect of interactions is approximated by encoding them in the parameters.

This is somewhat like a single effect (field) acting on all variables simultanuously

(mean field).

In physics, P is used to model a ferromagnet with an external magnetic field.

In this case, is the inverse temperature.

These systems exhibit a phenomenon called spontanuous magnetization at certain

temperatures. The mean field solution predicts spontanuous magnetization, but at the

wrong temperature values.

D IRECTED G RAPHICAL M ODELS :

M IXTURES AND A DMIXTURES

N EXT: BAYESIAN M IXTURES AND A DMIXTURES

Bayesian mixture models (mixtures with priors).

influenced by several components (e.g. topics).

One particular admixture model, called latent Dirichlet allocation, is one of the most

succesful machine learning models of the past ten years.

F INITE M IXTURE AS A G RAPHICAL M ODELS

X

(x) = ck p(x|k )

kK

1. Fix ck and k for k = 1, . . . , K.

2. Generate Zi Multinomial(c1 , . . . , cK ).

3. Generate Xi |Zi p( |Zi ).

As a graphical model

Box notation indicates

c c and are not random

Z1 ... Zn

X1 ... Xn

P LATE N OTATION

If variables are sampled repeatedly in a graphical model, we enclose these variables in plate.

c c

Z1 ... Zn = Z

X1 ... Xn X

n

BAYESIAN M IXTURE M ODEL

K

X Z K

X

(x) = ck p(x|k ) = p(x|)m()d with m := ck k

k=1 T k=1

All parameters are summarized in the mixing distribution m.

In a Bayesian model, parameters are random variables. Here, that means a random mixing

distribution:

K

X

M( . ) = Ck k ( . )

k=1

R ANDOM M IXING D ISTRIBUTION

Since M is discrete with finitely many terms, we only have to generate the random variables Ck

and k :

XK

M( . ) = Ck k ( . )

k=1

More precisely

Specifically, the term BMM implies that all priors are natural conjugate priors. That is:

The mixture components p(x|) are an exponential family model.

The prior on each k is a natural conjugate prior of p.

The prior of the vector (C1 , . . . , CK ) is a Dirichlet distribution.

When we sample from a finite mixture, we choose a component k from a multinomial

distribution with parameter vector (c1 , . . . , ck ).

The conjugate prior of the multinomial is the Dirichlet distribution.

BAYESIAN M IXTURE M ODELS

Definition

A model of the form

K

X Z

(x) = Ck p(x|k ) = p(x|)M()d

k=1 T

is called a Bayesian mixture model if p(x|) is an exponential family model and M a random

mixing distribution, where:

1 , . . . , K iid q( . |, y), where q is a natural conjugate prior for p.

BAYESIAN M IXTURE AS A G RAPHICAL M ODEL

Sampling from a Bayesian Mixture

1. Draw C = (C1 , . . . , Ck ) from a Dirichlet prior.

2. Draw 1 , . . . , K iid q, where q is the conjugate prior of p.

3. Draw Zi |C Multinomial(C).

4. Draw Xi |Zi , p( |Zi ).

As a graphical model

X

n

BAYESIAN M IXTURE : I NFERENCE

Posterior distribution

The posterior density of a BMM under observations x1 , . . . , xn is (up to normalization):

n X

Y K K

Y

(c1:K , 1:K |x1:n ) ck p(xi |k ) q(k |, y) qDirichlet (c1:K )

i=1 k=1 k=1

Thanks to conjugacy, we can evaluate each term of the posterior.

Q Pn

However: Due to the K k=1

n

i=1 . . . bit, the posterior has K terms!

G IBBS S AMPLER FOR THE BMM

This Gibbs sampler is a bit harder to derive, so we skip the derivation and only look at the algorithm.

Exponential family likelihood p(x|k ) for each cluster k = 1, . . . , K.

Natural conjugate prior q for all k .

Dirichlet prior Dirichlet(, g) for the mixture weights c1:K .

Assignment probabilities

Each step of the Gibbs sampler computes an assignment matrix:

a11 . . . a1K

a = .. .. = Pr{x in cluster k}

. . i

ik

an1 . . . anK

Entries are computed as they are in the EM algorithm:

Ck p(xi |k )

aik = PK

l=1 Cl p(xi |l )

In contrast to EM, the values Ck and k are random.

G IBBS FOR BMM: A LGORITHM

In each iteration j, the algorithm cycles through these steps:

exactly as in EM

1. For each xi , sample an assignment

C j1 p(xi | j1 )

Zij Multinomial(ai1j , . . . , aiK

j

) where aikj = PK k j1 k j1

l=1 Cl p(xi |l )

2. For each cluster k, sample a new value for kj from the conjugate posterior (k ) under

the observations currently assigned to k:

n n

kj + I{Zij = k}, y + I{Zij = k}S(xi )

X X

i=1 i=1

assigned to k cluster k

j

3. Sample new cluster proportions C1:K from the Dirichlet posterior (under all xi ):

assigned to k

prior concentration

I{Zij = k}

Pn

j j gk +

C1:K Dirichlet( + n, g1:K ) where gkj = i=1

+n

normalization

C OMPARISON : EM AND K- MEANS

The BMM Gibbs sampler looks very similar to the EM algorithm, with maximization steps (in

EM) substituted by posterior sampling steps:

Representation of assignments Parameters

EM Assignment probabilities ai,1:K aik -weighted MLE

K-means mi = arg maxk (ai,1:K ) MLE for each cluster

Gibbs for BMM mi Multinomial(ai,1:K ) Sample posterior for each cluster

T OOLS : T HE D IRICHLET D ISTRIBUTION

T HE D IRICHLET D ISTRIBUTION

e1

The set of all probability distributions on K events is the simplex

X

K := {(c1 , . . . , ck ) RK | ck 0 and cK = 1} . c3

k

c2

e2 e3

Dirichlet distribution

The Dirichlet distribution is the distribution on K with density

K

1 X

qDirichlet (c1:K |, g1:K ) := exp (gk 1) log(ck )

Z(, g1:K ) k=1

Parameters:

g1:K K : Mean parameter, i.e. E[c1:K ] = g1:K .

R+ : Concentration.

Larger sharper concentration around g1:K .

T HE D IRICHLET D ISTRIBUTION

In all plots, g1:K = 1 , 1 , 1 . Light colors = large density values.

3 3 3

Density plots

= 1.8 = 10

As heat maps

= 0.8 =1 = 1.8 = 10

Large density values Uniform distribution Density peaks Peak sharpens

at extreme points on K around its mean with increasing

M ULTINOMIAL -D IRICHLET M ODEL

Model

The Dirichlet is the natural conjugate prior on the multinomial parameters. If we observe hk

counts in category k, the posterior is

(c1:K |h1 , . . . , hk ) = qDirichlet (c1:K | + n, (g1 + h1 , . . . , gK + hK ))

P

where n = k hk is the total number of observations.

Suppose K = 3 and we obtain a single observation in category 3.

correponds to k = 3.

T HE D IRICHLET IS S PECIAL

prior amplifies in the posterior.

To keep inference feasible: Keep variables in the prior as independent as possible.

If (1 , . . . , K ) is a random probability distribution, the k cannot be independent.

How do we define 1:K so that components are as independent as possible?

Idea: Start with independent variables X1 , . . . , XK in (0, ). If we define

Xk

Xk := PK then (X1 , . . . , XK ) K

j=1 Xj

Suppose X1 , . . . , XK are independent random variables. If

X 1 K

Xk Gamma(k , 1) then (X1 , . . . , XK ) Dirichlet k ; P , . . . , P

k

j j j j

T HE D IRICHLET IS S PECIAL

In general: Even if X1 , . . . , XK are independent,

Xk X

P and Xj are stochastically dependent.

X

j j j

So: Components

P of the prior couple (1) through normalization and (2) through the latent

variable j Xj .

If X and Y are independent random variables in (0, ) (and not constant), then

X

X+Y if and only if X, Y are gamma with the same shape parameter

X+Y

In the Dirichlet, components couple only through the normalization constraint.

Any other random probability defined by normalizing independent variables introduces

more dependence.

T EXT M ODELS

We assume the corpus is generated by a multinomial mixture model of the form

K

X

(H) = ck P(H|k ) ,

k=1

A document is represented by a histogram H.

Topics 1 , . . . , K .

kj = Pr{ word j in topic k}.

Problem

Each document is generated by a single topic; that is a very restrictive assumption.

S AMPLING D OCUMENTS

Parameters

Suppose we consider a corpus with K topics and a vocubulary of d words.

K topic proportions (k = Pr{ topic k}).

Note: For random generation of documents, we assume that and the topic parameters k are given (they properties of the

corpus). To train the model, they have to be learned from data.

To sample a document containing M words:

1. Sample topic Z Multinomial().

2. For i = 1, . . . , M: Sample wordi |Z Multinomial(Z ).

The entire document is sample from topic Z.

L ATENT D IRICHLET A LLOCATION

Mixtures of topics

Whether we sample words or entire documents makes a big difference.

When we sample from the multinomial mixture, we choose a topic at random, then

sample the entire document from that topic.

For several topics to be represented in the document, we have to sample each word

individually (i.e. choose a new topic for each word).

Problem: If we do that in the mixture above, every document has the same topic

proportions.

Each document explained as a mixture of topics, with mixture weights C1:K .

Fix a matrix of size #topics #words, where

kj := probability that word j occurs under topic k

2. For i = 1, . . . , M:

2.1 Sample topic for word i as Zi |C1:K Multinomial(C1:K ).

2.2 Sample wordi |Zi Multinomial(Zi ).

This model is known as Latent Dirichlet Allocation (LDA).

C OMPARISON : LDA AND BMM

Observation

LDA is almost a Bayesian mixture model: Both use multinomial components and a Dirichlet

prior on the mixture weights. However, they are not identical.

Comparison

Bayesian MM Admixture (LDA)

Sample c1:K Dirichlet(). Sample c1:K Dirichlet().

Sample topic k Multinomial(c1:K ). For i = 1, . . . , M:

For i = 1, . . . , M: Sample topic ki Multinomial(c1:K ).

Sample wordi Multinomial(k ). Sample wordi Multinomial(ki ).

In admixtures:

c1:K is generated at random, once for each document.

LDA explains each document by a separate parameter c1:K K . That is, LDA models

documents as topic proportions.

LDA AS A G RAPHICAL M ODEL

topic proportions

C C Dirichlet()

topic of word

Z Z Multinomial(C)

observed word

word word Multinomial(row Z of )

N N # words

M M # documents

|vocabulary| each row has sum 1

Bayesian mixture

(for M documents of N words each) LDA

P RIOR ON T OPIC P ROBABILITIES

C

The parameter matrix is of size

#topics |vocabulary|.

Meaning: ki = probability that term i is Z

observed under topic k.

Note entries of are non-negative and

each row sums to 1.

To learn the parameters along with the

other parameters, we add a Dirichlet prior word

with parameters . The rows are drawn

N

i.i.d. from this prior.

M

is now random

VARIATIONAL I NFERENCE FOR LDA

Target distribution

Posterior L(Z, C, |words, , )

Variational approximation

C

Q = {q(z, c, |, ) | , , }

where

K M N

Y Y Y

q(z, c, |, ) := p(k |) q(cm |m ) p(zmn |mn ) Z

k=1 m=1 n=1

Dirichlet Multinomial

word

M

min DKL (q(z, c, |, )kp(z, c, |, ))

s.t. q Q

VARIATIONAL I NFERENCE FOR LDA

QK QM QN

q(z, c, |, ) := k=1 p(k |) m=1 q(cm |m ) n=1 p(zmn |mn )

C

word C Z

N K M N

M M

I TERATIVE S OLUTION

Algorithmic solution

We solve the minimization problem

min DKL (q(z, c, |, )kp(z, c, |, ))

s.t. q Q

by applying gradient descent to the resulting free energy.

VI algorithm

It can be shown that gradient descent amounts to the following algorithm:

update global parameters

U PDATE E QUATIONS

Local updates

(t+1) mn

mn := P

n mn

where

mn = exp Eq [log(Cm1 ), . . . , log(Cmd )| (t) ] + Eq [log(1,wn ), . . . , log(d,wn )|(t) ]

N

(t+1)

X

(t+1) := + n

n=1

Global updates

(t+1) (t+1)

XX

k =+ wordmn mn

m n

E XAMPLE : M IXTURE OF T OPICS

The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-

tan Opera Co., New York Philharmonic and Juilliard School. Our board felt that we had a

real opportunity to make a mark on the future of the performing arts with these grants an act

every bit as important as our traditional areas of support in health, medical research, education

and the social services, Hearst Foundation President Randolph A. Hearst said Monday in

announcing the grants. Lincoln Centers share will be $200,000 for its new building, which

will house young artists and provide new public facilities. The Metropolitan Opera Co. and

New York Philharmonic will receive $400,000 each. The Juilliard School, where music and

the performing arts are taught, will get $250,000. The Hearst Foundation, a leading supporter

of the Lincoln Center Consolidated Corporate Fund, will make its usual annual $100,000

donation, too.

Figure 8: An example article from the AP corpus. Each color codes a different factor from which

Advanced Machine Learning the word is putatively generated. From Blei, Ng, Jordan, "Latent Dirichlet Allocation", 2003 154 / 173

R ESTRICTED B OLTZMANN M ACHINES

B OLTZMANN MACHINE

Definition

A Markov random field distribution of variables X1 , . . . , Xn with values in {0, 1} is called a

Boltzmann machine if its joint law is

1 X X

P(x1 , . . . , xn ) = exp wij xi xj ci xi ,

Z i<jn in

where wij are the edge weights of the MRF neighborhood graph, and c1 , . . . , cn are scalar

parameters.

Remarks

The Markov blanket of Xi are those Xj with Wij 6= 0.

For x {1, 1}n instead: Potts model with external

magnetic field.

This is an exponential family with sufficient statistics E[Xi Xj ]

and E[Xi ].

As an exponential family, it is also a maximum entropy

model.

W EIGHT MATRIX

1 X X

P(x1 , . . . , xn ) = exp wij xi xj ci xi

Z i<jn in

Matrix representation

We collect the parameters in a matrix W := (wij ) and a vector c = (c1 , . . . , cn ), and write

equivalently:

1 t t

P(x1 , . . . , xn ) = ex Wx+c x

Z(W, c)

Because the MRF is undirected, the matrix is symmetric.

R ESTRICTED B OLTZMANN MACHINE

X2 Y4 X4

With observations

If some some vertices represent observation variables Yi :

t t t

e(x,y) W(x,y)+c x+c y Y1 Y2

P(x1 , . . . , xn , y1 , . . . , ym ) =

Z(W, c, c)

X1 Y3 X3

Recall our hierarchical design approach

Only permit layered structure.

Y1 Y2 Y3

Obvious grouping: One layer for X, one for Y.

As before: No connections within layers.

Since the graph is undirected, that makes it bipartite.

X1 X2 X3

R ESTRICTED B OLTZMANN MACHINE

Y1 Y2 Y3

Bipartite graphs

A graph is bipartite graph is a graph whose vertex set V can be

subdivided into two sets A and B = V \ A such that all edges have

one end in A and one end in B.

X1 X2 X3

Definition

A restricted Boltzmann machine (RBM) is a Boltzmann machine whose neighborhood graph

is bipartite.

That defines two layers. We usually think of one of these layers as observed, and one as

unobserved.

B OLTZMANN M ACHINES AND S IGMOIDS

1 x

P(x) = e for x {0, 1} and a fixed R .

Z

This is what the probability of a single variable in a RBM looks like, if the distribution

factorizes.

e 1

P(x = 1) = = = ()

e + e0 1 + e

where again denotes the sigmoid function.

Consequence

1 x

P(x) = e X Bernoulli(())

Z

G IBBS SAMPLING B OLTZMANN MACHINES

t t

ex Wx+c x

P(X = x) =

Z(W, c)

P(Xi = 1|x(i) ) = (Wit x + ci )

Variables in X-layer are conditionally

independent given Y-layer and vice versa Y1 Y2 Y3

Two groups of conditionals: X|Y and Y|X

Blocked Gibbs samplers

P(X = x|Y = y) = (W t y + c)

P(Y = y|X = x) = (W t x + c)

X1 X2 X3

D EEP B ELIEF N ETWORKS

D IRECTED GRAPHICAL MODEL

1 2 N

... Input layer

...

.. .. ..

. . .

...

... Data

X1 X2 XN

BAYESIAN V IEW

1 2 N

... L(1:N ) = Prior

...

.. .. ..

. . .

L(X1:N |1:N ) = Likelihood

...

...

X1 X2 XN

I NFERENCE P ROBLEM

1 2 N

Task: Given X1 , . . . , XN , find L(1:N |X1:N ). ...

Problem: Recall explaining away.

...

X Y

.. .. ..

Z . . .

variables dependent. ...

That means: Although each layer is conditionally

independent given the previous one, conditioning

on the subsequent one creates dependence within ...

the layer. X1 X2 XN

C OMPLEMENTARY P RIOR : I DEA

The prior L(1:N ) can itself be represented as Combining this prior with the likelihood stacks

a directed (layered) graphical model: the two networks on top of each other:

... ...

.. .. .. .. .. ..

. . . . . .

1 2 ... N 1 2 . . .. N

...

.. .. ..

. . .

...

X1 X2 ... XN

C OMPLEMENTARY P RIOR : I DEA

... Idea

Invent a prior such that the dependencies

in the prior and likelihood cancel out.

.. .. ..

. . . Such a prior is called a complementary

prior.

We will see how to construct a

1 2 . . .. N complementary prior below.

...

.. .. ..

. . .

...

X1 X2 ... XN

C OMPLEMENTARY P RIOR

X (1) ...

Denote the vector of variables in the kth layer

X (k) , so

(k) (k)

X (k) = (X1 , . . . , XN ) X (2) ...

Observation

X (1) , X (2) , . . . , X (K) is a Markov chain. .. .. ..

. . .

Suppose Markov chain is reversible. ...

Then all arrows can be reversed.

Now: Inference easy. X (K) ...

Advanced Machine Learning See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 168 / 173

C OMPLEMENTARY P RIOR

X (1) ...

Denote the vector of variables in the kth layer

X (k) , so

(k) (k)

X (k) = (X1 , . . . , XN ) X (2) ...

Observation

X (1) , X (2) , . . . , X (K) is a Markov chain. .. .. ..

. . .

Suppose Markov chain is reversible. ...

Then all arrows can be reversed.

Now: Inference easy. X (K) ...

Advanced Machine Learning See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 168 / 173

B UILDING A C OMPLEMENTARY P RIOR

X (1) ...

P(1) = P X (2) ...

Choose

.. .. ..

P(k+1) (|X (k) = x) = pT (|x) . . .

Then P(2) = . . . = P(k) = P

Since chain is reversible,

...

P(k) (|X (k+1) = x) = pT (|x)

and edges flip.

X (K) ...

Advanced Machine Learning See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 169 / 173

B UILDING A C OMPLEMENTARY P RIOR

X (1) ...

P(1) = P X (2) ...

Choose

.. .. ..

P(k+1) (|X (k) = x) = pT (|x) . . .

Then P(2) = . . . = P(k) = P

Since chain is reversible,

...

P(k) (|X (k+1) = x) = pT (|x)

and edges flip.

X (K) ...

Advanced Machine Learning See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 169 / 173

W HERE DO WE GET THE M ARKOV CHAIN ?

...

Start with an RBM X

Y ...

X (1) Y (1) X (2) Y (2) . . . X (K) Y (K)

X (1) Y (1) X (K) Y (K)

..

.

..

.

..

.

The Gibbs sampler for the RBM becomes the model for the directed network.

Advanced Machine Learning See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 170 / 173

W HERE DO WE GET THE M ARKOV CHAIN ?

..

.

X ...

.. Y ...

.

..

.

For N , we have

d

Y () = Y

Now suppose we use L(Y () ) as our prior distribution. Then:

Using an infinitely deep graphical model given by the Gibbs sampler for the RBM as a

prior is equivalent to using the Y-layer of the RBM as a prior.

D EEP B ELIEF N ETWORKS

...

First two layers: RBM (undirected)

1 2 ... N

...

. . .

.. .. ..

Remaining layers: Directed

...

X1 X2 ... XN

D EEP B ELIEF N ETWORKS

Summary

...

The RBM consisting of the first two layers is

equivalent to an infinitely deep directed network

representing the Gibbs samplers. (The rolled-off 1 2 ... N

Gibbs sampler on the previous slide.)

That network is infinitely deep because a draw from

the actual RBM distribution corresponds to a Gibbs ...

sampler that has reached its invariant distribution.

When we draw from the RBM, the second layer is

distributed according to the invariant distribution of

the Markov chain given by the RBM Gibbs .. .. ..

sampler. . . .

If the transition from each layer to the next in the

directed part is given by the Markov chains

transition kernel pT , the RBM is a complementary ...

prior.

We can then reverse all edges between the -layer

and the X-layer. X1 X2 ... XN

- LogisticRegression-week1Uploaded byMadhavan Sriram
- 20185644444444444Uploaded byAnonymous Z1fnzQk
- Book-Machine Learning - The Complete GuideUploaded byrex_plantado
- Closed-Form Algorithm for the Least-squares Trilateration ProblemUploaded byGina Lester
- Curve FittingUploaded byisusay
- Num Analysis CollinsUploaded byFavio90
- HOW TO USE FSTUploaded byKathrin Podlacha
- Applied Statistics and the SAS Programming 5th editionUploaded byblackgenie13
- Analysis of Epidemiological Data Using R and EpicalcUploaded byJosé Gómez Fernández
- Design of Solar Drying TechnologyUploaded byKhloud Madih
- Ch.2 - STATA Code for WebsiteUploaded byVinicio Arcos
- Coordinate Swapping in Standard Addition Graphs for Analytical Chemistry: A Simplified Path for Uncertainty Calculation in Linear and Nonlinear PlotsUploaded byMarceloDominguez
- Prior vs Likelihood vs Posterior DistributionUploaded byDharini Sudarsan
- Notes on Probabilistic Latent Semantic Analysis.pdfUploaded byJun Wang
- 3.ShumwayUploaded byDanieleDeLeonardis
- risks-04-00004Uploaded byPriyanka Dargad
- Longitudinal Data Analysis - Note Set 13Uploaded byXi Chen
- Basian-Markovian Principle in Fitting of Linear CurveUploaded bytheijes
- RMpptUploaded byRohit Padalkar
- DATA MINING Chapter 1 and 2 Lect SlideUploaded bySanjeev Thakur
- Lasso Based Diagnostic Framework for Multivariate SPCUploaded byjosealfonsor
- The Effect of Intellectual Capital on FirmUploaded byBrainy12345
- A weighted least squares method for scattered data fitting_zhou2008.pdfUploaded byAyad
- STAT659: Chapter 8Uploaded bysimplemts
- icse12Uploaded bycrennydane
- Comstock an Elementary Treatise Upon the Method of Least Squares (1889)Uploaded byMarcelo Silvano de Camargo
- Techniques of Demand ForecastingUploaded byrealguy789
- 2012_343Uploaded byTiti Afrida Sari
- 32.2.2.julia-etal.pdfUploaded bybogdan202
- 09-E-13Uploaded byMustofa Jehar

- exam2-revans (2).psUploaded bysykim657
- MSPH Section Info P9185Uploaded bysykim657
- 126703428-Actuarial-Mathematics-Newton-L-Bowers-Jr-et-al-2nd-Edition-pdf.pdfUploaded bysykim657
- hmk3[1]Uploaded bysykim657
- gmp-2013Uploaded bysykim657
- exam09Uploaded bysykim657
- Applied Regression II (Qixuan Chen) P8110 - Syllabus 2016Uploaded bysykim657
- GU5232sylUploaded bysykim657
- Financing GraduateUploaded byjroblesg
- mid1_Àú¹øÇÐ±â.pdfUploaded bysykim657
- QuantNet Online C CourseUploaded byAllen Li
- Exercise1 LGUploaded bysykim657
- Lecture 12Uploaded bysykim657
- Lecture Week 6Uploaded bysykim657
- Lecture Week 8Uploaded bysykim657
- univariate (1)Uploaded bysykim657
- test3Uploaded bysykim657

- CE2403_uwUploaded bynandhu
- PERINI_The Truth in PicturesUploaded byrosenbergalape
- PSSolutions_2Uploaded byVhan Perez
- Chapter11Uploaded byCleme Moscoso
- Alcatel OptimizationUploaded bykarthikiws
- a Join vs Database JoinUploaded byPradeep Kothakota
- Quick r Survival Command GuideUploaded byElite
- Factors Influencing Hydraulic RoughnessUploaded byFaruk Atalar
- 5000024518-5000037688-1-PB.pdfUploaded byPoe Sarpany Putra
- 7.10.1Uploaded byRahul Shrvastava
- CONTOH SOALANUploaded byjureen7
- Traffic EngineeringUploaded byFeras
- FATIGUE AND IMPACTUploaded byAlexDiaz
- CH 5 ThermochemistryUploaded byRafil Friasmar
- Study of Implementing Automated Attendance System Using Face Recognition TechniqueUploaded bySamathu
- Rotary Actuators-Sept 05Uploaded byEng-Mohammed Salem
- Life in-The-Last-Humanity - On the Speculative Ecology of Man, Animal, And PlantUploaded byOscar Pichardo Isaak
- Vmc Advance 1 Paper 2Uploaded byamogh kumar
- a b - Hydrodynamics of a Capillary Membrane BiorectorUploaded bygdngwn8468
- Commerce Control List 4.3Uploaded bymliq
- Gloria Stillman Applications and Modelling Research in Secondary Classrooms.pdfUploaded bykaskarait
- Interpreting Module 2Uploaded byCaro Park
- Ujian Pra Pentaksiran Prestasi STPM 2014 Semester 2Uploaded bymasyati
- Introduction Space FrameUploaded byRyan Yee
- Euclid - Elements Commented]Uploaded byrobertagoncalves
- A new metric for program sizeUploaded byJohn Michael Williams
- Chapter 1 General Principles of Static FullUploaded byAinur Sya Irah
- Pu 00011931Uploaded byEduardo Salgado Enríquez
- BJP Cross Sectional SurveyUploaded byGurvinder Kalra
- Reza Negarestani Black Boxing Pink 1Uploaded byBrian Prager