Вы находитесь на странице: 1из 135

All of Graphical Models

Xiaojin Zhu
Department of Computer Sciences
University of WisconsinMadison, USA

Tutorial at ICMLA 2011

The Whole Tutorial in One Slide

Given GM = joint distribution p(x1 , . . . , xn )

Do inference = p(XQ | XE ), in general


XQ XE {x1 . . . xn }

If p(x1 , . . . , xn ) not given, estimate it from data

Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family

Maximizing Problems
Parameter Learning
Structure Learning

Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family

Maximizing Problems
Parameter Learning
Structure Learning

Life without Graphical Models


. . . is fine mathematically:
I The universe is reduced to a set of random variables
x1 , . . . , xn
I
I

e.g., x1 , . . . , xn1 can be the discrete or continuous features


e.g., xn y can be the discrete class label

The joint p(x1 , . . . , xn ) completely describes how the universe


works

Machine learning: estimate p(x1 , . . . , xn ) from training


(i)
(i)
data X (1) , . . . , X (N ) , where X (i) = (x1 , . . . , xn )
Prediction: y = argmax p(xn | x1 , . . . , xn ), a.k.a.
inference

by the definition of conditional probability


p(x1 , . . . , xn , xn )

v p(x1 , . . . , xn , xn = v)

p(xn | x1 , . . . , xn ) = P

Conclusion

Life without graphical models is just fine

So why are we still here?

Life can be Better for Computer Scientists

Given GM = joint distribution p(x1 , . . . , xn )


I
I

Do inference = p(XQ | XE ), in general


XQ XE {x1 . . . xn }
I

exponential nave storage (2n for binary r.v.)


hard to interpret (conditional independence)

Often cant do it computationally

If p(x1 , . . . , xn ) not given, estimate it from data


I

Cant do it either

Acknowledgments Before We Start

Much of this tutorial is based on


I

Koller & Friedman, Probabilistic Graphical Models. MIT 2009

Wainwright & Jordan, Graphical Models, Exponential


Families, and Variational Inference. FTML 2008

Bishop, Pattern Recognition and Machine Learning. Springer


2006.

Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family

Maximizing Problems
Parameter Learning
Structure Learning

Graphical-Model-Nots

Graphical model is the study of probabilistic models

Just because there is a graph with nodes and edges doesnt


mean its GM
These are not graphical models

neural network

decision tree

network flow

HMM template

Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family

Maximizing Problems
Parameter Learning
Structure Learning

Bayesian Network
I

A directed graph has nodes X = (x1 , . . . , xn ), some of them


connected by directed edges xi xj

A cycle is a directed path x1 . . . xk where x1 = xk

A directed acyclic graph (DAG) contains no cycles

A Bayesian network on the DAG is a family of distributions


satisfying
Y
{p | p(X) =
p(xi | P a(xi ))}
i

where P a(xi ) is the set of parents of xi .


I

p(xi | P a(xi )) is the conditional probability distribution


(CPD) at xi

By specifying the CPDs for all i, we specify a particular


distribution p(X)

Example: Alarm
Binary variables
P(B)=0.001

P(E)=0.002

B
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001

A
J

P(J | A) = 0.9
P(J | ~A) = 0.05

M
P(M | A) = 0.7
P(M | ~A) = 0.01

P (B, E, A, J, M )
= P (B)P ( E)P (A | B, E)P (J | A)P ( M | A)
= 0.001 (1 0.002) 0.94 0.9 (1 0.7)
.000253

Example: Naive Bayes

x1

...

xd

x
d

Qd

i=1 p(xi

| y)

p(y, x1 , . . . xd ) = p(y)

Used extensively in natural language processing

Plate representation on the right

No Causality Whatsoever

P(A)=a
P(B|A)=b
P(B|~A)=c

P(B)=ab+(1a)c
P(A|B)=ab/(ab+(1a)c)
P(A|~B)=a(1b)/(1ab(1a)c)

The two BNs are equivalent in all respects


I

Bayesian networks imply no causality at all

They only encode the joint probability distribution (hence


correlation)

However, people tend to design BNs based on causal relations

Example: Latent Dirichlet Allocation (LDA)

Nd
w

A generative model for p(, , z, w | , ):


For each topic t
t Dirichlet()
For each document d
Dirichlet()
For each word position in d
topic z Multinomial()
word w Multinomial(z )
Inference goals: p(z | w, , ), argmax, p(, | w, , )

Example: Latent Dirichlet Allocation (LDA)

Nd
w

A generative model for p(, , z, w | , ):


For each topic t
t Dirichlet()
For each document d
Dirichlet()
For each word position in d
topic z Multinomial()
word w Multinomial(z )
Inference goals: p(z | w, , ), argmax, p(, | w, , )

Example: Latent Dirichlet Allocation (LDA)

Nd
w

A generative model for p(, , z, w | , ):


For each topic t
t Dirichlet()
For each document d
Dirichlet()
For each word position in d
topic z Multinomial()
word w Multinomial(z )
Inference goals: p(z | w, , ), argmax, p(, | w, , )

Example: Latent Dirichlet Allocation (LDA)

Nd
w

A generative model for p(, , z, w | , ):


For each topic t
t Dirichlet()
For each document d
Dirichlet()
For each word position in d
topic z Multinomial()
word w Multinomial(z )
Inference goals: p(z | w, , ), argmax, p(, | w, , )

Example: Latent Dirichlet Allocation (LDA)

Nd
w

A generative model for p(, , z, w | , ):


For each topic t
t Dirichlet()
For each document d
Dirichlet()
For each word position in d
topic z Multinomial()
word w Multinomial(z )
Inference goals: p(z | w, , ), argmax, p(, | w, , )

Example: Latent Dirichlet Allocation (LDA)

Nd
w

A generative model for p(, , z, w | , ):


For each topic t
t Dirichlet()
For each document d
Dirichlet()
For each word position in d
topic z Multinomial()
word w Multinomial(z )
Inference goals: p(z | w, , ), argmax, p(, | w, , )

Example: Latent Dirichlet Allocation (LDA)

Nd
w

A generative model for p(, , z, w | , ):


For each topic t
t Dirichlet()
For each document d
Dirichlet()
For each word position in d
topic z Multinomial()
word w Multinomial(z )
Inference goals: p(z | w, , ), argmax, p(, | w, , )

Example: Latent Dirichlet Allocation (LDA)

Nd
w

A generative model for p(, , z, w | , ):


For each topic t
t Dirichlet()
For each document d
Dirichlet()
For each word position in d
topic z Multinomial()
word w Multinomial(z )
Inference goals: p(z | w, , ), argmax, p(, | w, , )

Some Topics by LDA on the Wish Corpus

p(word | topic)

troops

election

love

Conditional Independence

Two r.v.s A, B are independent if P (A, B) = P (A)P (B) or


P (A|B) = P (A) (the two are equivalent)

Two r.v.s A, B are conditionally independent given C if


P (A, B | C) = P (A | C)P (B | C) or
P (A | B, C) = P (A | C) (the two are equivalent)

This extends to groups of r.v.s

Conditional independence in a BN is precisely specified by


d-separation (directed separation)

d-Separation Case 1: Tail-to-Tail

C
A

A, B in general dependent

A, B conditionally independent given C

C is a tail-to-tail node, blocks the undirected path A-B

d-Separation Case 2: Head-to-Tail

A, B in general dependent

A, B conditionally independent given C

C is a head-to-tail node, blocks the path A-B

d-Separation Case 3: Head-to-Head

A, B in general independent

A, B conditionally dependent given C, or any of Cs


descendants

C is a head-to-head node, unblocks the path A-B

d-Separation

Any groups of nodes A and B are conditionally independent


given another group C, if all undirected paths from any node
in A to any node in B are blocked
A path is blocked if it includes a node x such that either
I
I

The path is head-to-tail or tail-to-tail at x and x C, or


The path is head-to-head at x, and neither x nor any of its
descendants is in C.

d-Separation Example 1

The path from A to B not blocked by either E or F

A, B dependent given C

d-Separation Example 2

The path from A to B is blocked both at E and F

A, B conditionally independent given F

Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family

Maximizing Problems
Parameter Learning
Structure Learning

Markov Random Fields

The efficiency of directed graphical model (acyclic graph,


locally normalized CPDs) also makes it restrictive

A clique C in an undirected graph is a fully connected set of


nodes (note: full of loops!)

Define a nonnegative potential function C : XC 7 R+

An undirected graphical model (aka Markov Random Field)


on the graph is a family of distributions satisfying
(
)
1 Y
p | p(X) =
C (XC )
Z
C

Z=

RQ

C (XC )dX is the partition function

Example: A Tiny Markov Random Field

x1

x2
C

x1 , x2 {1, 1}

A single clique C (x1 , x2 ) = eax1 x2

p(x1 , x2 ) =

Z = (ea +

p(1, 1) = p(1, 1) = ea /(2ea + 2ea )

p(1, 1) = p(1, 1) = ea /(2ea + 2ea )

When the parameter a > 0, favor homogeneous chains

When the parameter a < 0, favor inhomogeneous chains

1 ax1 x2
Ze
ea + ea

+ ea )

Log Linear Models

Real-valued feature functions f1 (X), . . . , fk (X)

Real-valued weights w1 , . . . , wk
!
k
X
1
p(X) = exp
wi fi (X)
Z
i=1

Example: The Ising Model

s xs

st

xt

This is an undirected model with x {0, 1}.

X
X
1
s xs +
st xs xt
p (x) = exp
Z
sV

fs (X) = xs , fst (X) = xs xt

ws = s , wst = st

(s,t)E

Example: Image Denoising

noisy image

[From Bishop PRML]

argmaxX P (X|Y )

Example: Gaussian Random Field

p(X) N (, ) =

1
(2)n/2 ||1/2



1
exp (X )> 1 (X )
2

Multivariate Gaussian

The n n covariance matrix positive semi-definite

Let = 1 be the precision matrix

xi , xj are conditionally independent given all other variables, if


and only if ij = 0

When ij 6= 0, there is an edge between xi , xj

Conditional Independence in Markov Random Fields


I

Two group of variables A, B are conditionally independent


given another group C, if
I
I

Remove C and all edges involving C


A, B beome disconnected

Factor Graph
I

For both directed and undirected graphical models

Bipartite: edges between a variable node and a factor node

Factors represent computation

(A,B,C)

f (A,B,C)

B
f

P(A)P(B)P(C|A,B)

Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family

Maximizing Problems
Parameter Learning
Structure Learning

Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family

Maximizing Problems
Parameter Learning
Structure Learning

Inference by Enumeration

Let X = (XQ , XE , XO ) for query, evidence, and other


variables.

Infer P (XQ | XE )

By definition
P
P (XQ , XE )
X P (XQ , XE , XO )
=P O
P (XQ | XE ) =
P (XE )
XQ ,XO P (XQ , XE , XO )

Summing exponential number of terms: with k variables in


XO each taking r values, there are rk terms

Details of the summing problem

I
I
I
I

There are a bunch of other variables x1 , . . . , xk


P
We sum over r values each variable can take vxri =v1
P
This is exponential (rk ): x1 ...xk
P
We want x1 ...xk p(X)

For a graphical
model, the joint probability factors
Q
p(X) = m
f
j=1 j (X(j) )

Each factor fj operates on X(j) X

Eliminating a Variable

+
+ by whether
f1 . . . fl fl+1
. . . fm

Rearrange factors
x1 X(j)

P
+
+
x2 ...xk f1 . . . fl
x1 fl+1 . . . fm

P

+
+
Introduce a new factor fm+1
=
x1 fl+1 . . . fm

+
+ except x
fm+1
contains the union of variables in fl+1
. . . fm
1
P


In fact, x1 disappears altogether in x2 ...xk f1 . . . fl fm+1

Dynamic programming: compute fm+1


once, use it thereafter

Hope: fm+1 contains very few variables

Recursively eliminate other variables in turn

I
I
I
I

x1 ...xk

Obviously equivalent:

Example: Chain Graph


A

Binary variables

Say we want P (D) =

Let f1 (A) = P (A). Note f1 is an array of size two:


P (A = 0)
P (A = 1)

f2 (A, B) is a table of size four:


P (B = 0|A = 0)
P (B = 0|A = 1)
P (B = 1|A = 0)
P (B = 1|A = 1))
P
(B, C)f4 (C, D) =
PA,B,C f1 (A)f2 (A, B)f3 P
B,C f3 (B, C)f4 (C, D)( A f1 (A)f2 (A, B))

A,B,C

P (A)P (B|A)P (C|B)P (D|C)

Example: Chain Graph

A
I

f1 (A)f2 (A, B) an array of size four: match A values


P (A = 0)P (B = 0|A = 0)
P (A = 1)P (B = 0|A = 1)
P (A = 0)P (B = 1|A = 0)
P (A = 1)P (B = 1|A = 1)
P
f5 (B) A f1 (A)f2 (A, B) an array of size two
P (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1)
P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1)

For this example, f5 (B) happens to be P (B)


P
PB,C f3 (B, C)f
P4 (C, D)f5 (B) =
f
(C,
D)(
4
C
B f3 (B, C)f5 (B)), and so on

In the end, f7 (D) = (P (D = 0), P (D = 1))

Example: Chain Graph

Computation for P (D): 12 , 6 +

Enumeration: 48 , 14 +

Saving depends on elimination order. Finding optimal order


NP-hard; there are heuristic methods.

Saving depends more critically on the graph structure (tree


width), can be intractable

Handling Evidence

For evidence variables XE , simply plug in their value e

Eliminate variables XO = X XE XQ

The final factor will be the joint f (XQ ) = P (XQ , XE = e)

Normalize to answer query:


f (XQ )
XQ f (XQ )

P (XQ | XE = e) = P

Summary: Exact Inference

Enumeration

Variable elimination

Not covered: junction tree (aka clique tree)

Exact, but intractable for large graphs

Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family

Maximizing Problems
Parameter Learning
Structure Learning

Inference by Monte Carlo


I

Consider the inference problem p(XQ = cQ | XE ) where


XQ XE {x1 . . . xn }
Z
p(XQ = cQ | XE ) = 1(xQ =cQ ) p(xQ | XE )dxQ

If we can draw samples xQ , . . . xQ


unbiased estimator is

(1)

(m)

p(xQ | XE ), an
m

1 X
p(XQ = cQ | XE )
1(x(i) =c )
Q
m
Q
i=1

The variance of the estimator decreases as V/m

Inference reduces to sampling from p(xQ | XE )

Forward Sampling Example


P(B)=0.001

P(E)=0.002

B
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001

A
J

P(J | A) = 0.9
P(J | ~A) = 0.05

M
P(M | A) = 0.7
P(M | ~A) = 0.01

To generate a sample X = (B, E, A, J, M ):


1. Sample B Ber(0.001): r U (0, 1). If (r < 0.001) then
B = 1 else B = 0
2. Sample E Ber(0.002)
3. If B = 1 and E = 1, sample A Ber(0.95), and so on
4. If A = 1 sample J Ber(0.9) else J Ber(0.05)
5. If A = 1 sample M Ber(0.7) else M Ber(0.01)
Works for Bayesian networks.

Inference with Forward Sampling

Say the inference task is P (B = 1 | E = 1, M = 1)

Throw away all samples except those with (E = 1, M = 1)


m

p(B = 1 | E = 1, M = 1)

1 X
1(B (i) =1)
m
i=1

where m is the number of surviving samples


I

Can be highly inefficient (note P (E = 1) tiny)

Does not work for Markov Random Fields

Gibbs Sampler Example: P (B = 1 | E = 1, M = 1)


I

Gibbs sampler is a Markov Chain Monte Carlo (MCMC)


method.

Directly sample from p(xQ | XE )

Works for both graphical models


Initialization:

I
I

Fix evidence; randomly set other variables


e.g. X (0) = (B = 0, E = 1, A = 0, J = 0, M = 1)
P(B)=0.001

P(E)=0.002

B
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001

A
J

P(J | A) = 0.9
P(J | ~A) = 0.05

M
P(M | A) = 0.7
P(M | ~A) = 0.01

Gibbs Update
I
I
I

For each non-evidence variable xi , fixing all other nodes Xi ,


resample its value xi P (xi | Xi )
This is equivalent to xi P (xi | MarkovBlanket(xi ))
For a Bayesian network MarkovBlanket(xi ) includes xi s
parents, spouses, and children
Y
P (xi | MarkovBlanket(xi )) P (xi | P a(xi ))
P (y | P a(y))
yC(xi )

I
I

where P a(x) are the parents of x, and C(x) the children of x.


For many graphical models the Markov Blanket is small.
For example,
B P (B | E = 1, A = 0) P (B)P (A = 0 | B, E = 1)
P(B)=0.001

P(E)=0.002

B
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001

A
J

Gibbs Update
I

Say we sampled B = 1. Then


X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1)

Starting from X (1) , sample


A P (A | B = 1, E = 1, J = 0, M = 1) to get X (2)

Move on to J, then repeat B, A, J, B, A, J . . .

Keep all later samples. P (B = 1 | E = 1, M = 1) is the


fraction of samples with B = 1.
P(B)=0.001

P(E)=0.002

B
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001

A
J

P(J | A) = 0.9
P(J | ~A) = 0.05

M
P(M | A) = 0.7
P(M | ~A) = 0.01

Gibbs Example 2: The Ising Model

A
xs

This is an undirected model with x {0, 1}.

X
X
1
s xs +
p (x) = exp
st xs xt
Z
sV

(s,t)E

Gibbs Example 2: The Ising Model


A
xs

C
I

The Markov blanket of xs is A, B, C, D

In general for undirected graphical models


p(xs | xs ) = p(xs | xN (s) )
N (s) is the neighbors of s.

The Gibbs update is


p(xs = 1 | xN (s) ) =

1
exp((s +

tN (s) st xt ))

+1

Gibbs Sampling as a Markov Chain

A Markov chain is defined by a transition matrix T (X 0 | X)

Certain Markov chains have a stationary distribution such


that = T

Gibbs sampler is such a Markov chain with


Ti ((Xi , x0i ) | (Xi , xi )) = p(x0i | Xi ), and stationary
distribution p(xQ | XE )
But it takes time for the chain to reach stationary distribution
(mix)

I
I
I
I

Can be difficult to assert mixing


In practice burn in: discard X (0) , . . . , X (T )
Use all of X (T +1) , . . . for inference (they are correlated)
Do not thin

Collapsed Gibbs Sampling


1
m

Pm

(i) )

if X (i) p

In general, Ep [f (X)]

Sometimes X = (Y, Z) where Z has closed-form operations

If so,

i=1 f (X

Ep [f (X)] = Ep(Y ) Ep(Z|Y ) [f (Y, Z)]


m
1 X
Ep(Z|Y (i) ) [f (Y (i) , Z)]

m
i=1

if Y (i) p(Y )
I

No need to sample Z: it is collapsed

Collapsed Gibbs sampler Ti ((Yi , yi0 ) | (Yi , yi )) = p(yi0 | Yi )


R
Note p(yi0 | Yi ) = p(yi0 , Z | Yi )dZ

Example: Collapsed Gibbs Sampling for LDA

Nd
w

Collapse , , Gibbs update:


(w )

P (zi = j | zi , w)

(w )

(d )

i
i
ni,j
+ ni,j
+

()

(d )

i
ni,j + W ni,
+ T

i
ni,j
: number of times word wi has been assigned to topic j,
excluding the current position
(di )
ni,j
: number of times a word from document di has been
assigned to topic j, excluding the current position
()
ni,j : number of times any word has been assigned to topic j,
excluding the current position
(di )
ni,
: length of document di , excluding the current position

Summary: Markov Chain Monte Carlo

Gibbs sampling

Not covered: block Gibbs, Metropolis-Hastings

Unbiased (after burn-in), but can have high variance


To learn more, come to Prof. Prasad Tetalis tutorial Markov
Chain Mixing with Applications 2pm Monday.

Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family

Maximizing Problems
Parameter Learning
Structure Learning

Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family

Maximizing Problems
Parameter Learning
Structure Learning

The Sum-Product Algorithm

Also known as belief propagation (BP)

Exact if the graph is a tree; otherwise known as loopy BP,


approximate

The algorithm involves passing messages on the factor graph

Alternative view: variational approximation (more later)

Example: A Simple HMM


I

The Hidden Markov Model template (not a graphical model)


1/4

1/2

P(x | z=1)=(1/2, 1/4, 1/4)


R G B

P(x | z=2)=(1/4, 1/2, 1/4)


R G B

1 = 2 = 1/2
I

Observing x1 = R, x2 = G, the directed graphical model


z1

z2

x1=R

x2=G

Factor graph
f1
P(z1)P(x1| z1)

z1

f2
P(z2| z1)P(x2| z2)

z2

Messages

A message is a vector of length K, where K is the number of


values x takes.
There are two types of messages:
1. f x : message from a factor node f to a variable node x
f x (i) is the ith element, i = 1 . . . K.
2. xf : message from a variable node x to a factor node f

Leaf Messages
I

Assume tree factor graph. Pick an arbitrary root, say z2

Start messages at leaves.

If a leaf is a factor node f , f x (x) = f (x)

If a leaf is a variable node x, xf (x) = 1


f1

z1

f2

P(z1)P(x1| z1)

z2

P(z2| z1)P(x2| z2)

f1 z1 (z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 1/2 = 1/4


f1 z1 (z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 1/4 = 1/8
1/4

1/2

P(x | z=1)=(1/2, 1/4, 1/4)


R G B

P(x | z=2)=(1/4, 1/2, 1/4)


R G B

1 = 2 = 1/2

Message from Variable to Factor


I
I

A node (factor or variable) can send out a message if all other


incoming messages have arrived
Let x be in factor fs .
Y
f x (x)
xfs (x) =
f ne(x)\fs

ne(x)\fs are factors connected to x excluding fs .


f1

z1

f2

P(z1)P(x1| z1)

z2

P(z2| z1)P(x2| z2)

z1 f2 (z1 = 1) = 1/4
z1 f2 (z1 = 2) = 1/8
1/4

1/2

P(x | z=1)=(1/2, 1/4, 1/4)


R G B

P(x | z=2)=(1/4, 1/2, 1/4)


R G B

1 = 2 = 1/2

Message from Factor to Variable


I

Let x be in factor fs . Let the other variables in fs be x1:M .


fs x (x) =

...

x1

z1

P(z1)P(x1| z1)

2
X

fs (x, x1 , . . . , xM )

xM

f1

f2 z2 (s) =

M
Y

xm fs (xm )

m=1
f2

z2

P(z2| z1)P(x2| z2)

z1 f2 (s0 )f2 (z1 = s0 , z2 = s)

s0 =1

= 1/4P (z2 = s|z1 = 1)P (x2 = G|z2 = s)


+1/8P (z2 = s|z1 = 2)P (x2 = G|z2 = s)
We get
f2 z2 (z2 = 1) = 1/32
f2 z2 (z2 = 2) = 1/8

Up to Root, Back Down

The message has reached the root, pass it back down


f1

z1

f2

P(z1)P(x1| z1)

z2

P(z2| z1)P(x2| z2)

z2 f2 (z2 = 1) = 1
z2 f2 (z2 = 2) = 1
1/4

1/2

P(x | z=1)=(1/2, 1/4, 1/4)


R G B

P(x | z=2)=(1/4, 1/2, 1/4)


R G B

1 = 2 = 1/2

Keep Passing Down

f1

z1

f2

z2

P(z1)P(x1| z1)

P(z2| z1)P(x2| z2)


0
s0 =1 z2 f2 (s )f2 (z1 = s, z2 =

P2

f2 z1 (s) =
s0 )
= 1P (z2 = 1|z1 = s)P (x2 = G|z2 = 1)
+ 1P (z2 = 2|z1 = s)P (x2 = G|z2 = 2). We get
f2 z1 (z1 = 1) = 7/16
f2 z1 (z1 = 2) = 3/8
1/4

1/2

P(x | z=1)=(1/2, 1/4, 1/4)


R G B

P(x | z=2)=(1/4, 1/2, 1/4)


R G B

1 = 2 = 1/2

From Messages to Marginals


Once a variable receives all incoming messages, we compute its
marginal as
Y
p(x)
f x (x)
f ne(x)

In this example
1/4 
7/16 
7/64 
P (z1 |x1 , x2 ) f1 z1 f2 z1 = 1/8 3/8 = 3/64

1/32 
P (z2 |x1 , x2 ) f2 z2 = 1/8 0.2
0.8
One can also compute the marginal of the set of variables xs
involved in a factor fs
Y
p(xs ) fs (xs )
xf (x)
xne(f )

0.7
0.3

Handling Evidence
Observing x = v,
I

we can absorb it in the factor (as we did); or

set messages xf (x) = 0 for all x 6= v

Observing XE ,
I

multiplying the incoming messages to x


/ XE gives the joint
(not p(x|XE )):
Y
p(x, XE )
f x (x)
f ne(x)

The conditional is easily obtained by normalization


p(x, XE )
0
x0 p(x , XE )

p(x|XE ) = P

Loopy Belief Propagation

So far, we assumed a tree graph

When the factor graph contains loops, pass messages


indefinitely until convergence

But convergence may not happen

But in many cases loopy BP still works well, empirically

Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family

Maximizing Problems
Parameter Learning
Structure Learning

Example: The Ising Model

s xs

st

xt

The random variables x take values in {0, 1}.

X
X
1
p (x) = exp
s xs +
st xs xt
Z
sV

(s,t)E

The Conditional
s xs

st

xt

Markovian: the conditional distribution for xs is


p(xs | xs ) = p(xs | xN (s) )
N (s) is the neighbors of s.

This reduces to
p(xs = 1 | xN (s) ) =

1
exp((s +

Gibbs sampling would draw xs like this.

tN (s) st xt ))

+1

The Mean Field Algorithm for Ising Model

p(xs = 1 | xN (s) ) =
I

1
exp((s +

tN (s) st xt ))

+1

Instead of Gibbs sampling, let s be the estimated marginal


p(xs = 1)
s

1
exp((s +

tN (s) st t ))

+1

The s are updated iteratively

The Mean Field algorithm is coordinate ascent and


guaranteed to converge to a local optimal (more later).

Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family

Maximizing Problems
Parameter Learning
Structure Learning

Exponential Family

Let (X) = (1 (X), . . . , d (X))> be d sufficient statistics,


where i : X 7 R

Note X is all the nodes in a Graphical model

i (X) sometimes called a feature function

Let = (1 , . . . , d )> Rd be canonical parameters.

The exponential family is a family of probability densities:




p (x) = exp > (x) A()

Exponential Family



p (x) = exp > (x) A()
I

The key is the inner product between parameters and


sufficient statistics .

A is the log partition function,


Z


A() = log exp > (x) (dx)

A = log Z

Minimal vs. Overcomplete Models

Parameters for which the density is normalizable:


= { Rd | A() < }

A minimal exponential family is where the s are linearly


independent.

An overcomplete exponential family is where the s are


linearly dependent:
Rd , > (x) = constant x

Both minimal and overcomplete representations are useful.

Exponential Family Example 1: Bernoulli

p(x) = x (1 )1x for x {0, 1} and (0, 1).


I

Does not look like an exponential family!

Can be rewritten as
p(x) = exp (x log + (1 x) log(1 ))

Now in exponential family form with


1 (x) = x, 2 (x) = 1 x, 1 = log , 2 = log(1 ), and
A() = 0.

Overcomplete: 1 = 2 = 1 makes > (x) = 1 for all x

Exponential Family Example 1: Bernoulli

p(x) = exp (x log + (1 x) log(1 ))

Can be further rewritten as


p(x) = exp (x log(1 + exp()))

Minimal exponential family with

(x) = x, = log 1
, A() = log(1 + exp()).

Many distributions (e.g., Gaussian, exponential, Poisson, Beta) are


in the exponential family, but not all (e.g., the Laplace
distribution).

Exponential Family Example 2: Ising Model

s xs

st

xt

p (x) = exp

X
sV

s xs +

st xs xt A()

(s,t)E

Binary random variable xs {0, 1}

d = |V | + |E| sufficient statistics: (x) = (. . . xs . . . xst . . .)>

This is a regular ( = Rd ), minimal exponential family.

Exponential Family Example 3: Potts Model


s xs

I
I

xt

Similar to Ising model but generalizing xs {0, . . . , r 1}.


Indicator functions fsj (x) = 1 if xs = j and 0 otherwise, and
fstjk (x) = 1 if xs = j xt = k, and 0 otherwise.

X
X
p (x) = exp
sj fsj (x) +
stjk fstjk (x) A()
sj

st

r2 |E|

stjk

d = r|V | +
P
Regular but overcomplete, because r1
j=0 sj (x) = 1 for any
s V and all x.
The Potts model is a special case where the parameters are
tied: stkk = , and stjk = for j 6= k.

Important Relation

For sufficient statistics defined by indicator functions


I

e.g., sj (x) = fsj (x) = 1 if xs = j and 0 otherwise

The marginal can be obtained via the mean


E [sj (x)] = P (xs = j)

Since inference is about computing the marginal, in this case


it is equivalent to computing the mean.

Mean Parameters
I

Let p be any density (not necessarily in exponential family).

Given sufficient statistics , the mean parameters


= (1 , . . . , d )> is
Z
i = Ep [i (x)] = i (x)p(x)dx

The set of mean parameters


M = { Rd | p s.t. Ep [(x)] = }

If (1) , (2) M, there must exist p(1) , p(2)

The convex combinations of p(1) , p(2) leads to another mean


parameter in M

Therefore M is convex

Example: The First Two Moments

Let 1 (x) = x, 2 (x) = x2

For any p (not necessarily Gaussian) on x, the mean


parameters = (1 , 2 ) = (E(x), E(x2 ))> .

Note V(x) = E(x2 ) E2 (x) = 2 21 0 for any p

M is not R2 but rather the subset 1 R, 2 21 .

The Marginal Polytope

The marginal polytope is defined for discrete xs


P
Recall M = { Rd | = x (x)p(x) for some p}

p can be a point mass function on a particular x.

In fact any p is a convex combination of such point mass


functions.

M = conv{(x), x} is a convex hull, called the marginal


polytope.

Marginal Polytope Example


Tiny Ising model: two nodes x1 , x2 {0, 1} connected by an edge.
I

minimal sufficient statistics (x1 , x2 ) = (x1 , x2 , x1 x2 )> .

only 4 different x = (x1 , x2 ).

the marginal polytope is


M = conv{(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)}

the convex hull is a polytope inside the unit cube.

the three coordinates are node marginals 1 Ep [x1 = 1],


2 Ep [x2 = 1] and edge marginal 12 Ep [x1 = x2 = 1],
hence the name.

The Log Partition Function A

For any regular exponential family, A() is convex in .

Strictly convex for minimal exponential family.

Nice property:

Therefore, A = , the mean parameters of p .

A()
i

= E [i (x)]

Conjugate Duality
The conjugate dual function A to A is defined as
A () = sup > A()

Such definition, where a quantity is expressed as the solution to an


optimization problem, is called a variational definition.
I

For any Ms interior, let () satisfy


E() [(x)] = A(()) = .

Then A () = H(p() ) the negative entropy.

The dual of the dual gives back A:


A() = supM > A ()

For all , the supremum is attained uniquely at the


M0 by the moment matching conditions = E [(x)].

Example: Conjugate Dual for Bernoulli

Recall the minimal exponential family for Bernoulli with


(x) = x, A() = log(1 + exp()), = R.

By definition
A () = sup log(1 + exp())
R

Taking derivative and solve


A () = log + (1 ) log(1 )
i.e., the negative entropy.

Inference with Variational Representation

A() = supM > A () is attained by = E [(x)].


I

Want to compute the marginals P (xs = j)? They are the


mean parameters sj = E [ij (x)] under standard
overcomplete representation.

Want to compute the mean parameters sj ? They are the


arg sup to the optimization problem above.

This variational representation is exact, not approximate (will


relax it next to derive loopy BP and mean field)

The Difficulties with Variational Representation

A() = sup > A ()


M
I
I

Difficult to solve even though it is a convex problem


Two issues:
I

Although the marginal polytope M is convex, it can be quite


complex (exponential number of vertices)
The dual function A () usually does not admit an explicit
form.

Variational approximation modifies the optimization problem


so that it is tractable, at the price of an approximate solution.

Next, we cast mean field and sum-product algorithms as


variational approximations.

The Mean Field Method as Variational Approximation

The mean field method replaces M with a simpler subset


M(F ) on which A () has a closed form.

Consider the fully disconnected subgraph F = (V, ) of the


original graph G = (V, E)

Set all i = 0 if i involves edges

The densities in this sub-family are all fully factorized:


Y
p (x) =
p(xs ; s )
sV

The Geometry of M(F )


I

Let M(F ) be the mean parameters of the fully factorized


sub-family. In general, M(F ) M

Recall M is the convex hull of extreme points {(x)}.

It turns out the extreme points {(x)} M(F ).


Example:

I
I

I
I

The tiny Ising model x1 , x2 {0, 1} with = (x1 , x2 , x1 x2 )>


The point mass distribution p(x = (0, 1)> ) = 1 is realized as a
limit to the series p(x) = exp(1 x1 + 2 x2 A()) where
1 and 2 .
This series is in F because 12 = 0.
Hence the extreme point (x) = (0, 1, 0) is in M(F ).

The Geometry of M(F )

Because the extreme points of M are in M(F ), if M(F )


were convex, we would have M = M(F ).

But in general M(F ) is a true subset of M

Therefore, M(F ) is a nonconvex inner set of M

(x)
M(F)

The Mean Field Method as Variational Approximation


I

Recall the exact variational problem


A() = sup > A ()
M

attained by solution to inference problem = E [(x)].


I

The mean field method simply replaces M with M(F )


L() =

sup > A ()
M(F )

Obvious L() A().

The original solution may not be in M(F )

Even if M(F ), may hit local maximum and not find it

Why both? Because A () = H(p ()) has a very simple


form for M (F )

Example: Mean Field for Ising Model


I

The mean parameters for the Ising model are the node and
edge marginals: s = p(xx = 1), st = p(xs = 1, xt = 1)

Fully factorized M(F ) means no edge. st = s t

For M(F ), the dual function A () has the simple form


X
X
A () =
H(s ) =
s log s + (1 s ) log(1 s )
sV

sV

Thus the mean field problem is


X
L() =
sup >
(s log s + (1 s ) log(1 s ))
M(F )

sV

max

(1 ...m )[0,1]m

sV

s s +

X
(s,t)E

st s t +

X
sV

H(s )

Example: Mean Field for Ising Model

L() =

I
I
I

max

(1 ...m )[0,1]m

s s +

sV

X
(s,t)E

st s t +

H(s )

sV

Bilinear in , not jointly concave


But concave in a single dimension s , fixing others.
Iterative coordinate-wise maximization: fixing t for t 6= s and
optimizing s .
Setting the partial derivative w.r.t. s to 0 yields:
s =

1

P
1 + exp (s + (s,t)E st t )


as weve seen before.


Caution: mean field converges to a local maximum depending
on the initialization of 1 . . . m .

The Sum-Product Algorithm as Variational Approximation

A() = sup > A ()


M

The sum-product algorithm makes two approximations:


I

it relaxes M to an outer set L

it replaces the dual A with an approximation.


A() = sup > A ()
L

The Outer Relaxation


I

I
I

For overcomplete exponential families on discrete nodes, the


mean parameters are node and edge marginals
sj = p(xs = j), stjk = p(xs = j, xt = k).
The marginal polytope is M = { | p with marginals }.
Now consider Rd+ satisfying node normalization and
edge-node marginal consistency conditions:
r1
X

sj = 1

s V

j=0
r1
X
k=0
r1
X

stjk = sj

s, t V, j = 0 . . . r 1

stjk = tk

s, t V, k = 0 . . . r 1

j=0
I

Define L = { satisfying the above conditions}.

The Outer Relaxation


I
I

If the graph is a tree then M = L


If the graph has cycles then M L
I

L is too lax to satisfy other constraints that true marginals


need to satisfy

Nice property: L is still a polytope, but much simpler than M.

(x)

The first approximation in sum-product is to replace M with L.

Approximating A

Recall are node and edge marginals

If the graph is a tree, one can exactly reconstruct the joint


probability
Y
Y stx x
s t
p (x) =
sxs
sxs txt
sV

If the graph is a tree, the entropy of the joint distribution is


H(p ) =

r1
XX
sV j=0

(s,t)E

X X

sj log sj

(s,t)E j,k

Neither holds for graph with cycles.

stjk log

stjk
sj tk

Approximating A

Define the Bethe entropy for L on loopy graphs in the same


way:
HBethe (p ) =

r1
XX
sV j=0

sj log sj

X X
(s,t)E j,k

stjk log

stjk
sj tk

Note HBethe is not a true entropy. The second approximation in


sum-product is to replace A ( ) with HBethe (p ).

The Sum-Product Algorithm as Variational Approximation

With these two approximations, we arrive at the variational


problem

A()
= sup > + HBethe (p )
L
I

Optimality conditions require the gradients vanish w.r.t. both


and the Lagrangian multipliers on constraints L.

The sum-product algorithm can be derived as an iterative


fixed point procedure to satisfy optimality conditions.

At the solution, A()


is not guaranteed to be either an upper
or a lower bound of A()

may not correspond to a true marginal distribution

Summary: Variational Inference

The sum-product algorithm (loopy belief propagation)

The mean field method

Not covered: Expectation Propagation

Efficient computation. But often unknown bias in solution.

Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family

Maximizing Problems
Parameter Learning
Structure Learning

Maximizing Problems
Recall the HMM example
1/4

1/2

P(x | z=1)=(1/2, 1/4, 1/4)


R G B

P(x | z=2)=(1/4, 1/2, 1/4)


R G B

1 = 2 = 1/2

There are two senses of best states z1:N given x1:N :


1. So far we computed the marginal p(zn |x1:N )
I
I
I

We can define best as zn = arg maxk p(zn = k|x1:N )

However z1:N
as a whole may not be the best

can even have zero probability!


In fact z1:N

2. An alternative is to find

z1:N
= arg max p(z1:N |x1:N )
z1:N

I
I
I

finds the most likely state configuration as a whole


The max-sum algorithm solves this
Generalizes the Viterbi algorithm for HMMs

Intermediate: The Max-Product Algorithm

Simple modification to the sum-product algorithm: replace


max in the factor-to-variable messages.
fs x (x) = max . . . max fs (x, x1 , . . . , xM )
x1

xM

xm fs (xm ) =

f ne(xm )\fs

leaf f

leaf x

(x) = 1
(x) = f (x)

M
Y
m=1

f xm (xm )

with

xm fs (xm )

Intermediate: The Max-Product Algorithm

As in sum-product, pick an arbitrary variable node x as the


root

Pass messages up from leaves until they reach the root

Unlike sum-product, do not pass messages back from root to


leaves

At the root, multiply incoming messages

Y
pmax = max
f x (x)
x

f ne(x)

This is the probability of the most likely state configuration

Intermediate: The Max-Product Algorithm

To identify the configuration itself, keep back pointers:

When creating the message


fs x (x) = max . . . max fs (x, x1 , . . . , xM )
x1

xM

M
Y

xm fs (xm )

m=1

for each x value, we separately create M pointers back to the


values of x1 , . . . , xM that achieve the maximum.
I

At the root, backtrack the pointers.

Intermediate: The Max-Product Algorithm

f1

z1

f2

P(z1)P(x1| z1)

z2

P(z2| z1)P(x2| z2)


1/4

1/2

P(x | z=1)=(1/2, 1/4, 1/4)


R G B

P(x | z=2)=(1/4, 1/2, 1/4)


R G B

1 = 2 = 1/2

Message from leaf f1


f1 z1 (z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 1/2 = 1/4
f1 z1 (z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 1/4 = 1/8

The second message


z1 f2 (z1 = 1) = 1/4
z1 f2 (z1 = 2) = 1/8

Intermediate: The Max-Product Algorithm

f1

z1

f2

P(z1)P(x1| z1)

z2

P(z2| z1)P(x2| z2)


1/4

1/2

P(x | z=1)=(1/2, 1/4, 1/4)


R G B

P(x | z=2)=(1/4, 1/2, 1/4)


R G B

1 = 2 = 1/2

f2 z2 (z2 = 1)
= max f2 (z1 , z2 )z1 f2 (z1 )
z1

= max P (z2 = 1 | z1 )P (x2 = G | z2 = 1)z1 f2 (z1 )


z1

= max(1/4 1/4 1/4, 1/2 1/4 1/8) = 1/64


Back pointer for z2 = 1: either z1 = 1 or z1 = 2

Intermediate: The Max-Product Algorithm


f1

z1

f2

P(z1)P(x1| z1)

z2

P(z2| z1)P(x2| z2)


1/4

1/2

P(x | z=1)=(1/2, 1/4, 1/4)


R G B

P(x | z=2)=(1/4, 1/2, 1/4)


R G B

1 = 2 = 1/2

The other element of the same message:


f2 z2 (z2 = 2)
= max f2 (z1 , z2 )z1 f2 (z1 )
z1

= max P (z2 = 2 | z1 )P (x2 = G | z2 = 2)z1 f2 (z1 )


z1

= max(3/4 1/2 1/4, 1/2 1/2 1/8) = 3/32


Back pointer for z2 = 2: z1 = 1

Intermediate: The Max-Product Algorithm


f1

z1

f2

P(z1)P(x1| z1)

z2

P(z2| z1)P(x2| z2)


1/4

1/2

P(x | z=1)=(1/2, 1/4, 1/4)


R G B

P(x | z=2)=(1/4, 1/2, 1/4)


R G B

1 = 2 = 1/2

f2 z2 =

1/64 z1 =1,2 
3/32 z1 =1

At root z2 ,
max f2 z2 (s) = 3/32

s=1,2

z2 = 2 z1 = 1

z1:2
= arg max p(z1:2 |x1:2 ) = (1, 2)
z1:2

In this example, sum-product and max-product produce the same


best sequence; In general they differ.

From Max-Product to Max-Sum


The max-sum algorithm is equivalent to the max-product
algorithm, but work in log space to avoid underflow.
fs x (x) =

max log fs (x, x1 , . . . , xM ) +

x1 ...xM

m=1

xm fs (xm ) =

M
X

f xm (xm )

f ne(xm )\fs

leaf f

leaf x

(x) = 0
(x) = log f (x)

When at the root,

log pmax = max


x

The back pointers are the same.

X
f ne(x)

f x (x)

xm fs (xm )

Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family

Maximizing Problems
Parameter Learning
Structure Learning

Parameter Learning

Assume the graph structure is given

Learning in exponential family: estimate from iid data


x1 . . . xn .

Principle: maximum likelihood


Distinguish two cases:

I
I

fully observed data: all dimensions of x are observed


partially observed data: some dimensions of x are unobserved.

Fully Observed Data



p (x) = exp > (x) A()
I

Given iid data x1 . . . xn , the log likelihood is


!
n
n
1X
1X
>
`() =
log p (xi ) =
(xi ) A() = >
A()
n
n
i=1

i=1

Pn

n1 i=1 (xi ) is the mean parameter of the empirical


distribution on x1 . . . xn . Clearly
M.

Maximum likelihood: M L = arg sup >


A()

The solution is M L = (
), the exponential family density
whose mean parameter matches
.

When
M0 and minimal, there is a unique maximum
likelihood solution M L .

Partially Observed Data

Each item (x, z) where x observed, z unobserved

Full data (x1 , z1 ) . . . (xn , zn ), but we only observe x1 . . . xn


P
The incomplete
likelihood `() = n1 ni=1 log p (xi ) where
R
p (xi ) = p (xi , z)dz
P
Can be written as `() = n1 ni=1 Axi () A()

I
I

New log partition function of p (z | xi ), one per item:


Z
Axi () = log exp(> (xi , z0 ))dz0

Expectation-Maximization (EM) algorithm: lower bound Axi

EM as Variational Lower Bound


I

Mean parameter realizable by any distribution on z while


holding xi fixed:
Mxi = { Rd | = Ep [(xi , z)] for some p}

The variational definition Axi () = supMx > Axi ()

Trivial variational lower bound:


Axi () > i Axi (i ), i Mxi

Lower bound L on the incomplete log likelihood:

`() =

1X
Axi () A()
n
i=1

n

1 X > i
Axi (i ) A()
n
i=1

L(1 , . . . , n , )

Exact EM: The E-Step

The EM algorithm is coordinate ascent on L(1 , . . . , n , ).


I

In the E-step, maximizes each i


i arg max L(1 , . . . , n , )
i Mxi

Equivalently, argmaxi Mx > i Axi (i )

This is the variational representation of the mean parameter


i () = E [(xi , z)]

The E-step is named after this E [] under the current


parameters

Exact EM: The M-Step

In the M-step, maximize holding the s fixed:


arg max L(1 , . . . , n , ) = arg max >
A()

1
n

Pn

i
i=1

The solution (
) satisfies E() [(x)] =

Standard fully observed maximum likelihood problem, hence


the name M-step

Variational EM
For loopy graphs E-step often intractable.
I

Cant maximize
max > i Axi (i )

i Mxi
I

Improve but not necessarily maximize: generalized EM

The mean field method maximizes


max

i Mxi (F )

I
I

> i Axi (i )

up to local maximum
recall Mxi (F ) is an inner approximation to Mxi

Mean field E-step leads to generalized EM

The sum-product algorithm does not lead to generalized EM

Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family

Maximizing Problems
Parameter Learning
Structure Learning

Score-Based Structure Learning

Let M be all allowed candidate features

Let M M be a log-linear model structure


1
P (X | M, ) = exp
Z

!
X

i fi (X)

iM

A score for the model M can be max ln P (Data | M, )

The score is always better for larger M needs regularization

M and treated separately

Structure Learning for Gaussian Random Fields

Consider a p-dimensional multivariate Gaussian N (, )

The graphical model has p nodes x1 , . . . , xp

The edge between xi , xj is absent if and only if ij = 0,


where = 1

Equivalently, xi , xj are conditionally independent given other


variables
x1

x2
x3
x4

Structure Learning for Gaussian Random Fields

Let data be X (1) , . . . , X (n) N (, )

The log likelihood


is
n
1 Pn
(i) )> (X (i) )
i=1 (X
2 log || 2

The maximum likelihood estimate of is the sample


covariance
1 X (i)
> (X (i) X)

S=
(X X)
n
i

is the sample mean


where X
I

S 1 is not a good estimate of when n is small

Structure Learning for Gaussian Random Fields

For centered data, minimize a regularized problem instead:


n

log || +

X
1 X (i) >
X
X (i) +
|ij |
n
i=1

Known as glasso

i6=j

Recap

Given GM = joint distribution p(x1 , . . . , xn )


I
I

Do inference = p(XQ | XE ), in general


XQ XE {x1 . . . xn }
I

BN or MRF
conditional independence

exact, MCMC, variational

If p(x1 , . . . , xn ) not given, estimate it from data


I

parameter and structure learning

Much on-going research!

Вам также может понравиться