All of Graphical Models (Zhu) PDF

All of Graphical Models
Xiaojin Zhu
Department of Computer Sciences
University of WisconsinMadison, USA
Tutorial at ICMLA 2011
The Whole Tutorial in One Slide
Given GM = joint distribution p(x1 , . . . , xn )
Do inference = p(XQ | XE ), in general

XQ XE {x1 . . . xn }
If p(x1 , . . . , xn ) not given, estimate it from data
Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning

. . . is fine mathematically:
I The universe is reduced to a set of random variables
x1 , . . . , xn
I
I
e.g., x1 , . . . , xn1 can be the discrete or continuous features

e.g., xn y can be the discrete class label
The joint p(x1 , . . . , xn ) completely describes how the universe

works
Machine learning: estimate p(x1 , . . . , xn ) from training

(i)
(i)
data X (1) , . . . , X (N ) , where X (i) = (x1 , . . . , xn )
Prediction: y = argmax p(xn | x1 , . . . , xn ), a.k.a.
inference
by the definition of conditional probability

p(x1 , . . . , xn , xn )
v p(x1 , . . . , xn , xn = v)
p(xn | x1 , . . . , xn ) = P
Conclusion
Life without graphical models is just fine
So why are we still here?
Life can be Better for Computer Scientists

I
I

XQ XE {x1 . . . xn }
I
exponential nave storage (2n for binary r.v.)

hard to interpret (conditional independence)
Often cant do it computationally

I
Cant do it either
Acknowledgments Before We Start
Much of this tutorial is based on

I
Koller & Friedman, Probabilistic Graphical Models. MIT 2009
Wainwright & Jordan, Graphical Models, Exponential

Families, and Variational Inference. FTML 2008
Bishop, Pattern Recognition and Machine Learning. Springer

2006.
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Graphical-Model-Nots
Graphical model is the study of probabilistic models
Just because there is a graph with nodes and edges doesnt

mean its GM
These are not graphical models
neural network
decision tree
network flow
HMM template
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Bayesian Network
I
A directed graph has nodes X = (x1 , . . . , xn ), some of them

connected by directed edges xi xj
A cycle is a directed path x1 . . . xk where x1 = xk
A directed acyclic graph (DAG) contains no cycles
A Bayesian network on the DAG is a family of distributions

satisfying
Y
{p | p(X) =
p(xi | P a(xi ))}
i
where P a(xi ) is the set of parents of xi .

I
p(xi | P a(xi )) is the conditional probability distribution

(CPD) at xi
By specifying the CPDs for all i, we specify a particular

distribution p(X)
Example: Alarm
Binary variables
P(B)=0.001
P(E)=0.002
B
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001
A
J
P(J | A) = 0.9
P(J | ~A) = 0.05
M
P(M | A) = 0.7
P(M | ~A) = 0.01
P (B, E, A, J, M )
= P (B)P ( E)P (A | B, E)P (J | A)P ( M | A)
= 0.001 (1 0.002) 0.94 0.9 (1 0.7)
.000253
Example: Naive Bayes
x1
...
xd
x
d
Qd
i=1 p(xi
| y)
p(y, x1 , . . . xd ) = p(y)
Used extensively in natural language processing
Plate representation on the right
No Causality Whatsoever
P(A)=a
P(B|A)=b
P(B|~A)=c
P(B)=ab+(1a)c
P(A|B)=ab/(ab+(1a)c)
P(A|~B)=a(1b)/(1ab(1a)c)
The two BNs are equivalent in all respects

I
Bayesian networks imply no causality at all
They only encode the joint probability distribution (hence

correlation)
However, people tend to design BNs based on causal relations
Example: Latent Dirichlet Allocation (LDA)
Nd
w
A generative model for p(, , z, w | , ):

For each topic t
t Dirichlet()
For each document d
Dirichlet()
For each word position in d
topic z Multinomial()
word w Multinomial(z )
Inference goals: p(z | w, , ), argmax, p(, | w, , )
Nd
w

For each topic t
t Dirichlet()
For each document d
Dirichlet()
Nd
w

For each topic t
t Dirichlet()
For each document d
Dirichlet()
Nd
w

For each topic t
t Dirichlet()
For each document d
Dirichlet()
Nd
w

For each topic t
t Dirichlet()
For each document d
Dirichlet()
Nd
w

For each topic t
t Dirichlet()
For each document d
Dirichlet()
Nd
w

For each topic t
t Dirichlet()
For each document d
Dirichlet()
Nd
w

For each topic t
t Dirichlet()
For each document d
Dirichlet()
Some Topics by LDA on the Wish Corpus
p(word | topic)
troops
election
love
Conditional Independence
Two r.v.s A, B are independent if P (A, B) = P (A)P (B) or

P (A|B) = P (A) (the two are equivalent)
Two r.v.s A, B are conditionally independent given C if

P (A, B | C) = P (A | C)P (B | C) or
P (A | B, C) = P (A | C) (the two are equivalent)
This extends to groups of r.v.s
Conditional independence in a BN is precisely specified by

d-separation (directed separation)
d-Separation Case 1: Tail-to-Tail
C
A
A, B in general dependent
A, B conditionally independent given C
C is a tail-to-tail node, blocks the undirected path A-B
d-Separation Case 2: Head-to-Tail
A, B in general dependent
A, B conditionally independent given C
C is a head-to-tail node, blocks the path A-B
d-Separation Case 3: Head-to-Head
A, B in general independent
A, B conditionally dependent given C, or any of Cs

descendants
C is a head-to-head node, unblocks the path A-B
d-Separation
Any groups of nodes A and B are conditionally independent

given another group C, if all undirected paths from any node
in A to any node in B are blocked
A path is blocked if it includes a node x such that either
I
I
The path is head-to-tail or tail-to-tail at x and x C, or

The path is head-to-head at x, and neither x nor any of its
descendants is in C.
d-Separation Example 1
The path from A to B not blocked by either E or F
A, B dependent given C
d-Separation Example 2
The path from A to B is blocked both at E and F
A, B conditionally independent given F
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Markov Random Fields
The efficiency of directed graphical model (acyclic graph,

locally normalized CPDs) also makes it restrictive
A clique C in an undirected graph is a fully connected set of

nodes (note: full of loops!)
Define a nonnegative potential function C : XC 7 R+
An undirected graphical model (aka Markov Random Field)

on the graph is a family of distributions satisfying
(
)
1 Y
p | p(X) =
C (XC )
Z
C
Z=
RQ
C (XC )dX is the partition function
Example: A Tiny Markov Random Field
x1
x2
C
x1 , x2 {1, 1}
A single clique C (x1 , x2 ) = eax1 x2
p(x1 , x2 ) =
Z = (ea +
p(1, 1) = p(1, 1) = ea /(2ea + 2ea )
p(1, 1) = p(1, 1) = ea /(2ea + 2ea )
When the parameter a > 0, favor homogeneous chains
When the parameter a < 0, favor inhomogeneous chains
1 ax1 x2
Ze
ea + ea
+ ea )
Log Linear Models
Real-valued feature functions f1 (X), . . . , fk (X)
Real-valued weights w1 , . . . , wk
!
k
X
1
p(X) = exp
wi fi (X)
Z
i=1
Example: The Ising Model
s xs
st
xt
This is an undirected model with x {0, 1}.
X
X
1
s xs +
st xs xt
p (x) = exp
Z
sV
fs (X) = xs , fst (X) = xs xt
ws = s , wst = st
(s,t)E
Example: Image Denoising
noisy image
[From Bishop PRML]
argmaxX P (X|Y )
Example: Gaussian Random Field
p(X) N (, ) =
1
(2)n/2 ||1/2

1
exp (X )> 1 (X )
2
Multivariate Gaussian
The n n covariance matrix positive semi-definite
Let = 1 be the precision matrix
xi , xj are conditionally independent given all other variables, if

and only if ij = 0
When ij 6= 0, there is an edge between xi , xj
Conditional Independence in Markov Random Fields

I
Two group of variables A, B are conditionally independent

given another group C, if
I
I
Remove C and all edges involving C

A, B beome disconnected
Factor Graph
I
For both directed and undirected graphical models
Bipartite: edges between a variable node and a factor node
Factors represent computation
(A,B,C)
f (A,B,C)
B
f
P(A)P(B)P(C|A,B)
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Inference by Enumeration
Let X = (XQ , XE , XO ) for query, evidence, and other

variables.
Infer P (XQ | XE )
By definition
P
P (XQ , XE )
X P (XQ , XE , XO )
=P O
P (XQ | XE ) =
P (XE )
XQ ,XO P (XQ , XE , XO )
Summing exponential number of terms: with k variables in

XO each taking r values, there are rk terms
Details of the summing problem
I
I
I
I
There are a bunch of other variables x1 , . . . , xk

P
We sum over r values each variable can take vxri =v1
P
This is exponential (rk ): x1 ...xk
P
We want x1 ...xk p(X)
For a graphical
model, the joint probability factors
Q
p(X) = m
f
j=1 j (X(j) )
Each factor fj operates on X(j) X
Eliminating a Variable
+
+ by whether
f1 . . . fl fl+1
. . . fm
Rearrange factors
x1 X(j)
P
+
+
x2 ...xk f1 . . . fl
x1 fl+1 . . . fm

P
+
+
Introduce a new factor fm+1
=
x1 fl+1 . . . fm
+
+ except x
fm+1
contains the union of variables in fl+1
. . . fm
1
P

In fact, x1 disappears altogether in x2 ...xk f1 . . . fl fm+1
Dynamic programming: compute fm+1

once, use it thereafter
Hope: fm+1 contains very few variables
Recursively eliminate other variables in turn
I
I
I
I
x1 ...xk
Obviously equivalent:
Example: Chain Graph

A
Binary variables
Say we want P (D) =
Let f1 (A) = P (A). Note f1 is an array of size two:

P (A = 0)
P (A = 1)
f2 (A, B) is a table of size four:

P (B = 0|A = 0)
P (B = 0|A = 1)
P (B = 1|A = 0)
P (B = 1|A = 1))
P
(B, C)f4 (C, D) =
PA,B,C f1 (A)f2 (A, B)f3 P
B,C f3 (B, C)f4 (C, D)( A f1 (A)f2 (A, B))
A,B,C
P (A)P (B|A)P (C|B)P (D|C)
A
I
f1 (A)f2 (A, B) an array of size four: match A values

P (A = 0)P (B = 0|A = 0)
P (A = 1)P (B = 0|A = 1)
P (A = 0)P (B = 1|A = 0)
P (A = 1)P (B = 1|A = 1)
P
f5 (B) A f1 (A)f2 (A, B) an array of size two
P (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1)
P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1)
For this example, f5 (B) happens to be P (B)

P
PB,C f3 (B, C)f
P4 (C, D)f5 (B) =
f
(C,
D)(
4
C
B f3 (B, C)f5 (B)), and so on
In the end, f7 (D) = (P (D = 0), P (D = 1))
Computation for P (D): 12 , 6 +
Enumeration: 48 , 14 +
Saving depends on elimination order. Finding optimal order

NP-hard; there are heuristic methods.
Saving depends more critically on the graph structure (tree

width), can be intractable
Handling Evidence
For evidence variables XE , simply plug in their value e
Eliminate variables XO = X XE XQ
The final factor will be the joint f (XQ ) = P (XQ , XE = e)
Normalize to answer query:

f (XQ )
XQ f (XQ )
P (XQ | XE = e) = P
Summary: Exact Inference
Enumeration
Variable elimination
Not covered: junction tree (aka clique tree)
Exact, but intractable for large graphs
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Inference by Monte Carlo

I
Consider the inference problem p(XQ = cQ | XE ) where

XQ XE {x1 . . . xn }
Z
p(XQ = cQ | XE ) = 1(xQ =cQ ) p(xQ | XE )dxQ
If we can draw samples xQ , . . . xQ

unbiased estimator is
(1)
(m)
p(xQ | XE ), an
m
1 X
p(XQ = cQ | XE )
1(x(i) =c )
Q
m
Q
i=1
The variance of the estimator decreases as V/m
Inference reduces to sampling from p(xQ | XE )
Forward Sampling Example

P(B)=0.001
P(E)=0.002
B
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001
A
J
P(J | A) = 0.9
P(J | ~A) = 0.05
M
P(M | A) = 0.7
P(M | ~A) = 0.01
To generate a sample X = (B, E, A, J, M ):

1. Sample B Ber(0.001): r U (0, 1). If (r < 0.001) then
B = 1 else B = 0
2. Sample E Ber(0.002)
3. If B = 1 and E = 1, sample A Ber(0.95), and so on
4. If A = 1 sample J Ber(0.9) else J Ber(0.05)
5. If A = 1 sample M Ber(0.7) else M Ber(0.01)
Works for Bayesian networks.
Inference with Forward Sampling
Say the inference task is P (B = 1 | E = 1, M = 1)
Throw away all samples except those with (E = 1, M = 1)

m
p(B = 1 | E = 1, M = 1)
1 X
1(B (i) =1)
m
i=1
where m is the number of surviving samples

I
Can be highly inefficient (note P (E = 1) tiny)
Does not work for Markov Random Fields
Gibbs Sampler Example: P (B = 1 | E = 1, M = 1)

I
Gibbs sampler is a Markov Chain Monte Carlo (MCMC)

method.
Directly sample from p(xQ | XE )
Works for both graphical models

Initialization:
I
I
Fix evidence; randomly set other variables

e.g. X (0) = (B = 0, E = 1, A = 0, J = 0, M = 1)
P(B)=0.001
P(E)=0.002
B
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001
A
J
P(J | A) = 0.9
P(J | ~A) = 0.05
M
P(M | A) = 0.7
P(M | ~A) = 0.01
Gibbs Update
I
I
I
For each non-evidence variable xi , fixing all other nodes Xi ,

resample its value xi P (xi | Xi )
This is equivalent to xi P (xi | MarkovBlanket(xi ))
For a Bayesian network MarkovBlanket(xi ) includes xi s
parents, spouses, and children
Y
P (xi | MarkovBlanket(xi )) P (xi | P a(xi ))
P (y | P a(y))
yC(xi )
I
I
where P a(x) are the parents of x, and C(x) the children of x.

For many graphical models the Markov Blanket is small.
For example,
B P (B | E = 1, A = 0) P (B)P (A = 0 | B, E = 1)
P(B)=0.001
P(E)=0.002
B
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001
A
J
Gibbs Update
I
Say we sampled B = 1. Then

X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1)
Starting from X (1) , sample

A P (A | B = 1, E = 1, J = 0, M = 1) to get X (2)
Move on to J, then repeat B, A, J, B, A, J . . .
Keep all later samples. P (B = 1 | E = 1, M = 1) is the

fraction of samples with B = 1.
P(B)=0.001
P(E)=0.002
B
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001
A
J
P(J | A) = 0.9
P(J | ~A) = 0.05
M
P(M | A) = 0.7
P(M | ~A) = 0.01
Gibbs Example 2: The Ising Model
A
xs
This is an undirected model with x {0, 1}.
X
X
1
s xs +
p (x) = exp
st xs xt
Z
sV
(s,t)E
Gibbs Example 2: The Ising Model

A
xs
C
I
The Markov blanket of xs is A, B, C, D
In general for undirected graphical models

p(xs | xs ) = p(xs | xN (s) )
N (s) is the neighbors of s.
The Gibbs update is

p(xs = 1 | xN (s) ) =
1
exp((s +
tN (s) st xt ))
+1
Gibbs Sampling as a Markov Chain
A Markov chain is defined by a transition matrix T (X 0 | X)
Certain Markov chains have a stationary distribution such

that = T
Gibbs sampler is such a Markov chain with

Ti ((Xi , x0i ) | (Xi , xi )) = p(x0i | Xi ), and stationary
distribution p(xQ | XE )
But it takes time for the chain to reach stationary distribution
(mix)
I
I
I
I
Can be difficult to assert mixing

In practice burn in: discard X (0) , . . . , X (T )
Use all of X (T +1) , . . . for inference (they are correlated)
Do not thin
Collapsed Gibbs Sampling

1
m
Pm
(i) )
if X (i) p
In general, Ep [f (X)]
Sometimes X = (Y, Z) where Z has closed-form operations
If so,
i=1 f (X
Ep [f (X)] = Ep(Y ) Ep(Z|Y ) [f (Y, Z)]

m
1 X
Ep(Z|Y (i) ) [f (Y (i) , Z)]
m
i=1
if Y (i) p(Y )
I
No need to sample Z: it is collapsed
Collapsed Gibbs sampler Ti ((Yi , yi0 ) | (Yi , yi )) = p(yi0 | Yi )

R
Note p(yi0 | Yi ) = p(yi0 , Z | Yi )dZ
Example: Collapsed Gibbs Sampling for LDA
Nd
w
Collapse , , Gibbs update:

(w )
P (zi = j | zi , w)
(w )
(d )
i
i
ni,j
+ ni,j
+
()
(d )
i
ni,j + W ni,
+ T
i
ni,j
: number of times word wi has been assigned to topic j,
excluding the current position
(di )
ni,j
: number of times a word from document di has been
assigned to topic j, excluding the current position
()
ni,j : number of times any word has been assigned to topic j,
excluding the current position
(di )
ni,
: length of document di , excluding the current position
Summary: Markov Chain Monte Carlo
Gibbs sampling
Not covered: block Gibbs, Metropolis-Hastings
Unbiased (after burn-in), but can have high variance

To learn more, come to Prof. Prasad Tetalis tutorial Markov
Chain Mixing with Applications 2pm Monday.
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
The Sum-Product Algorithm
Also known as belief propagation (BP)
Exact if the graph is a tree; otherwise known as loopy BP,

approximate
The algorithm involves passing messages on the factor graph
Alternative view: variational approximation (more later)
Example: A Simple HMM

I
The Hidden Markov Model template (not a graphical model)

1/4
1/2
P(x | z=1)=(1/2, 1/4, 1/4)

R G B
P(x | z=2)=(1/4, 1/2, 1/4)

R G B
1 = 2 = 1/2
I
Observing x1 = R, x2 = G, the directed graphical model

z1
z2
x1=R
x2=G
Factor graph
f1
P(z1)P(x1| z1)
z1
f2
P(z2| z1)P(x2| z2)
z2
Messages
A message is a vector of length K, where K is the number of

values x takes.
There are two types of messages:
1. f x : message from a factor node f to a variable node x
f x (i) is the ith element, i = 1 . . . K.
2. xf : message from a variable node x to a factor node f
Leaf Messages
I
Assume tree factor graph. Pick an arbitrary root, say z2
Start messages at leaves.
If a leaf is a factor node f , f x (x) = f (x)
If a leaf is a variable node x, xf (x) = 1

f1
z1
f2
P(z1)P(x1| z1)
z2
P(z2| z1)P(x2| z2)
f1 z1 (z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 1/2 = 1/4

f1 z1 (z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 1/4 = 1/8
1/4
1/2
P(x | z=1)=(1/2, 1/4, 1/4)

R G B
P(x | z=2)=(1/4, 1/2, 1/4)

R G B
1 = 2 = 1/2
Message from Variable to Factor

I
I
A node (factor or variable) can send out a message if all other

incoming messages have arrived
Let x be in factor fs .
Y
f x (x)
xfs (x) =
f ne(x)\fs
ne(x)\fs are factors connected to x excluding fs .

f1
z1
f2
P(z1)P(x1| z1)
z2
P(z2| z1)P(x2| z2)
z1 f2 (z1 = 1) = 1/4
z1 f2 (z1 = 2) = 1/8
1/4
1/2
P(x | z=1)=(1/2, 1/4, 1/4)

R G B
P(x | z=2)=(1/4, 1/2, 1/4)

R G B
1 = 2 = 1/2
Message from Factor to Variable

I
Let x be in factor fs . Let the other variables in fs be x1:M .

fs x (x) =
...
x1
z1
P(z1)P(x1| z1)
2
X
fs (x, x1 , . . . , xM )
xM
f1
f2 z2 (s) =
M
Y
xm fs (xm )
m=1
f2
z2
P(z2| z1)P(x2| z2)
z1 f2 (s0 )f2 (z1 = s0 , z2 = s)
s0 =1
= 1/4P (z2 = s|z1 = 1)P (x2 = G|z2 = s)

+1/8P (z2 = s|z1 = 2)P (x2 = G|z2 = s)
We get
f2 z2 (z2 = 1) = 1/32
f2 z2 (z2 = 2) = 1/8
Up to Root, Back Down
The message has reached the root, pass it back down

f1
z1
f2
P(z1)P(x1| z1)
z2
P(z2| z1)P(x2| z2)
z2 f2 (z2 = 1) = 1
z2 f2 (z2 = 2) = 1
1/4
1/2
P(x | z=1)=(1/2, 1/4, 1/4)

R G B
P(x | z=2)=(1/4, 1/2, 1/4)

R G B
1 = 2 = 1/2
Keep Passing Down
f1
z1
f2
z2
P(z1)P(x1| z1)
P(z2| z1)P(x2| z2)

0
s0 =1 z2 f2 (s )f2 (z1 = s, z2 =
P2
f2 z1 (s) =
s0 )
= 1P (z2 = 1|z1 = s)P (x2 = G|z2 = 1)
+ 1P (z2 = 2|z1 = s)P (x2 = G|z2 = 2). We get
f2 z1 (z1 = 1) = 7/16
f2 z1 (z1 = 2) = 3/8
1/4
1/2
P(x | z=1)=(1/2, 1/4, 1/4)

R G B
P(x | z=2)=(1/4, 1/2, 1/4)

R G B
1 = 2 = 1/2
From Messages to Marginals

Once a variable receives all incoming messages, we compute its
marginal as
Y
p(x)
f x (x)
f ne(x)
In this example
1/4
7/16
7/64
P (z1 |x1 , x2 ) f1 z1 f2 z1 = 1/8 3/8 = 3/64

1/32
P (z2 |x1 , x2 ) f2 z2 = 1/8 0.2
0.8
One can also compute the marginal of the set of variables xs
involved in a factor fs
Y
p(xs ) fs (xs )
xf (x)
xne(f )
0.7
0.3
Handling Evidence
Observing x = v,
I
we can absorb it in the factor (as we did); or
set messages xf (x) = 0 for all x 6= v
Observing XE ,
I
multiplying the incoming messages to x

/ XE gives the joint
(not p(x|XE )):
Y
p(x, XE )
f x (x)
f ne(x)
The conditional is easily obtained by normalization

p(x, XE )
0
x0 p(x , XE )
p(x|XE ) = P
So far, we assumed a tree graph
When the factor graph contains loops, pass messages

indefinitely until convergence
But convergence may not happen
But in many cases loopy BP still works well, empirically
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Example: The Ising Model
s xs
st
xt
The random variables x take values in {0, 1}.
X
X
1
p (x) = exp
s xs +
st xs xt
Z
sV
(s,t)E
The Conditional
s xs
st
xt
Markovian: the conditional distribution for xs is

p(xs | xs ) = p(xs | xN (s) )
N (s) is the neighbors of s.
This reduces to
p(xs = 1 | xN (s) ) =
1
exp((s +
Gibbs sampling would draw xs like this.
tN (s) st xt ))
+1
The Mean Field Algorithm for Ising Model
p(xs = 1 | xN (s) ) =
I
1
exp((s +
tN (s) st xt ))
+1
Instead of Gibbs sampling, let s be the estimated marginal

p(xs = 1)
s
1
exp((s +
tN (s) st t ))
+1
The s are updated iteratively
The Mean Field algorithm is coordinate ascent and

guaranteed to converge to a local optimal (more later).
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Exponential Family
Let (X) = (1 (X), . . . , d (X))> be d sufficient statistics,

where i : X 7 R
Note X is all the nodes in a Graphical model
i (X) sometimes called a feature function
Let = (1 , . . . , d )> Rd be canonical parameters.
The exponential family is a family of probability densities:

p (x) = exp > (x) A()
Exponential Family

p (x) = exp > (x) A()
I
The key is the inner product between parameters and

sufficient statistics .
A is the log partition function,

Z

A() = log exp > (x) (dx)
A = log Z
Minimal vs. Overcomplete Models
Parameters for which the density is normalizable:

= { Rd | A() < }
A minimal exponential family is where the s are linearly

independent.
An overcomplete exponential family is where the s are

linearly dependent:
Rd , > (x) = constant x
Both minimal and overcomplete representations are useful.
Exponential Family Example 1: Bernoulli
p(x) = x (1 )1x for x {0, 1} and (0, 1).

I
Does not look like an exponential family!
Can be rewritten as
p(x) = exp (x log + (1 x) log(1 ))
Now in exponential family form with

1 (x) = x, 2 (x) = 1 x, 1 = log , 2 = log(1 ), and
A() = 0.
Overcomplete: 1 = 2 = 1 makes > (x) = 1 for all x
Exponential Family Example 1: Bernoulli
p(x) = exp (x log + (1 x) log(1 ))
Can be further rewritten as

p(x) = exp (x log(1 + exp()))
Minimal exponential family with
(x) = x, = log 1
, A() = log(1 + exp()).
Many distributions (e.g., Gaussian, exponential, Poisson, Beta) are

in the exponential family, but not all (e.g., the Laplace
distribution).
Exponential Family Example 2: Ising Model
s xs
st
xt
p (x) = exp
X
sV
s xs +
st xs xt A()
(s,t)E
Binary random variable xs {0, 1}
d = |V | + |E| sufficient statistics: (x) = (. . . xs . . . xst . . .)>
This is a regular ( = Rd ), minimal exponential family.
Exponential Family Example 3: Potts Model

s xs
I
I
xt
Similar to Ising model but generalizing xs {0, . . . , r 1}.

Indicator functions fsj (x) = 1 if xs = j and 0 otherwise, and
fstjk (x) = 1 if xs = j xt = k, and 0 otherwise.
X
X
p (x) = exp
sj fsj (x) +
stjk fstjk (x) A()
sj
st
r2 |E|
stjk
d = r|V | +
P
Regular but overcomplete, because r1
j=0 sj (x) = 1 for any
s V and all x.
The Potts model is a special case where the parameters are
tied: stkk = , and stjk = for j 6= k.
Important Relation
For sufficient statistics defined by indicator functions

I
e.g., sj (x) = fsj (x) = 1 if xs = j and 0 otherwise
The marginal can be obtained via the mean

E [sj (x)] = P (xs = j)
Since inference is about computing the marginal, in this case

it is equivalent to computing the mean.
Mean Parameters
I
Let p be any density (not necessarily in exponential family).
Given sufficient statistics , the mean parameters

= (1 , . . . , d )> is
Z
i = Ep [i (x)] = i (x)p(x)dx
The set of mean parameters

M = { Rd | p s.t. Ep [(x)] = }
If (1) , (2) M, there must exist p(1) , p(2)
The convex combinations of p(1) , p(2) leads to another mean

parameter in M
Therefore M is convex
Example: The First Two Moments
Let 1 (x) = x, 2 (x) = x2
For any p (not necessarily Gaussian) on x, the mean

parameters = (1 , 2 ) = (E(x), E(x2 ))> .
Note V(x) = E(x2 ) E2 (x) = 2 21 0 for any p
M is not R2 but rather the subset 1 R, 2 21 .
The Marginal Polytope
The marginal polytope is defined for discrete xs

P
Recall M = { Rd | = x (x)p(x) for some p}
p can be a point mass function on a particular x.
In fact any p is a convex combination of such point mass

functions.
M = conv{(x), x} is a convex hull, called the marginal

polytope.
Marginal Polytope Example

Tiny Ising model: two nodes x1 , x2 {0, 1} connected by an edge.
I
minimal sufficient statistics (x1 , x2 ) = (x1 , x2 , x1 x2 )> .
only 4 different x = (x1 , x2 ).
the marginal polytope is

M = conv{(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)}
the convex hull is a polytope inside the unit cube.
the three coordinates are node marginals 1 Ep [x1 = 1],

2 Ep [x2 = 1] and edge marginal 12 Ep [x1 = x2 = 1],
hence the name.
The Log Partition Function A
For any regular exponential family, A() is convex in .
Strictly convex for minimal exponential family.
Nice property:
Therefore, A = , the mean parameters of p .
A()
i
= E [i (x)]
Conjugate Duality
The conjugate dual function A to A is defined as
A () = sup > A()
Such definition, where a quantity is expressed as the solution to an

optimization problem, is called a variational definition.
I
For any Ms interior, let () satisfy

E() [(x)] = A(()) = .
Then A () = H(p() ) the negative entropy.
The dual of the dual gives back A:

A() = supM > A ()
For all , the supremum is attained uniquely at the

M0 by the moment matching conditions = E [(x)].
Example: Conjugate Dual for Bernoulli
Recall the minimal exponential family for Bernoulli with

(x) = x, A() = log(1 + exp()), = R.
By definition
A () = sup log(1 + exp())
R
Taking derivative and solve

A () = log + (1 ) log(1 )
i.e., the negative entropy.
Inference with Variational Representation
A() = supM > A () is attained by = E [(x)].

I
Want to compute the marginals P (xs = j)? They are the

mean parameters sj = E [ij (x)] under standard
overcomplete representation.
Want to compute the mean parameters sj ? They are the

arg sup to the optimization problem above.
This variational representation is exact, not approximate (will

relax it next to derive loopy BP and mean field)
The Difficulties with Variational Representation
A() = sup > A ()

M
I
I
Difficult to solve even though it is a convex problem

Two issues:
I
Although the marginal polytope M is convex, it can be quite

complex (exponential number of vertices)
The dual function A () usually does not admit an explicit
form.
Variational approximation modifies the optimization problem

so that it is tractable, at the price of an approximate solution.
Next, we cast mean field and sum-product algorithms as

variational approximations.
The Mean Field Method as Variational Approximation
The mean field method replaces M with a simpler subset

M(F ) on which A () has a closed form.
Consider the fully disconnected subgraph F = (V, ) of the

original graph G = (V, E)
Set all i = 0 if i involves edges
The densities in this sub-family are all fully factorized:

Y
p (x) =
p(xs ; s )
sV
The Geometry of M(F )

I
Let M(F ) be the mean parameters of the fully factorized

sub-family. In general, M(F ) M
Recall M is the convex hull of extreme points {(x)}.
It turns out the extreme points {(x)} M(F ).

Example:
I
I
I
I
The tiny Ising model x1 , x2 {0, 1} with = (x1 , x2 , x1 x2 )>

The point mass distribution p(x = (0, 1)> ) = 1 is realized as a
limit to the series p(x) = exp(1 x1 + 2 x2 A()) where
1 and 2 .
This series is in F because 12 = 0.
Hence the extreme point (x) = (0, 1, 0) is in M(F ).
The Geometry of M(F )
Because the extreme points of M are in M(F ), if M(F )

were convex, we would have M = M(F ).
But in general M(F ) is a true subset of M
Therefore, M(F ) is a nonconvex inner set of M
(x)
M(F)
The Mean Field Method as Variational Approximation

I
Recall the exact variational problem

A() = sup > A ()
M
attained by solution to inference problem = E [(x)].

I
The mean field method simply replaces M with M(F )

L() =
sup > A ()
M(F )
Obvious L() A().
The original solution may not be in M(F )
Even if M(F ), may hit local maximum and not find it
Why both? Because A () = H(p ()) has a very simple

form for M (F )
Example: Mean Field for Ising Model

I
The mean parameters for the Ising model are the node and
edge marginals: s = p(xx = 1), st = p(xs = 1, xt = 1)
Fully factorized M(F ) means no edge. st = s t
For M(F ), the dual function A () has the simple form

X
X
A () =
H(s ) =
s log s + (1 s ) log(1 s )
sV
sV
Thus the mean field problem is

X
L() =
sup >
(s log s + (1 s ) log(1 s ))
M(F )
sV
max
(1 ...m )[0,1]m
sV
s s +
X
(s,t)E
st s t +
X
sV
H(s )
Example: Mean Field for Ising Model
L() =
I
I
I
max
(1 ...m )[0,1]m
s s +
sV
X
(s,t)E
st s t +
H(s )
sV
Bilinear in , not jointly concave

But concave in a single dimension s , fixing others.
Iterative coordinate-wise maximization: fixing t for t 6= s and
optimizing s .
Setting the partial derivative w.r.t. s to 0 yields:
s =
1

P
1 + exp (s + (s,t)E st t )

as weve seen before.

Caution: mean field converges to a local maximum depending
on the initialization of 1 . . . m .
The Sum-Product Algorithm as Variational Approximation
A() = sup > A ()

M
The sum-product algorithm makes two approximations:

I
it relaxes M to an outer set L
it replaces the dual A with an approximation.

A() = sup > A ()
L
The Outer Relaxation

I
I
I
For overcomplete exponential families on discrete nodes, the

mean parameters are node and edge marginals
sj = p(xs = j), stjk = p(xs = j, xt = k).
The marginal polytope is M = { | p with marginals }.
Now consider Rd+ satisfying node normalization and
edge-node marginal consistency conditions:
r1
X
sj = 1
s V
j=0
r1
X
k=0
r1
X
stjk = sj
s, t V, j = 0 . . . r 1
stjk = tk
s, t V, k = 0 . . . r 1
j=0
I
Define L = { satisfying the above conditions}.
The Outer Relaxation

I
I
If the graph is a tree then M = L

If the graph has cycles then M L
I
L is too lax to satisfy other constraints that true marginals

need to satisfy
Nice property: L is still a polytope, but much simpler than M.
(x)
The first approximation in sum-product is to replace M with L.
Approximating A
Recall are node and edge marginals
If the graph is a tree, one can exactly reconstruct the joint

probability
Y
Y stx x
s t
p (x) =
sxs
sxs txt
sV
If the graph is a tree, the entropy of the joint distribution is

H(p ) =
r1
XX
sV j=0
(s,t)E
X X
sj log sj
(s,t)E j,k
Neither holds for graph with cycles.
stjk log
stjk
sj tk
Approximating A
Define the Bethe entropy for L on loopy graphs in the same

way:
HBethe (p ) =
r1
XX
sV j=0
sj log sj
X X
(s,t)E j,k
stjk log
stjk
sj tk
Note HBethe is not a true entropy. The second approximation in

sum-product is to replace A ( ) with HBethe (p ).
The Sum-Product Algorithm as Variational Approximation
With these two approximations, we arrive at the variational

problem
A()
= sup > + HBethe (p )
L
I
Optimality conditions require the gradients vanish w.r.t. both

and the Lagrangian multipliers on constraints L.
The sum-product algorithm can be derived as an iterative

fixed point procedure to satisfy optimality conditions.
At the solution, A()

is not guaranteed to be either an upper
or a lower bound of A()
may not correspond to a true marginal distribution
Summary: Variational Inference
The sum-product algorithm (loopy belief propagation)
The mean field method
Not covered: Expectation Propagation
Efficient computation. But often unknown bias in solution.
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Maximizing Problems
Recall the HMM example
1/4
1/2
P(x | z=1)=(1/2, 1/4, 1/4)

R G B
P(x | z=2)=(1/4, 1/2, 1/4)

R G B
1 = 2 = 1/2
There are two senses of best states z1:N given x1:N :

1. So far we computed the marginal p(zn |x1:N )
I
I
I
We can define best as zn = arg maxk p(zn = k|x1:N )
However z1:N
as a whole may not be the best
can even have zero probability!

In fact z1:N
2. An alternative is to find
z1:N
= arg max p(z1:N |x1:N )
z1:N
I
I
I
finds the most likely state configuration as a whole

The max-sum algorithm solves this
Generalizes the Viterbi algorithm for HMMs
Intermediate: The Max-Product Algorithm
Simple modification to the sum-product algorithm: replace

max in the factor-to-variable messages.
fs x (x) = max . . . max fs (x, x1 , . . . , xM )
x1
xM
xm fs (xm ) =
f ne(xm )\fs
leaf f
leaf x
(x) = 1
(x) = f (x)
M
Y
m=1
f xm (xm )
with
xm fs (xm )
As in sum-product, pick an arbitrary variable node x as the

root
Pass messages up from leaves until they reach the root
Unlike sum-product, do not pass messages back from root to

leaves
At the root, multiply incoming messages
Y
pmax = max
f x (x)
x
f ne(x)
This is the probability of the most likely state configuration
To identify the configuration itself, keep back pointers:
When creating the message

fs x (x) = max . . . max fs (x, x1 , . . . , xM )
x1
xM
M
Y
xm fs (xm )
m=1
for each x value, we separately create M pointers back to the

values of x1 , . . . , xM that achieve the maximum.
I
At the root, backtrack the pointers.
f1
z1
f2
P(z1)P(x1| z1)
z2
P(z2| z1)P(x2| z2)

1/4
1/2
P(x | z=1)=(1/2, 1/4, 1/4)

R G B
P(x | z=2)=(1/4, 1/2, 1/4)

R G B
1 = 2 = 1/2
Message from leaf f1

f1 z1 (z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 1/2 = 1/4
f1 z1 (z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 1/4 = 1/8
The second message

z1 f2 (z1 = 1) = 1/4
z1 f2 (z1 = 2) = 1/8
f1
z1
f2
P(z1)P(x1| z1)
z2
P(z2| z1)P(x2| z2)

1/4
1/2
P(x | z=1)=(1/2, 1/4, 1/4)

R G B
P(x | z=2)=(1/4, 1/2, 1/4)

R G B
1 = 2 = 1/2
f2 z2 (z2 = 1)
= max f2 (z1 , z2 )z1 f2 (z1 )
z1
= max P (z2 = 1 | z1 )P (x2 = G | z2 = 1)z1 f2 (z1 )

z1
= max(1/4 1/4 1/4, 1/2 1/4 1/8) = 1/64

Back pointer for z2 = 1: either z1 = 1 or z1 = 2

f1
z1
f2
P(z1)P(x1| z1)
z2
P(z2| z1)P(x2| z2)

1/4
1/2
P(x | z=1)=(1/2, 1/4, 1/4)

R G B
P(x | z=2)=(1/4, 1/2, 1/4)

R G B
1 = 2 = 1/2
The other element of the same message:

f2 z2 (z2 = 2)
= max f2 (z1 , z2 )z1 f2 (z1 )
z1
= max P (z2 = 2 | z1 )P (x2 = G | z2 = 2)z1 f2 (z1 )

z1
= max(3/4 1/2 1/4, 1/2 1/2 1/8) = 3/32

Back pointer for z2 = 2: z1 = 1

f1
z1
f2
P(z1)P(x1| z1)
z2
P(z2| z1)P(x2| z2)

1/4
1/2
P(x | z=1)=(1/2, 1/4, 1/4)

R G B
P(x | z=2)=(1/4, 1/2, 1/4)

R G B
1 = 2 = 1/2
f2 z2 =
1/64 z1 =1,2
3/32 z1 =1
At root z2 ,
max f2 z2 (s) = 3/32
s=1,2
z2 = 2 z1 = 1
z1:2
= arg max p(z1:2 |x1:2 ) = (1, 2)
z1:2
In this example, sum-product and max-product produce the same

best sequence; In general they differ.
From Max-Product to Max-Sum

The max-sum algorithm is equivalent to the max-product
algorithm, but work in log space to avoid underflow.
fs x (x) =
max log fs (x, x1 , . . . , xM ) +
x1 ...xM
m=1
xm fs (xm ) =
M
X
f xm (xm )
f ne(xm )\fs
leaf f
leaf x
(x) = 0
(x) = log f (x)
When at the root,
log pmax = max

x
The back pointers are the same.
X
f ne(x)
f x (x)
xm fs (xm )
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Parameter Learning
Assume the graph structure is given
Learning in exponential family: estimate from iid data

x1 . . . xn .
Principle: maximum likelihood

Distinguish two cases:
I
I
fully observed data: all dimensions of x are observed

partially observed data: some dimensions of x are unobserved.
Fully Observed Data

p (x) = exp > (x) A()
I
Given iid data x1 . . . xn , the log likelihood is

!
n
n
1X
1X
>
`() =
log p (xi ) =
(xi ) A() = >
A()
n
n
i=1
i=1
Pn
n1 i=1 (xi ) is the mean parameter of the empirical

distribution on x1 . . . xn . Clearly
M.
Maximum likelihood: M L = arg sup >

A()
The solution is M L = (
), the exponential family density
whose mean parameter matches
.
When
M0 and minimal, there is a unique maximum
likelihood solution M L .
Partially Observed Data
Each item (x, z) where x observed, z unobserved
Full data (x1 , z1 ) . . . (xn , zn ), but we only observe x1 . . . xn

P
The incomplete
likelihood `() = n1 ni=1 log p (xi ) where
R
p (xi ) = p (xi , z)dz
P
Can be written as `() = n1 ni=1 Axi () A()
I
I
New log partition function of p (z | xi ), one per item:

Z
Axi () = log exp(> (xi , z0 ))dz0
Expectation-Maximization (EM) algorithm: lower bound Axi
EM as Variational Lower Bound

I
Mean parameter realizable by any distribution on z while

holding xi fixed:
Mxi = { Rd | = Ep [(xi , z)] for some p}
The variational definition Axi () = supMx > Axi ()
Trivial variational lower bound:

Axi () > i Axi (i ), i Mxi
Lower bound L on the incomplete log likelihood:
`() =
1X
Axi () A()
n
i=1
n

1 X > i
Axi (i ) A()
n
i=1
L(1 , . . . , n , )
Exact EM: The E-Step
The EM algorithm is coordinate ascent on L(1 , . . . , n , ).

I
In the E-step, maximizes each i

i arg max L(1 , . . . , n , )
i Mxi
Equivalently, argmaxi Mx > i Axi (i )
This is the variational representation of the mean parameter

i () = E [(xi , z)]
The E-step is named after this E [] under the current

parameters
Exact EM: The M-Step
In the M-step, maximize holding the s fixed:

arg max L(1 , . . . , n , ) = arg max >
A()
1
n
Pn
i
i=1
The solution (
) satisfies E() [(x)] =
Standard fully observed maximum likelihood problem, hence

the name M-step
Variational EM
For loopy graphs E-step often intractable.
I
Cant maximize
max > i Axi (i )
i Mxi
I
Improve but not necessarily maximize: generalized EM
The mean field method maximizes

max
i Mxi (F )
I
I
> i Axi (i )
up to local maximum
recall Mxi (F ) is an inner approximation to Mxi
Mean field E-step leads to generalized EM
The sum-product algorithm does not lead to generalized EM
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Score-Based Structure Learning
Let M be all allowed candidate features
Let M M be a log-linear model structure

1
P (X | M, ) = exp
Z
!
X
i fi (X)
iM
A score for the model M can be max ln P (Data | M, )
The score is always better for larger M needs regularization
M and treated separately
Structure Learning for Gaussian Random Fields
Consider a p-dimensional multivariate Gaussian N (, )
The graphical model has p nodes x1 , . . . , xp
The edge between xi , xj is absent if and only if ij = 0,

where = 1
Equivalently, xi , xj are conditionally independent given other

variables
x1
x2
x3
x4
Let data be X (1) , . . . , X (n) N (, )
The log likelihood

is
n
1 Pn
(i) )> (X (i) )
i=1 (X
2 log || 2
The maximum likelihood estimate of is the sample

covariance
1 X (i)
> (X (i) X)
S=
(X X)
n
i
is the sample mean

where X
I
S 1 is not a good estimate of when n is small
For centered data, minimize a regularized problem instead:

n
log || +
X
1 X (i) >
X
X (i) +
|ij |
n
i=1
Known as glasso
i6=j
Recap

I
I

XQ XE {x1 . . . xn }
I
BN or MRF
conditional independence
exact, MCMC, variational

I
parameter and structure learning
Much on-going research!

All of Graphical Models (Zhu) PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

All of Graphical Models (Zhu) PDF

Загружено:

Авторское право:

Доступные форматы

All of Graphical Models

Tutorial at ICMLA 2011

The Whole Tutorial in One Slide

Given GM = joint distribution p(x1 , . . . , xn )

Do inference = p(XQ | XE ), in general

If p(x1 , . . . , xn ) not given, estimate it from data

Life without Graphical Models

e.g., x1 , . . . , xn1 can be the discrete or continuous features

The joint p(x1 , . . . , xn ) completely describes how the universe

Machine learning: estimate p(x1 , . . . , xn ) from training

by the definition of conditional probability

Life without graphical models is just fine

So why are we still here?

Life can be Better for Computer Scientists

Given GM = joint distribution p(x1 , . . . , xn )

Do inference = p(XQ | XE ), in general

exponential nave storage (2n for binary r.v.)

Often cant do it computationally

If p(x1 , . . . , xn ) not given, estimate it from data

Acknowledgments Before We Start

Much of this tutorial is based on

Koller & Friedman, Probabilistic Graphical Models. MIT 2009

Wainwright & Jordan, Graphical Models, Exponential

Bishop, Pattern Recognition and Machine Learning. Springer

Graphical model is the study of probabilistic models

Just because there is a graph with nodes and edges doesnt

A directed graph has nodes X = (x1 , . . . , xn ), some of them

A cycle is a directed path x1 . . . xk where x1 = xk

A directed acyclic graph (DAG) contains no cycles

A Bayesian network on the DAG is a family of distributions

where P a(xi ) is the set of parents of xi .

p(xi | P a(xi )) is the conditional probability distribution

By specifying the CPDs for all i, we specify a particular

Example: Naive Bayes

Used extensively in natural language processing

Plate representation on the right

The two BNs are equivalent in all respects

Bayesian networks imply no causality at all

They only encode the joint probability distribution (hence

However, people tend to design BNs based on causal relations

Example: Latent Dirichlet Allocation (LDA)

A generative model for p(, , z, w | , ):

Example: Latent Dirichlet Allocation (LDA)

A generative model for p(, , z, w | , ):

Example: Latent Dirichlet Allocation (LDA)

A generative model for p(, , z, w | , ):

Example: Latent Dirichlet Allocation (LDA)

A generative model for p(, , z, w | , ):

Example: Latent Dirichlet Allocation (LDA)

A generative model for p(, , z, w | , ):

Example: Latent Dirichlet Allocation (LDA)

A generative model for p(, , z, w | , ):

Example: Latent Dirichlet Allocation (LDA)

A generative model for p(, , z, w | , ):

Example: Latent Dirichlet Allocation (LDA)

A generative model for p(, , z, w | , ):

Some Topics by LDA on the Wish Corpus

Two r.v.s A, B are independent if P (A, B) = P (A)P (B) or

Two r.v.s A, B are conditionally independent given C if

This extends to groups of r.v.s

Conditional independence in a BN is precisely specified by

d-Separation Case 1: Tail-to-Tail

A, B conditionally independent given C