Notes On LDA With Gibbs Sampling

Notes on LDA with Gibbs Sampling
Nota!ons
We list related nota!ons as follows.
α and β are hyper-parameter with specified values.
M is the number of documents. K is the number of topics. V is the number of dis!nct

words.
Nm is the length of the m-th document.
wm,n (1 ≤ m ≤ M ; 1 ≤ n ≤ Nm ) is the n-th word in the m-th document, and words

contained in documents are observed variables.
zm,n (1 ≤ m ≤ M ; 1 ≤ n ≤ Nm ) is the topic assigned to the n-th word in the m-th

document, which is latent (or hidden) variable.
θm (1 ≤ m ≤ M ) is topic propor!on of the m-th document, which is a latent (or hidden)

variable of K -dimension vector.
ϕk (1 ≤ k ≤ K) is word propor!on of the k -th topic, which is a latent (or hidden)

variable of V -dimension vector.
Variable θm and Θ
θm is sampled from a Dirichlet distribu!on with hyper-parameter α.
θm ∼ Dir(α); 1 ≤ m ≤ M
θm,k is the k -th element in θm , which is corresponding the propor!on of the topic k in the m-
th document.
K
∑ θm,k = 1
k=1
We can represent θ1 ⋯ θm ⋯ θM as a M × K matrix Θ.
⎡ θ1 ⎤ ⎡ θ1,1 … θ1,k … θ1,K ⎤

⎢ ⋮ ⎥ ⎢ ⋮ ⋮ ⋮ ⋱ ⋮ ⎥
Θ=⎢ ⎥ ⎢
⎢ θm ⎥ = ⎢ θm,1 … … θm,K ⎥⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⋮ ⎥
θm,k
⋮ ⋮ ⋱
⎣θ ⎦ ⎣θ … θM ,k … θM ,K ⎦
M M ,1
Variable ϕk and Φ
ϕk is sampled from a Dirichlet distribu!on with hyper-parameter β .
ϕk ∼ Dir(β); 1 ≤ k ≤ K
ϕk,v is the v -th element in ϕk , which is corresponding to the propor!on of the word v in the k -
th topic. Each v is an element of the dic!onary with V dis!nct words.
V
∑ ϕk,v = 1
v=1
We can represent ϕ1 ⋯ ϕk ⋯ ϕK as a K × V matrix Φ.
⎡ ϕ1 ⎤ ⎡ ϕ1,1 … ϕ1,v … ϕ1,V ⎤

⎢ ⋮ ⎥ ⎢ ⋮ ⋮ ⋮ ⋱ ⋮ ⎥
Φ=⎢ ⎥ ⎢
⎢ ϕk ⎥ = ⎢ ϕk,1 … … ϕk,V ⎥
⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⋮ ⎥
ϕk,v
⋮ ⋮ ⋱
⎣ϕ ⎦ ⎣ϕ … ϕK,v … ϕK,V ⎦
K K,1
Words W
The m-th document Wm can be represented by words contained. wm,n is the n-th word in the
m-th document, and Nm is the number of words in the m-th document. wm,n in different
posi!ons can be instances of the same v in the dic!onary.
Wm = ( wm,1 … wm,n … wm,Nm )
The whole document set can be represented as W .

⎛ W1 ⎞ ⎛ w1,1 … w1,n … w1,N1 ⎞
⎜ ⋮ ⎟ ⎜ ⋮ ⋮ ⋮ ⋱ ⋮ ⎟
W=⎜
⎜ Wm
⎟ = ⎜ wm,1
⎟ ⎜ … … ⎟
⎟
⎜ ⋮ ⎟ ⎜ ⋮ ⎟
wm,n wm,Nm
⋮ ⋮ ⋱ ⋮
⎝ WM ⎠ ⎝ wM ,1 … wM ,n … wM ,NM ⎠
Please be aware that W is not a matrix, because its every row, which is corresponding the m-th
document, may have different number of words or length Nm . For example, N1 ≠ Nm ≠ NM .
Latent Variables Z for Topics assigned to Words
The latent topic variables assigned to words in the m-th document can be represented as Zm .
zm,n is the topic assigned to the n-th word in the m-th document, and Nm is the number of
words in the m-th document.
Zm = ( zm,1 … zm,n … zm,Nm )
The latent topic variables assigned to words in all documents can be represented as Z .
⎛ Z1 ⎞ ⎛ z1,1 … z1,n … z1,N1 ⎞

⎜ ⋮ ⎟ ⎜ ⋮ ⋮ ⋮ ⋱ ⋮ ⎟
Z=⎜
⎜ Zm
⎟ = ⎜ zm,1
⎟ ⎜ … … ⎟
⎟
⎜ ⋮ ⎟ ⎜ ⋮ ⎟
zm,n zm,Nm
⋮ ⋮ ⋱ ⋮
⎝ ZM ⎠ ⎝ zM ,1 … zM ,n … zM ,NM ⎠
Please be aware that Z is not a matrix, because its every row, which is corresponding to the m-
th document, may have different number of words or length Nm . For example, N1 ≠ Nm ≠
NM .
Graphical Models
Based on the genera!ve process of LDA, we can represent the LDA model using the plate
nota!on.
For an easy understanding, the corresponding expanded model is also shown here.
Joint Distribu!on
Based on the Bayesian network structure, we can present the joint distribu!on of LDA as
follows.
p(θ1 ⋯ θm ⋯ θM ,
ϕ1 ⋯ ϕk ⋯ ϕK ,
w1,1 ⋯ wm,n ⋯ wM ,NM ,
z1,1 ⋯ zm,n ⋯ zM ,NM ; α, β)
=p(θ1 ; α) ⋯ p(θm ; α) ⋯ p(θM ; α)
× p(ϕ1 ; β) ⋯ p(ϕk ; β) ⋯ p(ϕK ; β)
× p(w1,1 ∣z1,1 , Φ) ⋯ p(w1,n ∣z1,n , Φ) ⋯ p(w1,N1 ∣z1,N1 , Φ)
⋯ p(wm,1 ∣zm,1 , Φ) ⋯ p(wm,n ∣zm,n , Φ) ⋯ p(wm,Nm ∣zm,Nm , Φ)
⋯ p(wM ,1 ∣zM ,1 , Φ) ⋯ p(wM ,n ∣zM ,n , Φ) ⋯ p(wM ,Nm ∣zM ,NM , Φ)
× p(z1,1 ∣θ1 ) ⋯ p(z1,n ∣θm ) ⋯ p(z1,N1 ∣θm )
⋯ p(zm,1 ∣θm ) ⋯ p(zm,n ∣θm ) ⋯ p(zm,Nm ∣θm )
⋯ p(zM ,1 ∣θM ) ⋯ p(zM ,n ∣θM ) ⋯ p(zM ,NM ∣θm )
The compact representa!on is
p(Θ, Φ, W , Z; α, β) = p(Θ; α)p(Φ; β)p(Z∣Θ)p(W ∣Z, Φ)

The components are defined as follows.
Calcula!on of p(Θ; α)
M
p(Θ; α) = p(θ1 ; α) ⋯ p(θm ; α) ⋯ p(θM ; α) = ∏ p(θm ; α)
m=1
Because
K
Γ(Kα) α−1 Γ(Kα)
p(θm ; α) = θm,1 ⋯ θ α−1
m,k ⋯ θ α−1
m,K = ∏ θ α−1
m,k
(Γ(α))K (Γ(α))K
k=1
We get
M K
Γ(Kα)
p(Θ; α) = ∏ ∏ θ α−1
m,k
(Γ(α))K
m=1 k=1
Calcula!on of p(Φ; β)
K
p(Φ; β) = p(ϕ1 ; β) ⋯ p(ϕk ; β) ⋯ p(ϕK ; β) = ∏ p(ϕk ; β)
k=1
Because
V
Γ(V β) β−1 Γ(V β)
∏
β−1 β−1 β−1
p(ϕk ; β) = ϕ ⋯ ϕ ⋯ ϕ = ϕ
(Γ(β))V k,1 k,v k,V
(Γ(β))V k,v
v=1
We get
K K V
Γ(V β)
p(Φ; β) = ∏ p(ϕk ; β) = ∏ ∏
β−1
ϕk,v
(Γ(β))V
k=1 k=1 v=1
Calcula!on of P (Z∣Θ)
p(Z∣Θ) = p(z1,1 ∣θ1 ) ⋯ p(z1,n ∣θm ) ⋯ p(z1,N1 ∣θm )
M Nm
= ∏ ∏ p(zm,n ∣θm )
m=1 n=1
K
θm is sampled from a Dirichlet distribu!on, so ∑k=1 θm,k = 1. And p(zm,n ∣θm ) is a
categorical distribu!on.
p(zm,n = k∣θm ) = θm,k
In the m-th document, we can set the total number of words assigned to the topic k as im,k
K
with ∑k=1 im,k = Nm . So we can store the value of im,k into a M × K matrix I .
Nm K
p(Zm ∣θm ) = ∏ p(zm,n ∣θm ) = ∏ θm,k
im,k
n=1 k=1
We get
p(Z∣Θ) = p(z1,1 ∣θ1 ) ⋯ p(z1,n ∣θm ) ⋯ p(z1,N1 ∣θm )

M M Nm M K
= ∏ p(Zm ∣θm ) = ∏ ∏ p(zm,n ∣θm ) = ∏ ∏ θm,k
im,k
m=1 m=1 n=1 m=1 k=1
Calcula!on of p(W ∣Z, Φ)
If zm,n = k , then ϕzm,n is ϕk , and p(wm,n ∣zm,n , ϕzm,n ) = p(wm,n ∣ϕk ).
V
ϕk is sampled from a Dirichlet distribu!on, so ∑v=1 ϕk,v = 1. p(wm,n ∣ϕk ) is a categorical
distribu!on.
p(wm,n = v∣ϕk ) = ϕk,v

M
We can treat all M documents as one single large document of length L = ∑m=1 Nm , and l is
the posi!on of a word wl in this large document. We can set the total number of word v
K V
assigned to the topic k as jk,v with ∑k=1 ∑v=1 jk,v = L. So we can store the value of jk,v
into a K × V matrix J .
p(W ∣Z, Φ) = p(w1,1 ∣z1,1 , Φ) ⋯ p(w1,n ∣z1,n , Φ) ⋯ p(w1,N1 ∣z1,N1 , Φ)

⋯ p(wm,1 ∣zm,1 , Φ) ⋯ p(wm,n ∣zm,n , Φ) ⋯ p(wm,Nm ∣zm,Nm , Φ)
⋯ p(wM ,1 ∣zM ,1 , Φ) ⋯ p(wM ,n ∣zM ,n , Φ) ⋯ p(wM ,NM ∣zM ,NM , Φ)
M Nm
= ∏ ∏ p(wm,n ∣zm,n , Φ)
m=1 n=1
M Nm
= ∏ ∏ p(wm,n ∣zm,n , ϕzm,n )
m=1 n=1
L
∏ ϕ(zll),(wl l )
j(z ),(w )
=
l=1
K V
∏ ∏ ϕk,v
jk,v
=
k=1 v=1
Calcula!on of p(Θ, Φ, W , Z; α, β)
p(Θ, Φ, W , Z; α, β) = p(Θ; α)p(Φ; β)p(Z∣Θ)p(W ∣Z, Φ)
M K
Γ(Kα)
= ∏( ∏ θm,k )
α−1
(Γ(α))K
m=1 k=1
K V
Γ(V β)
×∏( ∏ k,v )
β−1
ϕ
(Γ(β))V
k=1 v=1
M K
× ∏ ∏ θm,k
im,k
m=1 k=1
K V
× ∏ ∏ ϕk,v
jk,v
k=1 v=1
p(Z∣Θ)p(Θ; α)
M K
Γ(Kα)
= ∏( ∏ )
im,k +α−1
θm,k
(Γ(α))K
m=1 k=1
p(W ∣Z, Φ)p(Φ; β)
K V
Γ(V β)
×∏( ∏ )
jk,v +β−1
ϕk,v
(Γ(β))V
k=1 v=1
Because hyper-parameters α and β are pre-specified and fixed values, so they can be omi#ed
and the compact representa!on can be presented as follows.
p(Θ, Φ, W , Z) = p(Z∣Θ)p(Θ)p(W ∣Z, Φ)p(Φ)
Marginal Distribu!on
p(W , Z) = ∫ ∫ p(Z, W , Θ, Φ)dΘdΦ
= ∫ ∫ p(Z∣Θ)p(Θ)p(W ∣Z, Φ)p(Φ)dΘdΦ
= ∫ p(Z∣Θ)p(Θ)dΘ × ∫ p(W ∣Z, Φ)p(Φ)dΦ

M K
Γ(Kα)
= ∏( ∫ ∏ dθm )
im,k +α−1
θm,k
(Γ(α))K
m=1 k=1
K V
Γ(V β)
×∏( ∫ ∏ dϕk )
jk,v +β−1
ϕ
(Γ(β))V k,v
k=1 v=1
M
Γ(Kα) ∏K k=1 Γ(im,k + α)
= ∏
(Γ(α))K Γ(∑K
m=1 k=1 (im,k + α))
K V
Γ(V β) ∏v=1 Γ(jk,v + β)
×∏
(Γ(β))V Γ(∑V (jk,v + β))
k=1 v=1
Γ(Kα) Γ(V β)
Because (Γ(α))K and (Γ(β))V are constant values, so we can cancel them and get
M K K V
∏k=1 Γ(im,k + α) ∏v=1 Γ(jk,v + β)
p(Z, W ) ∝ ∏ K ×∏ V
m=1 Γ(∑k=1 (im,k + α)) k=1 Γ(∑v=1 (jk,v + β))
Full Condi!onal Probability
The goal is to sample the topics assigned to all words in all documents. This can be done
through itera!on of sampling topic of each single word based on full condi!onal probability.
If the wa,b is the current word, and za,b is the assigned topic of wa,b . Z¬a,b represents all other
topic assignments to all other words excluding za,b .
Z = {za,b , Z¬a,b }
We know
p(za,b , Z¬a,b , W )
p(za,b ∣Z¬a,b , W ) =
p(Z¬a,b , W )
p(za,b , Z¬a,b , W )
=
∑za,b p(za,b , Z¬a,b , W )
∝ p(za,b , Z¬a,b , W ) = p(Z, W )
For calcula!on, we have
p(za,b = k∣Z¬a,b , W ) ∝ p(za,b = k, Z¬a,b , W )

p(za,b = k, Z¬a,b , W )
p(za,b = k∣Z¬a,b , W ) = K
∑k=1 p(za,b = k, Z¬a,b , W )
MCMC using Gibbs Sampling
Gibbs sampling is a widely used MCMC method, which try to infer value of whole Z based on
itera!on of calcula!ng full condi!onal probability of each single za,b .
As men!oned in the previous sec!on, if we treat all M documents as one single long document
with length L, and then Z = z1 ⋯ zl ⋯ zL , and za,b is corresponding to a specific zl .
(0)
zl is ini!alized with a random number between 1 and K . zl (t) is the t-th round sampled
instance of zl , and zl (t+1) is the (t + 1)-th round sampled instance of zl .
(t+1) (t+1) (t) (t)

p(zl ∣z1 ⋯ zl−1 zl+1 ⋯ zL , w1 ⋯ wl−1 wl wl+1 ⋯ wL )
(t+1) (t+1) (t) (t)
∝ p(z1 ⋯ zl−1 zl zl+1 ⋯ zL , w1 ⋯ wl−1 wl wl+1 ⋯ wL )
Or we can have
(t+1) (t+1) (t) (t)
p(zl = k/z1 ⋯ zl−1 zl+1 ⋯ zL , w1 ⋯ wl−1 wl wl+1 ⋯ wL )
(t+1) (t+1) (t) (t)
∝ p(z1 ⋯ zl−1 , zl = k, zl+1 ⋯ zL , w1 ⋯ wl−1 wl wl+1 ⋯ wL )
We define f (zl = k) as
(t+1) (t+1) (t) (t)

p(z1 ⋯ zl−1 , zl = k, zl+1 ⋯ zL , w1 ⋯ wl−1 wl wl+1 ⋯ wL )
We can calculate all K values of f (zl = k) (1 ≤ zl ≤ K ), and then use normaliza!on to get
value of
(t+1) (t+1) (t) (t)

p(zl = k∣z1 ⋯ zl−1 zl+1 ⋯ zL , w1 ⋯ wl−1 wl wl+1 ⋯ wL )
f (zl = k)
= K
∑zl =1 f (zl = k)
Based on the above probability distribu!on, we can sample the value of zl (t+1) . If l is
corresponding to the b-th word in the a-th document, then
zl = za,b
f (zl = k) = f (za,b = k)
where 1 ≤ k ≤ K .
Notes:
In regular literature on Bayesian inference, generally a large number of samples are generated
for the high-dimensional variable Z , and these samples are used to calculate the mean (or
expecta!on) of Z , or the expecta!on of some func!on q(Z).
Calcula!on of f (za,b = k)
M K K V
∏k=1 Γ(im,k + α) ∏v=1 Γ(jk,v + β)
f (za,b = kp(Z, W ) ∝ ∏ K ×∏ V
m=1 Γ( ∑ k=1 (im,k + α)) k=1 Γ(∑v=1 (jk,v + β))
excluding a-th document only a-th document
K K
∏k=1 Γ(im,k + α) ∏k=1 Γ(ia,k + α)
= ∏ K × K
m≠a Γ( ∑ k=1 (im,k + α)) Γ(∑k=1 (ia,k + α))
excluding v = wa,b only v = wa,b
( ∏ Γ(jk,v + β)) × (Γ(jk,(wa,b ) + β))
K
×∏
v≠wa,b
V
k=1
Γ(∑v=1 (jk,v + β))
We can cancel terms which don’t depend on a and b.
K K
∏k=1 Γ(ia,k + α) Γ(jk,(wa,b ) + β)
f (za,b = kp(Z, W ) ∝ K
×∏ V
Γ(∑k=1 (ia,k + α)) k=1 Γ(∑v=1 (jk,v + β))
K
Simplifying ∏k=1 Γ(ia,k + α)
In the a-th document, the b-th word wa,b has is assigned the topic wa,b . If the word wa,b is
excluded, then the number of words assigned to each topic k in the document is defined as
¬a,b
ia,k .
When k ≠ za,b , then

¬a,b
ia,k = ia,k
¬a,b
Γ(ia,k + α) = Γ(ia,k + α)
When k = za,b , then
¬a,b ¬a,b
ia,k = ia,k + 1 = ia,(za,b ) + 1
¬a,b ¬a,b
Γ(ia,k + α) = Γ(ia,k + α + 1) = Γ(ia,(za,b ) + α + 1)
Because Γ(x + 1) = x × Γ(x), we can get
¬a,b ¬a,b ¬a,b

Γ(ia,(za,b ) + α + 1) = (ia,(za,b ) + α) × Γ(ia,(za,b ) + α)
We can further get

K
∏ Γ(ia,k + α) = (ia,(za,b ) + α) × ∏ Γ(ia,k + α)
k=1 k≠za,b
= Γ(ia,(za,b ) + α + 1) × ∏ Γ(ia,k + α)
¬a,b ¬a,b
k≠za,b
= (ia,(za,b ) + α) × Γ(ia,(za,b ) + α) × ∏ Γ(ia,k + α)

¬a,b ¬a,b ¬a,b
k≠za,b
K
= (i¬a,b
a,(za,b ) + α) × ∏ Γ(i¬a,b
a,k + α)
k=1
Given the current a and b, because the topics assigned to all words other than wa,b are fixed, so
¬a,b
for each values of k (1 ≤ k ≤ K ), the corresponding value of ia,k are fixed. So
∏k=1 Γ(i¬a,b
K
a,k + α) is a constant value independent on k and can be cancelled.
¬a,b
But the value of (ia,(za,b ) + α) is dependent on the value of za,b (1 ≤ za,b ≤ K ), so we get
K
∏ Γ(ia,k + α) ∝ (ia,(za,b ) + α)
¬a,b
k=1
K
Simplifying Γ(∑k=1 (ia,k + α))
Similarly, m = a (a-th document) is fixed, and we can get
K
∑(ia,k + α) = (ia,(za,b ) + α) + ∑ (ia,k + α)
k=1 k≠za,b
k = za,b
a(za,b ) + α + 1) + ∑ (ia,k + α)
¬a,b
= (i¬a,b
k≠za,b
a(z ) + α) + ∑ (ia,k + α)
¬a,b
= 1 + (i¬a,b
a,b
k≠za,b
K
= 1 + ∑(ia,k + α)
¬a,b
k=1
We can further get

K K
Γ(∑(ia,k + α)) = Γ(1 + ∑(ia,k + α))
¬a,b
k=1 k=1
K K
= ∑(i¬a,b
a,k + α) × Γ(∑(ia,k + α))
¬a,b
k=1 k=1
K ¬a,b
Actually, ∑k=1 ia,k is the total number of words in a-th document without coun!ng wa,b , and
this number can be defined as Na ¬a,b .
Obviously,
K
∑ ia,k = Na ¬a,b = Na − 1
¬a,b
k=1
So
K
∑(ia,k + α) = Na ¬a,b + Kα = Na + Kα − 1
¬a,b
k=1
We can get
K
Γ(∑(ia,k + α)) = Γ(Na + Kα − 1)
¬a,b
k=1
Finally
K
Γ(∑(ia,k + α)) = (Na + Kα − 1) × Γ(Na + Kα − 1)
k=1
Given the current a and b, when calcula!ng f (za,b = k) with 1 ≤ k ≤ K , the whole part of
K
Γ(∑k=1 (ia,k + α)) is a fixed value independent on k , and can be cancelled.
K
Simplifying ∏k=1 Γ(jk,(wa,b ) + β)
Similarly, if we treat all documents as one single large document, and wa,b , the b-th word in the
a-th document is excluded, then we recalculate the number of each dis!nct word v assigned to
¬a,b
each topic k in this large document, which is defined as jk,v .
When k ≠ za,b , for any v , we have
¬a,b
jk,v = jk,v
¬a,b
Γ(jk,v + β) = Γ(jk,v + β)
And especially v = wa,b , we have
¬a,b
jk,(wa,b ) = jk,(wa,b )
¬a,b
Γ(jk,(wa,b ) + β) = Γ(jk,(wa,b )
+ β)
When k = za,b and v = wa,b , we have
¬a,b ¬a,b
jk,v = jk,v + 1 = j(za,b ),(wa,b ) = j(za,b ),(wa,b ) + 1
¬a,b ¬a,b
Γ(jk,v + β) = Γ(jk,v + β + 1) = Γ(j(za,b ),(wa,b ) + β) = Γ(j(za,b ),(wa,b ) + β + 1)
We get
K
∏ Γ(jk,(wa,b ) + β) =Γ(j(za,b ),(wa,b ) + β) × ∏ Γ(jk,(wa,b ) + β)
k=1 k≠za,b
∏
¬a,b ¬a,b
=Γ(j(za,b ),(wa,b )
+ β + 1) × Γ(jk,(wa,b ) + β)
k≠za,b
=(j(za,b ),(wa,b ) + β) × Γ(j(za,b ),(wa,b ) + β) × ∏ Γ(jk,(wa,b ) + β)

¬a,b ¬a,b ¬a,b
k≠za,b
K
+ β) × ∏ Γ(jk,(wa,b ) + β)
¬a,b ¬a,b
=(j(za,b ),(wa,b )
k=1
Given the current a and b, for each value of k (1 ≤ k ≤ K ), the corresponding value of
¬a,b K ¬a,b
(jk,(wa,b ) + β) are fixed. So ∏k=1 Γ(jk,(wa,b ) + β) is a constant value independent on k and
can be cancelled.
¬a,b
But the value of (j(za,b ),(wa,b ) + β) is dependent on the value of za,b (1 ≤ za,b ≤ K ), so we get
K
∏ Γ(jk,(wa,b ) + β) ∝ (j(za,b ),(wa,b ) + β)
¬a,b
k=1
K V
Simplifying ∏k=1 Γ(∑v=1 (jk,v + β))
When k ≠ za,b , for any v , we have
¬a,b
jk,v = jk,v
V V
∑ jk,v = ∑ jk,v
¬a,b
v=1 v=1
V V
Γ(∑(jk,v + β)) = Γ(∑(jk,v
¬a,b
+ β))
v=1 v=1
When k = za,b , if v ≠ wa,b , we s!ll have

¬a,b
jk,v = jk,v
∑ j(za,b ),v = ∑ j(za,b ),v
¬a,b
v≠wa,b v≠wa,b
Only when k = za,b and v = wa,b , we have
¬a,b ¬a,b
jk,v = jk,v + 1 = j(za,b ),(wa,b ) = j(za,b ),(wa,b )
+1
So we can get
V
∑ j(za,b ),v =j(za,b ),(wa,b ) + ∑ j(za,b ),v
v=1 v≠wa,b
=j(za,b ),(wa,b ) + 1 + ∑ j(za,b ),v

¬a,b ¬a,b
v≠wa,b
V
=1 + ∑ j(z
¬a,b
a,b ),v
v=1
We further get
V V
∑(j(za,b ),v + β) =1 + ∑(j(za,b ),v + β)
¬a,b
v=1 v=1
V V
Γ(∑(j(za,b ),v + β)) =Γ(1 + ∑(j(za,b ),v + β))
¬a,b
v=1 v=1
V V
=(∑(j(za,b ),v + β)) × (Γ(∑(j(za,b ),v + β)))
¬a,b ¬a,b
v=1 v=1
Finally, we can get
k = za,b
K V V V
∏ Γ(∑(jk,v + β)) =Γ(∑(j(za,b ),v + β)) × ∏ Γ(∑(jk,v + β))
k=1 v=1 v=1 k≠za,b v=1
V V V
=(∑(j(z ∑ ∏ ∑
¬a,b ¬a,b ¬a,b
a,b ),v
+ β)) × (Γ( (j(za,b ),v + β))) × Γ( (jk,v + β))
v=1 v=1 k≠za,b v=1
V K V
=(∑(j(z ∏ ∑
¬a,b ¬a,b
a,b ),v
+ β)) × Γ( (jk,v + β))
v=1 k=1 v=1
Given the current a and b, for each value of k (1 ≤ k ≤ K ), the corresponding value of
V ¬a,b K V ¬a,b
∑v=1 (jk,v + β) are fixed. So ∏k=1 Γ(∑v=1 (jk,v + β)) is a constant value independent
on k and can be cancelled.
V ¬a,b
But the value of (∑v=1 (j(za,b ),v + β)) is dependent on the value of za,b (1 ≤ za,b ≤ K ), so
we get
K V V
∏ Γ(∑(jk,v + β)) ∝ (∑(j(za,b ),v + β))
¬a,b
k=1 v=1 v=1
Simplifying f (za,b = k)
Based on the above steps, we can get

¬a,b ¬a,b
ia,(za,b ) + α j(za,b ),(wa,b ) + β
f (za,b = k) ∝ K ¬a,b × V ¬a,b
∑k=1 (ia,k + α) ∑v=1 (j(za,b ),v + β)
¬a,b ¬a,b
(ia,(za,b ) + α) × (j(za,b ),(wa,b ) + β)
∝ V ¬a,b
∑v=1 (j(za,b ),v + β)
Inference of Θ and Φ
Inference of Θ
A$er check the Markov blanket of Θ, we have
p(Θ∣W , Z, Φ; α, β) =p(Θ∣Z; α)
p(Θ, Z; α)
=
p(Z; α)
p(Θ, Z; α)
=
∫ p(Θ, Z; α)dΘ
p(Θ; α)p(Z∣Θ)
=
∫ p(Θ; α)p(Z∣Θ)dΘ
And
M
p(Θ∣Z; α) = ∏ p(θm , α)
m=1
M
p(θm ; , α)p(Zm ∣θm )
=∏
∫ p(θm ; , α)p(Zm ∣θm )dθm
m=1
We know
K
Γ(Kα)
p(θm ; , α) = ∏ θ α−1
m,k
(Γ(α))K
k=1
K
p(Zm ∣θm ) = ∏ θm,k
im,k
k=1
We can get
p(θm ; , α)p(Zm ∣θm )
p(θm ∣Zm ; α) =
∫ p(θm ; , α)p(Zm ∣θm )dθm
Γ(Kα) K α−1 K im,k
(Γ(α)) K ∏ θ
k=1 m,k × ∏ θ
k=1 m,k
= Γ(Kα) K α−1 K im,k
∫ (Γ(α))K ∏k=1 θm,k × ∏k=1 θm,k dθm
Γ(Kα) K im,k +α−1
(Γ(α))K ∏ θ
k=1 m,k
= Γ(Kα) K im,k +α−1
∫ (Γ(α))K ∏k=1 θm,k dθm
K im,k +α−1
∏k=1 θm,k
= K im,k +α−1
∫ ∏k=1 θm,k dθm
K K
Γ(∑k=1 (α + im,k ))
∏ θm,k
im,k +α−1
= K
∏k=1 Γ(α + im,k ) k=1
Posterior distribu!on of θm is also a Dirichlet distribu!on with hyper-parameter im,k + α, and

the corresponding expecta!on is
α + im,k α + im,k
E(θm,k ) = =
K
∑k=1 (α + im,k ) Kα + Nm
Inference of Φ
A$er check the Markov blanket of Θ, we have
p(Φ∣W , Z, Θ; α, β) =p(Φ∣W , Z; β)
p(Φ, W , Z; β)
=
p(W , Z; β)
p(Φ, W , Z; β)
=
∫ p(Φ, W , Z; β)dΦ
p(W ∣Φ, Z)p(Φ, Z)
=
∫ p(W ∣Φ, Z)p(Φ, Z)dΦ
p(W ∣Φ, Z)p(Φ)p(Z)
=
∫ p(W ∣Φ, Z)p(Φ)p(Z)dΦ
p(W ∣Φ, Z)p(Φ)
=
∫ p(W ∣Φ, Z)p(Φ)dΦ
Notes:
When there is no any condi!on or no any observa!on, Φ and Z are independent on each other,
we can have p(Φ, Z) = p(Φ)p(Z).
Because for a word wl , when we already know its assigned topic zl , we can organize all words
W into K groups. Each group Wk (1 ≤ k ≤ K ) contains all words sampled from ϕk with the
topic k . Every word in Wk has the same topic k , and Zk is used to represent all topic
assignments to Wk .
K
p(Φ∣W , Z; β) = ∏ p(ϕk ∣Wk , Zk ; β)
k=1
K
p(Wk ∣ϕk , Zk )p(ϕk )
=∏
∫ p(Wk ∣ϕk , Zk )p(ϕk )dϕk
k=1
K
p(Wk ∣ϕk )p(ϕk )
=∏
∫ p(Wk ∣ϕk )p(ϕk )dϕk
k=1
We know
V
p(Wk ∣ϕk ) = ∏ ϕk,v
jk,v
v=1
V
Γ(V β)
∏
β−1
p(ϕk ) = ϕk,v
(Γ(β))V
v=1
We can get
p(Wk ∣ϕk )p(ϕk )

p(ϕk ∣Wk , Zk ; β) =
∫ p(Wk ∣ϕk )p(ϕk )dϕk
V jk,v Γ(V β) V β−1
∏v=1 ϕk,v × (Γ(β)) V ∏ ϕ
v=1 k,v
= V jk,v Γ(V β) V β−1
∫ ∏v=1 ϕk,v × (Γ(β))V ∏v=1 ϕk,v dϕk
Γ(V β) V jk,v +β−1
(Γ(β)) V ∏ ϕ
v=1 k,v
= Γ(V β) V jk,v +β−1
∫ (Γ(β))V ∏ ϕ
v=1 k,v dϕk
V V
Γ(∑v=1 (β + jk,v ))
∏ ϕk,v
jk,v +β−1
= V
∏v=1 Γ(β + jk,v ) v=1
The posterior probability of ϕk is also a Dirichlet distribu!on with hyper-parameter β + jk,v ,

and the corresponding expecta!on is
β + jk,v
E(ϕk ) = V
∑v=1 (β + jk,v )
Concrete Implementa!on of Gibbs Sampling of LDA
The total number of words assigned to the topic k is twk , which is a element of a vector of
length K , T W = (tw1 ⋯ twk ⋯ twK ).
The total number of words in document m is dwm , which is a element of the vector of length
M , DW = (dw1 ⋯ dwm ⋯ dwM ).
IT ERAT ION S is set as the maximal number of itera!ons for Gibbs Sampling. BU RN IN
set as the number of itera!ons for burn-in. SAM P LELAG is set as the number of itera!ons
of the period for sampling Θ and Φ.

Notes On LDA With Gibbs Sampling

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Notes On LDA With Gibbs Sampling

Загружено:

Авторское право:

Доступные форматы

Notes on LDA with Gibbs Sampling

We list related nota!ons as follows.

α and β are hyper-parameter with specified values.

M is the number of documents. K is the number of topics. V is the number of dis!nct

Nm is the length of the m-th document.

wm,n (1 ≤ m ≤ M ; 1 ≤ n ≤ Nm ) is the n-th word in the m-th document, and words

zm,n (1 ≤ m ≤ M ; 1 ≤ n ≤ Nm ) is the topic assigned to the n-th word in the m-th

θm (1 ≤ m ≤ M ) is topic propor!on of the m-th document, which is a latent (or hidden)

ϕk (1 ≤ k ≤ K) is word propor!on of the k -th topic, which is a latent (or hidden)

θm is sampled from a Dirichlet distribu!on with hyper-parameter α.

⎡ θ1 ⎤ ⎡ θ1,1 … θ1,k … θ1,K ⎤

ϕk is sampled from a Dirichlet distribu!on with hyper-parameter β .

We can represent ϕ1 ⋯ ϕk ⋯ ϕK as a K × V matrix Φ.

⎡ ϕ1 ⎤ ⎡ ϕ1,1 … ϕ1,v … ϕ1,V ⎤

Wm = ( wm,1 … wm,n … wm,Nm )

The whole document set can be represented as W .

Latent Variables Z for Topics assigned to Words

Zm = ( zm,1 … zm,n … zm,Nm )

⎛ Z1 ⎞ ⎛ z1,1 … z1,n … z1,N1 ⎞

The compact representa!on is

p(Θ, Φ, W , Z; α, β) = p(Θ; α)p(Φ; β)p(Z∣Θ)p(W ∣Z, Φ)

p(zm,n = k∣θm ) = θm,k

p(Z∣Θ) = p(z1,1 ∣θ1 ) ⋯ p(z1,n ∣θm ) ⋯ p(z1,N1 ∣θm )

m=1 m=1 n=1 m=1 k=1

Calcula!on of p(W ∣Z, Φ)

If zm,n = k , then ϕzm,n is ϕk , and p(wm,n ∣zm,n , ϕzm,n ) = p(wm,n ∣ϕk ).

p(wm,n = v∣ϕk ) = ϕk,v

p(W ∣Z, Φ) = p(w1,1 ∣z1,1 , Φ) ⋯ p(w1,n ∣z1,n , Φ) ⋯ p(w1,N1 ∣z1,N1 , Φ)

p(Θ, Φ, W , Z) = p(Z∣Θ)p(Θ)p(W ∣Z, Φ)p(Φ)

= ∫ ∫ p(Z∣Θ)p(Θ)p(W ∣Z, Φ)p(Φ)dΘdΦ

= ∫ p(Z∣Θ)p(Θ)dΘ × ∫ p(W ∣Z, Φ)p(Φ)dΦ

Full Condi!onal Probability

For calcula!on, we have

p(za,b = k∣Z¬a,b , W ) ∝ p(za,b = k, Z¬a,b , W )

MCMC using Gibbs Sampling

(t+1) (t+1) (t) (t)

(t+1) (t+1) (t) (t)

(t+1) (t+1) (t) (t)

We can cancel terms which don’t depend on a and b.

When k ≠ za,b , then

When k = za,b , then

Because Γ(x + 1) = x × Γ(x), we can get

¬a,b ¬a,b ¬a,b

We can further get

= (ia,(za,b ) + α) × Γ(ia,(za,b ) + α) × ∏ Γ(ia,k + α)

Similarly, m = a (a-th document) is fixed, and we can get

We can further get

And especially v = wa,b , we have

When k = za,b and v = wa,b , we have

=(j(za,b ),(wa,b ) + β) × Γ(j(za,b ),(wa,b ) + β) × ∏ Γ(jk,(wa,b ) + β)

When k ≠ za,b , for any v , we have

When k = za,b , if v ≠ wa,b , we s!ll have

Only when k = za,b and v = wa,b , we have

=j(za,b ),(wa,b ) + 1 + ∑ j(za,b ),v

Finally, we can get

k=1 v=1 v=1

Based on the above steps, we can get

A$er check the Markov blanket of Θ, we have

Posterior distribu!on of θm is also a Dirichlet distribu!on with hyper-parameter im,k + α, and

A$er check the Markov blanket of Θ, we have

p(Wk ∣ϕk )p(ϕk )

The posterior probability of ϕk is also a Dirichlet distribu!on with hyper-parameter β + jk,v ,

Concrete Implementa!on of Gibbs Sampling of LDA