Академический Документы
Профессиональный Документы
Культура Документы
Nota!ons
Variable θm and Θ
θm ∼ Dir(α); 1 ≤ m ≤ M
θm,k is the k -th element in θm , which is corresponding the propor!on of the topic k in the m-
th document.
K
∑ θm,k = 1
k=1
We can represent θ1 ⋯ θm ⋯ θM as a M × K matrix Θ.
Variable ϕk and Φ
ϕk ∼ Dir(β); 1 ≤ k ≤ K
ϕk,v is the v -th element in ϕk , which is corresponding to the propor!on of the word v in the k -
th topic. Each v is an element of the dic!onary with V dis!nct words.
V
∑ ϕk,v = 1
v=1
Words W
The m-th document Wm can be represented by words contained. wm,n is the n-th word in the
m-th document, and Nm is the number of words in the m-th document. wm,n in different
posi!ons can be instances of the same v in the dic!onary.
Please be aware that W is not a matrix, because its every row, which is corresponding the m-th
document, may have different number of words or length Nm . For example, N1 ≠ Nm ≠ NM .
The latent topic variables assigned to words in the m-th document can be represented as Zm .
zm,n is the topic assigned to the n-th word in the m-th document, and Nm is the number of
words in the m-th document.
The latent topic variables assigned to words in all documents can be represented as Z .
Please be aware that Z is not a matrix, because its every row, which is corresponding to the m-
th document, may have different number of words or length Nm . For example, N1 ≠ Nm ≠
NM .
Graphical Models
Based on the genera!ve process of LDA, we can represent the LDA model using the plate
nota!on.
For an easy understanding, the corresponding expanded model is also shown here.
Joint Distribu!on
Based on the Bayesian network structure, we can present the joint distribu!on of LDA as
follows.
p(θ1 ⋯ θm ⋯ θM ,
ϕ1 ⋯ ϕk ⋯ ϕK ,
w1,1 ⋯ wm,n ⋯ wM ,NM ,
z1,1 ⋯ zm,n ⋯ zM ,NM ; α, β)
=p(θ1 ; α) ⋯ p(θm ; α) ⋯ p(θM ; α)
× p(ϕ1 ; β) ⋯ p(ϕk ; β) ⋯ p(ϕK ; β)
× p(w1,1 ∣z1,1 , Φ) ⋯ p(w1,n ∣z1,n , Φ) ⋯ p(w1,N1 ∣z1,N1 , Φ)
⋯ p(wm,1 ∣zm,1 , Φ) ⋯ p(wm,n ∣zm,n , Φ) ⋯ p(wm,Nm ∣zm,Nm , Φ)
⋯ p(wM ,1 ∣zM ,1 , Φ) ⋯ p(wM ,n ∣zM ,n , Φ) ⋯ p(wM ,Nm ∣zM ,NM , Φ)
× p(z1,1 ∣θ1 ) ⋯ p(z1,n ∣θm ) ⋯ p(z1,N1 ∣θm )
⋯ p(zm,1 ∣θm ) ⋯ p(zm,n ∣θm ) ⋯ p(zm,Nm ∣θm )
⋯ p(zM ,1 ∣θM ) ⋯ p(zM ,n ∣θM ) ⋯ p(zM ,NM ∣θm )
Calcula!on of p(Θ; α)
M
p(Θ; α) = p(θ1 ; α) ⋯ p(θm ; α) ⋯ p(θM ; α) = ∏ p(θm ; α)
m=1
Because
K
Γ(Kα) α−1 Γ(Kα)
p(θm ; α) = θm,1 ⋯ θ α−1
m,k ⋯ θ α−1
m,K = ∏ θ α−1
m,k
(Γ(α))K (Γ(α))K
k=1
We get
M K
Γ(Kα)
p(Θ; α) = ∏ ∏ θ α−1
m,k
(Γ(α))K
m=1 k=1
Calcula!on of p(Φ; β)
K
p(Φ; β) = p(ϕ1 ; β) ⋯ p(ϕk ; β) ⋯ p(ϕK ; β) = ∏ p(ϕk ; β)
k=1
Because
V
Γ(V β) β−1 Γ(V β)
∏
β−1 β−1 β−1
p(ϕk ; β) = ϕ ⋯ ϕ ⋯ ϕ = ϕ
(Γ(β))V k,1 k,v k,V
(Γ(β))V k,v
v=1
We get
K K V
Γ(V β)
p(Φ; β) = ∏ p(ϕk ; β) = ∏ ∏
β−1
ϕk,v
(Γ(β))V
k=1 k=1 v=1
Calcula!on of P (Z∣Θ)
p(Z∣Θ) = p(z1,1 ∣θ1 ) ⋯ p(z1,n ∣θm ) ⋯ p(z1,N1 ∣θm )
⋯ p(zm,1 ∣θm ) ⋯ p(zm,n ∣θm ) ⋯ p(zm,Nm ∣θm )
⋯ p(zM ,1 ∣θM ) ⋯ p(zM ,n ∣θM ) ⋯ p(zM ,NM ∣θm )
M Nm
= ∏ ∏ p(zm,n ∣θm )
m=1 n=1
K
θm is sampled from a Dirichlet distribu!on, so ∑k=1 θm,k = 1. And p(zm,n ∣θm ) is a
categorical distribu!on.
In the m-th document, we can set the total number of words assigned to the topic k as im,k
K
with ∑k=1 im,k = Nm . So we can store the value of im,k into a M × K matrix I .
Nm K
p(Zm ∣θm ) = ∏ p(zm,n ∣θm ) = ∏ θm,k
im,k
n=1 k=1
We get
V
ϕk is sampled from a Dirichlet distribu!on, so ∑v=1 ϕk,v = 1. p(wm,n ∣ϕk ) is a categorical
distribu!on.
Calcula!on of p(Θ, Φ, W , Z; α, β)
p(Θ, Φ, W , Z; α, β) = p(Θ; α)p(Φ; β)p(Z∣Θ)p(W ∣Z, Φ)
M K
Γ(Kα)
= ∏( ∏ θm,k )
α−1
(Γ(α))K
m=1 k=1
K V
Γ(V β)
×∏( ∏ k,v )
β−1
ϕ
(Γ(β))V
k=1 v=1
M K
× ∏ ∏ θm,k
im,k
m=1 k=1
K V
× ∏ ∏ ϕk,v
jk,v
k=1 v=1
p(Z∣Θ)p(Θ; α)
M K
Γ(Kα)
= ∏( ∏ )
im,k +α−1
θm,k
(Γ(α))K
m=1 k=1
p(W ∣Z, Φ)p(Φ; β)
K V
Γ(V β)
×∏( ∏ )
jk,v +β−1
ϕk,v
(Γ(β))V
k=1 v=1
Because hyper-parameters α and β are pre-specified and fixed values, so they can be omi#ed
and the compact representa!on can be presented as follows.
Marginal Distribu!on
p(W , Z) = ∫ ∫ p(Z, W , Θ, Φ)dΘdΦ
Γ(Kα) Γ(V β)
Because (Γ(α))K and (Γ(β))V are constant values, so we can cancel them and get
M K K V
∏k=1 Γ(im,k + α) ∏v=1 Γ(jk,v + β)
p(Z, W ) ∝ ∏ K ×∏ V
m=1 Γ(∑k=1 (im,k + α)) k=1 Γ(∑v=1 (jk,v + β))
The goal is to sample the topics assigned to all words in all documents. This can be done
through itera!on of sampling topic of each single word based on full condi!onal probability.
If the wa,b is the current word, and za,b is the assigned topic of wa,b . Z¬a,b represents all other
topic assignments to all other words excluding za,b .
Z = {za,b , Z¬a,b }
We know
p(za,b , Z¬a,b , W )
p(za,b ∣Z¬a,b , W ) =
p(Z¬a,b , W )
p(za,b , Z¬a,b , W )
=
∑za,b p(za,b , Z¬a,b , W )
∝ p(za,b , Z¬a,b , W ) = p(Z, W )
Gibbs sampling is a widely used MCMC method, which try to infer value of whole Z based on
itera!on of calcula!ng full condi!onal probability of each single za,b .
As men!oned in the previous sec!on, if we treat all M documents as one single long document
with length L, and then Z = z1 ⋯ zl ⋯ zL , and za,b is corresponding to a specific zl .
(0)
zl is ini!alized with a random number between 1 and K . zl (t) is the t-th round sampled
instance of zl , and zl (t+1) is the (t + 1)-th round sampled instance of zl .
Or we can have
(t+1) (t+1) (t) (t)
p(zl = k/z1 ⋯ zl−1 zl+1 ⋯ zL , w1 ⋯ wl−1 wl wl+1 ⋯ wL )
(t+1) (t+1) (t) (t)
∝ p(z1 ⋯ zl−1 , zl = k, zl+1 ⋯ zL , w1 ⋯ wl−1 wl wl+1 ⋯ wL )
We define f (zl = k) as
We can calculate all K values of f (zl = k) (1 ≤ zl ≤ K ), and then use normaliza!on to get
value of
Based on the above probability distribu!on, we can sample the value of zl (t+1) . If l is
corresponding to the b-th word in the a-th document, then
zl = za,b
f (zl = k) = f (za,b = k)
where 1 ≤ k ≤ K .
Notes:
In regular literature on Bayesian inference, generally a large number of samples are generated
for the high-dimensional variable Z , and these samples are used to calculate the mean (or
expecta!on) of Z , or the expecta!on of some func!on q(Z).
Calcula!on of f (za,b = k)
M K K V
∏k=1 Γ(im,k + α) ∏v=1 Γ(jk,v + β)
f (za,b = kp(Z, W ) ∝ ∏ K ×∏ V
m=1 Γ( ∑ k=1 (im,k + α)) k=1 Γ(∑v=1 (jk,v + β))
excluding a-th document only a-th document
K K
∏k=1 Γ(im,k + α) ∏k=1 Γ(ia,k + α)
= ∏ K × K
m≠a Γ( ∑ k=1 (im,k + α)) Γ(∑k=1 (ia,k + α))
excluding v = wa,b only v = wa,b
( ∏ Γ(jk,v + β)) × (Γ(jk,(wa,b ) + β))
K
×∏
v≠wa,b
V
k=1
Γ(∑v=1 (jk,v + β))
K K
∏k=1 Γ(ia,k + α) Γ(jk,(wa,b ) + β)
f (za,b = kp(Z, W ) ∝ K
×∏ V
Γ(∑k=1 (ia,k + α)) k=1 Γ(∑v=1 (jk,v + β))
K
Simplifying ∏k=1 Γ(ia,k + α)
In the a-th document, the b-th word wa,b has is assigned the topic wa,b . If the word wa,b is
excluded, then the number of words assigned to each topic k in the document is defined as
¬a,b
ia,k .
¬a,b ¬a,b
ia,k = ia,k + 1 = ia,(za,b ) + 1
¬a,b ¬a,b
Γ(ia,k + α) = Γ(ia,k + α + 1) = Γ(ia,(za,b ) + α + 1)
= Γ(ia,(za,b ) + α + 1) × ∏ Γ(ia,k + α)
¬a,b ¬a,b
k≠za,b
k≠za,b
K
= (i¬a,b
a,(za,b ) + α) × ∏ Γ(i¬a,b
a,k + α)
k=1
Given the current a and b, because the topics assigned to all words other than wa,b are fixed, so
¬a,b
for each values of k (1 ≤ k ≤ K ), the corresponding value of ia,k are fixed. So
∏k=1 Γ(i¬a,b
K
a,k + α) is a constant value independent on k and can be cancelled.
¬a,b
But the value of (ia,(za,b ) + α) is dependent on the value of za,b (1 ≤ za,b ≤ K ), so we get
K
∏ Γ(ia,k + α) ∝ (ia,(za,b ) + α)
¬a,b
k=1
K
Simplifying Γ(∑k=1 (ia,k + α))
K
∑(ia,k + α) = (ia,(za,b ) + α) + ∑ (ia,k + α)
k=1 k≠za,b
k = za,b
a(za,b ) + α + 1) + ∑ (ia,k + α)
¬a,b
= (i¬a,b
k≠za,b
a(z ) + α) + ∑ (ia,k + α)
¬a,b
= 1 + (i¬a,b
a,b
k≠za,b
K
= 1 + ∑(ia,k + α)
¬a,b
k=1
k=1 k=1
K K
= ∑(i¬a,b
a,k + α) × Γ(∑(ia,k + α))
¬a,b
k=1 k=1
K ¬a,b
Actually, ∑k=1 ia,k is the total number of words in a-th document without coun!ng wa,b , and
this number can be defined as Na ¬a,b .
Obviously,
K
∑ ia,k = Na ¬a,b = Na − 1
¬a,b
k=1
So
K
∑(ia,k + α) = Na ¬a,b + Kα = Na + Kα − 1
¬a,b
k=1
We can get
K
Γ(∑(ia,k + α)) = Γ(Na + Kα − 1)
¬a,b
k=1
Finally
K
Γ(∑(ia,k + α)) = (Na + Kα − 1) × Γ(Na + Kα − 1)
k=1
Given the current a and b, when calcula!ng f (za,b = k) with 1 ≤ k ≤ K , the whole part of
K
Γ(∑k=1 (ia,k + α)) is a fixed value independent on k , and can be cancelled.
K
Simplifying ∏k=1 Γ(jk,(wa,b ) + β)
Similarly, if we treat all documents as one single large document, and wa,b , the b-th word in the
a-th document is excluded, then we recalculate the number of each dis!nct word v assigned to
¬a,b
each topic k in this large document, which is defined as jk,v .
When k ≠ za,b , for any v , we have
¬a,b
jk,v = jk,v
¬a,b
Γ(jk,v + β) = Γ(jk,v + β)
¬a,b
jk,(wa,b ) = jk,(wa,b )
¬a,b
Γ(jk,(wa,b ) + β) = Γ(jk,(wa,b )
+ β)
¬a,b ¬a,b
jk,v = jk,v + 1 = j(za,b ),(wa,b ) = j(za,b ),(wa,b ) + 1
¬a,b ¬a,b
Γ(jk,v + β) = Γ(jk,v + β + 1) = Γ(j(za,b ),(wa,b ) + β) = Γ(j(za,b ),(wa,b ) + β + 1)
We get
K
∏ Γ(jk,(wa,b ) + β) =Γ(j(za,b ),(wa,b ) + β) × ∏ Γ(jk,(wa,b ) + β)
k=1 k≠za,b
∏
¬a,b ¬a,b
=Γ(j(za,b ),(wa,b )
+ β + 1) × Γ(jk,(wa,b ) + β)
k≠za,b
k≠za,b
K
+ β) × ∏ Γ(jk,(wa,b ) + β)
¬a,b ¬a,b
=(j(za,b ),(wa,b )
k=1
Given the current a and b, for each value of k (1 ≤ k ≤ K ), the corresponding value of
¬a,b K ¬a,b
(jk,(wa,b ) + β) are fixed. So ∏k=1 Γ(jk,(wa,b ) + β) is a constant value independent on k and
can be cancelled.
¬a,b
But the value of (j(za,b ),(wa,b ) + β) is dependent on the value of za,b (1 ≤ za,b ≤ K ), so we get
K
∏ Γ(jk,(wa,b ) + β) ∝ (j(za,b ),(wa,b ) + β)
¬a,b
k=1
K V
Simplifying ∏k=1 Γ(∑v=1 (jk,v + β))
¬a,b
jk,v = jk,v
V V
∑ jk,v = ∑ jk,v
¬a,b
v=1 v=1
V V
Γ(∑(jk,v + β)) = Γ(∑(jk,v
¬a,b
+ β))
v=1 v=1
v≠wa,b v≠wa,b
¬a,b ¬a,b
jk,v = jk,v + 1 = j(za,b ),(wa,b ) = j(za,b ),(wa,b )
+1
So we can get
V
∑ j(za,b ),v =j(za,b ),(wa,b ) + ∑ j(za,b ),v
v=1 v≠wa,b
v≠wa,b
V
=1 + ∑ j(z
¬a,b
a,b ),v
v=1
We further get
V V
∑(j(za,b ),v + β) =1 + ∑(j(za,b ),v + β)
¬a,b
v=1 v=1
V V
Γ(∑(j(za,b ),v + β)) =Γ(1 + ∑(j(za,b ),v + β))
¬a,b
v=1 v=1
V V
=(∑(j(za,b ),v + β)) × (Γ(∑(j(za,b ),v + β)))
¬a,b ¬a,b
v=1 v=1
k = za,b
K V V V
∏ Γ(∑(jk,v + β)) =Γ(∑(j(za,b ),v + β)) × ∏ Γ(∑(jk,v + β))
k=1 v=1 v=1 k≠za,b v=1
V V V
=(∑(j(z ∑ ∏ ∑
¬a,b ¬a,b ¬a,b
a,b ),v
+ β)) × (Γ( (j(za,b ),v + β))) × Γ( (jk,v + β))
v=1 v=1 k≠za,b v=1
V K V
=(∑(j(z ∏ ∑
¬a,b ¬a,b
a,b ),v
+ β)) × Γ( (jk,v + β))
v=1 k=1 v=1
Given the current a and b, for each value of k (1 ≤ k ≤ K ), the corresponding value of
V ¬a,b K V ¬a,b
∑v=1 (jk,v + β) are fixed. So ∏k=1 Γ(∑v=1 (jk,v + β)) is a constant value independent
on k and can be cancelled.
V ¬a,b
But the value of (∑v=1 (j(za,b ),v + β)) is dependent on the value of za,b (1 ≤ za,b ≤ K ), so
we get
K V V
∏ Γ(∑(jk,v + β)) ∝ (∑(j(za,b ),v + β))
¬a,b
Simplifying f (za,b = k)
Inference of Θ and Φ
Inference of Θ
p(Θ∣W , Z, Φ; α, β) =p(Θ∣Z; α)
p(Θ, Z; α)
=
p(Z; α)
p(Θ, Z; α)
=
∫ p(Θ, Z; α)dΘ
p(Θ; α)p(Z∣Θ)
=
∫ p(Θ; α)p(Z∣Θ)dΘ
And
M
p(Θ∣Z; α) = ∏ p(θm , α)
m=1
M
p(θm ; , α)p(Zm ∣θm )
=∏
∫ p(θm ; , α)p(Zm ∣θm )dθm
m=1
We know
K
Γ(Kα)
p(θm ; , α) = ∏ θ α−1
m,k
(Γ(α))K
k=1
K
p(Zm ∣θm ) = ∏ θm,k
im,k
k=1
We can get
p(θm ; , α)p(Zm ∣θm )
p(θm ∣Zm ; α) =
∫ p(θm ; , α)p(Zm ∣θm )dθm
Γ(Kα) K α−1 K im,k
(Γ(α)) K ∏ θ
k=1 m,k × ∏ θ
k=1 m,k
= Γ(Kα) K α−1 K im,k
∫ (Γ(α))K ∏k=1 θm,k × ∏k=1 θm,k dθm
Γ(Kα) K im,k +α−1
(Γ(α))K ∏ θ
k=1 m,k
= Γ(Kα) K im,k +α−1
∫ (Γ(α))K ∏k=1 θm,k dθm
K im,k +α−1
∏k=1 θm,k
= K im,k +α−1
∫ ∏k=1 θm,k dθm
K K
Γ(∑k=1 (α + im,k ))
∏ θm,k
im,k +α−1
= K
∏k=1 Γ(α + im,k ) k=1
α + im,k α + im,k
E(θm,k ) = =
K
∑k=1 (α + im,k ) Kα + Nm
Inference of Φ
p(Φ∣W , Z, Θ; α, β) =p(Φ∣W , Z; β)
p(Φ, W , Z; β)
=
p(W , Z; β)
p(Φ, W , Z; β)
=
∫ p(Φ, W , Z; β)dΦ
p(W ∣Φ, Z)p(Φ, Z)
=
∫ p(W ∣Φ, Z)p(Φ, Z)dΦ
p(W ∣Φ, Z)p(Φ)p(Z)
=
∫ p(W ∣Φ, Z)p(Φ)p(Z)dΦ
p(W ∣Φ, Z)p(Φ)
=
∫ p(W ∣Φ, Z)p(Φ)dΦ
Notes:
When there is no any condi!on or no any observa!on, Φ and Z are independent on each other,
we can have p(Φ, Z) = p(Φ)p(Z).
Because for a word wl , when we already know its assigned topic zl , we can organize all words
W into K groups. Each group Wk (1 ≤ k ≤ K ) contains all words sampled from ϕk with the
topic k . Every word in Wk has the same topic k , and Zk is used to represent all topic
assignments to Wk .
K
p(Φ∣W , Z; β) = ∏ p(ϕk ∣Wk , Zk ; β)
k=1
K
p(Wk ∣ϕk , Zk )p(ϕk )
=∏
∫ p(Wk ∣ϕk , Zk )p(ϕk )dϕk
k=1
K
p(Wk ∣ϕk )p(ϕk )
=∏
∫ p(Wk ∣ϕk )p(ϕk )dϕk
k=1
We know
V
p(Wk ∣ϕk ) = ∏ ϕk,v
jk,v
v=1
V
Γ(V β)
∏
β−1
p(ϕk ) = ϕk,v
(Γ(β))V
v=1
We can get
The total number of words assigned to the topic k is twk , which is a element of a vector of
length K , T W = (tw1 ⋯ twk ⋯ twK ).
The total number of words in document m is dwm , which is a element of the vector of length
M , DW = (dw1 ⋯ dwm ⋯ dwM ).
IT ERAT ION S is set as the maximal number of itera!ons for Gibbs Sampling. BU RN IN
set as the number of itera!ons for burn-in. SAM P LELAG is set as the number of itera!ons
of the period for sampling Θ and Φ.