Dynamic Topic Models: David M. Blei

Dynamic Topic Models
David M. Blei BLEI @ CS . PRINCETON . EDU

Computer Science Department, Princeton University, Princeton, NJ 08544, USA
John D. Lafferty LAFFERTY @ CS . CMU . EDU
School of Computer Science, Carnegie Mellon University, Pittsburgh PA 15213, USA
Abstract ment are assumed to be independently drawn from a mix-

A family of probabilistic time series models is ture of multinomials. The mixing proportions are randomly
developed to analyze the time evolution of topics drawn for each document; the mixture components, or top-
in large document collections. The approach is ics, are shared by all documents. Thus, each document
to use state space models on the natural param- reflects the components with different proportions. These
eters of the multinomial distributions that repre- models are a powerful method of dimensionality reduction
sent the topics. Variational approximations based for large collections of unstructured documents. Moreover,
on Kalman filters and nonparametric wavelet re- posterior inference at the document level is useful for infor-
gression are developed to carry out approximate mation retrieval, classification, and topic-directed brows-
posterior inference over the latent topics. In addi- ing.
tion to giving quantitative, predictive models of a Treating words exchangeably is a simplification that it is
sequential corpus, dynamic topic models provide consistent with the goal of identifying the semantic themes
a qualitative window into the contents of a large within each document. For many collections of interest,
document collection. The models are demon- however, the implicit assumption of exchangeable doc-
strated by analyzing the OCR’ed archives of the uments is inappropriate. Document collections such as
journal Science from 1880 through 2000. scholarly journals, email, news articles, and search query
logs all reflect evolving content. For example, the Science
article “The Brain of Professor Laborde” may be on the
same scientific path as the article “Reshaping the Corti-
1. Introduction cal Motor Map by Unmasking Latent Intracortical Connec-
Managing the explosion of electronic document archives tions,” but the study of neuroscience looked much different
requires new tools for automatically organizing, searching, in 1903 than it did in 1991. The themes in a document col-
indexing, and browsing large collections. Recent research lection evolve over time, and it is of interest to explicitly
in machine learning and statistics has developed new tech- model the dynamics of the underlying topics.
niques for finding patterns of words in document collec- In this paper, we develop a dynamic topic model which
tions using hierarchical probabilistic models (Blei et al., captures the evolution of topics in a sequentially organized
2003; McCallum et al., 2004; Rosen-Zvi et al., 2004; Grif- corpus of documents. We demonstrate its applicability by
fiths and Steyvers, 2004; Buntine and Jakulin, 2004; Blei analyzing over 100 years of OCR’ed articles from the jour-
and Lafferty, 2006). These models are called “topic mod- nal Science, which was founded in 1880 by Thomas Edi-
els” because the discovered patterns often reflect the under- son and has been published through the present. Under this
lying topics which combined to form the documents. Such model, articles are grouped by year, and each year’s arti-
hierarchical probabilistic models are easily generalized to cles arise from a set of topics that have evolved from the
other kinds of data; for example, topic models have been last year’s topics.
used to analyze images (Fei-Fei and Perona, 2005; Sivic
et al., 2005), biological data (Pritchard et al., 2000), and In the subsequent sections, we extend classical state space
survey data (Erosheva, 2002). models to specify a statistical model of topic evolution.
We then develop efficient approximate posterior inference
In an exchangeable topic model, the words of each docu- techniques for determining the evolving topics from a se-
Appearing in Proceedings of the 23 rd International Conference quential collection of documents. Finally, we present qual-
on Machine Learning, Pittsburgh, PA, 2006. Copyright 2006 by itative results that demonstrate how dynamic topic models
the author(s)/owner(s). allow the exploration of a large document collection in new
ways, and quantitative results that demonstrate greater pre- α α α

dictive accuracy when compared with static topic models.
θ θ θ
2. Dynamic Topic Models
While traditional time series modeling has focused on con- z z z
tinuous data, topic models are designed for categorical
data. Our approach is to use state space models on the nat-
w w w
ural parameter space of the underlying topic multinomials,
N N N
as well as on the natural parameters for the logistic nor- A A A
mal distributions used for modeling the document-specific
topic proportions.
β β β K
First, we review the underlying statistical assumptions of
a static topic model, such as latent Dirichlet allocation Figure 1. Graphical representation of a dynamic topic model (for
(LDA) (Blei et al., 2003). Let β1:K be K topics, each of three time slices). Each topic’s natural parameters βt,k evolve
which is a distribution over a fixed vocabulary. In a static over time, together with the mean parameters αt of the logistic
topic model, each document is assumed drawn from the normal distribution for the topic proportions.
following generative process:
1. Choose topic proportions θ from a distribution over tion (Aitchison, 1982) to time-series simplex data (West
the (K − 1)-simplex, such as a Dirichlet. and Harrison, 1997).
2. For each word: In LDA, the document-specific topic proportions θ are
(a) Choose a topic assignment Z ∼ Mult (θ). drawn from a Dirichlet distribution. In the dynamic topic
(b) Choose a word W ∼ Mult (βz ). model, we use a logistic normal with mean α to express
uncertainty over proportions. The sequential structure be-
tween models is again captured with a simple dynamic
This process implicitly assumes that the documents are
model
drawn exchangeably from the same set of topics. For many
collections, however, the order of the documents reflects αt | αt−1 ∼ N (αt−1 , δ 2 I) . (2)
an evolving set of topics. In a dynamic topic model, we For simplicity, we do not model the dynamics of topic cor-
suppose that the data is divided by time slice, for example relation, as was done for static models by Blei and Lafferty
by year. We model the documents of each slice with a K- (2006).
component topic model, where the topics associated with
slice t evolve from the topics associated with slice t − 1. By chaining together topics and topic proportion distribu-
tions, we have sequentially tied a collection of topic mod-
For a K-component model with V terms, let βt,k denote els. The generative process for slice t of a sequential corpus
the V -vector of natural parameters for topic k in slice t. is thus as follows:
The usual representation of a multinomial distribution is by
its mean parameterization. If we denote the mean param-
1. Draw topics βt | βt−1 ∼ N (βt−1 , σ 2 I).
eter of a V -dimensional multinomial by π, the ith com-
ponent of the natural parameter is given by the mapping 2. Draw αt | αt−1 ∼ N (αt−1 , δ 2 I).
βi = log(πi /πV ). In typical language modeling applica- 3. For each document:
tions, Dirichlet distributions are used to model uncertainty (a) Draw η ∼ N (αt , a2 I)
about the distributions over words. However, the Dirichlet (b) For each word:
is not amenable to sequential modeling. Instead, we chain
i. Draw Z ∼ Mult (π(η)).
the natural parameters of each topic βt,k in a state space
model that evolves with Gaussian noise; the simplest ver- ii. Draw Wt,d,n ∼ Mult (π(βt,z )).
sion of such a model is
Note that π maps the multinomial natural parameters to the
βt,k | βt−1,k ∼ N (βt−1,k , σ 2 I) . (1) exp(βk,t,w )
mean parameters, π(βk,t )w = P exp(β .
w k,t,w )
Our approach is thus to model sequences of compositional The graphical model for this generative process is shown in
random variables by chaining Gaussian distributions in a Figure 1. When the horizontal arrows are removed, break-
dynamic model and mapping the emitted values to the sim- ing the time dynamics, the graphical model reduces to a set
plex. This is an extension of the logistic normal distribu- of independent topic models. With time dynamics, the kth
topic at slice t has smoothly evolved from the kth topic at !

α !
α !
α
slice t − 1.
For clarity of presentation, we now focus on a model with α α α
K dynamic topics evolving as in (1), and where the topic
proportion model is fixed at a Dirichlet. The technical is-
sues associated with modeling the topic proportions in a θ θ θ
time series as in (2) are essentially the same as those for
chaining the topics together. z z z
3. Approximate Inference w w w
N N N
Working with time series over the natural parameters en- A A A
ables the use of Gaussian models for the time dynamics;
however, due to the nonconjugacy of the Gaussian and β β β
multinomial models, posterior inference is intractable. In
this section, we present a variational method for approx-
imate posterior inference. We use variational methods as !
β !
β ! K
β
deterministic alternatives to stochastic simulation, in or-
der to handle the large data sets typical of text analysis. Figure 2. A graphical representation of the variational approxima-
While Gibbs sampling has been effectively used for static tion for the time series topic model of Figure 1. The variational
topic models (Griffiths and Steyvers, 2004), nonconjugacy parameters β̂ and α̂ are thought of as the outputs of a Kalman
makes sampling methods more difficult for this dynamic filter, or as observed data in a nonparametric regression setting.
model.
The idea behind variational methods is to optimize the free
parameters of a distribution over the latent variables so that variables follows the same form as in Blei et al. (2003).
the distribution is close in Kullback-Liebler (KL) diver- Each proportion vector θt,d is endowed with a free Dirichlet
gence to the true posterior; this distribution can then be parameter γt,d , each topic indicator zt,d,n is endowed with
used as a substitute for the true posterior. In the dynamic a free multinomial parameter φt,d,n , and optimization pro-
topic model, the latent variables are the topics βt,k , mixture ceeds by coordinate ascent. The updates for the document-
proportions θt,d , and topic indicators zt,d,n . The variational level variational parameters have a closed form; we use
distribution reflects the group structure of the latent vari- the conjugate gradient method to optimize the topic-level
ables. There are variational parameters for each topic’s se- variational observations. The resulting variational approx-
quence of multinomial parameters, and variational param- imation for the natural topic parameters {βk,1 , . . . , βk,T }
eters for each of the document-level latent variables. The incorporates the time dynamics; we describe one approx-
approximate variational posterior is imation based on a Kalman filter, and a second based on
K wavelet regression.
Y
q(βk,1 , . . . , βk,T | β̂k,1 , . . . , β̂k,T ) × (3)
k=1 3.1. Variational Kalman Filtering
T Dt
!
Y Y QNt,d The view of the variational parameters as outputs is
q(θt,d | γt,d ) n=1 q(zt,d,n | φt,d,n ) .
based on the symmetry properties of the Gaussian density,
t=1 d=1
fµ,Σ (x) = fx,Σ (µ), which enables the use of the standard
In the commonly used mean-field approximation, each la- forward-backward calculations for linear state space mod-
tent variable is considered independently of the others. In els. The graphical model and its variational approximation
the variational distribution of {βk,1 , . . . , βk,T }, however, are shown in Figure 2. Here the triangles denote varia-
we retain the sequential structure of the topic by positing tional parameters; they can be thought of as “hypothetical
a dynamic model with Gaussian “variational observations” outputs” of the Kalman filter, to facilitate calculation.
{β̂k,1 , . . . , β̂k,T }. These parameters are fit to minimize the
To explain the main idea behind this technique in a sim-
KL divergence between the resulting posterior, which is
pler setting, consider the model where unigram models βt
Gaussian, and the true posterior, which is not Gaussian.
(in the natural parameterization) evolve over time. In this
(A similar technique for Gaussian processes is described
model there are no topics and thus no mixing parameters.
in Snelson and Ghahramani, 2006.)
The calculations are simpler versions of those we need for
The variational distribution of the document-level latent the more general latent variable models, but exhibit the es-
sential features. Our state space model is take n = 2J and J = 7. To be consistent with our earlier
notation, we assume that
βt | βt−1 ∼ N (βt−1 , σ 2 I)
wt,n | βt ∼ Mult (π(βt )) β̂t = m
e t + ν̂t
and we form the variational state space model where where t ∼ N (0, 1). Our variational wavelet regression
algorithm estimates {β̂t }, which we view as observed data,
β̂t | βt ∼ N (βt , ν̂t2 I) just as in the Kalman filter method, as well as the noise
level ν̂.
The variational parameters are β̂t and ν̂t . Using standard
For concreteness, we illustrate the technique using the Haar
Kalman filter calculations (Kalman, 1960), the forward
wavelet basis; Daubechies wavelets are used in our actual
mean and variance of the variational posterior are given by
examples. The model is then
mt ≡ E (βt | β̂1:t ) = j
X 2X
J−1 −1
ν̂t2 ν̂t2 β̂t = αφ(xt ) + Djk ψjk (xt )
m t−1 + 1 − β̂t
Vt−1 + σ 2 + ν̂t2 Vt−1 + σ 2 + ν̂t2 j=0 k=0
Vt ≡ E ((βt − mt )2 | β̂1:t ) where xt = t/n, φ(x) = 1 for 0 ≤ x ≤ 1,

ν̂t2 −1 if 0 ≤ x ≤ 21 ,
= (Vt−1 + σ 2 ) ψ(x) =
Vt−1 + σ 2 + ν̂t2 1 if 21 < x ≤ 1
with initial conditions specified by fixed m0 and V0 . The and ψjk (x) = 2j/2 ψ(2j x − k). Our variational estimate
backward recursion then calculates the marginal mean and for the posterior mean becomes
variance of βt given β̂1:T as
j
X 2X
J−1 −1
e t−1 ≡ E (βt−1 | β̂1:T ) =
m m
e t = α̂φ(xt ) + D̂jk ψjk (xt ).

σ2 σ2 j=0 k=0
m t−1 + 1 − m
et
Vt−1 + σ 2 Vt−1 + σ 2 Pn
where α̂ = n−1 t=1 β̂t , and D̂jk are obtained by thresh-
olding the coefficients
Vet−1 ≡ E ((βt−1 − me t−1 )2 | β̂1:T ) n
2 1X
Vt−1 et − (Vt−1 + σ 2 ) Zjk = β̂t ψjk (xt ).
= Vt−1 + V n t=1
Vt−1 + σ 2
with initial conditions m e T = mT and VeT = VT . We ap- To estimate β̂t we use gradient ascent, as for the Kalman
proximate the posterior p(β1:T | w1:T ) using the state space filter approximation, requiring the derivatives ∂ m
e t /∂ β̂t . If
soft thresholding is used, then we have that
posterior q(β1:T | β̂1:T ). From Jensen’s inequality, the log-
likelihood is bounded from below as J−1 j
∂m
et ∂ α̂ X 2X −1
∂ D̂jk
= φ(xt ) + ψjk (xt ).
log p(d1:T ) ≥ (4) ∂ β̂s ∂ β̂s ∂ β̂s
Z ! j=0 k=0
p(β1:T ) p(d1:T | β1:T )
q(β1:T | β̂1:T ) log dβ1:T with ∂ α̂/∂ β̂s = n−1 and
q(β1:T | β̂1:T )
T
(
X 1
if |Zjk | > λ
∂ D̂jk /∂ β̂s = n ψjk (xs )
= E q log p(β1:T ) + E q log p(dt | βt ) + H(q)
t=1
0 otherwise.
Details of optimizing this bound are given in an appendix. Note also that |Zjk | > λ if and only if |D̂jk | > 0. These
derivatives can be computed using off-the-shelf software
3.2. Variational Wavelet Regression for the wavelet transform in any of the standard wavelet
bases.
The variational Kalman filter can be replaced with varia-
tional wavelet regression; for a readable introduction stan- Sample results of running this and the Kalman variational
dard wavelet methods, see Wasserman (2006). We rescale algorithm to approximate a unigram model are given in
time so it is between 0 and 1. For 128 years of Science we Figure 3. Both variational approximations smooth out the
Darwin Einstein moon
6e−04
0.0012
0e+00 2e−04 4e−04 6e−04 8e−04 1e−03

4e−04
0.0008
2e−04
0.0004
0.0000
0e+00
1880 1900 1920 1940 1960 1980 2000 1880 1900 1920 1940 1960 1980 2000 1880 1900 1920 1940 1960 1980 2000
6e−04
0.0012
0e+00 2e−04 4e−04 6e−04 8e−04 1e−03

4e−04
0.0008
2e−04
0.0004
0.0000
0e+00
1880 1900 1920 1940 1960 1980 2000 1880 1900 1920 1940 1960 1980 2000 1880 1900 1920 1940 1960 1980 2000
Figure 3. Comparison of the Kalman filter (top) and wavelet regression (bottom) variational approximations to a unigram model. The
variational approximations (red and blue curves) smooth out the local fluctuations in the unigram counts (gray curves) of the words
shown, while preserving the sharp peaks that may indicate a significant change of content in the journal. The wavelet regression is able
to “superresolve” the double spikes in the occurrence of Einstein in the 1920s. (The spike in the occurrence of Darwin near 1910 may
be associated with the centennial of Darwin’s birth in 1809.)
local fluctuations in the unigram counts, while preserving 15,955. To explore the corpus and its themes, we estimated
the sharp peaks that may indicate a significant change of a 20-component dynamic topic model. Posterior inference
content in the journal. While the fit is similar to that ob- took approximately 4 hours on a 1.5GHZ PowerPC Mac-
tained using standard wavelet regression to the (normal- intosh laptop. Two of the resulting topics are illustrated in
ized) counts, the estimates are obtained by minimizing the Figure 4, showing the top several words from those topics
KL divergence as in standard variational approximations. in each decade, according to the posterior mean number of
occurrences as estimated using the Kalman filter variational
In the dynamic topic model of Section 2, the algorithms
approximation. Also shown are example articles which ex-
are essentially the same as those described above. How-
hibit those topics through the decades. As illustrated, the
ever, rather than fitting the observations from true ob-
model captures different scientific themes, and can be used
served counts, we fit them from expected counts under the
to inspect trends of word usage within them.
document-level variational distributions in (3).
To validate the dynamic topic model quantitatively, we con-
4. Analysis of Science sider the task of predicting the next year of Science given all
the articles from the previous years. We compare the pre-
We analyzed a subset of 30,000 articles from Science, 250 dictive power of three 20-topic models: the dynamic topic
from each of the 120 years between 1881 and 1999. Our model estimated from all of the previous years, a static
data were collected by JSTOR (www.jstor.org), a not- topic model estimated from all of the previous years, and a
for-profit organization that maintains an online scholarly static topic model estimated from the single previous year.
archive obtained by running an optical character recogni- All the models are estimated to the same convergence crite-
tion (OCR) engine over the original printed journals. JS- rion. The topic model estimated from all the previous data
TOR indexes the resulting text and provides online access and dynamic topic model are initialized at the same point.
to the scanned images of the original content through key-
The dynamic topic model performs well; it always assigns
word search.
higher likelihood to the next year’s articles than the other
Our corpus is made up of approximately 7.5 million words. two models (Figure 5). It is interesting that the predictive
We pruned the vocabulary by stemming each term to its power of each of the models declines over the years. We
root, removing function terms, and removing terms that oc- can tentatively attribute this to an increase in the rate of
curred fewer than 25 times. The total vocabulary size is specialization in scientific language.
9 : : 9 9 : L M 9 L M M 9 L 9 M 9 L P M 9 L R M 9 L T M 9 L U M 9 L V M 9 L X M 9 L : M 9 L L M P M M M
; < = > ? < D E < @ I A @ ? D ; < = > ? I D < = I B ? @ ? = A B ? @ ? = A B = I E I D ? ? > D = < @ ? ? > D = < @ ? ? > D = < @ J D I D ?
F G G G
C C C
? @ ? = A B ; < = > ? ? G ? > D = E > I A @ ? D D H ? < = B ? I J K = ? ? I J K = ? = I F E I D ? @ ? = A B ? @ ? = A B ? @ ? = A B I D < ? @ ? = A B
C C C C
< D E < @ I A @ ? D ? I J K = ? D H ? < = B ? G ? > D = < @ ? @ ? = A B ? G ? > D = < @ = I B ? G ? > D = < @ I D < I = D E > G ? ? @ ? = A B ? G ? > D = < @
C C C C N
E ; ; ? = ? @ ? = A B ; < = > ? ? ? > D = E > ? @ ? = A B D H ? < = B E A H D ? ? > D = < @ ? I J K = ? ? I J K = ? W ? J D = K > D K = I A @ ? D

F G G G G F
C C C
E A H D ? I J K = ? D H ? < = B I D < ? I J K = ? E A H D I D < ? I J K = ? = I B = I E I D = I E I D W ? W ?

G G F F G F G F
C C C C C
? I J K = ? E ; ; ? = J B J D ? J B J D ? = I B I ? I = D E > ? I D < I D < W ? < ? < ? I D <

F G G F F G F G
C C C N C C C C C
S Q
I A @ ? D E = ? > D < D E < @ ? I J K = ? ? ? > D = = I E I D = I B I = D E > ? W ? = I B I D < J D I D ? J B J D ?

F G F G G F
C C C N C C
E = ? > D E @ ? E @ ? E @ ? E @ ? I D < = I E I D D < D < < ? D < D < D <

F G G G G F F G
C C
S S S S S
I D D ? = = ? J K D < E @ D ? @ ? = A B ; < = > ? ? ? > D = E > < E @ D E A H D I = D E > ? I = D E > ? = I B I A @ ? D K I @ D K

G G G G G
C N N N N C Y C
= ? J K G D G E A H D F E ; ; ? = O < F B I G K ? I G K ? D H ? < = B I O J < = D < O J ? = ? I A @ ? D ? I J K = ? = I B H B J E >
Q Q N Q C C N

e f g g h i

! ! " # $ % %
& '
' (
) )
Z [ \ ] ^ _ ` a b c d _ ` d Z
h j h k g i l m
* ) ' + , + - . .
n o f m g o e
/ .
0
1 / 2
/ 3 0
" 4
/ 5 4 & ' " 5
6 ( " , 5 2 7 8
¥ ¦ ¦ § ¥ ¨ § § ¥ ¨ © § ¥ ¨ ª § ¥ ¨ « § ¥ ¨ ¦ § © § § §
z { { z z { z z z z z z z z z z { z
| } ~ | } ~ ~ } } } } } }

| } ~ } } } ~ } ~ ~

~ } ~ ~ } } ~ } } | } ~

} ~ } } ~ } } ~ ~

| } ~ ~ } } ~ } | } ~ ~

~ ~ ~ } } ~ ~ ~ | } ~ }

~ } | } } ~ } } } } ~

} ~ | } | } ~ } } } }

} } | } ~ ~ | } ~ | } } } } ~

} } } | } ~ ~ ~ | } ~ } } } } ~

m h i ¤ h
8 "
' ¬

! ! , % %
& 5
5
p q r s t u v w x r y w r p
) . &
* ) '
0
/ $ ® 0 ¯ ¬ ( 4 ( °
1 ) ¬ 2
k f £
* & "
2 4 ± 4 ± "
m h o i l m
) 8 5 (
) ' $ ' ' 0 ( ( 0
¡ ¢
Figure 4. Examples from the posterior analysis of a 20-topic dynamic model estimated from the Science corpus. For two topics, we
illustrate: (a) the top ten words from the inferred posterior distribution at ten year lags (b) the posterior estimate of the frequency as a
function of year of several words from the same two topics (c) example articles throughout the collection which exhibit these topics.
Note that the plots are scaled to give an idea of the shape of the trajectory of the words’ posterior probability (i.e., comparisons across
words are not meaningful).
5. Discussion Gaussian model, but it would be natural to include a drift

term in a more sophisticated autoregressive model to ex-
We have developed sequential topic models for discrete plicitly capture the rise and fall in popularity of a topic, or
data by using Gaussian time series on the natural param- in the use of specific terms. Another variant would allow
eters of the multinomial topics and logistic normal topic for heteroscedastic time series.
proportion models. We derived variational inference algo-
rithms that exploit existing techniques for sequential data; Perhaps the most promising extension to the methods pre-
we demonstrated a novel use of Kalman filters and wavelet sented here is to incorporate a model of how new topics in
regression as variational approximations. Dynamic topic the collection appear or disappear over time, rather than as-
models can give a more accurate predictive model, and also suming a fixed number of topics. One possibility is to use a
offer new ways of browsing large, unstructured document simple Galton-Watson or birth-death process for the topic
collections. population. While the analysis of birth-death or branching
processes often centers on extinction probabilities, here a
There are many ways that the work described here can be goal would be to find documents that may be responsible
extended. One direction is to use more sophisticated state for spawning new themes in a collection.
space models. We have demonstrated the use of a simple
data. PhD thesis, Carnegie Mellon University, Depart-

7e+06
4e+06 ment of Statistics.
Fei-Fei, L. and Perona, P. (2005). A Bayesian hierarchi-

Negative log likelihood (log scale)
cal model for learning natural scene categories. IEEE

LDA−prev Computer Vision and Pattern Recognition.
2e+06
LDA−all
DTM
Griffiths, T. and Steyvers, M. (2004). Finding scientific
topics. Proceedings of the National Academy of Science,
1e+06
101:5228–5235.
Kalman, R. (1960). A new approach to linear filtering and

prediction problems. Transaction of the AMSE: Journal
of Basic Engineering, 82:35–45.
McCallum, A., Corrada-Emmanuel, A., and Wang, X.

1920 1940 1960 1980 2000
(2004). The author-recipient-topic model for topic and
Year role discovery in social networks: Experiments with En-
ron and academic email. Technical report, University of
Figure 5. This figure illustrates the performance of using dy- Massachusetts, Amherst.
namic topic models and static topic models for prediction. For
each year between 1900 and 2000 (at 5 year increments), we es- Pritchard, J., Stephens, M., and Donnelly, P. (2000). Infer-
timated three models on the articles through that year. We then ence of population structure using multilocus genotype
computed the variational bound on the negative log likelihood of data. Genetics, 155:945–959.
next year’s articles under the resulting model (lower numbers are
better). DTM is the dynamic topic model; LDA-prev is a static Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smith, P.
topic model estimated on just the previous year’s articles; LDA- (2004). The author-topic model for authors and docu-
all is a static topic model estimated on all the previous articles.
ments. In Proceedings of the 20th Conference on Un-
certainty in Artificial Intelligence, pages 487–494. AUAI
Acknowledgments Press.
This research was supported in part by NSF grants IIS- Sivic, J., Rusell, B., Efros, A., Zisserman, A., and Freeman,
0312814 and IIS-0427206, the DARPA CALO project, and W. (2005). Discovering objects and their location in im-
a grant from Google. ages. In International Conference on Computer Vision
(ICCV 2005).
References
Snelson, E. and Ghahramani, Z. (2006). Sparse Gaussian
Aitchison, J. (1982). The statistical analysis of composi-
processes using pseudo-inputs. In Weiss, Y., Schölkopf,
tional data. Journal of the Royal Statistical Society, Se-
B., and Platt, J., editors, Advances in Neural Information
ries B, 44(2):139–177.
Processing Systems 18, Cambridge, MA. MIT Press.
Blei, D., Ng, A., and Jordan, M. (2003). Latent Dirich-
let allocation. Journal of Machine Learning Research, Wasserman, L. (2006). All of Nonparametric Statistics.
3:993–1022. Springer.
Blei, D. M. and Lafferty, J. D. (2006). Correlated topic
models. In Weiss, Y., Schölkopf, B., and Platt, J., editors, West, M. and Harrison, J. (1997). Bayesian Forecasting
Advances in Neural Information Processing Systems 18. and Dynamic Models. Springer.
MIT Press, Cambridge, MA.
Buntine, W. and Jakulin, A. (2004). Applying discrete PCA
in data analysis. In Proceedings of the 20th Conference
on Uncertainty in Artificial Intelligence, pages 59–66. A. Derivation of Variational Algorithm
AUAI Press. In this appendix we give some details of the variational
Erosheva, E. (2002). Grade of membership and latent algorithm outlined in Section 3.1, which calculates a dis-
structure models with application to disability survey tribution q(β1:T | β̂1:T ) to maximize the lower bound on
log p(d1:T ). The first term of the righthand side of (5) is Next, we maximize with respect to β̂s :
T
X VT ∂`(β̂, ν̂)
E q log p(βt | βt−1 ) = − log σ 2 + log 2π =
2 ∂ β̂sw
t=1
T
T 1 X ∂me tw ∂m
e t−1,w
1 X − (me tw − m
e t−1,w ) −
− 2 E q (βt − βt−1 )T (βt − βt−1 ) σ 2 t=1 ∂ β̂sw ∂ β̂sw
2σ t=1
T
X ∂m
T e tw
VT 1 X + e tw + Vetw /2)
ntw − nt ζ̂t−1 exp(m .
=− log σ 2 + log 2π − 2 km e t−1 k2
et − m ∂ β̂sw
2 2σ t=1 t=1
1 X e 1
T The forward-backward equations for m e t can be used to de-
− Tr V t + Tr (Ve0 ) − Tr (VeT ) rive a recurrence for ∂ m
e t /∂ β̂s . The forward recurrence is
σ 2 t=1 2σ 2

∂mt ν̂t2 ∂mt−1
using the Gaussian quadratic form identity = 2 + ν̂ 2
+
∂ β̂s v t−1 + σ t ∂ β̂s

E m,V (x − µ)T Σ−1 (x − µ) = ν̂t2
1− δs,t ,
(m − µ)T Σ−1 (m − µ) + Tr (Σ−1 V ). vt−1 + σ 2 + ν̂t2
The second term of (5) is with the initial condition ∂m0 /∂ β̂s = 0. The backward
recurrence is then
T
X
∂me t−1 σ2 ∂mt−1
E q log p(dt | βt ) = = +
Vt−1 + σ 2
t=1 ∂ β̂s ∂ β̂s
!
T X
X X σ2 ∂met
ntw E q βtw − log exp(βtw ) 1− ,
Vt−1 + σ 2 ∂ β̂s
t=1 w w
T X
X X with the initial condition ∂ m
e T /∂ β̂s = ∂mT /∂ β̂s .
≥ e tw − nt ζ̂t−1
ntw m e tw + Vetw /2)
exp(m
t=1 w w
T
X
+ nt − nt log ζ̂t
t=1
P
where nt = w ntw , introducing additional variational
parameters ζ̂1:T . The third term of (5) is the entropy
T
X
1 T
H(q) = log |Vet | + log 2π
t=1
2 2
T
1 XX TV
= log Vetw + log 2π.
2 t=1 w 2
To maximize the lower bound as a function of the varia-

tional parameters we use a conjugate gradient algorithm.
First, we maximize with respect to ζ̂; the derivative is
∂` nt X nt
= e tw + Vetw /2) − .
exp(m
∂ ζ̂t ζ̂t2 w ζ̂t
Setting to zero and solving for ζ̂t gives

X
ζ̂t = e tw + Vetw /2).
exp(m
w

Dynamic Topic Models: David M. Blei

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Dynamic Topic Models: David M. Blei

Загружено:

Авторское право:

Доступные форматы

Dynamic Topic Models

David M. Blei BLEI @ CS . PRINCETON . EDU

Abstract ment are assumed to be independently drawn from a mix-

ways, and quantitative results that demonstrate greater pre- α α α

topic at slice t has smoothly evolved from the kth topic at !

Vt ≡ E ((βt − mt )2 | β̂1:t ) where xt = t/n, φ(x) = 1 for 0 ≤ x ≤ 1,

Darwin Einstein moon

0e+00 2e−04 4e−04 6e−04 8e−04 1e−03

0e+00 2e−04 4e−04 6e−04 8e−04 1e−03

? @ ? = A B ; < = > ? ? G ? > D = E > I A @ ? D D H ? < = B ? I J K = ? ? I J K = ? = I F E I D ? @ ? = A B ? @ ? = A B ? @ ? = A B I D < ? @ ? = A B

E ; ; ? = ? @ ? = A B ; < = > ? ? ? > D = E > ? @ ? = A B D H ? < = B E A H D ? ? > D = < @ ? I J K = ? ? I J K = ? W ? J D = K > D K = I A @ ? D

E A H D ? I J K = ? D H ? < = B I D < ? I J K = ? E A H D I D < ? I J K = ? = I B = I E I D = I E I D W ? W ?

? I J K = ? E ; ; ? = J B J D ? J B J D ? = I B I ? I = D E > ? I D < I D < W ? < ? < ? I D <

I A @ ? D E = ? > D < D E < @ ? I J K = ? ? ? > D = = I E I D = I B I = D E > ? W ? = I B I D < J D I D ? J B J D ?

E = ? > D E @ ? E @ ? E @ ? E @ ? I D < = I E I D D < D < < ? D < D < D <

I D D ? = = ? J K D < E @ D ? @ ? = A B ; < = > ? ? ? > D = E > < E @ D E A H D I = D E > ? I = D E > ? = I B I A @ ? D K I @ D K

= ? J K G D G E A H D F E ; ; ? = O < F B I G K ? I G K ? D H ? < = B I O J < = D < O J ? = ? I A @ ? D ? I J K = ? = I B H B J E >

 / 5      4 &     '   "   5     

  )  ' $ '  '    0     (       (                 0    

5. Discussion Gaussian model, but it would be natural to include a drift

data. PhD thesis, Carnegie Mellon University, Depart-

Fei-Fei, L. and Perona, P. (2005). A Bayesian hierarchi-

cal model for learning natural scene categories. IEEE

Kalman, R. (1960). A new approach to linear filtering and

McCallum, A., Corrada-Emmanuel, A., and Wang, X.

To maximize the lower bound as a function of the varia-

Setting to zero and solving for ζ̂t gives

Вам также может понравиться

/ 5 4 & ' " 5

) ' $ ' ' 0 ( ( 0