Вы находитесь на странице: 1из 15

Estimating a Dirichlet distribution

Thomas P. Minka
2000 (revised 2003, 2009, 2012)

Abstract

The Dirichlet distribution and its compound variant, the Dirichlet-multinomial, are two
of the most basic models for proportional data, such as the mix of vocabulary words in a
text document. Yet the maximum-likelihood estimate of these distributions is not available in
closed-form. This paper describes simple and efficient iterative schemes for obtaining parameter
estimates in these models. In each case, a fixed-point iteration and a Newton-Raphson (or
generalized Newton-Raphson) iteration is provided.

1 The Dirichlet distribution

The Dirichlet distribution is a model of how proportions vary. Let p denote a random vector whose
elements sum to 1, so that pk represents the proportion of item k. Under the Dirichlet model with
parameter vector , the probability density at p is
P
( k k ) Y k 1
p(p) D(1 , ..., K ) = Q pk (1)
k (k ) k
where pk > 0 (2)
X
pk = 1 (3)
k

The parameters can be estimated from a training set ofQproportions: D = {p1 , ..., pN }. The
maximum-likelihood estimate of maximizes p(D|) = i p(pi |). The log-likelihood can be
written
X X X
log p(D|) = N log ( k ) N log (k ) + N (k 1) log pk (4)
k k k
1 X
where log pk = log pik (5)
N i

This objective is convex in since Dirichlet is in the exponential family. This implies that the
likelihood is unimodal and the maximum can be found by a simple search. A direct convexity proof
has also been given by Ronning (1989). The gradient of the log-likelihood with respect to one k is
d log p(D|) X
gk = = N ( k ) N (k ) + N log pk (6)
dk
k
d log (x)
(x) = (7)
dx
is known as the digamma function and is similar to the natural logarithm. As always with the
exponential family, when the gradient is zero, the expected sufficient statistics are equal to the
observed sufficient statistics. In this case, the expected sufficient statistics are
X
E[log pk ] = (k ) ( k ) (8)
k

and the observed sufficient statistics are log pk .

1
A fixed-point iteration for maximizing the likelihood can be derived as follows. Given an initial guess
for , we construct a simple lower bound on the likelihood which is tight at . The maximum of this
bound is computed in closed-form and it becomes the new guess. Such an iteration is guaranteed
to converge to a stationary point of the likelihoodin fact it is the same principle behind the EM
algorithm (Minka, 1998). For the Dirichlet, the maximum is the only stationary point.
P
As shown in appendix A, a bound on ( k k ) leads to the following fixed-point iteration:
X
(knew ) = ( kold ) + log pk (9)
k

This algorithm requires inverting the functiona procedure which is described in appendix C.

Another approach to finding a stationary point is Newton iteration. The second-derivatives, i.e.
Hessian matrix, of the log-likelihood are given by

d log p(D|) X
= N ( k ) N (k ) (10)
dk2
k
d log p(D|) X
= N ( k ) (k 6= j) (11)
dk dj
k

is known as the trigamma function. The Hessian can be written in matrix form as

H = Q + 11T z (12)
qjk = N (k )(j k) (13)
X
z = N ( k ) (14)
k

One Newton step is therefore

new = old H1 g (15)


Q1 11T Q1
H1 = Q1 (16)
1/z + 1T Q1 1
gk b
(H1 g)k = (17)
qkk
P
1T Q1 g j gj /qjj
where b = = (18)
1/z + 1T Q1 1
P
1/z + j 1/qjj

Unlike some Newton algorithms, this one does not require storing or inverting the Hessian matrix
explicitly. The same Newton algorithm was given by Ronning (1989) and by Naryanan (1991).
Naryanan also derives a stopping rule for the iteration.

An approximate MLE, useful for initialization, is given by finding the density which matches the
moments of the data. The first two moments of the density are

E[pk ] = P k (19)
k k
1 + k
E[p2k ] = E[pk ] P (20)
1 + k k
X E[p1 ] E[p21 ]
k = (21)
E[p21 ] E[p1 ]2
k

2
Multiplying (21) and (19) gives a formula for
P k in terms of moments. Equation (21) uses p1 , but
any other pk could also be used to estimate k k . Ronning (1989) suggests instead using all of the
pk s via
E[pk ](1 E[pk ])
var(pk ) = P (22)
1 + k k
K1  
X 1 X E[pk ](1 E[pk ])
log k = log 1 (23)
K 1 var(pk )
k k=1

Another approximate MLE, specifically for the case K = 2, is given by Johnson & Kotz (1970):
1 1 p2
1 = (24)
2 1 p1 p2
1 1 p1
2 = (25)
2 1 p1 p2

2 Estimating Dirichlet mean and precision separately

The parameters of the Dirichlet can be understood by considering the following alternative rep-
resentation:
X
s = k (26)
k
m = E[pk ] = /s (27)

Here m is the mean of the distribution for p and s can be understood as the precision. When s is
large, p is likely to be near m, and when s is small, p is distributed more diffusely. Interpretation of s
and m suggests situations, such as hierarchical modeling, in which we may want to fix one parameter
and only optimize the other. Additionally, s and m are roughly decoupled in the maximum-likelihood
objective, which means we can get simplifications and speedups by optimizing them alternately.
Thus, in this section we will reparameterize the distribution with (s, m) where

k = smk (28)
X
mk = 1 (29)
k

2.1 Estimating Dirichlet precision

The likelihood for s alone is


 P N
(s) exp(s mk log pk )
k
p(D|s) Q (30)
k (sm k)

whose derivatives are


d log p(D|s) X X
= N (s) N mk (smk ) + N mk log pk (31)
ds
k k
d2 log p(D|s) X
= N (s) N m2k (smk ) (32)
ds2
k

3
A convergent fixed-point iteration for s is
X X
(K 1)/snew = (K 1)/s (s) + mk (smk ) mk log pk (33)
k k

Proof Use the bound (see Appendix E)

(s)
Q exp(sb + (K 1) log(s) + c) (34)
k (smk )
X
b = (s) mk (smk ) (K 1)/s (35)
k

to get X
log p(D|s) s mk log pk + sb + (K 1) log(s) + (const.) (36)
k

from which (33) follows.

This iteration is only first-order convergent because the bound only matches the first derivative
of the likelihood. We can derive a second-order method using the technique of generalized Newton
iteration Minka (2000). The idea is to approximate the likelihood by a simpler function, by matching
the first two derivatives at the current guess:

(s)
Q exp(sb + a log(s) + c) (37)
k (smk )
X
a = s2 ( (s) m2k (smk )) (38)
k
X
b = (s) mk (smk ) a/s (39)
k

Maximizing the approximation leads to the update


1 
d2 log p(D|s)
 
1 1 1 d log p(D|s)
= + 2 (40)
snew s s ds2 ds

This update resembles Newton-Raphson, but converges faster.

For initialization of s, it is useful to derive a closed-form approximate MLE. Stirlings approximation


to gives
P
(s) exp(s k mk log pk )  s (K1)/2 Y
1/2
X pk
Q mk exp(s mk log ) (41)
k (sm k ) 2 m k
k k
(K 1)/2
s P pk (42)
k mk log mk

2.2 Estimating Dirichlet mean

Now suppose we fix the precision s and want to estimate the mean m. The likelihood for m alone is
!N
Y exp(smk log pk )
p(D|m) (43)
(smk )
k

4
Reparameterize with the unconstrained vector z, to get the gradient:
z
mk = Pk (44)
j zj

d log p(D|m) Ns X
= P log pk (smk ) mj (log pj (smj )) (45)
dzk j zj j

The MLE can be computed by the fixed-point iteration


X
mold log pj (smold

(k ) = log pk j j ) (46)
j

mnew
k = Pk (47)
j j

This update converges very quickly.

3 The Dirichlet-multinomial/Polya distribution

The Dirichlet-multinomial distribution is a compound distribution where p is drawn from a Dirichlet


and then a sample of discrete outcomes x is drawn from a multinomial with probability vector p.
This compounding is essentially a Polya urn scheme, so the Dirichlet-multinomial is also called the
Polya distribution. Let nk be the number of times the outcome was k, i.e.
X
nk = (xj k) (48)
j

Then the resulting distribution over x, a vector of outcomes, is


Z
p(x|) = p(x|p)p(p|)dp (49)
p
P
( k k ) Y (nk + k )
= P (50)
( k nk + k ) (k )
k

This distribution is also parameterized by , which can be estimated from a training set of count
vectors: D = {x1 , ..., xN }. The likelihood is
X
ni = nik (51)
k
Y
p(D|) = p(xi |) (52)
i
P !
Y ( k k ) Y (nik + k )
= P (53)
i
(ni + k k ) (k )
k

The gradient of the log-likelihood is


d log p(D|) X X X
gk = = ( k ) (ni + k ) + (nik + k ) (k ) (54)
dk i k k

The maximum can be computed via the fixed-point iteration


P
new (nik + k ) (k )
k = k P i P P (55)
i (ni + k k ) ( k k )

5
(see appendix B).

Alternatively, there is a simplified Newton iteration as in the Dirichlet case. The Hessian of the
log-likelihood is

d log p(D|) X X X
= ( k ) (ni + k ) + (nik + k ) (k ) (56)
dk2 i k k
d log p(D|) X X X
= ( k ) (ni + k ) (k 6= j) (57)
dk dj i k k

The Hessian can be written in matrix form as

H = Q + 11T z (58)
X
qjk = (j k) (nik + k ) (k ) (59)
i
X X X
z = ( k ) (ni + k ) (60)
i k k

from which a Newton step can be computed as before. The search can be initialized with the moment
matching estimate where pik is approximated by nik /ni .

Another approach is to reduce this problem to the previous one via EM; see appendix D.

A different method is to maximize the leave-one-out (LOO) likelihood instead of the true likelihood.
The LOO likelihood is the product of the probability of each sample given the remaining data and
the parameters. The LOO log-likelihood is
  X
X nik 1 + k X X
f () = nik log P = nik log(nik 1 + k ) ni log(ni 1 + k ) (61)
ni 1 + k k i
ik ik k

Note that it doesnt involve any special functions. The derivatives are

df () X nik ni
= P (62)
dk i
nik 1 + k ni 1 + k k
df () X nik ni
= + (63)
dk2 2 (ni 1 + k k )2
P
i
(nik 1 + k )
df () X ni
= (64)
(ni 1 + k k )2
P
dk j i

A convergent fixed-point iteration is


P nik
i nik 1+k
knew = k P nPi
(65)
i ni 1+ k k

Proof Use the bounds

log(n + x) q log x + (1 q) log n q log q (1 q) log(1 q) (66)


x
q = (67)
n + x
log(x) ax 1 + log x (68)
a = 1/x (69)

6
to get X X
f () nik qik log k ni ai k + (const.) (70)
i k

leading to (65).

The LOO likelihood can be interpreted as the approximation

(x + n)
(x + n 1)n (71)
(x)

4 Estimating Polya mean and precision separately

The parameters of the Polya distribution can be decomposed into mean m and precision s, just
as in the Dirichlet case, and optimization can be done separately on each part. The decomposition
also leads to an interesting interpretation of the Polya, discussed in the next subsection.

4.1 A novel interpretation of the Polya distribution

The Polya distribution can be interpreted as a multinomial distribution over a modified set of counts,
with a special normalizer. To see why, consider the log-probability of the outcomes x under the Polya
versus multinomial:
X
log p(x|) = log (s) log (s + n) + log (nk + smk ) log (smk ) (72)
k
X
log p(x|p) = nk log pk (73)
k

The multinomial is an exponential family, and the Polya is not. But we can find an approximate
exponential family representation of the Polya, by considering derivatives. In the multinomial case,
the counts can be recovered from the expression

d log p(x|p)
nk = pk (74)
dpk
In the Polya case, the analogous expression is

d log p(x|)
nk = mk (75)
dmk
= k ((nk + k ) (k )) (nk , k ) (76)

The log-probability of x under the Polya can thus be approximated by


X
log p(x|) = nk log mk (77)
k

When s , the Dirichlet-multinomial becomes an ordinary multinomial with p = m, and there-


fore the effective counts are the same as the ordinary counts:

(nk , ) = nk (78)

7
At the other extreme, as s 0, the Dirichlet-multinomial favors extreme proportions, and the
effective counts are a binarized version of the original counts:
(
0 if nk = 0,
(nk , 0) = (79)
1 if nk > 0.

For intermediate values of , the mapping behaves like a logarithm, reducing the influence of large
counts on the likelihood (see figure 1). Thus the Polya can be understood as a multinomial with
damped counts.

10
= 100
8

= 10
6
(nk, )

=3
4

=1
2

= 0.1
0

0 1 2 3 4 5 6 7 8 9 10

nk

Figure 1: The effective counts for the Polya distribution, as a function of the original count nk and
the parameter k .

This representation of the Polya also arises in the estimation of m when s is fixed (section 5).

4.2 Estimating Polya precision

The likelihood for s alone is


!
Y (s) Y (nik + smk )
p(D|s) (80)
i
(ni + s) (smk )
k

The derivatives are


d log p(D|s) X X
= (s) (ni + s) + mk (nik + smk ) mk (smk ) (81)
ds i k
d2 log p(D|s) X X
= (s) (ni + s) + m2k (nik + smk ) m2k (smk ) (82)
ds2 i k

A convergent fixed-point iteration is


P
new ik mk (nik + smk ) mk (smk )
s =s P (83)
i (ni + s) (s)

8
(the proof is similar to (55)). However, it is very slow. We can get a fast second-order method as
follows. When s is small, i.e. the gradient is positive, use the approximation

log p(D|s) a log(s) + cs + k (84)


a = s20 f (s0 ) (85)
c = f (s0 ) a/s0 (86)

to get the update


f (s)
snew = a/c = s/(1 + ) (87)
sf (s)
except when c 0, in which case the solution is s = . When s is large, i.e. the gradient is negative,
use the approximation
a c
log p(D|s) + (88)
2s2 s
a = s30 (s0 f (s0 ) + 2f (s0 )) (89)
c = (s20 f (s0 ) + a/s0 ) (90)

to get the update


f (s)
snew = a/c = s (91)
f (s) + 3f (s)/s
For large s, the value of a tends to be numerically unstable. If sf (s) + 2f (s) is within machine
epsilon, then it is better to substitute the limiting value:
X ni (ni 1)(2ni 1) X nik (nik 1)(2nik 1)
a (92)
i
6 6m2k
ik

A even faster update for large s is possible by using a richer approximation:


 
s e
log p(D|s) c log + (93)
s+b s+b
X X
c = (nik > 0) (ni > 0) (94)
ik i
s0 + b
e = (s0 (s0 + b)f (s0 ) cb) (95)
s0
b = RootOf(a2 b2 + a1 b + a0 ) (96)
a2 = s30 (s0 f (s0 ) + 2f (s0 )) (97)
a1 = 2s20 (s0 f (s0 ) + f (s0 )) (98)
a0 = s20 f (s0 ) + c (99)

The approximation comes from setting c log s to match log p(D|s) as s 0 and then choosing (b, e)
to match the first two derivatives of f at the current s. The resulting update is
1
cb2 1 f (s)(s + b)2

snew = = (100)
e cb s cb2

Note that a2 is equivalent to a above and should be corrected for stability via the same method.

The case of large dimension

An interesting special case arises when K is very large. The precision can be estimated simply by
counting the number of singleton elements in each x. Because the precision acts like a smoothing

9
parameter on the estimate of m, this result is reminiscent of smoothing methods in document
modeling which are based on counting singletons.

If mk is roughly uniform and K >> 1, then k << 1 and we can use the approximations

(nk + k ) (nk ) (101)


(k ) 1/k (102)
(s) Y
p(x|s) smk (nk ) (103)
(n + s) n >0
k

(s)sK
(104)
(n + s)

where K is the number of unique observations in x. The approximation does not hold if s is large,
which can happen when m is a good match to the data. But if the dimensionality is large enough,
the data will be too sparse for this to happen. The derivatives become
d log p(D|s) X
(s) (ni + s) + Ki /s (105)
ds i
d2 log p(D|s) X
(s) (ni + s) Ki /s2 (106)
ds2 i

Newton iteration can be used as long as the maximum for s is not on the boundary of (0, ). These
cases occur when K = 1 and K = n.

When the gradient is zero, we have

K = s((n + s) (s)) = E[K|s, n] (107)

A convergent fixed-point iteration is


P
new i Ki
s =P old ) (sold )
(108)
i (ni + s

Proof Use the bound


(s) (s) exp((s s)b)
(109)
(n + s) (n + s)
b = (n + s) (s) (110)

to get X X
p(D|s) s bi + Ki log s + (const.) (111)
i i

leading to (108).

Applying the large K approximation to the LOO likelihood gives


X
t = (nik 1) (number of singletons) (112)
ik
X
f (s) = t log s ni log(ni 1 + s) (113)
i
df (s) t X ni
= (114)
ds s i
ni 1+s

10
For N = 1:
t(n 1)
s = (115)
nt
s t(n 1) t
= 2
(116)
s+n n t n
which is the result we wanted.

5 Estimating Polya mean

The likelihood for m only is


Y (nik + smk )
p(D|m) (117)
(smk )
ik

The maximum can be computed by the fixed-point iteration


X
mnew
k (nik , smk ) (118)
i

where is defined in (76). This update can be understood intuitively as the maximum-likelihood
estimate of a multinomial distribution from effective counts nik = (nik , smk ). The proof of this
iteration is similar to (55).

For a Newton-Raphson iteration, reparameterize to get


K1
X
mK = 1 mk (119)
k=1
d log p(D|m) X
gk = = s (nik + smk ) (smk ) (niK + smK ) + (smK ) (120)
dmk i
d2 log p(D|m) X
= s2 (nik + smk ) (smk ) + (niK + smK ) (smK ) (121)
dm2k i
d2 log p(D|m) X
= s2 (niK + smK ) (smK ) (122)
dmk mj i

P
The search should be initialized at mk i nik , since for large s this is the exact optimum.

References
Johnson, N. L., & Kotz, S. (1970). Distributions in statistics: Continuous univariate distributions.
New York: Hougton Mifflin.

Minka, T. P. (1998). Expectation-Maximization as lower bound maximization.


http://research.microsoft.com/~minka/papers/em.html.

Minka, T. P. (2000). Beyond newtons method.


http://research.microsoft.com/~minka/papers/newton.html.

11
Naryanan, A. (1991). Algorithm as 266: Maximum likelihood estimation of the parameters of the
dirichlet distribution. Applied Statistics, 40, 365374.
http://www.psc.edu/~burkardt/dirichlet.html.
Ronning, G. (1989). Maximum-likelihood estimation of dirichlet distributions. Journal of
Statistical Computation and Simulation, 32, 215221.

A Proof of (9)

Use the bound

(x) (x) exp((x x)(x)) (123)

to get
1 X X X X
log p(D|) ( k )( kold ) log (k ) + (k 1) log pk + (const.) (124)
N
k k k k

leading to (9).

B Proof of (55)

Use the bound


(x) (x) exp((x x)b)
(125)
(n + x) (n + x)
b = (n + x) (x) (126)

and the bound


(n + x)
cxa if n 1, x 0 (127)
(x)
a = ((n + x) (x))x (128)
(n + x) a
c = x (129)
(x)
to get X X X
log p(D|) ( k 1) bi + aik log k + (const.) (130)
k i k

leading to (55).

Proof of (125): Use Jensens inequality on the integral definition of the Beta function.
1
(n)(x)
Z
= tx1 (1 t)n1 dt (131)
(n + x) 0
Z 1
tx1 (1 t)n1

exp q(t) log dt (132)
0 q(t)
(n + x) x1
q(t) = t (1 t)n1 (133)
(n)(x)

12
Proof of (127): The bound corresponds to a linear expansion of log (n + x) log (x) in log(x).
Thus we only need to show that log (n + x) log (x) is convex in log(x). By taking derivatives,
this amounts to proving that:
h(x, n) = x((n + x) (x)) + x2 ( (n + x) (x)) 0 if n 1 (134)
The digamma function has the following integral representation:
Z  t 
e ext
(x) = dt (135)
0 t 1 et
Substituting this into (134) gives:

1 ent
Z
h(x, n) = x(1 xt)ext dt (136)
0 1 et
Divide the integrand into two parts and apply integration by parts:
f (t) = x(1 xt)ext (137)
1 ent
g(t) = (138)
1 et
F (t) = xtext (anti-derivative of f ) (139)
Z
h(x, n) = f (t)g(t)dt (140)
0
Z
= 0 F (t)g (t)dt (141)
0

Since F (t) 0 for all t 0, we only have to show g (t) 0 for all t 0. This amounts to showing
that:
(1 et )(nent ) (1 ent )et (142)
et 1 (ent 1)/n (143)
t + t2 /2 + t3 /3! + t + nt2 /2 + n2 t3 /3! + (144)
The last line follows from n 1 which completes the proof. Note that if n 1, the last inequality
flips to show that the function is concave, i.e. the right-hand side of (127) becomes an upper
bound.

C Inverting the function

This section describes how to compute a high-accuracy solution to


(x) = y (145)
for x given y. Given a starting guess for x, Newtons method can be used to find the root of
(x) y = 0. The Newton update is
(x) y
xnew = xold (146)
(x)
To start the iteration, use the following asymptotic formulas for (x):

log(x 1/2) if x 0.6
(x) (147)
x1 if x < 0.6
= (1) (148)

13
to get 
1 exp(y) + 1/2 if y 2.22
(y) 1 (149)
y+ if y < 2.22
With this initialization, five Newton iterations are sufficient to reach fourteen digits of precision.

D EM for estimation from counts

Any algorithm for estimation from probability vectors can be turned into an algorithm for
estimation from counts, by treating the pi as hidden variables in EM. The E-step computes a
posterior distribution over pi :
q(pi ) D(nik + k ) (150)
and the M-step maximizes
X X X X
E[ log p(pi |)] = N log ( k ) N log (k ) + N (k 1) log pk (151)
i k k k
1 X
where log pk = E[log pik ] (152)
N i
1 X X
= (nik + kold ) (ni + kold ) (153)
N i
k

This is the same optimization problem as in section 1, with a new definition for p. It is not
necessary or desirable to reach the exact maximum in the M-step; a single Newton step will do.
The Newton step will end up using the old Hessian (10) but the new gradient (54). Compared to
the exact Newton algorithm, this uses half as much computation per iteration, but usually requires
more than twice the iterations.

E Proof of (34)

We want to prove the bound


(s)
Q exp(sb + (K 1) log(s) + c) (154)
k (smk )
X
b = (s) mk (smk ) (K 1)/s (155)
k

Define the function f (s) via


X
f (s) = log (s + 1) log (smk + 1) (156)
k
X
= log (s) log (smk ) (K 1)s + const. (157)
k

The bound is equivalent to saying that f (s) can be lower bounded by a linear function in s at any
point s, i.e. f (s) is convex in s. To show this, take the second derivative of f (s):
df P
= (s + 1) k (smk + 1)mk (158)
ds
d2 f
(smk + 1)m2k
P
= (s + 1) k (159)
ds2

14
We need to show that (159) is always positive. Since the function g(x) = (x + 1)x is increasing,
we know that:

g(s) g(smk ) (160)


X X
mk g(s) mk g(smk ) (161)
k k
X
g(s) mk g(smk ) (162)
k
X
(s + 1) (smk + 1)m2k (163)
k

thus f (s) is convex and the bound follows.

15

Вам также может понравиться