Академический Документы
Профессиональный Документы
Культура Документы
Overview
Statistical modeling addresses the problem of modeling the behavior of a random process In constructing this model, we typically have at our disposal a sample of output from the process. From the sample, which constitutes an incomplete state of knowledge about the process, the modeling problem is to parlay this knowledge into a succinct, accurate representation of the process We can then use this representation to make predictions of the future behavior of the process
Motivating example
Suppose we wish to model an expert translators decisions concerning the proper French rendering of the English word in. A model p of the experts decisions assigns to each French word or phrase f an estimate, p(f), of the probability that the expert would choose f as a translation of in. Develop p collect a large sample of instances of the experts decisions
Motivating example
Our goal is to Extract a set of facts about the decision-making process from the sample (the first task of modeling) Construct a model of this process (the second task)
Motivating example
One obvious clue we might glean from the sample is the list of allowed translations
in {dans, en, , au cours de, pendant}
With this information in hand, we can impose our first constraint on our model p:
p(dans ) p(en) p(a ) p (au cours de) p ( pendant ) ! 1
This equation represents our first statistic of the process; we can now proceed to search for a suitable model which obeys this equation
There are infinite number of models p for which this identify holds
Motivating example
One model which satisfies the above equation is p(dans)=1 in other words, the model always predicts dans. Another model which obeys this constraint predicts pendant with a probability of , and with a probability of . But both of these models offend our sensibilities: knowing only that the expert always chose from among these five French phrases, how can we justify either of these probability distributions?
Motivating example
Knowing only that the expert chose exclusively from among these five French phrases, the most intuitively appealing model is p( dans) ! 1 / 5
Motivating example
We might hope to glean more clues about the experts decisions from our sample. Suppose we notice that the expert chose either dans or en 30% of the time p( dans ) p (en) ! 3 / 10 p(dans ) p(en) p(a ) p (au cours de) p ( pendant ) ! 1 Once again there are many probability distributions consistent with these two constraints. p (dans ) ! 3 / 20 In the absence of any other knowledge, a reasonable choice for p is again the p (en) ! 3 / 20 most uniform that is, the distribution p (a ) ! 7 / 30 which allocates its probability as evenly as possible, subject to the constrains: p ( au cours de) ! 7 / 30 p ( pendant ) ! 7 / 30
Motivating example
Say we inspect the data once more, and this time notice another interesting fact: in half the cases, the expert chose either dans or . We can incorporate this information into our model as a third constraint: p( dans ) p (en) ! 3 / 10 p(dans ) p(en) p(a ) p (au cours de) p ( pendant ) ! 1
p( dans) p (a ) ! 1 / 2
We can once again look for the most uniform p satisfying these constraints, but now the choice is not as obvious.
Motivating example
As we have added complexity, we have encountered two problems:
First, what exactly is meant by uniform, and how can one measure the uniformity of a model? Second, having determined a suitable answer to these questions, how does one find the most uniform model subject to a set of constraints like those we have described?
Motivating example
The maximum entropy method answers both these questions. Intuitively, the principle is simple:
model all that is known and assume nothing about that which is unknown In other words, given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible.
This is precisely the approach we took in selecting our model p at each step in the above example
Maxent Modeling
Consider a random process which produces an output value y, a member of a finite set .
y may be any word in the set {dans, en, , au cours de, pendant}
In generating y, the process may be influenced by some contextual information x, a mamber of a finite set X.
x could include the words in the English sentence surrounding in
To construct a stochastic model that accurately represents the behavior of the random process
Given a context x, the process will output y.
Training data
Collect a large number of samples (x1, y1), (x2, y2),, (xN, yN)
Each sample would consist of a phrase x containing the words surrounding in, together with the translation y of in which the process produced
Typically, a particular pair (x, y) will either not occur at all in the sample, or will occur at most a few times.
smoothing
Indicator function
1 if y ! en and April follows in f x, y
! 0 otherwise
Expected value of f ~ f | ~x, y f x, y p p
x, y
(1)
x, y
p f ! ~ f p
(3)
x, y
x, y
Feature:
Is a binary-value function of (x, y)
Constraint
Is an equation between the expected value of the feature function in the model and its expected value in the training data
a C | _p P | p f i ! ~ f i for i _,2,..., na p 1
(4)
P C1
P C1 C2
P C1 C 2
(a) Figure 1:
(b)
(c)
(d)
If we impose no constraints, then all probability models are allowable Imposing one linear constraint C1 restricts us to those pP which lie on the region defined by C1 A second linear constraint could determine p exactly, if the two constraints are satisfiable, where the intersection of C1 and C2 is non-empty. p C1 C2 Alternatively, a second linear constraint could be inconsistent with the first (i,e, C1 C2 = ); no pP can satisfy them both
(5)
p* ! arg max H p
pC
(6)
Exponential form
The maximum entropy principle presents us with a problem in constrained optimization: find the p C which maximizes H(p) Find
p * ! arg max H p
pC
(7)
Exponential form
We refer to this as the primal problem; it is a succinct way of saying that we seek to maximize H(p) subject to the following constraints:
1. py | x
u 0 for all x, y. 2.
py | x
! 1
y
for all x.
This and the previous condition guarantee that p is a conditional probability distribution
3.
~ x py | x f x, y ! ~ x, y f x, y p x, y x, y p
for i _,2,..., na 1 .
In other words, p C, and so satisfies the active constraints C
Exponential form
To solve this optimization problem, introduce the Lagrangian
\ p, 0, K
| ~ x
p y | x
log p y | x
p
x, y
~ p x, y f i x, y ~ x py | x f i x, y Pi p i x, y p y | x 1 K y
(8)
Exponential form
x\ ! ~ x
log p y | x
Pi ~ x
f i x, y
K p 1 p xpy | x
i ~x
log py | x
Pi ~x
f i x, y
K ! 0 p 1 p
i
(9)
~ x
log py | x
! Pi ~x
f i x, y
K p 1 p
i
Exponential form
We have thus found the parametric form of p , and so we now take up the task of solving for the optimal values 0 , K . Recognizing that the second factor in this equation is the factor corresponding to the second of the constraints listed above, we can rewrit (10) as
p y | x
! Z x
exp Pi f i x, y
i
(11)
Z x ! exp Pi f i x, y y i
(12)
Exponential form
We have found K but not yet 0 . Towards this end we introduce some further notation. Define the dual function =(0) as
= 0
| \ p , 0, K
(13)
(14)
Since p and K are fixed, the righthand side of (14) has only the free variables 0={P1, P2,, Pn}.
Exponential form
Final result
The maximum entropy model subject to the constraints C has the parametric form p of (11), where can be determined by maximizing the dual function =(0)
p The log - likelihood L ~ p
of the empirical distribution ~ p as predicted by a model p is defined by ~ x , y
p ! ~ x, y
log py | x
(15) L~ p
| log py | x
p p
x, y x, y
Maximum likelihood
It is easy to check that the dual function = 0 of the previous section is, in fact, just the log - likelihood for the exponential model p; that is = 0 ! L~ p p (16) where p has the parametric form of (11). With this interpretation, the result of the previous section can be rephrased as : The model p* C with maximum entropy is the model in the parametric family p y | x that maximizes the likelihood of the p training sample ~.
Maximum likelihood
Since (16) and From (8) : \ p, 0, K
| ~ x
py | x
log py | x
p
x, y
~ ~ x
py | x
f x, y
Pi p x, y
f i x, y
p i i x, y K py | x
1 y p \ p, 0, K
| ~ x
py | x
log py | x
p p = 0
| \ p * , 0, K *
| ~ x
~ y | x
log py | x
x, y x, y
= 0
| \ p * , 0, K * | ~ x
~ y | x
log p y | x
p p
x, y
argmaxP=(0) problem argmaxpCH(p) maximum likelihood description maximum entropy type of search constrained optimization unconstrained optimization search domain real-value vectors {P1 P2,} pC solution p 0 Kuhn-Tucker theorem: p = p0
(18) (19)
x, y
where f # x, y
| i !1 f i x, y
n
b. Update the value of Pi according to : Pi n Pi (Pi 3. Go to step 2 if not all the Pi have converged
(P f x, y
i i i