Академический Документы
Профессиональный Документы
Культура Документы
Richard Golden
(following approach of Chapter 9 of Manning and Schutze, 2000)
REVISION DATE: April 15 (Tuesday), 2003
VMM (Visible Markov Model)
a11=0.7
S1
1
a12=0.3
S0
a21=0.5
2
S2
a22=0.5
HMM Notation
• State Sequence Variables: X1, …, XT+1
• Output Sequence Variables: O1, …, OT
• Set of Hidden States (S1, …, SN)
• Output Alphabet (K1, …, KM)
• Initial State Probabilities (1, .., N)
i=p(X1=Si), i=1,…,N
• State Transition Probabilities (aij) i,j{1,…,N}
aij =p(Xt+1|Xt), t=1,…,T
• Emission Probabilities (bij) i{1,…,N},j {1,…,M}
bij=p(Xt+1=Si|Xt=Sj), t=1,…,T
• Note that
HMM State-Emission sometimes a
Hidden Markov
Model is
Representation represented by
having the
emission
arrows come off
K1 the arcs
a11=0.7 • In this situation
b11=0.6
you would have
b12=0.1 K2 a lot more
emission
arrows because
b13=0.3 there’s a lot
S1 more arcs…
1=1 K3 • But the
transition and
emission
a12=0.3 b22=0.7 probabilities are
the same…it
S0 just takes
longer to draw
b23=0.2 on your
2=0 a21=0.5 powerpoint
presentation
S2 (self-conscious
presentation)
b21=0.1
a22=0.5
Arc-Emission Representation
• Note that sometimes a Hidden Markov Model
is represented by having the emission arrows
come off the arcs
• In this situation you would have a lot more
emission arrows because there’s a lot more
arcs…
• But the transition and emission probabilities
are the same…it just takes longer to draw on
your powerpoint presentation (self-conscious
presentation)
Fundamental Questions for HMMs
• MODEL FIT
– How can we compute likelihood of observations and
hidden states given known emission and transition
probabilities?
Compute:
p(“Dog”/NOUN,”is”/VERB,”Good”/ADJ | {aij},{bkm})
• INFERENCE
• Maximize:
• p(“Dog”/?,”is”/?, “Good”/? | {aij},{bkm})
with respect to the unknown labels
Fundamental Questions for HMMs
• LEARNING
– How can we estimate the emission and
transition probabilities given observations and
assuming that hidden states are observable
during learning process?
EXAMPLE:
P(“Dog”/NOUN,”is”/VERB,”Good”/ADJ | {aij},{bij}) =
p(“Dog”,”is”,”Good”|NOUN,VERB,ADJ {aij},{bij}) X
p(NOUN,VERB,ADJ | aij},{bij})
Direct Calculation of Likelihood of Labeled Observations
(note use of “Markov” Assumptions)
Part 2
EXAMPLE:
Compute p(“DOG”/NOUN,”is”/VERB,”good”/ADJ|{aij},{bkm})
K1
a11=0.7 Note that
b11=0.6
b12=0.1 “good” is
K2
The name
S1 b13=0.3 Of the dogj
1=1 K3 So it is a
Noun!
a12=0.3 b22=0.7
S0
b23=0.2
a21=0.5
2=0
S2
b21=0.1
a22=0.5
KILLER EQUATION!!!!!
Efficiency of Calculations
is Important (e.g., Model-Fit)
• Assume 1 multiplication per microsecond
• Assume N=1000 word vocabulary and T=7 word sentence.
• (2T+1)NT+1 multiplications by
“direct calculation” yields (2(7)+1)(1000)(7+1) is about
475,000 million years of computer time!!!
K1 K2 K3
K1 K2 K3 K1 K2 K3
b12=0.1
b11=0.6 b13=0.3
a11=0.7
S1 S1 S1
1
a12=0.3 a21=0.5
S0
2
S2 S2 S2
a22=0.5 b23=0.2
b21=0.1
b22=0.1
K1 K2 K3 K1 K2 K3 K1 K2 K3
Forward Calculations – Time 2 (1 word example)
TIME 2
b23=0.2
K1 K2 K3
Forward Calculations – Time 3 (2 word example)
TIME 2 TIME 3 TIME 4
K1 K2 K3
K1 K2 K3
b12=0.1
b11=0.6 1(3)
a11=0.7
S1 S1 S1
1
a12=0.3 a21=0.5
S0
2
S2 S2 S2
b21=0.1 a22=0.5
b22=0.1
K1 K2 K3 K1 K2 K3
Forward Calculations – Time 4 (3 word example)
TIME 2 TIME 3 TIME 4
K1 K2 K3
K1 K2 K3 K1 K2 K3
b12=0.1
b11=0.6 b13=0.3
a11=0.7 S1
S1 S1 S1
1
a12=0.3 a21=0.5
S0
2
S2
S2 S2 S2
b21=0.1 a22=0.5 b23=0.2
b22=0.1
K1 K2 K3 K1 K2 K3 K1 K2 K3
Forward Calculation of
Likelihood Function (“emit and jump”)
t=1 t=2 t=3 t=4
(0-word) (1-word) (2-word) (3-word)
K1 K2 K3
K1 K2 K3 K1 K2 K3
b12=0.1
b11=0.6 b13=0.3
a11=0.7
S1 S1 S1
1
a12=0.3 a21=0.5
S0
2
S2 S2 S2
a22=0.5 b23=0.2
b21=0.1
b22=0.1
K1 K2 K3 K1 K2 K3 K1 K2 K3
Backward Calculations – Time 4
TIME 4
K1 K2 K3
b11`=0.6
S1
S2
b21=0.1
K1 K2 K3
Backward Calculations – Time 3
TIME 3
K1 K2 K3
b11`=0.6
S1
S2
b21=0.1
K1 K2 K3
Backward Calculations – Time 2
TIME 2 TIME 3 TIME 4
NOTE: that 1 (2)+ 2 (2)
is the likelihood the K1 K2 K3 K1 K2 K3
observation/word sequence
“K2,K1” b12=0.1
in this “2 word example”
b13=0.3
a11=0.7
S1
1 (4) 1 S1
2 (4) 1
a12=0.3
a21=0.5
1 (3) 0.6
2 (3) 0.1
a22=0.5
S2 S2
1 (2) 1 (3)a11b12 2 (3)a12 b12 0.045 b23=0.2
K1 K2 K3 K1 K2 K3
Backward Calculations – Time 1
TIME 2 TIME 3 TIME 4
K1 K2 K3
K1 K2 K3 K1 K2 K3
b12=0.1
b11=0.6 b13=0.3
a11=0.7
S1 S1 S1
1
a12=0.3 a21=0.5
S0
2
S2 S2 S2
a22=0.5 b23=0.2
b21=0.1
b22=0.1
K1 K2 K3 K1 K2 K3 K1 K2 K3
Backward Calculation of
Likelihood Function (“EMIT AND JUMP”)
t=1 t=2 t=3 t=4
Forward Backward
t=1 t=2 t=3 t=4 t=1 t=2 t=3 t=4
K1 K2 K3
K1 K2 K3 K1 K2 K3
b12=0.1
b11=0.6 b13=0.3
a11=0.7
S1 S1 S1
1
a12=0.3 a21=0.5
S0
2
S2 S2 S2
a22=0.5 b23=0.2
b21=0.1
b22=0.1
K1 K2 K3 K1 K2 K3 K1 K2 K3
Forward Calculations – Time 2 (1 word example)
TIME 2
K1 K2 K3
1 (1) 1 1
b13=0.3 2 (1) 2 0
a11=0.7
1 (2) max{1 (1)b13a11 , 2 (1)b23a21} 0.21
S1 S1
1=1 2 (2) max{1 (1)b13a12 , 2 (1)b23a22 } 0.09
2=0
S2 S2
a22=0.5
b23=0.2
K1 K2 K3
Backtracking – Time 2 (1 word example)
TIME 2
K1 K2 K3
1 (1) 1 1
b13=0.3 2 (1) 2 0
a11=0.7
1 (2) max{1 (1)b13a11 , 2 (1)b23a21} 0.21
S1 S1
1=1 2 (2) max{1 (1)b13a12 , 2 (1)b23a22 } 0.09
2=0
S2 S2
a22=0.5
b23=0.2
K1 K2 K3
Forward Calculations – (2 word example)
TIME 2 TIME 3 TIME 4
K1 K2 K3
K1 K2 K3
b12=0.1
b13=0.3
S1 S1 S1
a11=0.7
1
a12=0.3 a21=0.5
S0
2 a22=0.5
S2 S2 S2
b23=0.2 b22=0.1
K1 K2 K3 K1 K2 K3
BACKTRACKING – (2 word example)
TIME 2 TIME 3 TIME 4
K1 K2 K3
K1 K2 K3
b12=0.1
b13=0.3
S1 S1 S1
a11=0.7
1
a12=0.3 a21=0.5
S0
2 a22=0.5
S2 S2 S2
b23=0.2 b22=0.1
K1 K2 K3 K1 K2 K3
Formal Analysis of 2 word case
1 (3) max{1 (2)b12 a11 , 2 (2)b22 a21 }
1 (3) max{(0.21)(0.1)(0.7), (0.09)(0.7)(0.5)}
1 (3) max{0.0147, 0.0315} 0.0315
K1 K2 K3
K1 K2 K3 K1 K2 K3
b12=0.1
b11=0.6 b13=0.3
a11=0.7 S1
S1 S1 S1
1
a12=0.3 a21=0.5
S0
2
S2
S2 S2 S2
b21=0.1 a22=0.5 b23=0.2
b22=0.1
K1 K2 K3 K1 K2 K3 K1 K2 K3
Backtracking to Obtain Labeling for 3 word case
TIME 2 TIME 3 TIME 4
K1 K2 K3
K1 K2 K3 K1 K2 K3
b12=0.1
b11=0.6 b13=0.3
a11=0.7 S1
S1 S1 S1
1
a12=0.3 a21=0.5
S0
2
S2
S2 S2 S2
b21=0.1 a22=0.5 b23=0.2
b22=0.1
K1 K2 K3 K1 K2 K3 K1 K2 K3
Formal Analysis of 3 word case
1 (4) 2
2 (4) 2
Third Fundamental Question:
Parameter Estimation
• Make Initial Guess for {aij} and {bkm}
• Compute probability one hidden state follows another
given: {aij} and {bkm} and sequence of observations.
(computed using forward-backward algorithm)
• Compute probability of observed state given a hidden
state given: {aij} and {bkm} and sequence of observations.
(computed using forward-backward algorithm)
• Use these computed probabilities to
make an improved guess for {aij} and {bkm}
• Repeat this process until convergence
• Can be shown that this algorithm does in
fact converge to correct choice for {aij} and {bkm}
assuming that the initial guess was close enough..