Академический Документы
Профессиональный Документы
Культура Документы
Pawan Goyal
CSE, IITKGP
Language Modeling
1 / 30
Language Modeling
2 / 30
Language Modeling
2 / 30
Language Modeling
2 / 30
Language Modeling
2 / 30
Speech Recognition
P(I saw a van) >> P(eyes awe of an)
Language Modeling
3 / 30
Speech Recognition
P(I saw a van) >> P(eyes awe of an)
Machine Translation
Which sentence is more plausible in the target language?
P(high winds) > P(large winds)
Language Modeling
3 / 30
Speech Recognition
P(I saw a van) >> P(eyes awe of an)
Machine Translation
Which sentence is more plausible in the target language?
P(high winds) > P(large winds)
Other Applications
Context Sensitive Spelling Correction
Natural Language Generation
...
Language Modeling
3 / 30
Completion Prediction
Language Modeling
4 / 30
Completion Prediction
Predictive text input systems can guess what you are typing and give
choices on how to complete it.
Language Modeling
4 / 30
P(W) = P(w1 , w2 , w3 , . . . , wn )
Language Modeling
5 / 30
P(W) = P(w1 , w2 , w3 , . . . , wn )
P(w4 |w1 , w2 , w3 )
Language Modeling
5 / 30
P(W) = P(w1 , w2 , w3 , . . . , wn )
P(w4 |w1 , w2 , w3 )
Language Modeling
5 / 30
Computing P(W)
Language Modeling
6 / 30
Computing P(W)
Basic Idea
Rely on the Chain Rule of Probability
Language Modeling
6 / 30
P(A, B)
P(A)
Language Modeling
7 / 30
P(A, B)
P(A)
P(A, B) = P(A)P(B|A)
Language Modeling
7 / 30
P(A, B)
P(A)
P(A, B) = P(A)P(B|A)
More Variables
P(A, B, C, D) = P(A)P(B|A)P(C|A, B)P(D|A, B, C)
Language Modeling
7 / 30
P(A, B)
P(A)
P(A, B) = P(A)P(B|A)
More Variables
P(A, B, C, D) = P(A)P(B|A)P(C|A, B)P(D|A, B, C)
The Chain Rule in General
P(x1 , x2 , . . . , xn ) = P(x1 )P(x2 |x1 )P(x3 |x1 , x2 ) . . . P(xn |x1 , . . . , xn1 )
Language Modeling
7 / 30
Language Modeling
8 / 30
Language Modeling
8 / 30
Language Modeling
9 / 30
Language Modeling
9 / 30
Markov Assumption
Language Modeling
10 / 30
Markov Assumption
Language Modeling
10 / 30
Markov Assumption
Language Modeling
11 / 30
Markov Assumption
Language Modeling
11 / 30
Markov Assumption
Language Modeling
11 / 30
N-Gram Models
Language Modeling
12 / 30
N-Gram Models
Language Modeling
12 / 30
N-Gram Models
Language Modeling
12 / 30
N-Gram Models
Language Modeling
12 / 30
N-Gram Models
Language Modeling
13 / 30
N-Gram Models
Language Modeling
13 / 30
N-Gram Models
Language Modeling
13 / 30
Language Modeling
14 / 30
count(wi1 , wi )
count(wi1 )
Language Modeling
14 / 30
count(wi1 , wi )
count(wi1 )
P(wi |wi1 ) =
c(wi1 , wi )
c(wi1 )
Language Modeling
14 / 30
An Example
c(wi1 , wi )
P(wi |wi1 ) =
c(wi1 )
Language Modeling
15 / 30
An Example
c(wi1 , wi )
P(wi |wi1 ) =
c(wi1 )
Estimating bigrams
P(I|<s>) =
P(</s>|here) =
P(would | I) =
P(here | am) =
P(know | like) =
Language Modeling
15 / 30
An Example
c(wi1 , wi )
P(wi |wi1 ) =
c(wi1 )
Estimating bigrams
P(I|<s>) = 2/3
P(</s>|here) =1
P(would | I) = 1/3
P(here | am) = 1/2
P(know | like) = 0
Language Modeling
15 / 30
Language Modeling
16 / 30
Language Modeling
17 / 30
Bigram Probabilities
Language Modeling
17 / 30
Language Modeling
18 / 30
Language Modeling
18 / 30
P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (i | <s>) = .25
Language Modeling
19 / 30
Practical Issues
Language Modeling
20 / 30
Practical Issues
Language Modeling
20 / 30
Practical Issues
Language Modeling
20 / 30
SRILM
http://www.speech.sri.com/projects/srilm/
Language Modeling
21 / 30
Google N-grams
http://googleresearch.blogspot.in/2006/08/
all-our-n-gram-are-belong-to-you.html
Language Modeling
22 / 30
Language Modeling
23 / 30
Language Modeling
24 / 30
Language Modeling
25 / 30
Language Modeling
25 / 30
Language Modeling
26 / 30
Language Modeling
27 / 30
Language Modeling
27 / 30
Language Modeling
27 / 30
Language Modeling
27 / 30
Perplexity
The best language model is one that best predics an unseen test set
Perplexity (PP(W))
Perplexity is the inverse probability of the test data, normalized by the number
of words:
Language Modeling
28 / 30
Perplexity
The best language model is one that best predics an unseen test set
Perplexity (PP(W))
Perplexity is the inverse probability of the test data, normalized by the number
of words:
1
PP(W) = P(w1 w2 . . . wN ) N
Applying chain Rule
1
PP(W) =
P(wi |w1 . . . wi1 )
Language Modeling
N1
28 / 30
Perplexity
The best language model is one that best predics an unseen test set
Perplexity (PP(W))
Perplexity is the inverse probability of the test data, normalized by the number
of words:
1
PP(W) = P(w1 w2 . . . wN ) N
Applying chain Rule
1
PP(W) =
P(wi |w1 . . . wi1 )
N1
For bigrams
1
PP(W) =
P(wi |wi1 )
Pawan Goyal (IIT Kharagpur)
Language Modeling
N1
28 / 30
Language Modeling
29 / 30
Language Modeling
29 / 30
PP(W) = P(w1 w2 . . . wN ) N
N ! N1
1
=
10
1
1
=
10
= 10
Language Modeling
29 / 30
WSJ Corpus
Training: 38 million words
Test: 1.5 million words
Language Modeling
30 / 30
WSJ Corpus
Training: 38 million words
Test: 1.5 million words
Language Modeling
30 / 30