Академический Документы
Профессиональный Документы
Культура Документы
[Read Ch. 6]
[Suggested exercises: 6.1, 6.2, 6.6]
Bayes Theorem
MAP, ML hypotheses
MAP learners
Minimum description length principle
Bayes optimal classier
Naive Bayes learner
Example: Learning over text data
Bayesian belief networks
Expectation Maximization algorithm
125 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Two Roles for Bayesian Methods
126 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Bayes Theorem
P (D
P (hjD) = P (D) jh )P (h )
127 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Choosing Hypotheses
P (D
P (hjD) = P (D)jh )P (h )
128 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Bayes Theorem
P (cancer) = P (:cancer) =
P (+jcancer) = P (?jcancer) =
P (+j:cancer) = P (?j:cancer) =
129 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Basic Formulas for Probabilities
130 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Brute Force MAP Hypothesis Learner
131 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Relation to Concept Learning
132 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Relation to Concept Learning
Choose P (Djh):
133 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Relation to Concept Learning
Then,
8
>
>
>
> jV S
1
j if h is consistent with D
P (hjD) = <>>
H;D
>
>
: 0 otherwise
134 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Evolution of Posterior Probabilities
135 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Characterizing Learning Algorithms by
Equivalent MAP Learners
Inductive system
Prior assumptions
made explicit
136 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Learning A Real Valued Function
hML
137 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Learning A Real Valued Function
= argmax i p
m
Y 1 e ? di ?h xi
1
(
( ) 2
)
2
2
h2H =1 2
h2H =1
m
= argmax i ? (di ? h(xi))
X 2
h2H =1
m
= argmin i (di ? h(xi))
X 2
h2H =1
138 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Learning to Predict Probabilities
139 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Minimum Description Length Principle
140 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Minimum Description Length Principle
= arg min
h2H
? log P (Djh) ? log P (h) (1)
2 2
So interpret (1):
? log P (h) is length of h under optimal code
2
141 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Most Probable Classication of New Instances
142 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Bayes Optimal Classier
Example:
therefore
X
hi2H
P (+ jh i)P (hi jD) = :4
X
h 2H
P (?jhi)P (hijD) = :6
i
and
arg max X
vj 2V h 2H
P (vj jhi)P (hijD) = ?
i
143 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Gibbs Classier
144 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Naive Bayes Classier
145 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Naive Bayes Classier
vj 2V n 1 2
which gives
Naive Bayes classier: vNB = argmax P (vj ) Yi P (aijvj ) vj 2V
146 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Naive Bayes Algorithm
147 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Naive Bayes: Example
148 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Naive Bayes: Subtleties
149 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Naive Bayes: Subtleties
150 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Learning to Classify Text
Why?
Learn which news articles are of interest
Learn to classify web pages by topic
Naive Bayes is among most eective algorithms
What attributes shall we use to represent text
documents??
151 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Learning to Classify Text
152 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Learn naive Bayes text(Examples; V )
1. collect all words and other tokens that occur in
Examples
V ocabulary all distinct words and other
tokens in Examples
2. calculate the required P (vj ) and P (wk jvj )
probability terms
For each target value vj in V do
{ docsj subset of Examples for which the
target value is vj
jdocsj j
{ P (vj ) jExamples j
{ Textj a single document created by
concatenating all members of docsj
{ n total number of words in Textj (counting
duplicate words multiple times)
{ for each word wk in V ocabulary
nk number of times word wk occurs in
Textj
P (wkjvj ) nk +1
n+jV ocabularyj
153 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Classify naive Bayes text(Doc)
positions all word positions in Doc that
contain tokens found in V ocabulary
Return vNB , where
vNB = argmax P (vj ) i2positions
Y
P (aijvj )
v 2V j
154 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Twenty NewsGroups
155 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Article from rec.sport.hockey
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.e
From: xxx@yyy.zzz.edu (John Doe)
Subject: Re: This year's biggest and worst (opinio
Date: 5 Apr 93 09:53:39 GMT
156 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Learning Curve for 20 Newsgroups
20News
100
90
80
70 Bayes
TFIDF
60 PRTFIDF
50
40
30
20
10
0
100 1000 10000
Accuracy vs. Training set size (1/3 withheld for test)
157 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Bayesian Belief Networks
Interesting because:
Naive Bayes assumption of conditional
independence too restrictive
But it's intractable without some such
assumptions...
Bayesian Belief networks describe conditional
independence among subsets of variables
! allows combining prior knowledge about
(in)dependencies among variables with observed
training data
(also called Bayes Nets)
158 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Conditional Independence
159 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Bayesian Belief Network
Storm BusTourGroup
Campfire
Thunder ForestFire
160 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Bayesian Belief Network
Storm BusTourGroup
Campfire
Thunder ForestFire
161 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Inference in Bayesian Networks
162 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Learning of Bayesian Networks
163 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Learning Bayes Nets
164 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Gradient Ascent for Bayes Nets
165 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
More on Learning Bayes Nets
166 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Summary: Bayesian Belief Networks
167 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Expectation Maximization (EM)
When to use:
Data is only partially observable
Unsupervised clustering (target value
unobservable)
Supervised learning (some instance attributes
unobservable)
Some uses:
Train Bayesian Belief Networks
Unsupervised clustering (AUTOCLASS)
Learning Hidden Markov Models
168 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Generating Data from Mixture of k
Gaussians
p(x)
169 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
EM for Estimating k Means
Given:
Instances from X generated by mixture of k
Gaussian distributions
Unknown means h ; : : : ; ki of the k Gaussians
1
170 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
EM for Estimating k Means
p (x = x ij = j )
E [zij ] = P p(x = x j = )
n
2
i
=1 n
e ? xi?j 1
2 2
( )2
= P ? xi?n
n e
1
2 ( )2
2 2
=1
j iPmEE[z[ijz] ]xi
Pm
=1
i ij =1
171 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
EM Algorithm
172 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
General EM Problem
Given:
Observed data X = fx ; : : : ; xmg 1
173 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
General EM Method
174 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997