Академический Документы
Профессиональный Документы
Культура Документы
413
(Notes adapted from Tom Mitchell and Andrew Moore.)
Choosing Hypotheses
Generally want the most probable hypothesis given the training data
P (D|h)P (h)
P (h|D) =
P (D)
If assume P (hi ) = P (hj ), then can choose the Maximum likelihood (ML) hypothesis
P (D|h)P (h)
P (h|D) =
P (D)
hM AP = argmax P (h|D)
h∈H
1
Then,
1
|V SH,D | if h is consistent with D
P (h|D) =
0 otherwise
If prior over hypotheses is uniform, then there is a standard method for maximum-likelihood parameter learning:
1. Write down an expression for the likelihood of the data as a function of the parameter(s).
2. Write down the derivative of the log likelihood with respect to each parameter.
3. Find the parameter values such that the derivatives are zero.
By taking logarithms, we reduce the product to a sum over the data, which is usually easier to maximize.) To find the
maximum-likelihood value of θ, we differentiate L with respect to θ and set the resulting expression to zero.
Discrete models
Assume data set d is N independent draws from binomial population with unknown parameter θ, that is,
N
Y
P (d|θ) = P (dj |θ) = θ c (1 − θ)l
j=1
c instances have +ve label and (N − c) have −ve label and we want to learn θ.
2
Most Probable Classification of New Instances
So far we’ve sought the most probable hypothesis given the data D (i.e., hM AP )
Given new instance x, what is its most probable classification?
• hM AP (x) is not the most probable classification
Gibbs Classifier
Bayes optimal classifier provides best result, but can be expensive if many hypotheses.
Gibbs algorithm:
1. Choose one hypothesis at random, according to P (h|D)
2. Use this to classify new instance
Surprising fact: Assume target concepts are drawn at random from H according to priors on H. Then:
E[errorGibbs ] ≤ 2E[errorBayesOptimal ]
3
Learning A Real Valued Function
hML
hM L = argmax p(D|h)
h∈H
m
Y
= argmax p(di |h)
h∈H i=1
m
Y 1 1 di −h(xi ) 2
= argmax √ e− 2 ( σ )
h∈H i=1 2πσ 2
Then the maximum likelihood hypothesis hM L is the one that minimizes the sum of squared errors.
4
Naive Bayes Classifier
Assume data has attributes a1 , a2 , . . . , an and has possible labels vj . The Naive Bayes assumption:
Y
P (a1 , a2 . . . an |vj ) = P (ai |vj )
i
which gives Y
Naive Bayes classifier: vN B = argmax P (vj ) P (ai |vj )
vj ∈V i
Bayes Algorithm
Naive Bayes Learn(examples)
For each target value vj
P̂ (vj ) ← estimate P (vj )
For each attribute value ai of each attribute a
P̂ (ai |vj ) ← estimate P (ai |vj )
Bayes: Subtleties
Q
1. Conditional independence assumption is often violated: P (a1 , a2 . . . an |vj ) = i P (ai |vj )
...but it works surprisingly well anyway. Note don’t need estimated posteriors P̂ (vj |x) to be correct; need only
that Y
argmax P̂ (vj ) P̂ (ai |vj ) = argmax P (vj )P (a1 . . . , an |vj )
vj ∈V i vj ∈V
5
Bayesian Belief Networks
Recall that X is conditionally independent of Y given Z if the probability distribution governing X is independent of
the value of Y given the value of Z; that is, if
(∀xi , yj , zk ) P (X = xi |Y = yj , Z = zk ) = P (X = xi |Z = zk )
Storm BusTourGroup
Campfire
Thunder ForestFire
• Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors.
where P arents(Yi ) denotes immediate predecessors of Yi in graph. So, the joint distribution is fully defined by graph,
plus the P (yi |P arents(Yi ))
How can one infer the (probabilities of) values of one or more network variables, given observed values of others?
6
• In general case, problem is NP hard
Suppose structure known, variables partially observable e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but
not Lightning, Campfire...
Let wijk denote one entry in the conditional probability table for variable Yi in the network
7
Unsupervised Learning –Expectation Maximization (EM)
p(x)
Given:
• Instances from X generated by mixture of k Gaussian distributions
• Unknown means hµ1 , . . . , µk i of the k Gaussians
• Don’t know which instance xi was generated by which Gaussian
Determine:
• Maximum likelihood estimates of hµ1 , . . . , µk i
Think of full description of each instance as yi = hxi , zi1 , zi2 i, where
• zij is 1 if xi generated by jth Gaussian
• xi observable
• zij unobservable
8
M step: Calculate a new maximum likelihood hypothesis h0 = hµ01 , µ02 i, assuming the value taken on by each hidden
variable zij is its expected value E[zij ] calculated above. Replace h = hµ1 , µ2 i by h0 = hµ01 , µ02 i.
Pm
i=1 E[zij ] xi
µj ← P m
i=1 E[zij ]
EM Algorithm
Converges to local maximum likelihood h and provides estimates of hidden variables z ij In fact, local maximum is
E[ln P (Y |h)]
General EM Problem
Given:
• Observed data X = {x1 , . . . , xm }
• Unobserved data Z = {z1 , . . . , zm }
• Parameterized probability distribution P (Y |h), where
– Y = {y1 , . . . , ym } is the full data yi = xi ∪ zi
– h are the parameters
Determine:
• h that (locally) maximizes E[ln P (Y |h)]
General EM Method
Define likelihood function Q(h0 |h) which calculates Y = X ∪ Z using observed X and current parameters h to
estimate Z
EM Algorithm:
Estimation (E) step: Calculate Q(h0 |h) using the current hypothesis h and the observed data X to estimate the
probability distribution over Y .
Q(h0 |h) ← E[ln P (Y |h0 )|h, X]
Maximization (M) step: Replace hypothesis h by the hypothesis h0 that maximizes this Q function.