Вы находитесь на странице: 1из 9

Notes on Machine Learning for 16.410 and 16.

413
(Notes adapted from Tom Mitchell and Andrew Moore.)

Choosing Hypotheses

Generally want the most probable hypothesis given the training data

P (D|h)P (h)
P (h|D) =
P (D)

Maximum a posteriori hypothesis hM AP :

hM AP = arg max P (h|D)


h∈H
P (D|h)P (h)
= arg max
h∈H P (D)
= arg max P (D|h)P (h)
h∈H

If assume P (hi ) = P (hj ), then can choose the Maximum likelihood (ML) hypothesis

hM L = arg max P (D|hi )


hi ∈H

Brute Force MAP Hypothesis Learner


1. For each hypothesis h in H, calculate the posterior probability

P (D|h)P (h)
P (h|D) =
P (D)

2. Output the hypothesis hM AP with the highest posterior probability

hM AP = argmax P (h|D)
h∈H

Relation to Concept Learning

Consider our usual concept learning task


• instance space X, hypothesis space H, training examples D
• List-then-Eliminate learning algorithm (outputs set of hypotheses from the version space V S H,D )
Choose P (D|h):
• P (D|h) = 1 if h consistent with D
• P (D|h) = 0 otherwise
Choose P (h) to be uniform distribution
1
• P (h) = |H| for all h in H

1
Then,

1

 |V SH,D | if h is consistent with D
P (h|D) =
0 otherwise

Maximum-likelihood parameter learning:

If prior over hypotheses is uniform, then there is a standard method for maximum-likelihood parameter learning:

1. Write down an expression for the likelihood of the data as a function of the parameter(s).
2. Write down the derivative of the log likelihood with respect to each parameter.
3. Find the parameter values such that the derivatives are zero.

By taking logarithms, we reduce the product to a sum over the data, which is usually easier to maximize.) To find the
maximum-likelihood value of θ, we differentiate L with respect to θ and set the resulting expression to zero.

Discrete models

Assume data set d is N independent draws from binomial population with unknown parameter θ, that is,
N
Y
P (d|θ) = P (dj |θ) = θ c (1 − θ)l
j=1

c instances have +ve label and (N − c) have −ve label and we want to learn θ.

L(d|θ) = log P (d|θ)


N
X
= log P (dj |θ)
j=1
= c log θ + l log(1 − θ)
dL(d|θ) c l
⇒ = − =0
dθ θ 1−θ
c c
⇒θ = =
c+l N

Minimum Description Length Principle

MDL: prefer the hypothesis h that minimizes

hM DL = arg max P (D|h)P (h)


h∈H
= arg max log2 P (D|h) + log2 P (h)
h∈H
= arg min − log2 P (D|h) − log2 P (h)
h∈H

• − log2 P (h) is length of h under optimal code


• − log2 P (D|h) is length of D given h under optimal code

2
Most Probable Classification of New Instances

So far we’ve sought the most probable hypothesis given the data D (i.e., hM AP )
Given new instance x, what is its most probable classification?
• hM AP (x) is not the most probable classification

Bayes Optimal Classifier


X
arg max P (vj |hi )P (hi |D)
vj ∈V
hi ∈H

Gibbs Classifier

Bayes optimal classifier provides best result, but can be expensive if many hypotheses.
Gibbs algorithm:
1. Choose one hypothesis at random, according to P (h|D)
2. Use this to classify new instance
Surprising fact: Assume target concepts are drawn at random from H according to priors on H. Then:

E[errorGibbs ] ≤ 2E[errorBayesOptimal ]

Suppose correct, uniform prior distribution over H, then


• Pick any hypothesis from version space with uniform probability
• Its expected error is no worse than twice Bayes optimal

3
Learning A Real Valued Function

hML

Consider any real-valued target function f


Training examples hxi , di i, where di is noisy training value
• di = f (xi ) + ei
• ei is random variable (noise) drawn independently for each xi according to some Gaussian distribution with
mean=0

hM L = argmax p(D|h)
h∈H
m
Y
= argmax p(di |h)
h∈H i=1
m
Y 1 1 di −h(xi ) 2
= argmax √ e− 2 ( σ )
h∈H i=1 2πσ 2

Maximize natural log of this instead...


m  2
X 11 di − h(xi )
hM L = argmax ln √ −
h∈H i=1 2πσ 2 2 σ
m 
X 1 di − h(xi )  2
= argmax −
h∈H i=1
2 σ
m
2
X
= argmax − (di − h(xi ))
h∈H i=1
m
2
X
= argmin (di − h(xi ))
h∈H i=1

Then the maximum likelihood hypothesis hM L is the one that minimizes the sum of squared errors.

4
Naive Bayes Classifier

Assume data has attributes a1 , a2 , . . . , an and has possible labels vj . The Naive Bayes assumption:
Y
P (a1 , a2 . . . an |vj ) = P (ai |vj )
i

Assume target function f : X → V , where each instance x described by attributes ha 1 , a2 . . . an i.


Most probable value of f (x) is:
vM AP = argmax P (vj |a1 , a2 . . . an )
vj ∈V

P (a1 , a2 . . . an |vj )P (vj )


vM AP = argmax
vj ∈V P (a1 , a2 . . . an )
= argmax P (a1 , a2 . . . an |vj )P (vj )
vj ∈V

which gives Y
Naive Bayes classifier: vN B = argmax P (vj ) P (ai |vj )
vj ∈V i

Bayes Algorithm
Naive Bayes Learn(examples)
For each target value vj
P̂ (vj ) ← estimate P (vj )
For each attribute value ai of each attribute a
P̂ (ai |vj ) ← estimate P (ai |vj )

Bayes: Subtleties
Q
1. Conditional independence assumption is often violated: P (a1 , a2 . . . an |vj ) = i P (ai |vj )
...but it works surprisingly well anyway. Note don’t need estimated posteriors P̂ (vj |x) to be correct; need only
that Y
argmax P̂ (vj ) P̂ (ai |vj ) = argmax P (vj )P (a1 . . . , an |vj )
vj ∈V i vj ∈V

2. Naive Bayes posteriors often unrealistically close to 1 or 0


3. What if none of the training instances with target value vj have attribute value ai ? Then
Y
P̂ (ai |vj ) = 0, and...P̂ (vj ) P̂ (ai |vj ) = 0
i

Typical solution is Bayesian estimate for P̂ (ai |vj )


nc + mp
P̂ (ai |vj ) ←
n+m
where
• n is number of training examples for which v = vj ,
• nc number of examples for which v = vj and a = ai
• p is prior estimate for P̂ (ai |vj )
• m is weight given to prior (i.e. number of “virtual” examples)

5
Bayesian Belief Networks

Recall that X is conditionally independent of Y given Z if the probability distribution governing X is independent of
the value of Y given the value of Z; that is, if

(∀xi , yj , zk ) P (X = xi |Y = yj , Z = zk ) = P (X = xi |Z = zk )

more compactly, we write


P (X|Y, Z) = P (X|Z)

Naive Bayes uses conditional independence to justify

P (X, Y |Z) = P (X|Y, Z)P (Y |Z)


= P (X|Z)P (Y |Z)

Bayesian Belief Network

Storm BusTourGroup

S,B S,¬B ¬S,B ¬S,¬B


C 0.4 0.1 0.8 0.2
Lightning Campfire
¬C 0.6 0.9 0.2 0.8

Campfire

Thunder ForestFire

Network represents a set of conditional independence assertions:

• Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors.

Network uses independence assertions to represent the joint probability distribution,


e.g., P (Storm, BusT ourGroup, . . . , F orestF ire), over all variables compactly
n
Y
P (y1 , . . . , yn ) = P (yi |P arents(Yi ))
i=1

where P arents(Yi ) denotes immediate predecessors of Yi in graph. So, the joint distribution is fully defined by graph,
plus the P (yi |P arents(Yi ))

Inference in Bayesian Networks

How can one infer the (probabilities of) values of one or more network variables, given observed values of others?

• Bayes net contains all information needed for this inference


• If only one variable with unknown value, easy to infer it

6
• In general case, problem is NP hard

In practice, can succeed in many cases


• Exact inference methods work well for some network structures
• Monte Carlo methods “simulate” the network randomly to calculate approximate solutions

Learning of Bayesian Networks

Several variants of this learning task

• Network structure might be known or unknown


• Training examples might provide values of all network variables, or just some

If structure known and observe all variables


• Then it’s easy as training a Naive Bayes classifier

Gradient Ascent for Bayes Nets

Suppose structure known, variables partially observable e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but
not Lightning, Campfire...
Let wijk denote one entry in the conditional probability table for variable Yi in the network

wijk = P (Yi = yij |P arents(Yi ) = the list uik of values)

e.g., if Yi = Campf ire, then uik might be hStorm = T, BusT ourGroup = F i


Perform gradient ascent by repeatedly
1. update all wijk using training data D
X Ph (yij , uik |d)
wijk ← wijk + η
wijk
d∈D

2. then, renormalize the wijk to assure


P
• j wijk = 1
• 0 ≤ wijk ≤ 1

7
Unsupervised Learning –Expectation Maximization (EM)

p(x)

Each instance x generated by


1. Choosing one of the k Gaussians with uniform probability
2. Generating an instance at random according to that Gaussian

EM for Estimating k Means

Given:
• Instances from X generated by mixture of k Gaussian distributions
• Unknown means hµ1 , . . . , µk i of the k Gaussians
• Don’t know which instance xi was generated by which Gaussian
Determine:
• Maximum likelihood estimates of hµ1 , . . . , µk i
Think of full description of each instance as yi = hxi , zi1 , zi2 i, where
• zij is 1 if xi generated by jth Gaussian
• xi observable
• zij unobservable

EM for Estimating k Means

EM Algorithm: Pick random initial h = hµ1 , µ2 i, then iterate


E step: Calculate the expected value E[zij ] of each hidden variable zij , assuming the current hypothesis h = hµ1 , µ2 i
holds.
p(x = xi |µ = µj )
E[zij ] = P2
n=1 p(x = xi |µ = µn )
1 2
e− 2σ2 (xi −µj )
= P2 1 2
n=1 e− 2σ2 (xi −µn )

8
M step: Calculate a new maximum likelihood hypothesis h0 = hµ01 , µ02 i, assuming the value taken on by each hidden
variable zij is its expected value E[zij ] calculated above. Replace h = hµ1 , µ2 i by h0 = hµ01 , µ02 i.
Pm
i=1 E[zij ] xi
µj ← P m
i=1 E[zij ]

EM Algorithm

Converges to local maximum likelihood h and provides estimates of hidden variables z ij In fact, local maximum is
E[ln P (Y |h)]

• Y is complete (observable plus unobservable variables) data


• Expected value is taken over possible values of unobserved variables in Y

General EM Problem

Given:
• Observed data X = {x1 , . . . , xm }
• Unobserved data Z = {z1 , . . . , zm }
• Parameterized probability distribution P (Y |h), where
– Y = {y1 , . . . , ym } is the full data yi = xi ∪ zi
– h are the parameters
Determine:
• h that (locally) maximizes E[ln P (Y |h)]

General EM Method

Define likelihood function Q(h0 |h) which calculates Y = X ∪ Z using observed X and current parameters h to
estimate Z

Q(h0 |h) ← E[ln P (Y |h0 )|h, X]

EM Algorithm:

Estimation (E) step: Calculate Q(h0 |h) using the current hypothesis h and the observed data X to estimate the
probability distribution over Y .
Q(h0 |h) ← E[ln P (Y |h0 )|h, X]

Maximization (M) step: Replace hypothesis h by the hypothesis h0 that maximizes this Q function.

h ← argmax Q(h0 |h)


h0

Вам также может понравиться