UNIT-2 Summarized

Notes on Machine Learning for 16.410 and 16.
413
(Notes adapted from Tom Mitchell and Andrew Moore.)
Choosing Hypotheses
Generally want the most probable hypothesis given the training data
P (D|h)P (h)
P (h|D) =
P (D)
Maximum a posteriori hypothesis hM AP :
hM AP = arg max P (h|D)

h∈H
P (D|h)P (h)
= arg max
h∈H P (D)
= arg max P (D|h)P (h)
h∈H
If assume P (hi ) = P (hj ), then can choose the Maximum likelihood (ML) hypothesis
hM L = arg max P (D|hi )

hi ∈H
Brute Force MAP Hypothesis Learner

1. For each hypothesis h in H, calculate the posterior probability
P (D|h)P (h)
P (h|D) =
P (D)
2. Output the hypothesis hM AP with the highest posterior probability
hM AP = argmax P (h|D)
h∈H
Relation to Concept Learning
Consider our usual concept learning task

• instance space X, hypothesis space H, training examples D
• List-then-Eliminate learning algorithm (outputs set of hypotheses from the version space V S H,D )
Choose P (D|h):
• P (D|h) = 1 if h consistent with D
• P (D|h) = 0 otherwise
Choose P (h) to be uniform distribution
1
• P (h) = |H| for all h in H
1
Then,
1

 |V SH,D | if h is consistent with D
P (h|D) =
0 otherwise

Maximum-likelihood parameter learning:
If prior over hypotheses is uniform, then there is a standard method for maximum-likelihood parameter learning:
1. Write down an expression for the likelihood of the data as a function of the parameter(s).
2. Write down the derivative of the log likelihood with respect to each parameter.
3. Find the parameter values such that the derivatives are zero.
By taking logarithms, we reduce the product to a sum over the data, which is usually easier to maximize.) To find the
maximum-likelihood value of θ, we differentiate L with respect to θ and set the resulting expression to zero.
Discrete models
Assume data set d is N independent draws from binomial population with unknown parameter θ, that is,
N
Y
P (d|θ) = P (dj |θ) = θ c (1 − θ)l
j=1
c instances have +ve label and (N − c) have −ve label and we want to learn θ.
L(d|θ) = log P (d|θ)

N
X
= log P (dj |θ)
j=1
= c log θ + l log(1 − θ)
dL(d|θ) c l
⇒ = − =0
dθ θ 1−θ
c c
⇒θ = =
c+l N
Minimum Description Length Principle
MDL: prefer the hypothesis h that minimizes
hM DL = arg max P (D|h)P (h)

h∈H
= arg max log2 P (D|h) + log2 P (h)
h∈H
= arg min − log2 P (D|h) − log2 P (h)
h∈H
• − log2 P (h) is length of h under optimal code

• − log2 P (D|h) is length of D given h under optimal code
2
Most Probable Classification of New Instances
So far we’ve sought the most probable hypothesis given the data D (i.e., hM AP )
Given new instance x, what is its most probable classification?
• hM AP (x) is not the most probable classification
Bayes Optimal Classifier

X
arg max P (vj |hi )P (hi |D)
vj ∈V
hi ∈H
Gibbs Classifier
Bayes optimal classifier provides best result, but can be expensive if many hypotheses.
Gibbs algorithm:
1. Choose one hypothesis at random, according to P (h|D)
2. Use this to classify new instance
Surprising fact: Assume target concepts are drawn at random from H according to priors on H. Then:
E[errorGibbs ] ≤ 2E[errorBayesOptimal ]
Suppose correct, uniform prior distribution over H, then

• Pick any hypothesis from version space with uniform probability
• Its expected error is no worse than twice Bayes optimal
3
Learning A Real Valued Function
hML
Consider any real-valued target function f

Training examples hxi , di i, where di is noisy training value
• di = f (xi ) + ei
• ei is random variable (noise) drawn independently for each xi according to some Gaussian distribution with
mean=0
hM L = argmax p(D|h)
h∈H
m
Y
= argmax p(di |h)
h∈H i=1
m
Y 1 1 di −h(xi ) 2
= argmax √ e− 2 ( σ )
h∈H i=1 2πσ 2
Maximize natural log of this instead...

m 2
X 11 di − h(xi )
hM L = argmax ln √ −
h∈H i=1 2πσ 2 2 σ
m
X 1 di − h(xi ) 2
= argmax −
h∈H i=1
2 σ
m
2
X
= argmax − (di − h(xi ))
h∈H i=1
m
2
X
= argmin (di − h(xi ))
h∈H i=1
Then the maximum likelihood hypothesis hM L is the one that minimizes the sum of squared errors.
4
Naive Bayes Classifier
Assume data has attributes a1 , a2 , . . . , an and has possible labels vj . The Naive Bayes assumption:
Y
P (a1 , a2 . . . an |vj ) = P (ai |vj )
i
Assume target function f : X → V , where each instance x described by attributes ha 1 , a2 . . . an i.

Most probable value of f (x) is:
vM AP = argmax P (vj |a1 , a2 . . . an )
vj ∈V
P (a1 , a2 . . . an |vj )P (vj )

vM AP = argmax
vj ∈V P (a1 , a2 . . . an )
= argmax P (a1 , a2 . . . an |vj )P (vj )
vj ∈V
which gives Y
Naive Bayes classifier: vN B = argmax P (vj ) P (ai |vj )
vj ∈V i
Bayes Algorithm
Naive Bayes Learn(examples)
For each target value vj
P̂ (vj ) ← estimate P (vj )
For each attribute value ai of each attribute a
P̂ (ai |vj ) ← estimate P (ai |vj )
Bayes: Subtleties
Q
1. Conditional independence assumption is often violated: P (a1 , a2 . . . an |vj ) = i P (ai |vj )
...but it works surprisingly well anyway. Note don’t need estimated posteriors P̂ (vj |x) to be correct; need only
that Y
argmax P̂ (vj ) P̂ (ai |vj ) = argmax P (vj )P (a1 . . . , an |vj )
vj ∈V i vj ∈V
2. Naive Bayes posteriors often unrealistically close to 1 or 0

3. What if none of the training instances with target value vj have attribute value ai ? Then
Y
P̂ (ai |vj ) = 0, and...P̂ (vj ) P̂ (ai |vj ) = 0
i
Typical solution is Bayesian estimate for P̂ (ai |vj )

nc + mp
P̂ (ai |vj ) ←
n+m
where
• n is number of training examples for which v = vj ,
• nc number of examples for which v = vj and a = ai
• p is prior estimate for P̂ (ai |vj )
• m is weight given to prior (i.e. number of “virtual” examples)
5
Bayesian Belief Networks
Recall that X is conditionally independent of Y given Z if the probability distribution governing X is independent of
the value of Y given the value of Z; that is, if
(∀xi , yj , zk ) P (X = xi |Y = yj , Z = zk ) = P (X = xi |Z = zk )
more compactly, we write

P (X|Y, Z) = P (X|Z)
Naive Bayes uses conditional independence to justify
P (X, Y |Z) = P (X|Y, Z)P (Y |Z)

= P (X|Z)P (Y |Z)
Bayesian Belief Network
Storm BusTourGroup
S,B S,¬B ¬S,B ¬S,¬B

C 0.4 0.1 0.8 0.2
Lightning Campfire
¬C 0.6 0.9 0.2 0.8
Campfire
Thunder ForestFire
Network represents a set of conditional independence assertions:
• Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors.
Network uses independence assertions to represent the joint probability distribution,

e.g., P (Storm, BusT ourGroup, . . . , F orestF ire), over all variables compactly
n
Y
P (y1 , . . . , yn ) = P (yi |P arents(Yi ))
i=1
where P arents(Yi ) denotes immediate predecessors of Yi in graph. So, the joint distribution is fully defined by graph,
plus the P (yi |P arents(Yi ))
Inference in Bayesian Networks
How can one infer the (probabilities of) values of one or more network variables, given observed values of others?
• Bayes net contains all information needed for this inference

• If only one variable with unknown value, easy to infer it
6
• In general case, problem is NP hard
In practice, can succeed in many cases

• Exact inference methods work well for some network structures
• Monte Carlo methods “simulate” the network randomly to calculate approximate solutions
Learning of Bayesian Networks
Several variants of this learning task
• Network structure might be known or unknown

• Training examples might provide values of all network variables, or just some
If structure known and observe all variables

• Then it’s easy as training a Naive Bayes classifier
Gradient Ascent for Bayes Nets
Suppose structure known, variables partially observable e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but
not Lightning, Campfire...
Let wijk denote one entry in the conditional probability table for variable Yi in the network
wijk = P (Yi = yij |P arents(Yi ) = the list uik of values)
e.g., if Yi = Campf ire, then uik might be hStorm = T, BusT ourGroup = F i

Perform gradient ascent by repeatedly
1. update all wijk using training data D
X Ph (yij , uik |d)
wijk ← wijk + η
wijk
d∈D
2. then, renormalize the wijk to assure

P
• j wijk = 1
• 0 ≤ wijk ≤ 1
7
Unsupervised Learning –Expectation Maximization (EM)
p(x)
Each instance x generated by

1. Choosing one of the k Gaussians with uniform probability
2. Generating an instance at random according to that Gaussian
EM for Estimating k Means
Given:
• Instances from X generated by mixture of k Gaussian distributions
• Unknown means hµ1 , . . . , µk i of the k Gaussians
• Don’t know which instance xi was generated by which Gaussian
Determine:
• Maximum likelihood estimates of hµ1 , . . . , µk i
Think of full description of each instance as yi = hxi , zi1 , zi2 i, where
• zij is 1 if xi generated by jth Gaussian
• xi observable
• zij unobservable
EM for Estimating k Means
EM Algorithm: Pick random initial h = hµ1 , µ2 i, then iterate

E step: Calculate the expected value E[zij ] of each hidden variable zij , assuming the current hypothesis h = hµ1 , µ2 i
holds.
p(x = xi |µ = µj )
E[zij ] = P2
n=1 p(x = xi |µ = µn )
1 2
e− 2σ2 (xi −µj )
= P2 1 2
n=1 e− 2σ2 (xi −µn )
8
M step: Calculate a new maximum likelihood hypothesis h0 = hµ01 , µ02 i, assuming the value taken on by each hidden
variable zij is its expected value E[zij ] calculated above. Replace h = hµ1 , µ2 i by h0 = hµ01 , µ02 i.
Pm
i=1 E[zij ] xi
µj ← P m
i=1 E[zij ]
EM Algorithm
Converges to local maximum likelihood h and provides estimates of hidden variables z ij In fact, local maximum is
E[ln P (Y |h)]
• Y is complete (observable plus unobservable variables) data

• Expected value is taken over possible values of unobserved variables in Y
General EM Problem
Given:
• Observed data X = {x1 , . . . , xm }
• Unobserved data Z = {z1 , . . . , zm }
• Parameterized probability distribution P (Y |h), where
– Y = {y1 , . . . , ym } is the full data yi = xi ∪ zi
– h are the parameters
Determine:
• h that (locally) maximizes E[ln P (Y |h)]
General EM Method
Define likelihood function Q(h0 |h) which calculates Y = X ∪ Z using observed X and current parameters h to
estimate Z
Q(h0 |h) ← E[ln P (Y |h0 )|h, X]
EM Algorithm:
Estimation (E) step: Calculate Q(h0 |h) using the current hypothesis h and the observed data X to estimate the
probability distribution over Y .
Q(h0 |h) ← E[ln P (Y |h0 )|h, X]
Maximization (M) step: Replace hypothesis h by the hypothesis h0 that maximizes this Q function.
h ← argmax Q(h0 |h)

h0

UNIT-2 Summarized

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

UNIT-2 Summarized

Загружено:

Авторское право:

Доступные форматы

Notes on Machine Learning for 16.410 and 16.

Maximum a posteriori hypothesis hM AP :

hM AP = arg max P (h|D)

hM L = arg max P (D|hi )

Brute Force MAP Hypothesis Learner

2. Output the hypothesis hM AP with the highest posterior probability

Relation to Concept Learning

Consider our usual concept learning task

Maximum-likelihood parameter learning:

L(d|θ) = log P (d|θ)

Minimum Description Length Principle

MDL: prefer the hypothesis h that minimizes

hM DL = arg max P (D|h)P (h)

• − log2 P (h) is length of h under optimal code

Bayes Optimal Classifier

Suppose correct, uniform prior distribution over H, then

Consider any real-valued target function f

Maximize natural log of this instead...

Assume target function f : X → V , where each instance x described by attributes ha 1 , a2 . . . an i.

P (a1 , a2 . . . an |vj )P (vj )

2. Naive Bayes posteriors often unrealistically close to 1 or 0

Typical solution is Bayesian estimate for P̂ (ai |vj )

more compactly, we write

Naive Bayes uses conditional independence to justify

P (X, Y |Z) = P (X|Y, Z)P (Y |Z)

Bayesian Belief Network

S,B S,¬B ¬S,B ¬S,¬B

Network represents a set of conditional independence assertions:

Network uses independence assertions to represent the joint probability distribution,

Inference in Bayesian Networks

• Bayes net contains all information needed for this inference

In practice, can succeed in many cases

Learning of Bayesian Networks

Several variants of this learning task

• Network structure might be known or unknown

If structure known and observe all variables

Gradient Ascent for Bayes Nets

wijk = P (Yi = yij |P arents(Yi ) = the list uik of values)

e.g., if Yi = Campf ire, then uik might be hStorm = T, BusT ourGroup = F i

2. then, renormalize the wijk to assure

Each instance x generated by

EM for Estimating k Means

EM for Estimating k Means

EM Algorithm: Pick random initial h = hµ1 , µ2 i, then iterate

• Y is complete (observable plus unobservable variables) data

Q(h0 |h) ← E[ln P (Y |h0 )|h, X]

h ← argmax Q(h0 |h)

Вам также может понравиться