Machine Learning

CSE
575: Sta*s*cal Machine Learning

Jingrui He
CIDSE, ASU
Instance-based Learning
1-Nearest Neighbor
Four things make a memory based learner:
1. A distance metric

Euclidian (and many more)
2. How many nearby neighbors to look at?
One
1. A weigh:ng func:on (op:onal)

Unused
2. How to t with the local points?

Just predict the same output as the nearest neighbor.
Consistency of 1-NN
Consider an es*mator fn trained on n examples
e.g., 1-NN, regression, ...
Es*mator is consistent if true error goes to zero as amount of

data increases
e.g., for no noise data, consistent if:
Regression is not consistent!

Representa*on bias
1-NN is consistent (under some mild neprint)
What about variance???

4
1-NN overts?
k-Nearest Neighbor
Four things make a memory based learner:
1. A distance metric

Euclidian (and many more)
2. How many nearby neighbors to look at?
k
1. A weigh:ng func:on (op:onal)

Unused
2.
How to t with the local points?

Just predict the average output among the k nearest neighbors.
k-Nearest Neighbor (here k=9)
K-nearest neighbor for funcFon Hng smooth away noise, but there are clear
deciencies.
What can we do about all the discon*nui*es that k-NN gives us?
Curse of dimensionality for

instance-based learning
Must store and retrieve all data!
Most real work done during tes*ng
For every test sample, must search through all dataset very slow!
There are fast methods for dealing with large datasets, e.g., tree-based
methods, hashing methods,
Instance-based learning o^en poor with noisy or irrelevant features
Support Vector Machines
Linear classiers Which line is beber?

Data:
Example i:
w.x = j w(j) x(j)
10
w.x + b
= 0
Pick the one with the largest margin!
w.x = j w(j) x(j)
11
w.x + b
= 0
Maximize the margin
12
w.x + b
= 0
But there are a many planes
13
w.x + b
= 0
Review: Normal to a plane
14
x+
margin 2
= -1
w.x + b
= 0
w.x + b
w.x + b
= +1
Normalized margin Canonical

hyperplanes
x-
15
x+
margin 2
= -1
w.x + b
= 0
w.x + b
w.x + b
= +1
Normalized margin Canonical

hyperplanes
x-
16
w.x + b
= 0
= +1
= -1
w.x + b
w.x + b
Margin maximiza*on using

canonical hyperplanes
margin 2
17
= -1
w.x + b
= 0
w.x + b
w.x + b
= +1
Support vector machines (SVMs)
Solve eciently by quadra*c

programming (QP)
Well-studied solu*on algorithms
Hyperplane dened by support vectors

margin 2
18
What if the data is not linearly

separable?
Use features of features
of features of features.
19
What if the data is s*ll not linearly

separable?
Minimize w.w and number of training
mistakes
Tradeo two criteria?
Tradeo #(mistakes) and w.w
20
0/1 loss
Slack penalty C
Not QP anymore
Also doesnt dis*nguish near misses
and really bad mistakes
Slack variables Hinge loss
If margin 1, dont care

If margin < 1, pay linear penalty
21
Side note: Whats the dierence

between SVMs and logis*c regression?
SVM:
LogisFc regression:
Log loss:
22
Constrained op*miza*on
23
Lagrange mul*pliers Dual variables

Moving the constraint to objecFve funcFon
Lagrangian:
Solve:
24
Lagrange mul*pliers Dual variables

Solving:
25
Dual SVM deriva*on (1)

the linearly separable case
26
Dual SVM deriva*on (2)

27
w.x + b
= 0
Dual SVM interpreta*on
28
Dual SVM formula*on

29
Dual SVM deriva*on

the non-separable case
30
Dual SVM formula*on

the non-separable case
31
Why did we learn about the dual SVM?

There are some quadra*c programming
algorithms that can solve the dual faster than
the primal
But, more importantly, the kernel trick!!!
Another lible detour
32
Reminder from last *me: What if the

data is not linearly separable?
Use features of features
of features of features.
Feature space can get really 33 large really quickly!
number of monomial terms
Higher order polynomials
d=4
m input features
d degree of polynomial
d=3
d=2
number of input dimensions
34
grows fast!
d = 6, m = 100
about 1.6 billion terms
Dual formula*on only depends on

dot-products, not on w!
35
Dot-product of polynomials
36
Finally: the kernel trick!
Never represent features explicitly

Compute dot products in closed form
Constant-*me high-dimensional dot-

products for many classes of features
Very interes*ng theory Reproducing
Kernel Hilbert Spaces
37
Polynomial kernels
All monomials of degree d in O(d) opera*ons:
How about all monomials of degree up to d?

Solu*on 0:
Beber solu*on:
38
Common kernels
Polynomials of degree d
Polynomials of degree up to d
Gaussian kernels
Sigmoid
39
Overvng?
Huge feature space with kernels, what about
overvng???
Maximizing margin leads to sparse set of support
vectors
Some interes*ng theory says that SVMs search for
simple hypothesis with large margin
O^en robust to overvng
40
What about at classica*on *me

For a new input x, if we need to represent
(x), we are in trouble!
Recall classier: sign(w.(x)+b)
Using kernels we are cool!
41
SVMs with kernels

Choose a set of features and kernel func*on
Solve dual problem to obtain support vectors
i
At classica*on *me, compute:
Classify as
42
Whats the dierence between

SVMs and Logis*c Regression?
Loss function
High dimensional
features with
kernels
SVMs
Logistic
Regression
Hinge loss
Log-loss
Yes!
No
43
Kernels in logis*c regression
Dene weights in terms of support vectors:
Derive simple gradient descent rule on i

44
Whats the dierence between SVMs

and Logis*c Regression? (Revisited)
Loss function
High dimensional
features with
kernels
Solution sparse
Semantics of
output
SVMs
Logistic
Regression
Hinge loss
Log-loss
Yes!
Yes!
Often yes!
Almost always no!
Margin
Real probabilities
45

Machine Learning

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Machine Learning

Загружено:

Авторское право:

Доступные форматы

CSE

575: Sta*s*cal Machine Learning

2. How to t with the local points?

Es*mator is consistent if true error goes to zero as amount of

Regression is not consistent!

1-NN is consistent (under some mild neprint)

What about variance???

How to t with the local points?

k-Nearest Neighbor (here k=9)

Curse of dimensionality for

Instance-based learning o^en poor with noisy or irrelevant features

Support Vector Machines

Linear classiers Which line is beber?

w.x = j w(j) x(j)

Pick the one with the largest margin!

w.x = j w(j) x(j)

Maximize the margin

But there are a many planes

Review: Normal to a plane

Normalized margin Canonical

Normalized margin Canonical

Margin maximiza*on using

Support vector machines (SVMs)

Solve eciently by quadra*c

Hyperplane dened by support vectors

What if the data is not linearly

What if the data is s*ll not linearly

Tradeo #(mistakes) and w.w

Slack variables Hinge loss

If margin 1, dont care

Side note: Whats the dierence

Lagrange mul*pliers Dual variables

Lagrange mul*pliers Dual variables

Dual SVM deriva*on (1)

Dual SVM deriva*on (2)

Dual SVM interpreta*on

Dual SVM formula*on

Dual SVM deriva*on

Dual SVM formula*on

Why did we learn about the dual SVM?

Reminder from last *me: What if the

Feature space can get really 33 large really quickly!

number of monomial terms

Higher order polynomials

Dual formula*on only depends on

Finally: the kernel trick!

Never represent features explicitly

Constant-*me high-dimensional dot-

How about all monomials of degree up to d?

What about at classica*on *me

SVMs with kernels

Whats the dierence between

Kernels in logis*c regression

Dene weights in terms of support vectors:

Derive simple gradient descent rule on i

Whats the dierence between SVMs

Almost always no!

Вам также может понравиться

575: Stascal Machine Learning

What about at classicaon me