Вы находитесь на странице: 1из 18

COMS 4721: Machine Learning for Data Science

Lecture 1, 1/20/2015
Prof. John Paisley
Columbia University

COMS W4721: Machine Learning for Data Science, Spring 2015

1 / 18

OVERVIEW

This class will cover techniques for {learning, predicting, classifying,


recommending, . . . } from data.

There are a few ways we can divide up the material as we go along, e.g.,
I
I
I

supervised / unsupervised learning,


probabilistic / non-probabilistic models,
modeling approaches / inference techniques.

Well adopt the first method and work in the second two along the way.

COMS W4721: Machine Learning for Data Science, Spring 2015

2 / 18

OVERVIEW: S UPERVISED LEARNING

1
t
0

(a) Regression

(b) Classification

Regression: Using set of inputs, predict real-valued output.


Classification: Using set of inputs, predict a discrete label (aka class).

COMS W4721: Machine Learning for Data Science, Spring 2015

3 / 18

E XAMPLE C LASSIFICATION P ROBLEM

Given a set of inputs characterizing an item, assign it a label.

Is this spam?
hi everyone,
i saw that close to my hotel there is a pub with bowling
(its on market between 9th and 10th avenue). meet
there at 8:30?

What about this?


Enter for a chance to win a trip to Universal Orlando to
celebrate the arrival of Dr. Seusss The Lorax on Movies
On Demand on August 21st! Click here now!

COMS W4721: Machine Learning for Data Science, Spring 2015

4 / 18

OVERVIEW: U NSUPERVISED L EARNING

(c) Recommendations

(d) Social networks

With unsupervised learning our goal is to uncover underlying structure in the


data. This helps predictions, recommendations, efficient data exploration.
COMS W4721: Machine Learning for Data Science, Spring 2015

5 / 18

E XAMPLE U NSUPERVISED P ROBLEM


Goal: Learn the dominant topics from a set of news articles.

COMS W4721: Machine Learning for Data Science, Spring 2015

6 / 18

DATA M ODELING

A good place to start for the remainder of todays lecture is with maximum
likelihood estimation for a Gaussian distribution.

COMS W4721: Machine Learning for Data Science, Spring 2015

7 / 18

G AUSSIAN D ISTRIBUTION ( UNIVARIATE )

0.4

Gaussian density in one dimension

0.3

0.2

 (x )2 
1
exp
p(x|, ) :=
2 2
2
I
I
I

0.1

= expected value of x: E(x) = xp(x|,


)dx =
,
Cumulative
distribution
function
!
R
2
2
2 (z) = P (Z 2 z) =
= variance of x: V(x) = (x E(x)) p(x|, )dx = ,

= standard deviation = variance.

p(z # )dz #

The quotient x
measures deviation of x from its expected value in units of
(i.e. defines the length scale)

COMS W4721: Machine Learning for Data Science, Spring 2015

8 / 18

G AUSSIAN D ISTRIBUTION ( MULTIVARIATE )

Gaussian density in d dimensions


The quadratric function from the univariate density,

(x )2
2 2

1
(x )( 2 )1 (x )
2

is replaced by a quadratic form:


p(x|, ) :=

(2)

d
2

1
p

 1

exp (x)T 1 (x)
2
det()

The central moments are:


E(x) = ,
Cov(x) = E[(x E(x))(x E(x))T ] = .

COMS W4721: Machine Learning for Data Science, Spring 2015

9 / 18

PARAMETRIC M ODELS (S TATISTICS )


Probability Models
A model is a set of probability distributions, p(x|). We pick the distribution
family p(). We index each distribution by a parameter value T . The set
T is the parameter space of the model.
Example: Model the n data points, xi Rd , as coming from a Gaussian
distribution p(x1 , . . . , xn |), where = {, }.

Parametric model
A model is parametric if the number of free parameters in is:
(1) finite, and (2) independent of the number of data points.
Intuitively, the complexity of a parametric model doesnt increase with n.

COMS W4721: Machine Learning for Data Science, Spring 2015

10 / 18

M AXIMUM L IKELIHOOD E STIMATION (MLE)


Setting
I

Given: Data x1 , . . . , xn , parametric model p(x|), T .

Objective: Find the distribution in family p() which best explains the
data. That means we have to choose a "best" parameter value T .

Maximum Likelihood approach


Maximum likelihood assumes that the data is best explained using the
parameter for which the observed data has the highest probability (or
highest density value).
Hence, the maximum likelihood estimator is defined as
ML := arg max p(x1 , . . . , xn |),
T

the parameter which maximizes the joint density of the data.


COMS W4721: Machine Learning for Data Science, Spring 2015

11 / 18

A NALYTIC M AXIMUM L IKELIHOOD


The i.i.d. assumption
The standard assumption of ML methods is that the data is independent and
identically distributed (iid). That is, the data x1 , . . . , xn are generated by
independently sampling n times from the same distribution.
Writing the density as p(x|), then the joint density decomposes as
p(x1 , . . . , xn |) =

n
Y

p(xi |).

i=1

Maximum Likelihood equation


The analytic criterion for this maximum likelihood estimator is:

n
Y
i=1


p(xi |) = 0.

Simply put, the maximum is at a peak. There is no upward direction.


COMS W4721: Machine Learning for Data Science, Spring 2015

12 / 18

L OGARITHM T RICK
Logarithm trick
Calculating

Qn

i=1

p(xi |) can be very ugly. We use the fact that


Y  X
ln
fi =
ln(fi ).
i

Logarithms and maxima


The logarithm is monotonically increasing on R+ .
Consequence: Application of ln does not change the location of a maximum
or minimum:
max ln(g(y)) 6= max g(y)
y

arg max ln(g(y)) = arg max g(y)


y

COMS W4721: Machine Learning for Data Science, Spring 2015

The value changes.


The location does not change.

13 / 18

A NALYTIC MLE
Maximum Likelihood and the logarithm trick
ML = arg max

n
Y

p(xi |) = arg max ln

i=1

n
Y
i=1

n

X
p(xi |) = arg max
ln p(xi |)

i=1

Analytic maximality criterion


The same criterion applies to the logarithm: Find

n
X

ln p(xi |) =

i=1

n
X

ln p(xi |) = 0.

i=1

Being able to solve this analytically, numerically or only approximately


depends on the choice of the model.

COMS W4721: Machine Learning for Data Science, Spring 2015

14 / 18

E XAMPLE : M ULTIVARIATE G AUSSIAN MLE


Model: Multivariate Gaussians
The model is the set of all Gaussian densities on Rd with unknown
= {, }, where T = Rd Sd++ .
That is, Rd and Sd++ (positive definite d d matrix).
iid

We assume that x1 , . . . , xn are iid p(x|, ), written xi p(x|, ).

MLE equation
We have to solve the equation
n
X

(,) ln p(xi |, ) = 0

i=1

for and . (Try doing this without the log to appreciate its usefulness.)

COMS W4721: Machine Learning for Data Science, Spring 2015

15 / 18

E XAMPLE : G AUSSIAN M EAN MLE


First take the gradient with respect to .
0 =

n
X
i=1

=
=

1
2

n
X

i=1
n
X
i=1

ln p


 1
exp (xi )T 1 (xi )
2
(2)d ||
1

1
1
ln(2)d || (xi )T 1 (xi )
2
2
n


X
xiT 1 xi 2T 1 xi + T 1 = 1
(xi )
i=1

Since is positive definite, the only solution is


n
X

1X
xi
n
n

(xi ) = 0

i=1

ML =

i=1

ML .
Since this solution is independent of , it doesnt depend on
COMS W4721: Machine Learning for Data Science, Spring 2015

16 / 18

E XAMPLE : G AUSSIAN C OVARIANCE MLE


Now take the gradient with respect to .
0 =

n
X
i=1

1
1
ln(2)d || (xi )T 1 (xi )
2
2

n


X
n
1
= ln || trace 1
(xi )(xi )T
2
2
i=1

X
n
1
= 1 + 2
(xi )(xi )T
2
2
n

i=1

Solving for and plugging in =


ML ,
X
ML = 1

(xi
ML )(xi
ML )T .
n
n

i=1

COMS W4721: Machine Learning for Data Science, Spring 2015

17 / 18

E XAMPLE : G AUSSIAN MLE (S UMMARY )

So if we have data x1 , . . . , xn that we hypothesize is iid Gaussian, the


maximum likelihood values of the mean and covariance matrix are
1X
xi ,
n
n

ML =

i=1

X
ML = 1

(xi
ML )(xi
ML )T .
n
n

i=1

So are we done? There are many assumptions/issues with this approach that
makes finding the best parameter values not a complete victory.
I

We made a model assumption (multivariate Gaussian).

We made an iid assumption.

Often we want to use ML to make predictions about xnew . ML describes


x1 , . . . , xn the best, but how does this generalize to xnew ? If x1 , . . . , xn
dont capture the space well, ML can overfit to the data.

COMS W4721: Machine Learning for Data Science, Spring 2015

18 / 18

Оценить