Вы находитесь на странице: 1из 18

# COMS 4721: Machine Learning for Data Science

Lecture 1, 1/20/2015
Prof. John Paisley
Columbia University

1 / 18

OVERVIEW

## This class will cover techniques for {learning, predicting, classifying,

recommending, . . . } from data.

There are a few ways we can divide up the material as we go along, e.g.,
I
I
I

## supervised / unsupervised learning,

probabilistic / non-probabilistic models,
modeling approaches / inference techniques.

Well adopt the first method and work in the second two along the way.

2 / 18

## OVERVIEW: S UPERVISED LEARNING

1
t
0

(a) Regression

(b) Classification

## Regression: Using set of inputs, predict real-valued output.

Classification: Using set of inputs, predict a discrete label (aka class).

3 / 18

## Given a set of inputs characterizing an item, assign it a label.

Is this spam?
hi everyone,
i saw that close to my hotel there is a pub with bowling
(its on market between 9th and 10th avenue). meet
there at 8:30?

Enter for a chance to win a trip to Universal Orlando to
celebrate the arrival of Dr. Seusss The Lorax on Movies

4 / 18

## OVERVIEW: U NSUPERVISED L EARNING

(c) Recommendations

## With unsupervised learning our goal is to uncover underlying structure in the

data. This helps predictions, recommendations, efficient data exploration.
COMS W4721: Machine Learning for Data Science, Spring 2015

5 / 18

## E XAMPLE U NSUPERVISED P ROBLEM

Goal: Learn the dominant topics from a set of news articles.

## COMS W4721: Machine Learning for Data Science, Spring 2015

6 / 18

DATA M ODELING

A good place to start for the remainder of todays lecture is with maximum
likelihood estimation for a Gaussian distribution.

7 / 18

0.4

0.3

0.2

 (x )2 
1
exp
p(x|, ) :=
2 2
2
I
I
I

0.1

## = expected value of x: E(x) = xp(x|,

)dx =
,
Cumulative
distribution
function
!
R
2
2
2 (z) = P (Z 2 z) =
= variance of x: V(x) = (x E(x)) p(x|, )dx = ,

## = standard deviation = variance.

p(z # )dz #

The quotient x
measures deviation of x from its expected value in units of
(i.e. defines the length scale)

8 / 18

## Gaussian density in d dimensions

The quadratric function from the univariate density,

(x )2
2 2

1
(x )( 2 )1 (x )
2

p(x|, ) :=

(2)

d
2

1
p

 1

exp (x)T 1 (x)
2
det()

## The central moments are:

E(x) = ,
Cov(x) = E[(x E(x))(x E(x))T ] = .

9 / 18

## PARAMETRIC M ODELS (S TATISTICS )

Probability Models
A model is a set of probability distributions, p(x|). We pick the distribution
family p(). We index each distribution by a parameter value T . The set
T is the parameter space of the model.
Example: Model the n data points, xi Rd , as coming from a Gaussian
distribution p(x1 , . . . , xn |), where = {, }.

Parametric model
A model is parametric if the number of free parameters in is:
(1) finite, and (2) independent of the number of data points.
Intuitively, the complexity of a parametric model doesnt increase with n.

10 / 18

Setting
I

## Given: Data x1 , . . . , xn , parametric model p(x|), T .

Objective: Find the distribution in family p() which best explains the
data. That means we have to choose a "best" parameter value T .

## Maximum Likelihood approach

Maximum likelihood assumes that the data is best explained using the
parameter for which the observed data has the highest probability (or
highest density value).
Hence, the maximum likelihood estimator is defined as
ML := arg max p(x1 , . . . , xn |),
T

## the parameter which maximizes the joint density of the data.

COMS W4721: Machine Learning for Data Science, Spring 2015

11 / 18

## A NALYTIC M AXIMUM L IKELIHOOD

The i.i.d. assumption
The standard assumption of ML methods is that the data is independent and
identically distributed (iid). That is, the data x1 , . . . , xn are generated by
independently sampling n times from the same distribution.
Writing the density as p(x|), then the joint density decomposes as
p(x1 , . . . , xn |) =

n
Y

p(xi |).

i=1

## Maximum Likelihood equation

The analytic criterion for this maximum likelihood estimator is:

n
Y
i=1


p(xi |) = 0.

## Simply put, the maximum is at a peak. There is no upward direction.

COMS W4721: Machine Learning for Data Science, Spring 2015

12 / 18

L OGARITHM T RICK
Logarithm trick
Calculating

Qn

i=1

Y  X
ln
fi =
ln(fi ).
i

## Logarithms and maxima

The logarithm is monotonically increasing on R+ .
Consequence: Application of ln does not change the location of a maximum
or minimum:
max ln(g(y)) 6= max g(y)
y

y

## The value changes.

The location does not change.

13 / 18

A NALYTIC MLE
Maximum Likelihood and the logarithm trick
ML = arg max

n
Y

## p(xi |) = arg max ln

i=1

n
Y
i=1

n

X
p(xi |) = arg max
ln p(xi |)

i=1

## Analytic maximality criterion

The same criterion applies to the logarithm: Find

n
X

ln p(xi |) =

i=1

n
X

ln p(xi |) = 0.

i=1

## Being able to solve this analytically, numerically or only approximately

depends on the choice of the model.

14 / 18

## E XAMPLE : M ULTIVARIATE G AUSSIAN MLE

Model: Multivariate Gaussians
The model is the set of all Gaussian densities on Rd with unknown
= {, }, where T = Rd Sd++ .
That is, Rd and Sd++ (positive definite d d matrix).
iid

## We assume that x1 , . . . , xn are iid p(x|, ), written xi p(x|, ).

MLE equation
We have to solve the equation
n
X

(,) ln p(xi |, ) = 0

i=1

for and . (Try doing this without the log to appreciate its usefulness.)

15 / 18

## E XAMPLE : G AUSSIAN M EAN MLE

First take the gradient with respect to .
0 =

n
X
i=1

=
=

1
2

n
X

i=1
n
X
i=1

ln p


 1
exp (xi )T 1 (xi )
2
(2)d ||
1

1
1
ln(2)d || (xi )T 1 (xi )
2
2
n


X
xiT 1 xi 2T 1 xi + T 1 = 1
(xi )
i=1

## Since is positive definite, the only solution is

n
X

1X
xi
n
n

(xi ) = 0

i=1

ML =

i=1

ML .
Since this solution is independent of , it doesnt depend on
COMS W4721: Machine Learning for Data Science, Spring 2015

16 / 18

## E XAMPLE : G AUSSIAN C OVARIANCE MLE

Now take the gradient with respect to .
0 =

n
X
i=1

1
1
ln(2)d || (xi )T 1 (xi )
2
2

n


X
n
1
= ln || trace 1
(xi )(xi )T
2
2
i=1

X
n
1
= 1 + 2
(xi )(xi )T
2
2
n

i=1

ML ,
X
ML = 1

(xi
ML )(xi
ML )T .
n
n

i=1

17 / 18

## So if we have data x1 , . . . , xn that we hypothesize is iid Gaussian, the

maximum likelihood values of the mean and covariance matrix are
1X
xi ,
n
n

ML =

i=1

X
ML = 1

(xi
ML )(xi
ML )T .
n
n

i=1

So are we done? There are many assumptions/issues with this approach that
makes finding the best parameter values not a complete victory.
I

## Often we want to use ML to make predictions about xnew . ML describes

x1 , . . . , xn the best, but how does this generalize to xnew ? If x1 , . . . , xn
dont capture the space well, ML can overfit to the data.

18 / 18