You are on page 1of 42

Introduction to Statistical

Machine Learning

Introduction to Statistical Machine Learning


Christfried Webers

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Outlines

Statistical Machine Learning Group


NICTA
and
College of Engineering and Computer Science
The Australian National University

Canberra
February June 2013

Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Neural Networks 1
Neural Networks 2
Kernel Methods
Sparse Kernel Methods
Graphical Models 1
Graphical Models 2
Graphical Models 3
Mixture Models and EM 1
Mixture Models and EM 2
Approximate Inference
Sampling
Principal Component Analysis
Sequential Data 1

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Sequential Data 2
Combining Models
Selected Topics
Discussion and Summary
1of 114

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

Part II

I
SML
2013
Polynomial Curve Fitting
Probability Theory

Introduction

Probability Densities
Expectations and
Covariances

74of 114

Introduction to Statistical
Machine Learning

Polynomial Curve Fitting

c
2013
Christfried Webers
NICTA
The Australian National
University

some artificial data created from the function


sin(2x) + random noise

x = 0, . . . , 1

I
SML
2013
Polynomial Curve Fitting
Probability Theory

Probability Densities

Expectations and
Covariances

75of 114

Polynomial Curve Fitting - Input Specification

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting

N = 10
x (x1 , . . . , xN )T
t (t1 , . . . , tN )T

Probability Theory
Probability Densities
Expectations and
Covariances

76of 114

Polynomial Curve Fitting - Input Specification

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013

N = 10

Polynomial Curve Fitting

x (x1 , . . . , xN )

t (t1 , . . . , tN )T
xi R i = 1,. . . , N
ti R i = 1,. . . , N

Probability Theory
Probability Densities
Expectations and
Covariances

77of 114

Polynomial Curve Fitting - Model Specification

M : order of polynomial
y(x, w) = w0 + w1 x + w2 x2 + + wM xM
=

M
X

wm xm

m=0

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances

nonlinear function of x
linear function of the unknown model parameter w
How can we find good parameters w = (w1 , . . . , wM )T ?

78of 114

Learning is Improving Performance


t

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

tn

y(xn , w)

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities

xn

Expectations and
Covariances

79of 114

Learning is Improving Performance

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

tn

y(xn , w)

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities

xn

Expectations and
Covariances

Performance measure : Error between target and


prediction of the model for the training data
N

E(w) =

1X
2
(y(xn , w) tn )
2
n=1

unique minimum of E(w) for argument w?


80of 114

Model Comparison or Model Selection

y(x, w) =

M
X
m=0

wm x

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013

M=0

= w0

Polynomial Curve Fitting


Probability Theory
Probability Densities
Expectations and
Covariances

M =0

1
t
0

81of 114

Model Comparison or Model Selection

y(x, w) =

M
X

wm x

m=0

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013

M=1

= w0 + w1 x

Polynomial Curve Fitting


Probability Theory
Probability Densities
Expectations and
Covariances

M =1

1
t
0

82of 114

Model Comparison or Model Selection

y(x, w) =

M
X

wm x

m=0

= w0 + w1 x +

M=3
w2 x2

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013

+ w3 x3

Polynomial Curve Fitting


Probability Theory
Probability Densities
Expectations and
Covariances

M =3

1
t
0

83of 114

Model Comparison or Model Selection

y(x, w) =

M
X
m=0

wm x

M=9

= w0 + w1 x + + w8 x8 + w9 x9

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting
Probability Theory

overfitting

Probability Densities
Expectations and
Covariances

M =9

1
t
0

1
84of 114

Introduction to Statistical
Machine Learning

Testing the Model

c
2013
Christfried Webers
NICTA
The Australian National
University

Train the model and get w?


Get 100 new data points
Root-mean-square (RMS) error
p
ERMS = 2E(w? )/N

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities

ERMS

Training
Test

Expectations and
Covariances

0.5

85of 114

Introduction to Statistical
Machine Learning

Testing the Model

w?0
w?1
w?2
w?3
w?4
w?5
w?6
w?7
w?8
w?9

M=0
0.19

c
2013
Christfried Webers
NICTA
The Australian National
University

M=1
0.82
-1.27

M=3
0.31
7.99
-25.43
17.37

M=9
0.35
232.37
-5321.83
48568.31
-231639.30
640042.26
-1061800.52
1042400.18
-557682.99
125201.43

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances

Table : Coefficients w? for polynomials of various order.

86of 114

Introduction to Statistical
Machine Learning

More Data

c
2013
Christfried Webers
NICTA
The Australian National
University

N = 15

I
SML
2013
Polynomial Curve Fitting

N = 15

Probability Theory

Probability Densities
Expectations and
Covariances

87of 114

Introduction to Statistical
Machine Learning

More Data
N = 100
heuristics : have no less than 5 to 10 times as many data
points than parameters
but number of parameters is not necessarily the most
appropriate measure of model complexity !
later: Bayesian approach

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances

N = 100

1
t
0

1
88of 114

Introduction to Statistical
Machine Learning

Regularisation

c
2013
Christfried Webers
NICTA
The Australian National
University

How to constrain the growing of the coefficients w ?


Add a regularisation term to the error function

I
SML
2013
Polynomial Curve Fitting
Probability Theory

e
E(w)
=

N
1X

n=1

( y(xn , w) tn ) +

kwk2
2

Probability Densities
Expectations and
Covariances

Squared norm of the parameter vector w


kwk2 wT w = w20 + w21 + + w2M

89of 114

Introduction to Statistical
Machine Learning

Regularisation

c
2013
Christfried Webers
NICTA
The Australian National
University

M=9

I
SML
2013
Polynomial Curve Fitting

ln = 18

Probability Theory

Probability Densities
Expectations and
Covariances

90of 114

Introduction to Statistical
Machine Learning

Regularisation

c
2013
Christfried Webers
NICTA
The Australian National
University

M=9

I
SML
2013
Polynomial Curve Fitting

ln = 0

Probability Theory

Probability Densities
Expectations and
Covariances

91of 114

Introduction to Statistical
Machine Learning

Regularisation

c
2013
Christfried Webers
NICTA
The Australian National
University

M=9

I
SML
2013

1
Training
Test

Polynomial Curve Fitting


Probability Theory

ERMS

Probability Densities
Expectations and
Covariances

0.5

35

30

ln 25

20

92of 114

Introduction to Statistical
Machine Learning

Probability Theory

c
2013
Christfried Webers
NICTA
The Australian National
University

p(X, Y )

I
SML
2013
Polynomial Curve Fitting
Probability Theory

Y =2

Probability Densities
Expectations and
Covariances

Y =1

93of 114

Introduction to Statistical
Machine Learning

Probability Theory
Y vs. X
2
1
sum

a
0
3
3

b
0
6
6

c
0
8
8

d
1
8
9

e
4
5
9

f
5
3
8

g
8
1
9

h
6
0
6

i
2
0
2

sum
26
34
60

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities

p(X, Y )

Expectations and
Covariances

Y =2

Y =1

94of 114

Introduction to Statistical
Machine Learning

Sum Rule
Y vs. X
2
1
sum

c
2013
Christfried Webers
NICTA
The Australian National
University

a
0
3
3

b
0
6
6

c
0
8
8

d
1
8
9

e
4
5
9

f
5
3
8

g
8
1
9

h
6
0
6

i
2
0
2

sum
26
34
60

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities

p(X = d, Y = 1) = 8/60
p(X = d) = p(X = d, Y = 2) + p(X = d, Y = 1)
= 1/60 + 8/60
p(X = d) =

Expectations and
Covariances

p(X = d, Y)

p(X) =

p(X, Y)

95of 114

Introduction to Statistical
Machine Learning

Sum Rule

c
2013
Christfried Webers
NICTA
The Australian National
University

Y vs. X
2
1
sum

a
0
3
3

b
0
6
6

c
0
8
8

d
1
8
9

e
4
5
9

f
5
3
8

g
8
1
9

h
6
0
6

i
2
0
2

sum
26
34
60

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities

p(X) =

p(X, Y)

p(Y) =

p(X, Y)

Expectations and
Covariances

p(X)

p(Y )

96of 114

Introduction to Statistical
Machine Learning

Product Rule
Y vs. X
2
1
sum

a
0
3
3

b
0
6
6

c
0
8
8

d
1
8
9

e
4
5
9

f
5
3
8

g
8
1
9

h
6
0
6

i
2
0
2

sum
26
34
60

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting

Conditional Probability

Probability Theory
Probability Densities

p(X = d | Y = 1) = 8/34

Expectations and
Covariances

Calculate p(Y = 1):


p(Y = 1) =

p(X, Y = 1) = 34/60

p(X = d, Y = 1) = p(X = d | Y = 1)p(Y = 1)


p(X, Y) = p(X | Y) p(Y)
97of 114

Introduction to Statistical
Machine Learning

Product Rule
Y vs. X
2
1
sum

a
0
3
3

b
0
6
6

c
0
8
8

d
1
8
9

e
4
5
9

f
5
3
8

g
8
1
9

p(X, Y) = p(X | Y) p(Y)

h
6
0
6

i
2
0
2

sum
26
34
60

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities

p(X|Y = 1)

Expectations and
Covariances

98of 114

Introduction to Statistical
Machine Learning

Sum Rule and Product Rule

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting

Sum Rule
p(X) =

Probability Theory

p(X, Y)

Product Rule

Probability Densities
Expectations and
Covariances

p(X, Y) = p(X | Y) p(Y)

99of 114

Why not using Fractions?

Why not using pairs of numbers (s, t) such that


p(X, Y) = s/t (e.g. s = 8, t = 60 )?

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances

100of 114

Introduction to Statistical
Machine Learning

Why not using Fractions?

c
2013
Christfried Webers
NICTA
The Australian National
University

Why not using pairs of numbers (s, t) such that


p(X, Y) = s/t (e.g. s = 8, t = 60 )?
Why not using pairs of numbers (a, c) instead of
sin(alpha) = a/c?

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances

alpha

101of 114

Introduction to Statistical
Machine Learning

Bayes Theorem

c
2013
Christfried Webers
NICTA
The Australian National
University

Use product rule


p(X, Y) = p(X | Y) p(Y) = p(Y | X) p(X)
Bayes Theorem

I
SML
2013
Polynomial Curve Fitting
Probability Theory

p(X | Y) p(Y)
p(Y | X) =
p(X)

Probability Densities
Expectations and
Covariances

and
p(X) =

p(X, Y)

(sum rule)

X
Y

p(X | Y) p(Y)

(product rule)

102of 114

Introduction to Statistical
Machine Learning

Probability Densities

c
2013
Christfried Webers
NICTA
The Australian National
University

Real valued variable x R


Probability of x to fall in the interval (x, x + x) is given by
p(x)x for infinitesimal small x.

I
SML
2013
Polynomial Curve Fitting

p(x (a, b)) =


p(x)

Probability Theory

p(x) dx.

Probability Densities

Expectations and
Covariances

P (x)

103of 114

Introduction to Statistical
Machine Learning

Constraints on p(x)

c
2013
Christfried Webers
NICTA
The Australian National
University

Nonnegative
p(x) 0
Normalisation

I
SML
2013

Polynomial Curve Fitting

p(x) dx = 1.

Probability Theory

Probability Densities

p(x)

Expectations and
Covariances

P (x)

104of 114

Cumulative distribution function P(x)

p(z) dz

I
SML
2013

Polynomial Curve Fitting

d
P(x) = p(x)
dx
p(x)

c
2013
Christfried Webers
NICTA
The Australian National
University

P(x) =
or

Introduction to Statistical
Machine Learning

Probability Theory
Probability Densities
Expectations and
Covariances

P (x)

105of 114

Multivariate Probability Density


x1
..
T
Vector x (x1 , . . . , xD ) = .
xD

Nonnegative

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting
Probability Theory

p(x) 0
Normalisation

Probability Densities
Expectations and
Covariances

p(x) dx = 1.

This means
Z

...

p(x) dx1 . . . dxD = 1.

106of 114

Sum and Product Rule for Probability Densities

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting

Sum Rule

p(x) =

Probability Theory

p(x, y) dy

Product Rule

Probability Densities
Expectations and
Covariances

p(x, y) = p(y | x) p(x)

107of 114

Introduction to Statistical
Machine Learning

Expectations

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013

Weighted average of a function f(x) under the probability


distribution p(x)
X
E [f ] =
p(x) f (x)
discrete distribution p(x)

Polynomial Curve Fitting


Probability Theory
Probability Densities
Expectations and
Covariances

Z
E [f ] =

p(x) f (x) dx

probability density p(x)

108of 114

Introduction to Statistical
Machine Learning

How to approximate E [f ]

c
2013
Christfried Webers
NICTA
The Australian National
University

Given a finite number N of points xn drawn from the


probability distribution p(x).
Approximate the expectation by a finite sum:

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities

E [f ] '

1
N

N
X

f (xn )

Expectations and
Covariances

n=1

How to draw points from a probability distribution p(x) ?


Lecture coming about Sampling

109of 114

Expection of a function of several variables

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013

arbitrary function f (x, y)


X
Ex [f (x, y)] =
p(x) f (x, y)

Polynomial Curve Fitting


Probability Theory

discrete distribution p(x)

Expectations and
Covariances

Z
Ex [f (x, y)] =

Probability Densities

p(x) f (x, y) dx

probability density p(x)

Note that Ex [f (x, y)] is a function of y.

110of 114

Introduction to Statistical
Machine Learning

Conditional Expectation
arbitrary function f (x)
X
Ex [f | y] =
p(x | y) f (x)

c
2013
Christfried Webers
NICTA
The Australian National
University

discrete distribution p(x)

Ex [f | y] =

p(x | y) f (x) dx

probability density p(x)

Polynomial Curve Fitting

Note that Ex [f | y] is a function of y.


Other notation used in the literature : Ex|y [f ].
What is E [E [f (x) | y]] ? Can we simplify it?
This must mean Ey [Ex [f (x) | y]]. (Why?)
X
X
X
Ey [Ex [f (x) | y]] =
p(y) Ex [f | y] =
p(y)
p(x|y) f (x)
y

f (x) p(x, y) =

x,y

I
SML
2013

Probability Theory
Probability Densities
Expectations and
Covariances

f (x) p(x)

= Ex [f (x)]
111of 114

Introduction to Statistical
Machine Learning

Variance

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013

arbitrary function f (x)






2
var[f ] = E (f (x) E [f (x)])2 = E f (x)2 E [f (x)]
Special case: f (x) = x

Polynomial Curve Fitting


Probability Theory
Probability Densities
Expectations and
Covariances



 
2
var[x] = E (x E [x])2 = E x2 E [x]

112of 114

Introduction to Statistical
Machine Learning

Covariance

c
2013
Christfried Webers
NICTA
The Australian National
University

Two random variables x R and y R


cov[x, y] = Ex,y [(x E [x])(y E [y])]
= Ex,y [x y] E [x] E [y]

I
SML
2013
Polynomial Curve Fitting

With E [x] = a and E [y] = b

Probability Theory
Probability Densities

cov[x, y] = Ex,y [(x a)(y b)]


= Ex,y [x y] Ex,y [x b] Ex,y [a y] + Ex,y [a b]
= Ex,y [x y] b Ex,y [x] a Ex,y [y] +a b Ex,y [1]
| {z }
| {z }
| {z }
=Ex [x]

=Ey [y]

Expectations and
Covariances

=1

= Ex,y [x y] a b a b + a b = Ex,y [x y] a b
= Ex,y [x y] E [x] E [y]
Expresses how strongly x and y vary together. If x and y
are independent, their covariance vanishes.
113of 114

Covariance for Vector Valued Variables

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting

Two random variables x RD and y RD



 
cov[x, y] = Ex,y (x E [x])(yT E yT )


 
= Ex,y x yT E [x] E yT

Probability Theory
Probability Densities
Expectations and
Covariances

114of 114