You are on page 1of 42

# Introduction to Statistical

Machine Learning

## Introduction to Statistical Machine Learning

Christfried Webers

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Outlines

## Statistical Machine Learning Group

NICTA
and
College of Engineering and Computer Science
The Australian National University

Canberra
February June 2013

Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Neural Networks 1
Neural Networks 2
Kernel Methods
Sparse Kernel Methods
Graphical Models 1
Graphical Models 2
Graphical Models 3
Mixture Models and EM 1
Mixture Models and EM 2
Approximate Inference
Sampling
Principal Component Analysis
Sequential Data 1

## (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Sequential Data 2
Combining Models
Selected Topics
Discussion and Summary
1of 114

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

Part II

I
SML
2013
Polynomial Curve Fitting
Probability Theory

Introduction

Probability Densities
Expectations and
Covariances

74of 114

Introduction to Statistical
Machine Learning

## Polynomial Curve Fitting

c
2013
Christfried Webers
NICTA
The Australian National
University

## some artificial data created from the function

sin(2x) + random noise

x = 0, . . . , 1

I
SML
2013
Polynomial Curve Fitting
Probability Theory

Probability Densities

Expectations and
Covariances

75of 114

## Polynomial Curve Fitting - Input Specification

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting

N = 10
x (x1 , . . . , xN )T
t (t1 , . . . , tN )T

Probability Theory
Probability Densities
Expectations and
Covariances

76of 114

## Polynomial Curve Fitting - Input Specification

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013

N = 10

## Polynomial Curve Fitting

x (x1 , . . . , xN )

t (t1 , . . . , tN )T
xi R i = 1,. . . , N
ti R i = 1,. . . , N

Probability Theory
Probability Densities
Expectations and
Covariances

77of 114

## Polynomial Curve Fitting - Model Specification

M : order of polynomial
y(x, w) = w0 + w1 x + w2 x2 + + wM xM
=

M
X

wm xm

m=0

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances

nonlinear function of x
linear function of the unknown model parameter w
How can we find good parameters w = (w1 , . . . , wM )T ?

78of 114

## Learning is Improving Performance

t

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

tn

y(xn , w)

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities

xn

Expectations and
Covariances

79of 114

## Learning is Improving Performance

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

tn

y(xn , w)

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities

xn

Expectations and
Covariances

## Performance measure : Error between target and

prediction of the model for the training data
N

E(w) =

1X
2
(y(xn , w) tn )
2
n=1

80of 114

## Model Comparison or Model Selection

y(x, w) =

M
X
m=0

wm x

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013

M=0

= w0

## Polynomial Curve Fitting

Probability Theory
Probability Densities
Expectations and
Covariances

M =0

1
t
0

81of 114

## Model Comparison or Model Selection

y(x, w) =

M
X

wm x

m=0

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013

M=1

= w0 + w1 x

## Polynomial Curve Fitting

Probability Theory
Probability Densities
Expectations and
Covariances

M =1

1
t
0

82of 114

## Model Comparison or Model Selection

y(x, w) =

M
X

wm x

m=0

= w0 + w1 x +

M=3
w2 x2

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013

+ w3 x3

## Polynomial Curve Fitting

Probability Theory
Probability Densities
Expectations and
Covariances

M =3

1
t
0

83of 114

## Model Comparison or Model Selection

y(x, w) =

M
X
m=0

wm x

M=9

= w0 + w1 x + + w8 x8 + w9 x9

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting
Probability Theory

overfitting

Probability Densities
Expectations and
Covariances

M =9

1
t
0

1
84of 114

Introduction to Statistical
Machine Learning

## Testing the Model

c
2013
Christfried Webers
NICTA
The Australian National
University

## Train the model and get w?

Get 100 new data points
Root-mean-square (RMS) error
p
ERMS = 2E(w? )/N

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities

ERMS

Training
Test

Expectations and
Covariances

0.5

85of 114

Introduction to Statistical
Machine Learning

## Testing the Model

w?0
w?1
w?2
w?3
w?4
w?5
w?6
w?7
w?8
w?9

M=0
0.19

c
2013
Christfried Webers
NICTA
The Australian National
University

M=1
0.82
-1.27

M=3
0.31
7.99
-25.43
17.37

M=9
0.35
232.37
-5321.83
48568.31
-231639.30
640042.26
-1061800.52
1042400.18
-557682.99
125201.43

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances

## Table : Coefficients w? for polynomials of various order.

86of 114

Introduction to Statistical
Machine Learning

More Data

c
2013
Christfried Webers
NICTA
The Australian National
University

N = 15

I
SML
2013
Polynomial Curve Fitting

N = 15

Probability Theory

Probability Densities
Expectations and
Covariances

87of 114

Introduction to Statistical
Machine Learning

More Data
N = 100
heuristics : have no less than 5 to 10 times as many data
points than parameters
but number of parameters is not necessarily the most
appropriate measure of model complexity !
later: Bayesian approach

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances

N = 100

1
t
0

1
88of 114

Introduction to Statistical
Machine Learning

Regularisation

c
2013
Christfried Webers
NICTA
The Australian National
University

## How to constrain the growing of the coefficients w ?

Add a regularisation term to the error function

I
SML
2013
Polynomial Curve Fitting
Probability Theory

e
E(w)
=

N
1X

n=1

( y(xn , w) tn ) +

kwk2
2

Probability Densities
Expectations and
Covariances

## Squared norm of the parameter vector w

kwk2 wT w = w20 + w21 + + w2M

89of 114

Introduction to Statistical
Machine Learning

Regularisation

c
2013
Christfried Webers
NICTA
The Australian National
University

M=9

I
SML
2013
Polynomial Curve Fitting

ln = 18

Probability Theory

Probability Densities
Expectations and
Covariances

90of 114

Introduction to Statistical
Machine Learning

Regularisation

c
2013
Christfried Webers
NICTA
The Australian National
University

M=9

I
SML
2013
Polynomial Curve Fitting

ln = 0

Probability Theory

Probability Densities
Expectations and
Covariances

91of 114

Introduction to Statistical
Machine Learning

Regularisation

c
2013
Christfried Webers
NICTA
The Australian National
University

M=9

I
SML
2013

1
Training
Test

## Polynomial Curve Fitting

Probability Theory

ERMS

Probability Densities
Expectations and
Covariances

0.5

35

30

ln 25

20

92of 114

Introduction to Statistical
Machine Learning

Probability Theory

c
2013
Christfried Webers
NICTA
The Australian National
University

p(X, Y )

I
SML
2013
Polynomial Curve Fitting
Probability Theory

Y =2

Probability Densities
Expectations and
Covariances

Y =1

93of 114

Introduction to Statistical
Machine Learning

Probability Theory
Y vs. X
2
1
sum

a
0
3
3

b
0
6
6

c
0
8
8

d
1
8
9

e
4
5
9

f
5
3
8

g
8
1
9

h
6
0
6

i
2
0
2

sum
26
34
60

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities

p(X, Y )

Expectations and
Covariances

Y =2

Y =1

94of 114

Introduction to Statistical
Machine Learning

Sum Rule
Y vs. X
2
1
sum

c
2013
Christfried Webers
NICTA
The Australian National
University

a
0
3
3

b
0
6
6

c
0
8
8

d
1
8
9

e
4
5
9

f
5
3
8

g
8
1
9

h
6
0
6

i
2
0
2

sum
26
34
60

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities

p(X = d, Y = 1) = 8/60
p(X = d) = p(X = d, Y = 2) + p(X = d, Y = 1)
= 1/60 + 8/60
p(X = d) =

Expectations and
Covariances

p(X = d, Y)

p(X) =

p(X, Y)

95of 114

Introduction to Statistical
Machine Learning

Sum Rule

c
2013
Christfried Webers
NICTA
The Australian National
University

Y vs. X
2
1
sum

a
0
3
3

b
0
6
6

c
0
8
8

d
1
8
9

e
4
5
9

f
5
3
8

g
8
1
9

h
6
0
6

i
2
0
2

sum
26
34
60

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities

p(X) =

p(X, Y)

p(Y) =

p(X, Y)

Expectations and
Covariances

p(X)

p(Y )

96of 114

Introduction to Statistical
Machine Learning

Product Rule
Y vs. X
2
1
sum

a
0
3
3

b
0
6
6

c
0
8
8

d
1
8
9

e
4
5
9

f
5
3
8

g
8
1
9

h
6
0
6

i
2
0
2

sum
26
34
60

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting

Conditional Probability

Probability Theory
Probability Densities

p(X = d | Y = 1) = 8/34

Expectations and
Covariances

## Calculate p(Y = 1):

p(Y = 1) =

p(X, Y = 1) = 34/60

## p(X = d, Y = 1) = p(X = d | Y = 1)p(Y = 1)

p(X, Y) = p(X | Y) p(Y)
97of 114

Introduction to Statistical
Machine Learning

Product Rule
Y vs. X
2
1
sum

a
0
3
3

b
0
6
6

c
0
8
8

d
1
8
9

e
4
5
9

f
5
3
8

g
8
1
9

## p(X, Y) = p(X | Y) p(Y)

h
6
0
6

i
2
0
2

sum
26
34
60

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities

p(X|Y = 1)

Expectations and
Covariances

98of 114

Introduction to Statistical
Machine Learning

## Sum Rule and Product Rule

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting

Sum Rule
p(X) =

Probability Theory

p(X, Y)

Product Rule

Probability Densities
Expectations and
Covariances

99of 114

## Why not using pairs of numbers (s, t) such that

p(X, Y) = s/t (e.g. s = 8, t = 60 )?

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances

100of 114

Introduction to Statistical
Machine Learning

## Why not using Fractions?

c
2013
Christfried Webers
NICTA
The Australian National
University

## Why not using pairs of numbers (s, t) such that

p(X, Y) = s/t (e.g. s = 8, t = 60 )?
Why not using pairs of numbers (a, c) instead of
sin(alpha) = a/c?

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances

alpha

101of 114

Introduction to Statistical
Machine Learning

Bayes Theorem

c
2013
Christfried Webers
NICTA
The Australian National
University

## Use product rule

p(X, Y) = p(X | Y) p(Y) = p(Y | X) p(X)
Bayes Theorem

I
SML
2013
Polynomial Curve Fitting
Probability Theory

p(X | Y) p(Y)
p(Y | X) =
p(X)

Probability Densities
Expectations and
Covariances

and
p(X) =

p(X, Y)

(sum rule)

X
Y

p(X | Y) p(Y)

(product rule)

102of 114

Introduction to Statistical
Machine Learning

Probability Densities

c
2013
Christfried Webers
NICTA
The Australian National
University

## Real valued variable x R

Probability of x to fall in the interval (x, x + x) is given by
p(x)x for infinitesimal small x.

I
SML
2013
Polynomial Curve Fitting

## p(x (a, b)) =

p(x)

Probability Theory

p(x) dx.

Probability Densities

Expectations and
Covariances

P (x)

103of 114

Introduction to Statistical
Machine Learning

Constraints on p(x)

c
2013
Christfried Webers
NICTA
The Australian National
University

Nonnegative
p(x) 0
Normalisation

I
SML
2013

## Polynomial Curve Fitting

p(x) dx = 1.

Probability Theory

Probability Densities

p(x)

Expectations and
Covariances

P (x)

104of 114

p(z) dz

I
SML
2013

## Polynomial Curve Fitting

d
P(x) = p(x)
dx
p(x)

c
2013
Christfried Webers
NICTA
The Australian National
University

P(x) =
or

Introduction to Statistical
Machine Learning

Probability Theory
Probability Densities
Expectations and
Covariances

P (x)

105of 114

## Multivariate Probability Density

x1
..
T
Vector x (x1 , . . . , xD ) = .
xD

Nonnegative

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting
Probability Theory

p(x) 0
Normalisation

Probability Densities
Expectations and
Covariances

p(x) dx = 1.

This means
Z

...

106of 114

## Sum and Product Rule for Probability Densities

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting

Sum Rule

p(x) =

Probability Theory

p(x, y) dy

Product Rule

Probability Densities
Expectations and
Covariances

## p(x, y) = p(y | x) p(x)

107of 114

Introduction to Statistical
Machine Learning

Expectations

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013

## Weighted average of a function f(x) under the probability

distribution p(x)
X
E [f ] =
p(x) f (x)
discrete distribution p(x)

## Polynomial Curve Fitting

Probability Theory
Probability Densities
Expectations and
Covariances

Z
E [f ] =

p(x) f (x) dx

## probability density p(x)

108of 114

Introduction to Statistical
Machine Learning

How to approximate E [f ]

c
2013
Christfried Webers
NICTA
The Australian National
University

## Given a finite number N of points xn drawn from the

probability distribution p(x).
Approximate the expectation by a finite sum:

I
SML
2013
Polynomial Curve Fitting
Probability Theory
Probability Densities

E [f ] '

1
N

N
X

f (xn )

Expectations and
Covariances

n=1

109of 114

## Expection of a function of several variables

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013

X
Ex [f (x, y)] =
p(x) f (x, y)

## Polynomial Curve Fitting

Probability Theory

## discrete distribution p(x)

Expectations and
Covariances

Z
Ex [f (x, y)] =

Probability Densities

p(x) f (x, y) dx

## Note that Ex [f (x, y)] is a function of y.

110of 114

Introduction to Statistical
Machine Learning

Conditional Expectation
arbitrary function f (x)
X
Ex [f | y] =
p(x | y) f (x)

c
2013
Christfried Webers
NICTA
The Australian National
University

## discrete distribution p(x)

Ex [f | y] =

p(x | y) f (x) dx

## Note that Ex [f | y] is a function of y.

Other notation used in the literature : Ex|y [f ].
What is E [E [f (x) | y]] ? Can we simplify it?
This must mean Ey [Ex [f (x) | y]]. (Why?)
X
X
X
Ey [Ex [f (x) | y]] =
p(y) Ex [f | y] =
p(y)
p(x|y) f (x)
y

f (x) p(x, y) =

x,y

I
SML
2013

Probability Theory
Probability Densities
Expectations and
Covariances

f (x) p(x)

= Ex [f (x)]
111of 114

Introduction to Statistical
Machine Learning

Variance

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013

## arbitrary function f (x)





2
var[f ] = E (f (x) E [f (x)])2 = E f (x)2 E [f (x)]
Special case: f (x) = x

## Polynomial Curve Fitting

Probability Theory
Probability Densities
Expectations and
Covariances



 
2
var[x] = E (x E [x])2 = E x2 E [x]

112of 114

Introduction to Statistical
Machine Learning

Covariance

c
2013
Christfried Webers
NICTA
The Australian National
University

## Two random variables x R and y R

cov[x, y] = Ex,y [(x E [x])(y E [y])]
= Ex,y [x y] E [x] E [y]

I
SML
2013
Polynomial Curve Fitting

## With E [x] = a and E [y] = b

Probability Theory
Probability Densities

## cov[x, y] = Ex,y [(x a)(y b)]

= Ex,y [x y] Ex,y [x b] Ex,y [a y] + Ex,y [a b]
= Ex,y [x y] b Ex,y [x] a Ex,y [y] +a b Ex,y [1]
| {z }
| {z }
| {z }
=Ex [x]

=Ey [y]

Expectations and
Covariances

=1

= Ex,y [x y] a b a b + a b = Ex,y [x y] a b
= Ex,y [x y] E [x] E [y]
Expresses how strongly x and y vary together. If x and y
are independent, their covariance vanishes.
113of 114

## Covariance for Vector Valued Variables

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Polynomial Curve Fitting

## Two random variables x RD and y RD


 
cov[x, y] = Ex,y (x E [x])(yT E yT )


 
= Ex,y x yT E [x] E yT

Probability Theory
Probability Densities
Expectations and
Covariances

114of 114