02 Introduction

Introduction to Statistical
Machine Learning
Introduction to Statistical Machine Learning

Christfried Webers
c
2013
Christfried Webers
NICTA
The Australian National
University
I
SML
2013
Outlines
Statistical Machine Learning Group

NICTA
and
College of Engineering and Computer Science
The Australian National University
Canberra
February June 2013
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Neural Networks 1
Neural Networks 2
Kernel Methods
Sparse Kernel Methods
Graphical Models 1
Graphical Models 2
Graphical Models 3
Mixture Models and EM 1
Mixture Models and EM 2
Approximate Inference
Sampling
Principal Component Analysis
Sequential Data 1
(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")
Sequential Data 2
Combining Models
Selected Topics
Discussion and Summary
1of 114
Machine Learning
c
2013
Christfried Webers
NICTA
University
Part II
I
SML
2013
Polynomial Curve Fitting
Probability Theory
Introduction
Probability Densities
Expectations and
Covariances
74of 114
Machine Learning
c
2013
Christfried Webers
NICTA
University
some artificial data created from the function

sin(2x) + random noise
x = 0, . . . , 1
I
SML
2013
Probability Theory
Expectations and
Covariances
75of 114
Polynomial Curve Fitting - Input Specification
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
N = 10
x (x1 , . . . , xN )T
t (t1 , . . . , tN )T
Probability Theory
Expectations and
Covariances
76of 114
Polynomial Curve Fitting - Input Specification
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
N = 10
x (x1 , . . . , xN )
t (t1 , . . . , tN )T
xi R i = 1,. . . , N
ti R i = 1,. . . , N
Probability Theory
Expectations and
Covariances
77of 114
Polynomial Curve Fitting - Model Specification
M : order of polynomial
y(x, w) = w0 + w1 x + w2 x2 + + wM xM
=
M
X
wm xm
m=0
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Probability Theory
Expectations and
Covariances
nonlinear function of x
linear function of the unknown model parameter w
How can we find good parameters w = (w1 , . . . , wM )T ?
78of 114
Learning is Improving Performance

t
Machine Learning
c
2013
Christfried Webers
NICTA
University
tn
y(xn , w)
I
SML
2013
Probability Theory
xn
Expectations and
Covariances
79of 114
Learning is Improving Performance
Machine Learning
c
2013
Christfried Webers
NICTA
University
tn
y(xn , w)
I
SML
2013
Probability Theory
xn
Expectations and
Covariances
Performance measure : Error between target and

prediction of the model for the training data
N
E(w) =
1X
2
(y(xn , w) tn )
2
n=1
unique minimum of E(w) for argument w?

80of 114
Model Comparison or Model Selection
y(x, w) =
M
X
m=0
wm x
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
M=0
= w0

Probability Theory
Expectations and
Covariances
M =0
1
t
0
81of 114
y(x, w) =
M
X
wm x
m=0
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
M=1
= w0 + w1 x

Probability Theory
Expectations and
Covariances
M =1
1
t
0
82of 114
y(x, w) =
M
X
wm x
m=0
= w0 + w1 x +
M=3
w2 x2
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
+ w3 x3

Probability Theory
Expectations and
Covariances
M =3
1
t
0
83of 114
y(x, w) =
M
X
m=0
wm x
M=9
= w0 + w1 x + + w8 x8 + w9 x9
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Probability Theory
overfitting
Expectations and
Covariances
M =9
1
t
0
1
84of 114
Machine Learning
Testing the Model
c
2013
Christfried Webers
NICTA
University
Train the model and get w?

Get 100 new data points
Root-mean-square (RMS) error
p
ERMS = 2E(w? )/N
I
SML
2013
Probability Theory
ERMS
Training
Test
Expectations and
Covariances
0.5
85of 114
Machine Learning
Testing the Model
w?0
w?1
w?2
w?3
w?4
w?5
w?6
w?7
w?8
w?9
M=0
0.19
c
2013
Christfried Webers
NICTA
University
M=1
0.82
-1.27
M=3
0.31
7.99
-25.43
17.37
M=9
0.35
232.37
-5321.83
48568.31
-231639.30
640042.26
-1061800.52
1042400.18
-557682.99
125201.43
I
SML
2013
Probability Theory
Expectations and
Covariances
Table : Coefficients w? for polynomials of various order.
86of 114
Machine Learning
More Data
c
2013
Christfried Webers
NICTA
University
N = 15
I
SML
2013
N = 15
Probability Theory
Expectations and
Covariances
87of 114
Machine Learning
More Data
N = 100
heuristics : have no less than 5 to 10 times as many data
points than parameters
but number of parameters is not necessarily the most
appropriate measure of model complexity !
later: Bayesian approach
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Probability Theory
Expectations and
Covariances
N = 100
1
t
0
1
88of 114
Machine Learning
Regularisation
c
2013
Christfried Webers
NICTA
University
How to constrain the growing of the coefficients w ?

Add a regularisation term to the error function
I
SML
2013
Probability Theory
e
E(w)
=
N
1X
n=1
( y(xn , w) tn ) +
kwk2
2
Expectations and
Covariances
Squared norm of the parameter vector w

kwk2 wT w = w20 + w21 + + w2M
89of 114
Machine Learning
Regularisation
c
2013
Christfried Webers
NICTA
University
M=9
I
SML
2013
ln = 18
Probability Theory
Expectations and
Covariances
90of 114
Machine Learning
Regularisation
c
2013
Christfried Webers
NICTA
University
M=9
I
SML
2013
ln = 0
Probability Theory
Expectations and
Covariances
91of 114
Machine Learning
Regularisation
c
2013
Christfried Webers
NICTA
University
M=9
I
SML
2013
1
Training
Test

Probability Theory
ERMS
Expectations and
Covariances
0.5
35
30
ln 25
20
92of 114
Machine Learning
Probability Theory
c
2013
Christfried Webers
NICTA
University
p(X, Y )
I
SML
2013
Probability Theory
Y =2
Expectations and
Covariances
Y =1
93of 114
Machine Learning
Probability Theory
Y vs. X
2
1
sum
a
0
3
3
b
0
6
6
c
0
8
8
d
1
8
9
e
4
5
9
f
5
3
8
g
8
1
9
h
6
0
6
i
2
0
2
sum
26
34
60
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Probability Theory
p(X, Y )
Expectations and
Covariances
Y =2
Y =1
94of 114
Machine Learning
Sum Rule
Y vs. X
2
1
sum
c
2013
Christfried Webers
NICTA
University
a
0
3
3
b
0
6
6
c
0
8
8
d
1
8
9
e
4
5
9
f
5
3
8
g
8
1
9
h
6
0
6
i
2
0
2
sum
26
34
60
I
SML
2013
Probability Theory
p(X = d, Y = 1) = 8/60
p(X = d) = p(X = d, Y = 2) + p(X = d, Y = 1)
= 1/60 + 8/60
p(X = d) =
Expectations and
Covariances
p(X = d, Y)
p(X) =
p(X, Y)
95of 114
Machine Learning
Sum Rule
c
2013
Christfried Webers
NICTA
University
Y vs. X
2
1
sum
a
0
3
3
b
0
6
6
c
0
8
8
d
1
8
9
e
4
5
9
f
5
3
8
g
8
1
9
h
6
0
6
i
2
0
2
sum
26
34
60
I
SML
2013
Probability Theory
p(X) =
p(X, Y)
p(Y) =
p(X, Y)
Expectations and
Covariances
p(X)
p(Y )
96of 114
Machine Learning
Product Rule
Y vs. X
2
1
sum
a
0
3
3
b
0
6
6
c
0
8
8
d
1
8
9
e
4
5
9
f
5
3
8
g
8
1
9
h
6
0
6
i
2
0
2
sum
26
34
60
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Conditional Probability
Probability Theory
p(X = d | Y = 1) = 8/34
Expectations and
Covariances
Calculate p(Y = 1):

p(Y = 1) =
p(X, Y = 1) = 34/60
p(X = d, Y = 1) = p(X = d | Y = 1)p(Y = 1)

p(X, Y) = p(X | Y) p(Y)
97of 114
Machine Learning
Product Rule
Y vs. X
2
1
sum
a
0
3
3
b
0
6
6
c
0
8
8
d
1
8
9
e
4
5
9
f
5
3
8
g
8
1
9
p(X, Y) = p(X | Y) p(Y)
h
6
0
6
i
2
0
2
sum
26
34
60
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Probability Theory
p(X|Y = 1)
Expectations and
Covariances
98of 114
Machine Learning
Sum Rule and Product Rule
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Sum Rule
p(X) =
Probability Theory
p(X, Y)
Product Rule
Expectations and
Covariances
p(X, Y) = p(X | Y) p(Y)
99of 114
Why not using Fractions?
Why not using pairs of numbers (s, t) such that

p(X, Y) = s/t (e.g. s = 8, t = 60 )?
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Probability Theory
Expectations and
Covariances
100of 114
Machine Learning
Why not using Fractions?
c
2013
Christfried Webers
NICTA
University
Why not using pairs of numbers (s, t) such that

p(X, Y) = s/t (e.g. s = 8, t = 60 )?
Why not using pairs of numbers (a, c) instead of
sin(alpha) = a/c?
I
SML
2013
Probability Theory
Expectations and
Covariances
alpha
101of 114
Machine Learning
Bayes Theorem
c
2013
Christfried Webers
NICTA
University
Use product rule

p(X, Y) = p(X | Y) p(Y) = p(Y | X) p(X)
Bayes Theorem
I
SML
2013
Probability Theory
p(X | Y) p(Y)
p(Y | X) =
p(X)
Expectations and
Covariances
and
p(X) =
p(X, Y)
(sum rule)
X
Y
p(X | Y) p(Y)
(product rule)
102of 114
Machine Learning
c
2013
Christfried Webers
NICTA
University
Real valued variable x R

Probability of x to fall in the interval (x, x + x) is given by
p(x)x for infinitesimal small x.
I
SML
2013
p(x (a, b)) =

p(x)
Probability Theory
p(x) dx.
Expectations and
Covariances
P (x)
103of 114
Machine Learning
Constraints on p(x)
c
2013
Christfried Webers
NICTA
University
Nonnegative
p(x) 0
Normalisation
I
SML
2013
p(x) dx = 1.
Probability Theory
p(x)
Expectations and
Covariances
P (x)
104of 114
Cumulative distribution function P(x)
p(z) dz
I
SML
2013
d
P(x) = p(x)
dx
p(x)
c
2013
Christfried Webers
NICTA
University
P(x) =
or
Machine Learning
Probability Theory
Expectations and
Covariances
P (x)
105of 114
Multivariate Probability Density

x1
..
T
Vector x (x1 , . . . , xD ) = .
xD
Nonnegative
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Probability Theory
p(x) 0
Normalisation
Expectations and
Covariances
p(x) dx = 1.
This means
Z
...
p(x) dx1 . . . dxD = 1.
106of 114
Sum and Product Rule for Probability Densities
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Sum Rule
p(x) =
Probability Theory
p(x, y) dy
Product Rule
Expectations and
Covariances
p(x, y) = p(y | x) p(x)
107of 114
Machine Learning
Expectations
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Weighted average of a function f(x) under the probability

distribution p(x)
X
E [f ] =
p(x) f (x)
discrete distribution p(x)

Probability Theory
Expectations and
Covariances
Z
E [f ] =
p(x) f (x) dx
probability density p(x)
108of 114
Machine Learning
How to approximate E [f ]
c
2013
Christfried Webers
NICTA
University
Given a finite number N of points xn drawn from the

probability distribution p(x).
Approximate the expectation by a finite sum:
I
SML
2013
Probability Theory
E [f ] '
1
N
N
X
f (xn )
Expectations and
Covariances
n=1
How to draw points from a probability distribution p(x) ?

Lecture coming about Sampling
109of 114
Expection of a function of several variables
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
arbitrary function f (x, y)

X
Ex [f (x, y)] =
p(x) f (x, y)

Probability Theory
Expectations and
Covariances
Z
Ex [f (x, y)] =
p(x) f (x, y) dx
Note that Ex [f (x, y)] is a function of y.
110of 114
Machine Learning
Conditional Expectation
arbitrary function f (x)
X
Ex [f | y] =
p(x | y) f (x)
c
2013
Christfried Webers
NICTA
University
Ex [f | y] =
p(x | y) f (x) dx
Note that Ex [f | y] is a function of y.

Other notation used in the literature : Ex|y [f ].
What is E [E [f (x) | y]] ? Can we simplify it?
This must mean Ey [Ex [f (x) | y]]. (Why?)
X
X
X
Ey [Ex [f (x) | y]] =
p(y) Ex [f | y] =
p(y)
p(x|y) f (x)
y
f (x) p(x, y) =
x,y
I
SML
2013
Probability Theory
Expectations and
Covariances
f (x) p(x)
= Ex [f (x)]
111of 114
Machine Learning
Variance
c
2013
Christfried Webers
NICTA
University
I
SML
2013
arbitrary function f (x)

2
var[f ] = E (f (x) E [f (x)])2 = E f (x)2 E [f (x)]
Special case: f (x) = x

Probability Theory
Expectations and
Covariances

2
var[x] = E (x E [x])2 = E x2 E [x]
112of 114
Machine Learning
Covariance
c
2013
Christfried Webers
NICTA
University
Two random variables x R and y R

cov[x, y] = Ex,y [(x E [x])(y E [y])]
= Ex,y [x y] E [x] E [y]
I
SML
2013
With E [x] = a and E [y] = b
Probability Theory
cov[x, y] = Ex,y [(x a)(y b)]

= Ex,y [x y] Ex,y [x b] Ex,y [a y] + Ex,y [a b]
= Ex,y [x y] b Ex,y [x] a Ex,y [y] +a b Ex,y [1]
| {z }
| {z }
| {z }
=Ex [x]
=Ey [y]
Expectations and
Covariances
=1
= Ex,y [x y] a b a b + a b = Ex,y [x y] a b
= Ex,y [x y] E [x] E [y]
Expresses how strongly x and y vary together. If x and y
are independent, their covariance vanishes.
113of 114
Covariance for Vector Valued Variables
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Two random variables x RD and y RD

cov[x, y] = Ex,y (x E [x])(yT E yT )

= Ex,y x yT E [x] E yT
Probability Theory
Expectations and
Covariances
114of 114

02 Introduction

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

02 Introduction

Загружено:

Авторское право:

Доступные форматы

Introduction to Statistical

Introduction to Statistical Machine Learning

Statistical Machine Learning Group

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Polynomial Curve Fitting

some artificial data created from the function

Polynomial Curve Fitting - Input Specification

Polynomial Curve Fitting - Input Specification

Polynomial Curve Fitting

Polynomial Curve Fitting - Model Specification

Learning is Improving Performance

Learning is Improving Performance

Performance measure : Error between target and

unique minimum of E(w) for argument w?

Model Comparison or Model Selection

Polynomial Curve Fitting

Model Comparison or Model Selection

Polynomial Curve Fitting

Model Comparison or Model Selection

Polynomial Curve Fitting

Model Comparison or Model Selection

Testing the Model

Train the model and get w?

Testing the Model

Table : Coefficients w? for polynomials of various order.

How to constrain the growing of the coefficients w ?

Squared norm of the parameter vector w

Polynomial Curve Fitting

Calculate p(Y = 1):

p(X = d, Y = 1) = p(X = d | Y = 1)p(Y = 1)

p(X, Y) = p(X | Y) p(Y)

Sum Rule and Product Rule

p(X, Y) = p(X | Y) p(Y)

Why not using Fractions?

Why not using pairs of numbers (s, t) such that

Why not using Fractions?

Why not using pairs of numbers (s, t) such that

Use product rule

Real valued variable x R

p(x (a, b)) =

Polynomial Curve Fitting

Cumulative distribution function P(x)

Polynomial Curve Fitting

Multivariate Probability Density

p(x) dx1 . . . dxD = 1.

Sum and Product Rule for Probability Densities

p(x, y) = p(y | x) p(x)

Weighted average of a function f(x) under the probability

Polynomial Curve Fitting

probability density p(x)

Given a finite number N of points xn drawn from the

How to draw points from a probability distribution p(x) ?

Expection of a function of several variables

arbitrary function f (x, y)

Polynomial Curve Fitting

discrete distribution p(x)

probability density p(x)

Note that Ex [f (x, y)] is a function of y.

discrete distribution p(x)

probability density p(x)

Polynomial Curve Fitting

Note that Ex [f | y] is a function of y.

arbitrary function f (x)