You are on page 1of 183

Introduction to Statistical

Machine Learning

Introduction to Statistical Machine Learning


Christfried Webers

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Outlines

Statistical Machine Learning Group


NICTA
and
College of Engineering and Computer Science
The Australian National University

Overview
Linear Regression
Linear Classification
Neural Networks
Kernel Methods and SVM
Mixture Models and EM
Resources
More Machine Learning

Machine Learning Summer School


MLSS-2010, 27 September - 6 October

(Figures from C. M. Bishop, "Pattern Recognition and Machine Learning" and


T. Hastie, R. Tibshirani, J. Friedman, "The Elements of Statistical Learning")

1of 183

Overview

What is Machine Learning?

Definition

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Outlines

Examples of Machine Learning

Related Fields

Fundamental Types of Learning

Basic Probability Theory

Polynomial Curve Fitting

Overview
Linear Regression
Linear Classification
Neural Networks
Kernel Methods and SVM
Mixture Models and EM
Resources
More Machine Learning

2of 183

Linear Regression

Linear Basis Function Models

Maximum Likelihood and Least Squares

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Outlines

10

Regularized Least Squares

11

Bayesian Regression

12

Example for Bayesian Regression

13

Predictive Distribution

14

Limitations of Linear Basis Function Models

Overview
Linear Regression
Linear Classification
Neural Networks
Kernel Methods and SVM
Mixture Models and EM
Resources
More Machine Learning

3of 183

Linear Classification
15

Classification

16

Generalised Linear Model

17

Inference and Decision

18

Decision Theory

19

Fishers Linear Discriminant

20

The Perceptron Algorithm

21

Probabilistic Generative Models

22

Discrete Features

23

Logistic Regression

24

Feature Space

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Outlines
Overview
Linear Regression
Linear Classification
Neural Networks
Kernel Methods and SVM
Mixture Models and EM
Resources
More Machine Learning

4of 183

Neural Networks

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Outlines

25

Neural Networks

Overview
Linear Regression
Linear Classification
Neural Networks
Kernel Methods and SVM
Mixture Models and EM
Resources
More Machine Learning

26

Parameter Optimisation

5of 183

Kernel Methods and SVM

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Outlines

27

Kernel Methods

Overview
Linear Regression
Linear Classification
Neural Networks
Kernel Methods and SVM
Mixture Models and EM
Resources
More Machine Learning

28

Maximum Margin Classifiers

6of 183

Mixture Models and EM

29

K-means Clustering

30

Mixture Models and EM

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Outlines
Overview
Linear Regression
Linear Classification

31

Mixture of Bernoulli Distributions

32

EM for Gaussian Mixtures - Latent Variables

33

Convergence of EM

Neural Networks
Kernel Methods and SVM
Mixture Models and EM
Resources
More Machine Learning

7of 183

Resources

34

Sampling from the Uniform Distribution

35

Sampling from Standard Distributions

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Outlines
Overview
Linear Regression
Linear Classification

36

Rejection Sampling

37

Importance Sampling

38

Markov Chain Monte Carlo - The Idea

Neural Networks
Kernel Methods and SVM
Mixture Models and EM
Resources
More Machine Learning

8of 183

More Machine Learning

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Outlines
Overview
Linear Regression
Linear Classification

39

More Machine Learning

Neural Networks
Kernel Methods and SVM
Mixture Models and EM
Resources
More Machine Learning

9of 183

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

Part I

MLSS
2010
What is Machine
Learning?

Overview

Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

10of 183

What is Machine Learning?


Definition
Machine learning is concerned with the design and
development of algorithms that allow computers (machines) to
improve their performance over time based on data.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
What is Machine
Learning?

learning from past experience (training data)


generalisation
quantify learning: improve their performance over time
need to quantify performance

Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

11of 183

What is Machine Learning?


Definition
Machine learning is concerned with the design and
development of algorithms that allow computers (machines) to
improve their performance over time based on data.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
What is Machine
Learning?

learning from past experience (training data)


generalisation
quantify learning: improve their performance over time
need to quantify performance

Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory

Definition (Mitchell, 1998)

Polynomial Curve Fitting

A computer program is said to learn from experience E with


respect to some class of tasks T and performance measure P,
if its performance at tasks in T, as measured by P, improves
with experience E.
12of 183

Why Machine Learning?


Machine Learning is essential when
humans are unable to explain their expertise (e.g. speech
recognition).
humans are not around for help (e.g. navigation on Mars,
underwater robotics).
large amount of data with possible hidden relationships
and correlations (empirical sciences, e.g. discover unusual
astronomical objects).
environment changes (fast) in time (e.g. mobile phone
network).
solutions need to be adapted to many particular cases
(e.g. junk mail).
Example: It is easier to write a program that learns to play
checkers or backgammon well by self-play rather than
converting the expertise of a master player to a program.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

13of 183

Junk Mail Filtering


Given examples of data (mail), and targets {Junk,NoJunk}.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

Learn to identify new incoming mail as Junk or NoJunk.


Continue to learn from the user classifying new mail.
14of 183

Handwritten Digit Recognition


Given handwritten ZIP codes on letters, money amounts
on cheques etc.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

Learn to correctly recognise new handwritten digits.


Nonsense input: Dont know preferred to some wrong
digit.

15of 183

Backgammon

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

World best computer program TD-GAMMON (Tesauro


1992, 1995) played over a million games against itself.
Plays now on the level of human world champion.

16of 183

Introduction to Statistical
Machine Learning

Image Denoising

Original image

c 2010
Christfried Webers
NICTA
The Australian National
University

Noise added

Denoised

MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

McAuley et. al., "Learning High-Order MRF Priors of Color


Images", ICML2006

17of 183

Introduction to Statistical
Machine Learning

Separating Audio Sources

c 2010
Christfried Webers
NICTA
The Australian National
University

Cocktail Party Problem (human brains may do it differently ;)


MLSS
2010

Audio Sources

Microphones

Audio Mixtures
What is Machine
Learning?
Definition
1.0

0.5

0.5

10

20

30

40

50

10

20

30

40

50

Examples of Machine
Learning

0.5

0.5

Related Fields

1.0

Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
1.0
1.0

0.5
0.5

10

10

0.5

20

30

40

50

20

30

40

50

0.5

1.0

18of 183

Other applications of Machine Learning

autonomous robotics,
detecting credit card fraud,
detecting network intrusion,
bioinformatics,
neuroscience,
medical diagnosis,
stock market analysis,
...

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

19of 183

Related Fields

Artificial Intelligence - AI
Statistics
Game Theory
Neuroscience, Psychology
Data Mining
Computer Science
Adaptive Control Theory
...

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

20of 183

Fundamental Types of Learning

Unsupervised Learning
Association
Clustering
Density Estimation
Blind source
separation

Supervised Learning
Regression
Classification

Reinforcement Learning
Agents

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
What is Machine
Learning?
Definition

Others
Active Learning
SemiSupervised
Learning
Transductive
Learning
...

Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

21of 183

Unsupervised Learning

Only input data given, no targets (labels).


Goal: Determine how the data are organised.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

22of 183

Unsupervised Learning - Clustering


Clustering : Group similar instances

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

Example applications
Clustering customers in
Customer-Relationship-Management
Image compression: color quantisation
23of 183

Supervised Learning
Given pairs of data and targets (=labels).
Learn a mapping from the data to the targets (training).
Goal: Use the learned mapping to correctly predict the
target for new input data.
Need to generalise well from the training data/target pairs.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

24of 183

Reinforcement Learning

Example: Game playing. There is one reward at the end of


the game (negative or positive).
Find suitable actions in a given environment with the goal
of maximising some reward.
correct input/output pairs never presented
Reward might only come after many actions.
Current action may not only influence the current reward,
but future rewards too.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

25of 183

Introduction to Statistical
Machine Learning

Reinforcement Learning

observation1

reward1

c 2010
Christfried Webers
NICTA
The Australian National
University

observation2

reward2

observationi

receive
reward

receive
reward

rewardi
receive
reward

Agent

Agent

choose action

choose action

choose action

action1

action2

actioni

...

Agent

MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning

Exploration versus Exploitation.


Well suited for problems with a long-term versus
short-term reward trade-off.
Naturally focusing on online performance.

Basic Probability Theory


Polynomial Curve Fitting

26of 183

Basic Probability Theory


Probability

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

is a way of expressing knowledge or belief that an event will


occur or has occurred.
Example: Fair Six-Sided Die
Sample space
Events
Probability
Outcome
Conditional Probability

= {1, 2, 3, 4, 5, 6}
Even = {2, 4, 6}, Odd = {1, 3, 5}
P(3) = 16 , P(Odd) = P(Even) = 21
3
and Odd)
1
P(3 | Odd) = P(3P(Odd)
= 1/6
1/2 = 3

General Axioms
P({}) = 0 P(A) P() = 1,
P(A B) + P(A B) = P(A) + P(B),
P(A B) = P(A | B)P(B).
Rules of Probability
Sum rule: P(X) = Y P(X, Y)
Product rule: P(X, Y) = P(X|Y) P(Y)

MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

27of 183

Introduction to Statistical
Machine Learning

Probability Jargon

c 2010
Christfried Webers
NICTA
The Australian National
University

(Un)fair Coin: = {Tail = 0, Head = 1} . P(1) = [0, 1].


Likelihood P(1101 | ) = (1 )
Maximum Likelihood (ML) estimate = arg max P(1101 | ) = 34
Prior If we are indifferent, then P() = const.
1
(actually )
Evidence P(1101) = P(1101 | )P() = 20
P(1101 | )P()
3
Posterior P( | 1101) =

(1

)
(Bayes Rule)
P(1101)
Maximum a Posterior (MAP) estimate = arg max P( | 1101) = 3
P(11011)
P(1101)

2
3

Predictive Distribution P(1 | 1101) =


=
Expectation E [f | . . .] = f ()P( | . . . ), e.g. E [ | 1101] =
2
Variance var() = E ( E [])2 | 1101 = 63
1
Probability Density P() = P([, + ]) for 0

MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory

2
3

Polynomial Curve Fitting

28of 183

Introduction to Statistical
Machine Learning

Polynomial Curve Fitting

c 2010
Christfried Webers
NICTA
The Australian National
University

some artificial data created from the function


sin(2x) + random noise

x = 0, . . . , 1
MLSS
2010
What is Machine
Learning?

Definition
Examples of Machine
Learning

Related Fields

Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

29of 183

Polynomial Curve Fitting - Training Data

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010

N = 10

What is Machine
Learning?

x (x1 , . . . , xN )

t (t1 , . . . , tN )
xi R i = 1,. . . , N
ti R i = 1,. . . , N

Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

30of 183

Polynomial Curve Fitting - Model Specification

M : order of polynomial
y(x, w) = w0 + w1 x + w2 x2 + + wM xM
M

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
What is Machine
Learning?
Definition

wm xm

Introduction to Statistical
Machine Learning

m=0

nonlinear function of x
linear function of the unknown model parameter w
How can we find good parameters w = (w1 , . . . , wM )T ?

Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

31of 183

Introduction to Statistical
Machine Learning

Learning is Improving Performance

c 2010
Christfried Webers
NICTA
The Australian National
University

tn

MLSS
2010

y(xn , w)

What is Machine
Learning?
Definition

xn

Examples of Machine
Learning

Related Fields

Performance measure : Error between target and


prediction of the model for the training data
E(w) =

1
2

Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

N
2

n=1

(y(xn , w) tn )

unique minimum of E(w) for argument w


32of 183

Introduction to Statistical
Machine Learning

Model is Constant Function

c 2010
Christfried Webers
NICTA
The Australian National
University

wm xm

y(x, w) =
m=0

MLSS
2010

M=0

= w0

What is Machine
Learning?
Definition
Examples of Machine
Learning

M =0

Related Fields

Fundamental Types of
Learning
Basic Probability Theory

Polynomial Curve Fitting

33of 183

Introduction to Statistical
Machine Learning

Model is Linear Function

c 2010
Christfried Webers
NICTA
The Australian National
University

wm xm

y(x, w) =
m=0

MLSS
2010

M=1

= w0 + w1 x

What is Machine
Learning?
Definition
Examples of Machine
Learning

M =1

Related Fields

Fundamental Types of
Learning
Basic Probability Theory

Polynomial Curve Fitting

34of 183

Introduction to Statistical
Machine Learning

Model is Cubic Polynomial

c 2010
Christfried Webers
NICTA
The Australian National
University

wm xm

y(x, w) =
m=0

= w0 + w1 x +

M=3
w2 x2

MLSS
2010

+ w3 x3

What is Machine
Learning?
Definition
Examples of Machine
Learning

M =3

Related Fields

Fundamental Types of
Learning
Basic Probability Theory

Polynomial Curve Fitting

35of 183

Introduction to Statistical
Machine Learning

Model is 9th order Polynomial

c 2010
Christfried Webers
NICTA
The Australian National
University

wm xm

y(x, w) =
m=0

M=9

= w0 + w1 x + + w8 x8 + w9 x9
overfitting

MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields

M =9

Fundamental Types of
Learning

Basic Probability Theory


Polynomial Curve Fitting

1
36of 183

Introduction to Statistical
Machine Learning

Testing the Fitted Model

c 2010
Christfried Webers
NICTA
The Australian National
University

Train the model and get w


Get 100 new data points
Root-mean-square (RMS) error
ERMS =

MLSS
2010

2E(w )/N

What is Machine
Learning?
Definition

Examples of Machine
Learning

ERMS

Training
Test

Related Fields
Fundamental Types of
Learning

0.5

Basic Probability Theory


Polynomial Curve Fitting

37of 183

Introduction to Statistical
Machine Learning

Parameters of the Fitted Model

w0
w1
w2
w3
w4
w5
w6
w7
w8
w9

M=0
0.19

M=1
0.82
-1.27

M=3
0.31
7.99
-25.43
17.37

c 2010
Christfried Webers
NICTA
The Australian National
University

M=9
0.35
232.37
-5321.83
48568.31
-231639.30
640042.26
-1061800.52
1042400.18
-557682.99
125201.43

MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

Table: Coefficients w for polynomials of various order.

38of 183

Introduction to Statistical
Machine Learning

Get More Data

c 2010
Christfried Webers
NICTA
The Australian National
University

N = 15
MLSS
2010
What is Machine
Learning?

N = 15

Definition

Examples of Machine
Learning

Related Fields
Fundamental Types of
Learning
Basic Probability Theory

Polynomial Curve Fitting

39of 183

Introduction to Statistical
Machine Learning

Get Even More Data


N = 100
heuristics : have no less than 5 to 10 times as many data
points than parameters
but number of parameters is not necessarily the most
appropriate measure of model complexity !
later: Bayesian approach

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning

N = 100

1
t

Related Fields
Fundamental Types of
Learning
Basic Probability Theory

Polynomial Curve Fitting

1
40of 183

Introduction to Statistical
Machine Learning

Regularisation

c 2010
Christfried Webers
NICTA
The Australian National
University

How to constrain the growing of the coefficients w ?


Add a regularisation term to the error function
1
E(w) =
2

What is Machine
Learning?
Definition

( y(xn , w) tn ) +
w
2
2

n=1

MLSS
2010

Squared norm of the parameter vector w

Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory

wT w = w20 + w21 + + w2M

Polynomial Curve Fitting

41of 183

Introduction to Statistical
Machine Learning

Regularisation

c 2010
Christfried Webers
NICTA
The Australian National
University

M=9
MLSS
2010
What is Machine
Learning?

ln = 18

Definition

Examples of Machine
Learning

Related Fields
Fundamental Types of
Learning
Basic Probability Theory

Polynomial Curve Fitting

42of 183

Introduction to Statistical
Machine Learning

Regularisation

c 2010
Christfried Webers
NICTA
The Australian National
University

M=9
MLSS
2010
What is Machine
Learning?

ln = 0

Definition

Examples of Machine
Learning

Related Fields
Fundamental Types of
Learning
Basic Probability Theory

Polynomial Curve Fitting

43of 183

Introduction to Statistical
Machine Learning

Regularisation

c 2010
Christfried Webers
NICTA
The Australian National
University

M=9
MLSS
2010

1
Training
Test

What is Machine
Learning?

ERMS

Definition
Examples of Machine
Learning

0.5

Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting

35

30

ln 25

20

44of 183

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

Part II

MLSS
2010
Linear Basis Function
Models

Linear Regression

Maximum Likelihood and


Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

45of 183

Introduction to Statistical
Machine Learning

Linear Regression Model


input "feature" vector x = (1 x(0) , x(1) , . . . , x(D) )T RD+1
linear regression model

MLSS
2010

wj x(j) = wT x

y(x, w) =
j=0

model parameter w = (w0 , . . . , wD )T where w0 is the bias


X2

c 2010
Christfried Webers
NICTA
The Australian National
University

Linear Basis Function


Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares

2.0

Bayesian Regression

1.5
1.0

Example for Bayesian


Regression

20
15

Predictive Distribution

Y 10

Limitations of Linear
Basis Function Models

5
0
1.0
1.5
2.0
X1

2.5
3.0

Hyperplanes for w = {(2, 1, 1), (5, 2, 1), (10, 2, 2)}


46of 183

Linear Regression - Finding the Best Model


Use training data (x1 , t1 ), . . . , (xN , tN )
and loss function (performance measure) to find best w.
Example : Residual sum of squares
N

Loss(w) =

(tn y(xn , w))2

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 3

n=1

Least square regression

Linear Basis Function


Models
Maximum Likelihood and
Least Squares

w = arg min Loss(w)


w

Regularized Least
Squares
Bayesian Regression

Example for Bayesian


Regression

Predictive Distribution


X2

Limitations of Linear
Basis Function Models

X1

47of 183

Introduction to Statistical
Machine Learning

Linear Basis Function Models

c 2010
Christfried Webers
NICTA
The Australian National
University

Linear combination of fixed nonlinear basis functions


j (x) R
M1

y(x, w) =

wj j (x) = w (x)
j=0
T

parameter w = (w0 , . . . , wM1 ) ,


w0 is the bias parameter,
basis functions = (0 , . . . , M1 )T
convention 0 (x) = 1

MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

48of 183

Introduction to Statistical
Machine Learning

Polynomial Basis Functions

c 2010
Christfried Webers
NICTA
The Australian National
University

Scalar input variable x


j (x) = xj
Limitation : Polynomials are global functions of the input
variable x.
Extension: Split the input space into regions and fit a
different polynomial to each region (spline functions).

MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares

Regularized Least
Squares
Bayesian Regression

0.5

Example for Bayesian


Regression
Predictive Distribution

Limitations of Linear
Basis Function Models

0.5
1
1

1
49of 183

Introduction to Statistical
Machine Learning

Gaussian Basis Functions

c 2010
Christfried Webers
NICTA
The Australian National
University

Scalar input variable x


j (x) = exp

(x j )2
2s2

Not a probability distribution.


No normalisation required, taken care of by the model
parameters w.

MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares

Regularized Least
Squares
Bayesian Regression

0.75

Example for Bayesian


Regression
Predictive Distribution

0.5

Limitations of Linear
Basis Function Models

0.25
0

1
50of 183

Introduction to Statistical
Machine Learning

Sigmoidal Basis Functions

c 2010
Christfried Webers
NICTA
The Australian National
University

Scalar input variable x


j (x) =

x j
s

where (a) is the logistic sigmoid function defined by


(a) =

1
1 + exp(a)

MLSS
2010
Linear Basis Function
Models

(a) is related to the hyperbolic tangent tanh(a) by


tanh(a) = 2(a) 1.
1

Maximum Likelihood and


Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution

0.75

Limitations of Linear
Basis Function Models

0.5
0.25
0
1

51of 183

Introduction to Statistical
Learning
Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 ChapMachine
5

Other Basis Functions - Wavelets

c 2010
Christfried Webers
NICTA
The Australian National
University

Haar Wavelets

Symmlet-8 Wavelets

6,35
6,15
5,15
5,1

Wavelets : localised in
both space and
frequency
mutually orthogonal to
simplify application.

MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares

4,9

Regularized Least
Squares

4,4

Bayesian Regression

3,5

Example for Bayesian


Regression

3,2

Predictive Distribution

2,3

Limitations of Linear
Basis Function Models

2,1
1,0
0.0

0.2

0.4

0.6
Time

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time

52of 183

Other Basis Functions - 2D Splines

Introduction to Statistical
Machine Learning

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 5

Splines: polynomials restricted to regions of the input space

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

53of 183

Maximum Likelihood and Least Squares

No special assumption about the basis functions j (x). In


the simplest case, one can think of j (x) = xj .
Assume target t is given by

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares

t = y(x, w) +
deterministic

Introduction to Statistical
Machine Learning

noise

Regularized Least
Squares
Bayesian Regression

where is a zero-mean Gaussian random variable with


precision (inverse variance) .
Thus
p(t | x, w, ) = N (t | y(x, w), 1 )

Example for Bayesian


Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

54of 183

Maximum Likelihood and Least Squares

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

Likelihood of one target t given the data x


p(t | x, w, ) = N (t | y(x, w), 1 )
Set of inputs X = {x1 , . . . , xN } with corresponding target
values t = (t1 , . . . , tn )T .
Assume data are independent and identically distributed
(i.i.d.) (means : data are drawn independent and from the
same distribution). The likelihood of the target t is then

n=1

Linear Basis Function


Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression

p(t | X, w, ) =

MLSS
2010

N (tn | y(xn , w), 1 )

Predictive Distribution
Limitations of Linear
Basis Function Models

=
n=1

N (tn | wT (xn ), 1 )

55of 183

Maximum Likelihood and Least Squares


Consider the logarithm of the likelihood p(t | X, w, ) (the
logarithm is a monoton function! )
N

ln p(t | X, w, ) =

n=1

ln N (tn | w (xn ),

exp (tn wT (xn ))2


2
2

ln
n=1

N
N
ln ln(2) ED (w)
2
2

where the sum-of-squares error function is

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

ED (w) =

1
2

n=1

{tn wT (xn )}2 .

arg maxw ln p(t | X, w, ) arg minw ED (w)


56of 183

Introduction to Statistical
Machine Learning

Maximum Likelihood and Least Squares

c 2010
Christfried Webers
NICTA
The Australian National
University

Rewrite the Error Function


ED (w) =

1
2

n=1

{tn wT (xn )}2 =

where t = (t1 , . . . , tN )T , and

0 (x1 ) 1 (x1 )
0 (x2 ) 1 (x2 )

..
= ...
.

0 (xN ) 1 (xN )

1
(t w)T (t w)
2

MLSS
2010
Linear Basis Function
Models

...
...
..
.
...

M1 (x1 )
M1 (x2 )

..

M1 (xN )

Maximum Likelihood and


Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

Maximum likelihood estimate


wML = arg max ln p(t | w, ) = arg min ED (w)
w
T

= ( )

t= t

where is the Moore-Penrose pseudo-inverse of .


57of 183

Introduction to Statistical
Machine Learning

Regularized Least Squares

c 2010
Christfried Webers
NICTA
The Australian National
University

Add regularisation in order to prevent overfitting


ED (w) + EW (w)

Linear Basis Function


Models

with regularisation coefficient .


Simple quadratic regulariser
EW (w) =

MLSS
2010

Maximum Likelihood and


Least Squares
Regularized Least
Squares

1 T
w w
2

Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution

Maximum likelihood solution


T

wML = I +

Limitations of Linear
Basis Function Models

58of 183

Introduction to Statistical
Machine Learning

Regularized Least Squares

c 2010
Christfried Webers
NICTA
The Australian National
University

More general regulariser


1
EW (w) =
2

MLSS
2010

M
q

j=1

|wj |

Linear Basis Function


Models

q = 1 (lasso) leads to a sparse model if large enough.

Maximum Likelihood and


Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

q = 0.5

q=1

q=2

q=4

59of 183

Comparison of Quadratic and Lasso Regulariser

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

Assume a sufficiently large regularisation coefficient .


Quadratic regulariser
1
2

Lasso regulariser

1
2

w2j
j=1

w2

MLSS
2010

j=1

Linear Basis Function


Models

|wj |

Maximum Likelihood and


Least Squares
Regularized Least
Squares

w2

Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

w1

w1

60of 183

Introduction to Statistical
Machine Learning

Bayesian Regression

c 2010
Christfried Webers
NICTA
The Australian National
University

Bayes Theorem
posterior =

likelihood prior
normalisation

p(w | t) =

p(t | w) p(w)
p(t)

likelihood for i.i.d. data

n=1

N (tn | y(xn , w), 1 )

n=1

Regularized Least
Squares
Bayesian Regression

Linear Basis Function


Models
Maximum Likelihood and
Least Squares

p(t | w) =

MLSS
2010

N (tn | wT (xn ), 1 )

1
= const exp{ (t w)T (t w)}
2

Example for Bayesian


Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

where we left out the conditioning on x (always assumed),


and , which is assumed to be constant.
61of 183

How to choose a prior?

Can we find a prior for the given likelihood which


makes sense for the problem at hand
allows us to find a posterior in a nice form

An answer to the second question:


Definition ( Conjugate Prior)
A class of prior probability distributions p(w) is conjugate to a
class of likelihood functions p(x | w) if the resulting posterior
distributions p(w | x) are in the same family as p(w).

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

62of 183

Examples of Conjugate Prior Distributions

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

Table: Discrete likelihood distributions

Likelihood
Bernoulli
Binomial
Poisson
Multinomial

Conjugate Prior
Beta
Beta
Gamma
Dirichlet

MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression

Table: Continuous likelihood distributions

Example for Bayesian


Regression
Predictive Distribution

Likelihood
Uniform
Exponential
Normal
Multivariate normal

Conjugate Prior
Pareto
Gamma
Normal
Multivariate normal

Limitations of Linear
Basis Function Models

63of 183

Introduction to Statistical
Machine Learning

Bayesian Regression

c 2010
Christfried Webers
NICTA
The Australian National
University

likelihood

prior/posterior
p(w)

No data point
(N = 0): start with
prior.
Each posterior acts
as the prior for the
next data/target pair.
Nicely fits a
sequential learning
framework.

MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares

p(t1 | w, x1 )

Bayes

p(w | t1, x1 )

Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

p(t2 | w, x2)

Bayes

p(w | t1, x1, t2, x2)

64of 183

Sequential Update of the Posterior

Example of a linear (basis function) model


Single input x, single output t
Linear model y(x, w) = w0 + w1 x.
Data creation
1
2
3

Choose an xn from the uniform distribution U(x | 1, 1).


Calculate f (xn , a) = a0 + a1 xn , where a0 = 0.3, a1 = 0.5.
Add Gaussian noise with standard deviation = 0.2,
tn = N (xn | f (xn , a), 0.04)

Set the precision of the uniform prior to = 2.0.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

65of 183

Sequential Update of the Posterior

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

66of 183

Sequential Update of the Posterior

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

67of 183

Introduction to Statistical
Machine Learning

Predictive Distribution

c 2010
Christfried Webers
NICTA
The Australian National
University

Definition (The Predictive Distribution)


The Predictive Distribution is the probability of the test target t
given test data x, the training data set X and the training
targets t.
p(t | x, X, t)

p(t, w | x, X, t) dw
p(t | w, x, X, t) p(w | x, X, t) dw
testing only

training only

Linear Basis Function


Models
Maximum Likelihood and
Least Squares

How to calculate the Predictive Distribution?


p(t | x, X, t) =

MLSS
2010

Regularized Least
Squares

(sum rule)

Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

p(t | w, x) p(w | X, t) dw

68of 183

Predictive Distribution Isotropic Gaussian Prior

(Simplified) isotropic Gaussian prior

Predictive distribution p(t | x, X, t) is Gaussian, variance


after N data points have been seen
1

noise of data

+ T (I + T )1
uncertainty of w

2
N+1
(x) N2 (x) and limN N2 (x) =

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010

p(w | ) = N (w | 0, 1 I)

N2 (x) =

Introduction to Statistical
Machine Learning

Linear Basis Function


Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

69of 183

Predictive Distribution Isotropic Gaussian Prior


1

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

t
0

MLSS
2010
Linear Basis Function
Models

Maximum Likelihood and


Least Squares
Regularized Least
Squares

Bayesian Regression

Example for Bayesian


Regression

Predictive Distribution
Limitations of Linear
Basis Function Models

Example with artificial sinusoidal data from sin(2x) (green) and


added noise. Mean of the predictive distribution (red) and regions of
one standard deviation from mean (red shaded).

70of 183

Introduction to Statistical
Machine Learning

Samples from the Posterior Distribution


1

c 2010
Christfried Webers
NICTA
The Australian National
University

t
0

MLSS
2010
Linear Basis Function
Models

Maximum Likelihood and


Least Squares
Regularized Least
Squares

Bayesian Regression

Example for Bayesian


Regression

Predictive Distribution
Limitations of Linear
Basis Function Models

Example with artificial sinusoidal data from sin(2x) (green) and added
noise. Samples y(x, w) (red) from the posterior distribution p(w | X, t) .

71of 183

Limitations of Linear Basis Function Models

Basis function j (x) are fixed before the training data set is
observed.
Curse of dimensionality : Number of basis function grows
rapidly, often exponentially, with the dimensionality D.
But typical data sets have two nice properties which can
be exploited if the basis functions are not fixed :
Data lie close to a nonlinear manifold with intrinsic
dimension much smaller than D. Need algorithms which
place basis functions only where data are (e.g. radial basis
function networks, support vector machines, relevance
vector machines, neural networks).
Target variables may only depend on a few significant
directions within the data manifold. Need algorithms which
can exploit this property (Neural networks).

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

72of 183

Curse of Dimensionality
Linear Algebra allows us to operate in n-dimensional
vector spaces using the intution from our 3-dimensional
world as a vector space. No surprises as long as n is finite.
If we add more structure to a vector space (e.g. inner
product, metric), our intution gained from the
3-dimensional world around us may be wrong.
Example: Sphere of radius r = 1. What is the fraction of
the volume of the sphere in a D-dimensional space which
lies between radius r = 1 and r = 1 ?
Volume scales like rD , therefore the formula for the volume
of a sphere is VD (r) = KD rD .

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

VD (1) VD (1 )
= 1 (1 )D
VD (1)

73of 183

Introduction to Statistical
Machine Learning

Curse of Dimensionality
Fraction of the volume of the sphere in a D-dimensional
space which lies between radius r = 1 and r = 1
VD (1) VD (1 )
= 1 (1 )D
VD (1)

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Linear Basis Function
Models

Maximum Likelihood and


Least Squares

D = 20

volume fraction

Regularized Least
Squares

D =5

0.8

D =2

0.6

Bayesian Regression
Example for Bayesian
Regression

D =1

Predictive Distribution

0.4

Limitations of Linear
Basis Function Models

0.2

0.2

0.4

0.6

0.8

74of 183

Introduction to Statistical
Machine Learning

Curse of Dimensionality

c 2010
Christfried Webers
NICTA
The Australian National
University

Probability density with respect to radius r of a Gaussian


distribution for various values of the dimensionality D.

MLSS
2010

2
Linear Basis Function
Models

p(r)

D=1

Maximum Likelihood and


Least Squares

D=2

Regularized Least
Squares

D = 20

Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution

Limitations of Linear
Basis Function Models

2
r

75of 183

Introduction to Statistical
Machine Learning

Curse of Dimensionality
Probability density with respect to radius r of a Gaussian
distribution for various values of the dimensionality D.
Example: D = 2; assume = 0, = I
1
1
N (x | 0, I) =
exp xT x
2
2

1
1
=
exp (x12 + x22 )
2
2

MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares

Coordinate transformation
x1 = r cos()

c 2010
Christfried Webers
NICTA
The Australian National
University

x2 = r sin()

Regularized Least
Squares
Bayesian Regression

Probability in the new coordinates


p(r, | 0, I) = N (r(x), (x) | 0, I) | J |

Example for Bayesian


Regression
Predictive Distribution
Limitations of Linear
Basis Function Models

where | J | = r is the determinant of the Jacobian for the


given coordinate transformation.
p(r, | 0, I) =

1
1
r exp r2
2
2
76of 183

Introduction to Statistical
Machine Learning

Curse of Dimensionality
Probability density with respect to radius r of a Gaussian
distribution for D = 2 (and = 0, = I)
p(r, | 0, I) =

1
1
r exp r2
2
2

p(r | 0, I) =

Maximum Likelihood and


Least Squares

1
1
r exp r2
2
2

1
d = r exp r2
2

D=1
p(r)

Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression

Predictive Distribution
Limitations of Linear
Basis Function Models

D=2
1

MLSS
2010
Linear Basis Function
Models

Integrate over all angles


2

c 2010
Christfried Webers
NICTA
The Australian National
University

D = 20

2
r

77of 183

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

Part III

MLSS
2010
Classification

Linear Classification

Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

78of 183

Introduction to Statistical
Machine Learning

Classification

c 2010
Christfried Webers
NICTA
The Australian National
University

Goal : Given input data x, assign it to one of K discrete


classes Ck where k = 1, . . . , K.
Divide the input space into different regions.

MLSS
2010
Classification

Generalised Linear
Model

Inference and Decision


Decision Theory

Fishers Linear
Discriminant
2

The Perceptron
Algorithm
4

Probabilistic Generative
Models
6

Discrete Features
8

Logistic Regression
4

Feature Space

79of 183

How to represent binary class labels?

Class labels are no longer real values as in regression, but


a discrete set.
Two classes : t {0, 1}
( t = 1 represents class C1 and t = 0 represents class C2 )
Can interpret the value of t as the probability of class C1 ,
with only two values possible for the probability, 0 or 1.
Note: Other conventions to map classes into integers
possible, check the setup.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

80of 183

How to represent multi-class labels?

If there are more than two classes ( K > 2), we call it a


multi-class setup.
Often used: 1-of-K coding scheme in which t is a vector of
length K which has all values 0 except for tj = 1, where j
comes from the membership in class Cj to encode.
Example: Given 5 classes, {C1 , . . . , C5 }. Membership in
class C2 will be encoded as the target vector
t = (0, 1, 0, 0, 0)T
Note: Other conventions to map multi-classes into integers
possible, check the setup.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

81of 183

Introduction to Statistical
Machine Learning

Linear Model
Idea: Use again a Linear Model as in regression: y(x, w) is
a linear function of the parameters w
y(xn , w) = wT (xn )
But generally y(xn , w) R.
Example: Which class is y(x, w) = 0.71623 ?

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision

Decision Theory

Fishers Linear
Discriminant

The Perceptron
Algorithm

Probabilistic Generative
Models

Discrete Features
Logistic Regression

Feature Space

8
4

8
82of 183

Introduction to Statistical
Machine Learning

Generalised Linear Model


Apply a mapping f : R Z to the linear model to get the
discrete class labels.
Generalised Linear Model
y(xn , w) = f (wT (xn ))

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model

Activation function: f ()
Link function : f 1 ()

Inference and Decision


Decision Theory
Fishers Linear
Discriminant

sign z
1.0

The Perceptron
Algorithm

0.5

0.5

0.0

0.5

1.0

0.5

1.0

Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

Figure: Example of an activation function f (z) = sign (z) .

83of 183

Three Models for Decision Problems


In increasing order of complexity
Find a discriminant function f (x) which maps each input
directly onto a class label.
Discriminative Models
1

Solve the inference problem of determining the posterior


class probabilities p(Ck | x).
Use decision theory to assign each new x to one of the
classes.

Generative Models
1

2
3
4
5

Solve the inference problem of determining the


class-conditional probabilities p(x | Ck ).
Also, infer the prior class probabilities p(Ck ).
Use Bayes theorem to find the posterior p(Ck | x).
Alternatively, model the joint distribution p(x, Ck ) directly.
Use decision theory to assign each new x to one of the
classes.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

84of 183

Introduction to Statistical
Machine Learning

Decision Theory - Key Ideas

c 2010
Christfried Webers
NICTA
The Australian National
University

probability of a mistake
p(mistake) = p(x R1 , C2 ) + p(x R2 , C1 )
=
R1

p(x, C2 ) dx +

R2

p(x, C1 ) dx

MLSS
2010
Classification
Generalised Linear
Model

goal: minimize p(mistake)

Inference and Decision

x0

Decision Theory

Fishers Linear
Discriminant

p(x, C1 )

The Perceptron
Algorithm

p(x, C2 )

Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

x
R1

R2
85of 183

Minimising the Expected Loss

Not all mistakes are equally costly.


Weight each misclassification of x to the wrong class Cj
instead of assigning it to the correct class Ck by a factor Lkj .
The expected loss is now

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory

E [L] =
k

Rj

Lkj p(x, Ck )dx

Goal: minimize the expected loss E [L]

Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

86of 183

Introduction to Statistical
Machine Learning

The Reject Region

c 2010
Christfried Webers
NICTA
The Australian National
University

Avoid making automated decisions on difficult cases.


Difficult cases:
posterior probabilities p(Ck | x) are very small
joint distributions p(x, Ck ) have comparable values
1.0

p(C1 |x)

p(C2 |x)

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features

0.0
reject region

Logistic Regression
Feature Space

87of 183

Least Squares for Classification

Regression with a linear function of the model parameters


and minimisation of sum-of-squares error function resulted
in a closed-from solution for the parameter values.
Is this also possible for classification?
Given input data x belonging to one of K classes Ck .
Use 1-of-K binary coding scheme.
Each class is described by its own linear model
yk (x) = wTk x + wk0

k = 1, . . . , K

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

88of 183

Introduction to Statistical
Machine Learning

Least Squares for Classification

c 2010
Christfried Webers
NICTA
The Australian National
University

With the conventions


wk =

wk0
wk

RD+1

1
x=
x
W = w1

D+1

R
...

wK

R(D+1)K

we get for the discriminant function (vector valued)


y(x) = WT x

RK .

For a new input x, the class is then defined by the index of


the largest value in the row vector y(x)

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

89of 183

Introduction to Statistical
Machine Learning

Determine W

c 2010
Christfried Webers
NICTA
The Australian National
University

Given a training set {xn , t} where n = 1, . . . , N, and t is the


class in the 1-of-K coding scheme.
Define a matrix T where row n corresponds to tTn .
The sum-of-squares error can now be written as
ED (W) =

1
tr (XW T)T (XW T)
2

The minimum of ED (W) will be reached for


W = (XT X)1 XT T = X T

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features

where X is the pseudo-inverse of X.

Logistic Regression
Feature Space

90of 183

Discriminant Function for Multi-Class

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

The discriminant function y(x) is therefore


y(x) = WT x = TT (X )T x,
where X is given by the training data, and x is the new
input.
Interesting property: If for every tn the same linear
constraint aT tn + b = 0 holds, then the prediction y(x) will
also obey the same constraint
aT y(x) + b = 0.

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models

For the 1-of-K coding scheme, the sum of all components


in tn is one, and therefore all components of y(x) will sum
to one. BUT: the components are not probabilities, as they
are not constraint to the interval (0, 1).

Discrete Features
Logistic Regression
Feature Space

91of 183

Introduction to Statistical
Machine Learning

Deficiencies of the Least Squares Approach

c 2010
Christfried Webers
NICTA
The Australian National
University

Magenta curve : Decision Boundary for the least squares


approach ( Green curve : Decision boundary for the logistic
regression model described later)

MLSS
2010

Classification

Generalised Linear
Model

Fishers Linear
Discriminant

The Perceptron
Algorithm

Probabilistic Generative
Models

Inference and Decision

8
4

Decision Theory

Discrete Features

8
2

Logistic Regression
Feature Space

92of 183

Introduction to Statistical
Machine Learning

Deficiencies of the Least Squares Approach

c 2010
Christfried Webers
NICTA
The Australian National
University

Magenta curve : Decision Boundary for the least squares


approach ( Green curve : Decision boundary for the logistic
regression model described later)

MLSS
2010

Classification

Generalised Linear
Model

Fishers Linear
Discriminant

The Perceptron
Algorithm

Probabilistic Generative
Models

Inference and Decision


Decision Theory

Discrete Features

6
6

6
6

Logistic Regression
Feature Space

93of 183

Introduction to Statistical
Machine Learning

Fishers Linear Discriminant

c 2010
Christfried Webers
NICTA
The Australian National
University

View linear classification as dimensionality reduction.


T

MLSS
2010

y(x) = w x
Classification

If y w0 then class C1 , otherwise C2 .


But there are many projections from a D-dimensional input
space onto one dimension.
Projection always means loss of information.
For classification we want to preserve the class separation
in one dimension.
Can we find a projection which maximally preserves the
class separation ?

Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

94of 183

Introduction to Statistical
Machine Learning

Fishers Linear Discriminant

c 2010
Christfried Webers
NICTA
The Australian National
University

Samples from two classes in a two-dimensional input space


and their histogram when projected to two different
one-dimensional spaces.

MLSS
2010
Classification
Generalised Linear
Model

Inference and Decision


Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features

Logistic Regression
Feature Space

95of 183

Fishers Linear Discriminant - First Try


Given N1 input data of class C1 , and N2 input data of class
C2 , calculate the centres of the two classes
m1 =

1
N1

xn ,

m2 =

nC1

1
N2

xn
nC2

Choose w so as to maximise the projection of the class


means onto w
m1 m2 = wT (m1 m2 )
Problem with non-uniform covariance
4

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models

Discrete Features

Logistic Regression
Feature Space

6
96of 183

Introduction to Statistical
Machine Learning

Fishers Linear Discriminant


Measure also the within-class variance for each class
s2k =
nCk

c 2010
Christfried Webers
NICTA
The Australian National
University

(yn mk )2
MLSS
2010

where yn = w xn .
Maximise the Fisher criterion

Classification
Generalised Linear
Model

J(w) =

(m2 m1 )
s21 + s22

Inference and Decision


Decision Theory
Fishers Linear
Discriminant

The Perceptron
Algorithm
Probabilistic Generative
Models

Discrete Features

0
Logistic Regression
Feature Space

6
97of 183

Introduction to Statistical
Machine Learning

Fishers Linear Discriminant

c 2010
Christfried Webers
NICTA
The Australian National
University

The Fisher criterion can be rewritten as


T

J(w) =

w SB w
wT SW w

Classification
Generalised Linear
Model

SB is the between-class covariance

Inference and Decision

SB = (m2 m1 )(m2 m1 )T
SW is the within-class covariance
SW =
nC1

(xn m1 )(xn m1 )T +

MLSS
2010

Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm

nC2

(xn m2 )(xn m2 )T

Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

98of 183

Introduction to Statistical
Machine Learning

Fishers Linear Discriminant

c 2010
Christfried Webers
NICTA
The Australian National
University

The Fisher criterion

MLSS
2010

J(w) =

wT SB w
wT SW w

has a maximum for Fishers linear discriminant

Classification
Generalised Linear
Model
Inference and Decision
Decision Theory

S1
W (m2

m1 )

Fishers linear discriminant is NOT a discriminant, but can


be used to construct one by choosing a threshold y0 in the
projection space.

Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

99of 183

The Perceptron Algorithm

Perceptron ("MARK 1", Cornell Univ., 1960) was the first


computer which could learn new skills by trial and error

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

100of 183

The Perceptron Algorithm

Frank Rosenblatt (1928 - 1969)


"Principles of neurodynamics: Perceptrons and the theory
of brain mechanisms" (Spartan Books, 1962)

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

101of 183

The Perceptron Algorithm


Two class model
Create feature vector (x) by a fixed nonlinear
transformation of the input x.
Generalised linear model
y(x) = f (wT (x))
with (x) containing some bias element 0 (x) = 1.
nonlinear activation function
f (a) =

+1, a 0
1, a < 0

Target coding for perceptron


t=

+1, if C1
1, if C2

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

102of 183

The Perceptron Algorithm - Error Function

Idea : Minimise total number of misclassified patterns.


Problem : As a function of w, this is piecewise constant
and therefore the gradient is zero almost everywhere.
Better idea: Using the (1, +1) target coding scheme, we
want all patterns to satisfy wT (xn )tn > 0.
Perceptron Criterion : Add the errors for all patterns
belonging to the set of misclassified patterns M
EP (w) =

wT (xn )tn
nM

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

103of 183

Perceptron - Stochastic Gradient Descent

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

Perceptron Criterion (with notation n = (xn ) )


EP (w) =

w n tn

MLSS
2010
Classification

nM
Generalised Linear
Model

One iteration at step


1
2

Choose a training pair (xn , tn )


Update the weight vector w by
w( +1) = w( ) EP (w) = w( ) + n tn

As y(x, w) does not depend on the norm of w, one can set


=1
w( +1) = w( ) + n tn

Inference and Decision


Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

104of 183

Introduction to Statistical
Machine Learning

The Perceptron Algorithm - Update 1

c 2010
Christfried Webers
NICTA
The Australian National
University

Update of the perceptron weights from a misclassified pattern


(green)
MLSS
2010

w( +1) = w( ) + n tn

Classification
Generalised Linear
Model

Inference and Decision

Decision Theory

0.5

Fishers Linear
Discriminant

0.5

The Perceptron
Algorithm

0.5

0.5

Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

1
1

0.5

0.5

1
1

0.5

0.5

105of 183

Introduction to Statistical
Machine Learning

The Perceptron Algorithm - Update 2

c 2010
Christfried Webers
NICTA
The Australian National
University

Update of the perceptron weights from a misclassified pattern


(green)
MLSS
2010

w( +1) = w( ) + n tn

Classification
Generalised Linear
Model

Inference and Decision

Decision Theory

0.5

Fishers Linear
Discriminant

0.5

The Perceptron
Algorithm

0.5

0.5

Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

1
1

0.5

0.5

1
1

0.5

0.5

106of 183

The Perceptron Algorithm - Convergence

Does the algorithm converge ?


For a single update step
w( +1)T n tn = w( )T n tn (n tn )T n tn < w( )T n tn
T

because (n tn ) n tn = n tn > 0.
BUT: contributions to the error from the other misclassified
patterns might have increased.
AND: some correctly classified patterns might now be
misclassified.
Perceptron Convergence Theorem : If the training set is
linearly separable, the perceptron algorithm is guaranteed
to find a solution in a finite number of steps.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

107of 183

Three Models for Decision Problems


In increasing order of complexity
Find a discriminant function f (x) which maps each input
directly onto a class label.
Discriminative Models
1

Solve the inference problem of determining the posterior


class probabilities p(Ck | x).
Use decision theory to assign each new x to one of the
classes.

Generative Models
1

2
3
4
5

Solve the inference problem of determining the


class-conditional probabilities p(x | Ck ).
Also, infer the prior class probabilities p(Ck ).
Use Bayes theorem to find the posterior p(Ck | x).
Alternatively, model the joint distribution p(x, Ck ) directly.
Use decision theory to assign each new x to one of the
classes.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

108of 183

Probabilistic Generative Models


Generative approach: model class-conditional densities
p(x | Ck ) and priors p(Ck ) to calculate the posterior
probability for class C1
p(x | C1 )p(C1 )
p(x | C1 )p(C1 ) + p(x | C2 )p(C2 )
1
=
= (a(x))
1 + exp(a(x))

p(C1 | x) =

where a and the logistic sigmoid function (a) are given by


p(x | C1 ) p(C1 )
p(x, C1 )
= ln
p(x | C2 ) p(C2 )
p(x, C2 )
1
.
(a) =
1 + exp(a)
a(x) = ln

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

109of 183

Introduction to Statistical
Machine Learning

Logistic Sigmoid
The logistic sigmoid function (a) =

c 2010
Christfried Webers
NICTA
The Australian National
University

1
1+exp(a)

"squashing function because it maps the real axis into a


finite interval (0, 1)
(a) = 1 (a)
d
Derivative da
(a) = (a) (a) = (a) (1 (a))
Inverse is called logit function a() = ln

MLSS
2010
Classification
Generalised Linear
Model

Inference and Decision


Decision Theory
Fishers Linear
Discriminant

1.0

The Perceptron
Algorithm
4

0.8

Probabilistic Generative
Models

2
0.6

0.2

0.4

0.6

0.4

0.8

Discrete Features

1.0

Logistic Regression
2

Feature Space

0.2

10

Logistic Sigmoid (a)

10

Logit a()
110of 183

Probabilistic Generative Models - Multiclass

The normalised exponential is given by


p(Ck | x) =

p(x | Ck ) p(Ck )
=
j p(x | Cj ) p(Cj )

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010

exp(ak )
j exp(aj )

Classification
Generalised Linear
Model
Inference and Decision

where
ak = ln(p(x | Ck ) p(Ck )).
Also called softmax function as it is a smoothed version of
the max function.
Example: If ak
aj for all j = k, then p(Ck | x) 1, and
p(Cj | x) 0.

Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

111of 183

General Case - K Classes, Different Covariance


If each class-conditional probability is Gaussian and has a
different covariance, the quadratic terms 21 xT 1 x do no
longer cancel each other out.
We get a quadratic discriminant.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision

2.5

Decision Theory

Fishers Linear
Discriminant

1.5
1

The Perceptron
Algorithm

0.5
0

Probabilistic Generative
Models

0.5

Discrete Features

1
1.5

Logistic Regression

2
2.5

Feature Space

112of 183

Discrete Features - Naive Bayes

Assume the input space consists of discrete features, in


the simplest case xi {0, 1}.
For a D-dimensional input space, a general distribution
would be represented by a table with 2D entries.
Together with the normalisation constraint, this are 2D 1
independent variables.
Grows exponentially with the number of features.
The Naive Bayes assumption is that all features
conditioned on the class Ck are independent of each other.
D

p(x | Ck ) =

i=1

xkii (1 ki )1xi

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

113of 183

Introduction to Statistical
Machine Learning

Discrete Features - Naive Bayes

c 2010
Christfried Webers
NICTA
The Australian National
University

With the naive Bayes


D

p(x | Ck ) =

i=1

xkii (1 ki )1xi

we can then again find the factors ak in the normalised


exponential

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory

p(Ck | x) =

p(x | Ck )p(Ck )
=
j p(x | Cj )p(Cj )

exp(ak )
j exp(aj )

as a linear function of the xi

Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features

ak (x) =
i=1

{xi ln ki + (1 xi ) ln(1 ki )} + ln p(Ck ).

Logistic Regression
Feature Space

114of 183

Three Models for Decision Problems


In increasing order of complexity
Find a discriminant function f (x) which maps each input
directly onto a class label.
Discriminative Models
1

Solve the inference problem of determining the posterior


class probabilities p(Ck | x).
Use decision theory to assign each new x to one of the
classes.

Generative Models
1

2
3
4
5

Solve the inference problem of determining the


class-conditional probabilities p(x | Ck ).
Also, infer the prior class probabilities p(Ck ).
Use Bayes theorem to find the posterior p(Ck | x).
Alternatively, model the joint distribution p(x, Ck ) directly.
Use decision theory to assign each new x to one of the
classes.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

115of 183

Logistic Regression is Classification

Two classes where the posterior of class C1 is a logistic


sigmoid () acting on a linear function of the feature vector

p(C1 | ) = y() = (wT )


p(C2 | ) = 1 p(C1 | )
Model dimension is equal to dimension of the feature
space M.
Compare this to fitting two Gaussians
2M + M(M + 1)/2 = M(M + 5)/2
means

shared covariance

For larger M, the logistic regression model has a clear


advantage.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

116of 183

Logistic Regression is Classification


Determine the parameter via maximum likelihood for data
(n , tn ), n = 1, . . . , N, where n = (xn ). The class
membership is coded as tn {0, 1}.
Likelihood function
N

p(t | w) =

ytnn (1
n=1

1tn

yn )

where yn = p(C1 | n ).
Error function : negative log likelihood resulting in the
cross-entropy error function

n=1

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features

E(w) = ln p(t | w) =

Introduction to Statistical
Machine Learning

{tn ln yn + (1 tn ) ln(1 yn )}

Logistic Regression
Feature Space

117of 183

Logistic Regression is Classification


Error function (cross-entropy error )

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

E(w) =

n=1

{tn ln yn + (1 tn ) ln(1 yn )}

yn = p(C1 | n ) = (wT n )
Gradient of the error function (using

Classification

d
da

= (1 ) )

n=1

Generalised Linear
Model
Inference and Decision
Decision Theory

E(w) =

MLSS
2010

(yn tn )n

gradient does not contain any sigmoid function


for each data point error is product of deviation yn tn and
basis function n .
BUT : maximum likelihood solution can exhibit over-fitting
even for many data points; should use regularised error or
MAP then.

Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space

118of 183

Introduction to Statistical
Machine Learning

Original Input versus Feature Space


Used direct input x until now.
All classification algorithms work also if we first apply a
fixed nonlinear transformation of the inputs using a vector
of basis functions (x).
Example: Use two Gaussian basis functions centered at
the green crosses in the input space.

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory

Fishers Linear
Discriminant

The Perceptron
Algorithm

x2

Probabilistic Generative
Models

0.5

Discrete Features
Logistic Regression
Feature Space

1
0
1

x1

0.5

1
119of 183

Introduction to Statistical
Machine Learning

Original Input versus Feature Space


Linear decision boundaries in the feature space
correspond to nonlinear decision boundaries in the input
space.
Classes which are NOT linearly separable in the input
space can become linearly separable in the feature space.
BUT: If classes overlap in input space, they will also
overlap in feature space.
Nonlinear features (x) can not remove the overlap; but
they may increase it !

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm

Probabilistic Generative
Models

1
2

x2

Discrete Features
0

Logistic Regression

0.5

Feature Space
1
0
1

x1

0.5

120of 183

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

Part IV

MLSS
2010
Neural Networks
Parameter Optimisation

Neural Networks

121of 183

Introduction to Statistical
Machine Learning

Functional Transformations
As before, the biases can be absorbed into the weights by
introducing an extra input x0 = 1 and a hidden unit z0 = 1.

(2)
wkj h

yk (x, w) = g

c 2010
Christfried Webers
NICTA
The Australian National
University

j=0

(1)
wji xi

MLSS
2010

i=0
Neural Networks

Compare to Generalised Linear Model

Parameter Optimisation

(2)
wkj j (x)

yk (x, w) = g
j=0

(1)

hidden units
zM

wM D

(2)

wKM

xD

yK
outputs

inputs

y1
x1
z1
x0
z0

(2)

w10

122of 183

Variable Basis Functions in a Neural Networks


(x) = (w0 + w1 x1 + w2 x2 ) for different parameter w.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Neural Networks
-10

-10
-5

Parameter Optimisation

-5
x2

x2

0
5

5
10
-10

10
-10

-5

-5

0
x1

x1

10

10

w = (0, 1, 0.1)

w = (0, 0.1, 1)

-10

-10
-5

-5
x2

x2

0
5

5
10
-10

10
-10

-5

-5

0
x1

x1

5
10

w = (0, 0.5, 0.5)

5
10

w = (10, 0.5, 0.5)

123of 183

Aproximation Capabilities of Neural Networks

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

Neural network approximating


f (x) = x2

MLSS
2010
Neural Networks
Parameter Optimisation

Two-layer network with 3 hidden units (tanh activation functions)


and linear outputs trained on 50 data points sampled from the
interval (1, 1). Red: resulting output. Dashed: Output of the
hidden units.

124of 183

Aproximation Capabilities of Neural Networks

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

Neural network approximating


f (x) = sin(x)

MLSS
2010
Neural Networks
Parameter Optimisation

Two-layer network with 3 hidden units (tanh activation functions)


and linear outputs trained on 50 data points sampled from the
interval (1, 1). Red: resulting output. Dashed: Output of the
hidden units.

125of 183

Aproximation Capabilities of Neural Networks

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

Neural network approximating


f (x) = |x|

MLSS
2010
Neural Networks
Parameter Optimisation

Two-layer network with 3 hidden units (tanh activation functions)


and linear outputs trained on 50 data points sampled from the
interval (1, 1). Red: resulting output. Dashed: Output of the
hidden units.

126of 183

Aproximation Capabilities of Neural Networks


Neural network approximating Heaviside function
f (x) =

1, x 0
0, x < 0

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Neural Networks
Parameter Optimisation

Two-layer network with 3 hidden units (tanh activation functions)


and linear outputs trained on 50 data points sampled from the
interval (1, 1). Red: resulting output. Dashed: Output of the
hidden units.
127of 183

Aproximation Capabilities of Neural Networks


Neural network for two-class classification.
2 inputs, 2 hidden units with tanh activation function, 1
output with logistic sigmoid activation function.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010

3
Neural Networks

Parameter Optimisation

1
0
1
2
2

Red: y = 0.5 decision boundary. Dashed blue: z = 0.5 hidden


unit contours. Green: Optimal decision boundary from the
known data distribution.
128of 183

Introduction to Statistical
Machine Learning

Parameter Optimisation

c 2010
Christfried Webers
NICTA
The Australian National
University

Nonlinear mapping from input xn to output y(xn , w).


Sum-of-squares error function over all training data
E(w) =

1
2

n=1

MLSS
2010
Neural Networks

y(xn , w) tn 2 ,

Parameter Optimisation

where we have N pairs of input vectors xn and target


vectors tn .
Find the parameter w which minimises E(w)
w = arg min E(w)
w

by gradient descent.

129of 183

Introduction to Statistical
Machine Learning

Error Backpropagation

c 2010
Christfried Webers
NICTA
The Australian National
University

Given current errors k , the activation function h(), its


derivative h (), and its output zi in the previous layer.
Error in the previous layer via the backpropagation formula
j = h (aj )

MLSS
2010
Neural Networks

wkj k .

Parameter Optimisation

Components of the gradient En are then

zi
wji

En (w)
wji

= j zi .

j
wkj
zj

1
130of 183

Efficieny of Error Backpropagation


As the number of weights is usually much larger than the
number of units (the network is well connected), the
n (w)
via error
complexity of calculating the gradient Ew
ji
backpropagation is of O(W) where W is the number of
weights.
Compare this to numerical differentiation using

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Neural Networks
Parameter Optimisation

En (wji + ) En (wji )
En (w)
=
+ O( )
wji
or the numerically more stable (fewer round-off errors)
symmetric differences
En (wji + ) En (wji )
En (w)
=
+ O( 2 )
wji
2
which both need O(W 2 ) operations.
131of 183

Introduction to Statistical
Machine Learning

Regularisation in Neural Networks

c 2010
Christfried Webers
NICTA
The Australian National
University

Model complexity matters again.

MLSS
2010
Neural Networks
Parameter Optimisation

M =1

M =3

M=1

M=3

M = 10

M = 10

Examples of two-layer networks with M hidden units.

132of 183

Introduction to Statistical
Machine Learning

Regularisation via Early Stopping

c 2010
Christfried Webers
NICTA
The Australian National
University

Stop training at the minimum of the validation set error.

MLSS
2010
Neural Networks
Parameter Optimisation

0.45

0.25
0.4
0.2

0.15

10

20

30

Training set error.

40

50

0.35

10

20

30

40

50

Validation set error.

133of 183

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

Part V

MLSS
2010
Kernel Methods

Kernel Methods and SVM

Maximum Margin
Classifiers

134of 183

Introduction to Statistical
Machine Learning

Kernel Methods

c 2010
Christfried Webers
NICTA
The Australian National
University

Keep (some) of the training data and recast prediction as a


linear combination of kernel functions which are evaluated
at the kept training data points and the new test point.
Let L(t, y(x) be any loss function
and J(f ) be any penalty quadratic in f ,
then minimum of penalised loss
has form f (x) =

N
n=1

MLSS
2010
Kernel Methods
Maximum Margin
Classifiers

L(tn , y(xn )) + J(f )

N
n=1 n k(xn , x)
N
n=1 L(tn , (K)n )

with minimising
+ T K,
and Kernel Kij = Kji = k(xi , xj )
Kernel trick based on Mercers theorem: Any continuous,
symmetric, positive semi-definite kernel function k(x, y)
can be expressed as a dot product in a high-dimensional
(possibly infinite dimensional) space.

135of 183

Introduction to Statistical
Machine Learning

Maximum Margin Classifiers


Support Vector Machines choose the decision boundary which
maximises the smallest distance to samples in both classes.
w = arg max min [tn (wT (xn ))]
w: w =1

tn {1, 1}

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Kernel Methods

Linear boundary for k (x) = x(k)

Maximum Margin
Classifiers

y = 1
y=0
y=1

136of 183

Maximum Margin Classifiers


Non-linear boundary for general (x).

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

w=

n (xn )
n=1

for a few n = 0 and corresponding xn (support vectors).

MLSS
2010
Kernel Methods
Maximum Margin
Classifiers

f (x) = wT (x) =

n k(xn , x)

with k(xn , x) = (xn )T (x)

n=1

137of 183

Introduction to Statistical
Machine Learning

Overlapping Class distributions


Introduce slack variable n 0 for each data point n .

data point is correctly classified and


0,
n =
on margin boundary or beyond

|tn y(x)|, otherwise

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Kernel Methods
Maximum Margin
Classifiers

y = 1
y=0
y=1

>1

<1
=0
=0
138of 183

Introduction to Statistical
Machine Learning

Overlapping Class distributions

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Kernel Methods
Maximum Margin
Classifiers

2
2

The -SVM algorithm using Gaussian kernels exp( x x


with = 0.45 applied to a nonseparable data set in two
dimensions. Support vectors are indicated by circles.

139of 183

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

Part VI

MLSS
2010
K-means Clustering
Mixture Models and EM

Mixture Models and EM

Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM

140of 183

Introduction to Statistical
Machine Learning

K-means Clustering
Goal: Partition N features xn into K clusters using
Euclidian distance d(xi , xj ) = xi xj such that each
feature belongs to the cluster with the nearest mean.
N
Distortion measure : J(, cl(xi )) = n=1 d(xi , cl(xi ) )2
where cl(xi ) is the index of the cluster centre closest to xi .
Start with K arbitrary cluster centres k .
M-step: Minimise J w.r.t. cl(xi ): Assign each data point xi
to closest cluster with index cl(xi ).
E-step: Minimise J w.r.t. k : Find new k as the mean of
points belonging to cluster k.
Iteration over M/E-steps converges to local minimum of J.
2

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
K-means Clustering
Mixture Models and EM
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM

(a)

2
2

2
141of 183

Introduction to Statistical
Machine Learning

K-means Clustering - Example


2

(a)

(b)

c 2010
Christfried Webers
NICTA
The Australian National
University

(c)

MLSS
2010
2

2
2

(d)

2
2

(e)

K-means Clustering

2
2

(f)

Mixture Models and EM


Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM

2
2

(g)

2
2

(h)

2
0

(i)

2
2

2
2

142of 183

Introduction to Statistical
Machine Learning

Mixture Models and EM

c 2010
Christfried Webers
NICTA
The Australian National
University

Mixture of Gaussians:
P(x | , , ) =
K
k=1 k N (x | k , k )
Maximise likelihood
P(x | , , ) w.r.t. , , .
M-step: Minimise J
w.r.t. cl(xi ): Assign
each data point xi to
closest cluster with
index cl(xi ).
E-step: Minimise J
w.r.t. k : Find new
k as the mean of
points belonging to
cluster k.

0.30

0.30

0.25

0.25

0.20

0.20

0.15

0.15

0.10

0.10

0.05

MLSS
2010

0.05

0.00-20

-15

-10

-5

10

15

0.00-20

20

-15

-10

-5

10

15

K-means Clustering

20

Mixture Models and EM


Mixture of Bernoulli
Distributions

EM for Gaussian
Mixtures - Latent
Variables

L=1

Convergence of EM
2

2
2

(a)

2
2

(b)

2
L=2

2
(d)

(c)

(f)

2
0

L = 20

2
2

L=5

2
2

(e)

143of 183

Introduction to Statistical
Machine Learning

EM for Gaussian Mixtures


Given a Gaussian mixture and data X, maximise the log
likelihood w.r.t. the parameters (, , ).
1

Initialise the means k , covariances k and mixing


coefficients k . Evaluate the log likelihood function.
E step : Evaluate the (zk ) using the current parameters
k N (x | k , k )
K
j=1 j N (x | j , j )

(zk ) =
3

1
=
Nk

new
=
k

1
Nk

(znk ) xn
n=1

knew

MLSS
2010
K-means Clustering
Mixture Models and EM

M step : Re-estimate the parameters using the current (zk )


new
k

c 2010
Christfried Webers
NICTA
The Australian National
University

Nk
=
N

Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM

n=1

new T
(znk )(xn new
k )(xn k )

Evaluate the log likelihood, if not converged then goto 2.


N

ln p(X | , , ) =

ln
n=1

k=1

new
knew N (x | new
k , k )
144of 183

Mixture of Bernoulli Distributions

Set of D binary variables xi , i = 1, . . . , D.


Each governed by a Bernoulli distribution with parameter
i . Therefore

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
K-means Clustering
Mixture Models and EM

p(x | ) =

i=1

xi i (1 i )1xi

Expecation and covariance

Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM

E [x] =
cov[x] = diag{i (1 i )}

145of 183

Introduction to Statistical
Machine Learning

Mixture of Bernoulli Distributions


Mixture

c 2010
Christfried Webers
NICTA
The Australian National
University

p(x | , ) =
with

k=1

k p(x | k )
MLSS
2010

p(x | k ) =

i=1

xkii (1 ki )1xi

Similar calculation as with mixture of Gaussian


K
j=1
N

Nk =

Mixture Models and EM


Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables

k p(xn | k )

(znk ) =

K-means Clustering

j p(xn | j )

Convergence of EM

(znk )
n=1

x =
k =

1
Nk

(znk )xn

k =
x

n=1

Nk
N
146of 183

EM for Mixture of Bernoulli Distributions - Digits

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
K-means Clustering

Examples from a digits data set, each pixel taken only binary
values.

Mixture Models and EM


Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM

Parameters ki for each


component in the mixture.

Fit to one multivariate


Bernoulli distribution.
147of 183

Introduction to Statistical
Machine Learning

The Role of Latent Variables


EM finds the maximum likelihod solution for models with
latent variables.
Two kinds of variables
Observed variables X
Latent variables Z

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
K-means Clustering

plus model parameters .


Log likelihood is then

Mixture Models and EM


Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables

ln p(X | ) = ln

p(X, Z | )

Convergence of EM

Optimisation problem due to the log-sum.


Assume maximisation of the distribution p(X, Z | ) over
the complete data set {X, Z} is straightforward.
But we only have the incomplete data set {X} and the
posterior distribution p(Z | X, ).
148of 183

Introduction to Statistical
Machine Learning

EM - Key Idea

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010

Key idea of EM: As Z is not observed, work with an


averaged version Q(, old ) of the complete log-likelihood
ln p(X, Z | ), averaged over all states of Z.
Q(, old ) =
Z

p(Z | X, old ) ln p(X, Z | )

K-means Clustering
Mixture Models and EM
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM

149of 183

Introduction to Statistical
Machine Learning

EM Algorithm
1
2
3

Choose an initial setting for the parameters old .


E step Evaluate p(Z | X, old ).
M step Evaluate new given by
new = arg max Q(, old )

MLSS
2010
K-means Clustering
Mixture Models and EM

Mixture of Bernoulli
Distributions

where
Q(,

old

)=
Z

c 2010
Christfried Webers
NICTA
The Australian National
University

p(Z | X,

old

) ln p(X, Z | )

EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM

Check for convergence of log likelihood or parameter


values. If not yet converged, then
old = new
and go to step 2.
150of 183

Introduction to Statistical
Machine Learning

EM Algorithm - Convergence

c 2010
Christfried Webers
NICTA
The Australian National
University

Start with the product rule for the observed variables x, the
unobserved variables Z, and the parameters
ln p(X, Z | ) = ln p(Z | X, ) + ln p(X | ).
Apply

q(Z) with arbitrary q(Z) to the formula

q(Z) ln p(X, Z | ) =

K-means Clustering
Mixture Models and EM

q(Z) ln p(Z | X, ) + ln p(X | ).

Rewrite as
ln p(X | ) =

MLSS
2010

Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM

q(Z) ln
Z

p(X, Z | )

q(Z)

q(Z) ln
Z

L(q,)

p(Z | X, )
q(Z)

KL(q p)

KL(q p) is the Kullback-Leibler divergence.


151of 183

Introduction to Statistical
Machine Learning

Kullback-Leibler Divergence

c 2010
Christfried Webers
NICTA
The Australian National
University

Distance between two distributions p(y) and q(y)


KL(q p) =

q(y) ln
y

KL(q p) =

q(y) ln

q(y)
p(y)

q(y)
dy
p(y)

=
=

q(y) ln
y

q(y) ln

KL(q p) 0
not symmetric: KL(q p) = KL(p q)
KL(q p) = 0 iff q = p.
invariant under parameter transformations

MLSS
2010

p(y)
q(y)

p(y)
dy
q(y)

K-means Clustering
Mixture Models and EM
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM

152of 183

Introduction to Statistical
Machine Learning

EM Algorithm - Convergence

c 2010
Christfried Webers
NICTA
The Australian National
University

The two parts of ln p(X | )


ln p(X | ) =

q(Z) ln
Z

p(X, Z | )

q(Z)

q(Z) ln
Z

L(q,)

p(Z | X, )
q(Z)

KL(q p)

MLSS
2010
K-means Clustering
Mixture Models and EM
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables

KL(q||p)

Convergence of EM

L(q, )

ln p(X|)

153of 183

Introduction to Statistical
Machine Learning

EM Algorithm - E Step
old

Hold fixed. Maximise the lower bound L(q,


respect to q().
L(q, old ) is a functional.
ln p(X | ) does NOT depend on q().
Maximum for L(q, old ) will occur when the
Kullback-Leibler divergence vanishes.
Therefore, choose q(Z) = p(Z | X, old )
ln p(X | ) =

q(Z) ln
Z

p(X, Z | )

q(Z)

L(q,)

q(Z) ln
Z

old

) with

p(Z | X, )
q(Z)

KL(q p)

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
K-means Clustering
Mixture Models and EM
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM

KL(q||p) = 0

L(q, old )

ln p(X| old )

154of 183

Introduction to Statistical
Machine Learning

EM Algorithm - M Step
old

Hold q() = p(Z | X, ) fixed. Maximise the lower bound


L(q, ) with respect to :
new = arg max L(q, old ) = arg max Z q() ln p(X, Z | )
L(q, new ) > L(q, old ) unless maximum already reached.
As q() = p(Z | X, old ) is fixed, p(Z | X, new ) will not be
equal to q(), and therefore the Kullback-Leiber distance
will be greater than zero (unless converged).
ln p(X | ) =

q(Z) ln
Z

p(X, Z | )

q(Z)

q(Z) ln
Z

L(q,)

p(Z | X, )
q(Z)

KL(q p)

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
K-means Clustering
Mixture Models and EM
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM

KL(q||p)

L(q, new )

ln p(X| new )

155of 183

Introduction to Statistical
Machine Learning

EM Algorithm - Parameter View

c 2010
Christfried Webers
NICTA
The Australian National
University

ln p(X|)

MLSS
2010
K-means Clustering
Mixture Models and EM
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM

L (q, )
new
old

Red curve : incomplete data likelihood.


Blue curve : After E step. Green curve : After M step.

156of 183

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

Part VII

MLSS
2010
Sampling from the
Uniform Distribution

Sampling

Sampling from Standard


Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea

157of 183

Sampling from the Uniform Distribution

In a computer usually via pseudorandom number


generator : an algorithm generating a sequence of
numbers that approximates the properties of random
numbers.
Example : linear congruential generators
z(n+1) = (a z(n) + c) mod m

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling

for modulus m > 0, multiplier 0 < a < m, increment


0 c < m, and seed z0 .
Other classes of pseudorandom number generators:

Markov Chain Monte


Carlo - The Idea

Lagged Fibonacci generators


Linear feedback shift registers
Generalised feedback shift registers

158of 183

Example: RANDU Random Number Generator

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Sampling from the
Uniform Distribution

Used since the 1960s on many machines


Defined by the recurrence

Sampling from Standard


Distributions
Rejection Sampling

(n+1)

16

(n)

= (2 + 3) z

mod 2

31

Importance Sampling
Markov Chain Monte
Carlo - The Idea

159of 183

RANDU looks somehow ok?


(n) T

Plotting z(n+2) , z(n+1) , z

in 3D . . .

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea

160of 183

Introduction to Statistical
Machine Learning

RANDU not really ok


(n+2)

(n+1)

(n) T

,z
,z
Plotting z
in 3D . . . and changing the
viewpoint results in 15 planes.

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea

161of 183

Introduction to Statistical
Machine Learning

A Bad Generator - RANDU

c 2010
Christfried Webers
NICTA
The Australian National
University

Analyse the recurrence


z(n+1) = (216 + 3) z(n)

mod 231

Assuming every equation to be modulo 231 , we can


correlate three samples
z(n+2) = (216 + 3)2 z(n)
= (232 + 6 216 + 9)z(n)

= (6(216 + 3) 9)z(n)

MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea

= 6z(n+1) 9z(n)

Marsaglia, George "Random Numbers Fall Mainly In The


Planes", Proc National Academy of Sciences 61, 25-28,
1968.
162of 183

Sampling from Standard Distributions

Goal: Sample from p(y) which is given in analytical form.


Suppose uniformly distributed samples of z in the interval
(0, 1) are available.
Calculate the cumulative distribution function
y

h(y) =

p(x) dx

Transform the samples from U(z | 0, 1) by

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea

y = h1 (z)
to obtain samples y distributed according to p(y).

163of 183

Sampling from Standard Distributions


Goal: Sample from p(y) which is given in analytical form.
If a uniformly distributed random variable z is transformed
using y = h1 (z) then y will be distributed according to p(y).

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Sampling from the
Uniform Distribution

1
h(y)

Sampling from Standard


Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea

p(y)

y
164of 183

Sampling from the Exponential Distribution


Goal: Sample from the exponential distribution
p(y) =

ey
0

0y
y<0

with rate parameter > 0.


Suppose uniformly distributed samples of z in the interval
(0, 1) are available.
Calculate the cumulative distribution function
y

h(y) =

p(x) dx =

ey dx = 1 ey

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea

Transform the samples from U(z | 0, 1) by


1
y = h1 (z) = ln(1 z)

to obtain samples y distributed according to the


exponential distribution.
165of 183

Sampling the Gaussian Distribution - Box-Muller


1

Generate pairs of uniformly distributed random numbers


z1 , z2 (1, 1) (e.g. zi = 2z 1 for z from U(z | 0, 1))
Discard any pair (z1 , z2 ) unless z21 + z22 1. Results in a
uniform distribution inside of the unit circle p(z1 , z2 ) = 1/.
Evaluate r2 = z21 + z22 and
y1 = z1
y2 = z2

1/2

2 ln r2
r2

1/2

2 ln r
r2

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea

y1 and y2 are independent with joint distribution


p(y1 , y2 ) = p(z1 , z2 )

2
2
(z1 , z2 )
1
1
= ey1 /2 ey2 /2
(y1 , y2 )
2
2

166of 183

Introduction to Statistical
Machine Learning

Rejection Sampling
Assumption 1 : Sampling directly from p(z) is difficult, but
we can evaluate p(z) up to some unknown normalisation
constant Zp
1
p(z) = p(z)
Zp
Assumption 2 : We can draw samples from a simpler
distribution q(z) and for some constant k and all z holds

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling

kq(z) p(z)

Importance Sampling
Markov Chain Monte
Carlo - The Idea

kq(z)

kq(z0 )

u0
z0

p(z)
z
167of 183

Introduction to Statistical
Machine Learning

Rejection Sampling
1
2

3
4

Generate a random number z0 from the distribution q(z).


Generate a number from the u0 from the uniform
distribution over [0, k q(z0 )].
If u0 > p(z0 ) then reject the pair (z0 , u0 ).
The remaining pairs have uniform distribution under the
curve p(z).
The z values are distributed according to p(z).

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling

kq(z)

kq(z0 )

u0
z0

Markov Chain Monte


Carlo - The Idea

p(z)
z
168of 183

Introduction to Statistical
Machine Learning

Importance Sampling
Provides a framework to directly calculate the expectation
Ep [f (z)] with respect to some distribution p(z).
Does NOT provide p(z).
Again use a proposal distribution q(z) and draw samples z
from it.
Then
E [f ] =

f (z) p(z) dz =

p(z)

f (z)

q(z)

p(z)
1
q(z) dz
q(z)
L

l=1

p(z(l) ) (l)
f (z )
q(z(l) )

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea

f (z)

z
169of 183

Introduction to Statistical
Machine Learning

Importance Sampling - Unnormalised


Consider both p(z) and q(z) to be not normalised.
p(z) =

p(z)
Zp

q(z) =

q(z)
.
Zq

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010

It follows then that


E [f ]

Zq 1
Zp L

(l)

rl f (z(l) )

rl =

l=1

p(z )
.
q(z(l) )

Sampling from Standard


Distributions
Rejection Sampling
Importance Sampling

Use the same set of samples to calculate


Zp
1

Zq
L

Sampling from the


Uniform Distribution

Markov Chain Monte


Carlo - The Idea

rl ,
l=1

resulting in the formula for unnormalised distributions


L

E [f ]

wl f (z(l) )
l=1

wl =

rl
L
m=1 rm
170of 183

Importance Sampling - Key Points

Try to choose sample points in the input space where the


product f (z) p(z) is large.
Or at least where p(z) is large.
Importance weights rl correct the bias introduced by
sampling from the proposal distribution q(z) instead of the
wanted distribution p(z).
Success depends on how well q(z) approximates p(z).
If p(z) > 0 in same region, then q(z) > 0 necessary.

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea

171of 183

Introduction to Statistical
Machine Learning

Markov Chain Monte Carlo

c 2010
Christfried Webers
NICTA
The Australian National
University

Goal : Generate samples from the distribution p(z).


Idea : Build a machine which uses the current sample to
decide which next sample to produce in such a way that
the overall distribution of the samples will be p(z) .
1

(r)

Current sample z is known. Generate a new sample z


from a proposal distribution q(z | z(r) ) we know how to
sample from.
Accept or reject the new sample according to some
appropriate criterion.
z(l+1) =

z
z(r)

MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea

if accepted
if rejected

Proposal distribution depends on the current state.

172of 183

Introduction to Statistical
Machine Learning

Metropolis Algorithm

c 2010
Christfried Webers
NICTA
The Australian National
University

Choose a symmetric proposal distribution


q(zA | zB ) = q(zB | zA ).
Accept the new sample z with probability
p(z )
A(z , z ) = min 1,
p(z(r) )
(r)

MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling

How? Choose a random number u with uniform


distribution in (0, 1). Accept new sample if A(z , z(r) ) > u.

Importance Sampling
Markov Chain Monte
Carlo - The Idea

z(l+1) =

z
z(r)

if accepted
if rejected

Rejection of a point leads to inclusion of the previous sample.


(Different from rejection sampling.)

173of 183

Introduction to Statistical
Machine Learning

Metropolis Algorithm - Illustration


Sampling from a Gaussian Distribution (black contour
shows one standard deviation).
Proposal distribution is isotropic Gaussian with standard
deviation 0.2.
150 candidates generated; 43 rejected.

c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Sampling from the
Uniform Distribution

Sampling from Standard


Distributions

2.5

Rejection Sampling

Importance Sampling
Markov Chain Monte
Carlo - The Idea

1.5
1
0.5
0

0.5

1.5

2.5

accepted steps, rejected steps.


174of 183

Markov Chain Monte Carlo - Metropolis-Hasting

Generalisation of the Metropolis algorithm for


nonsymmetric proposal distributions qk .
At step , draw a sample z from the distribution qk (z | z( ) )
where k labels the set of possible transitions.
Accept with probability
Ak (z, z( ) ) = min 1,

p(z ) qk (z( ) | z )
p(z( ) ) qk (z | z( ) )

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea

Choice of proposal distribution critical.


Common choice : Gaussian centered on the current state.
small variance high acceptance rate, but slow walk
through the state space; samples not independent
large variance high rejection rate

175of 183

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

Part VIII

MLSS
2010
More Machine Learning

More Machine Learning

176of 183

More Machine Learning


Graphical Models
Gaussian Processes
Sequential Data
Sequential Decision Theory
Learning Agents
Reinforcement Learning
Theoretical Model Selection
Additive Models and Trees and Related Methods
Approximate (Variational) Inference
Boosting
Concept Learning
Computational Learning Theory
Genetic Algorithms
Learning Sets of Rules
Analytical Learning
...

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
More Machine Learning

177of 183

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

Part IX

MLSS
2010
Journals
Books

Resources

Datasets

178of 183

Journals

Journal of Machine Learning Research


Machine Learning
IEEE Transactions on Pattern Analysis and Machine
Intelligence
IEEE Transactions on Neural Networks
Neural Computation
Neural Networks
Annals of Statistics
Journal of the American Statistical Association
SIAM Journal on Applied Mathematics (SIAP)
...

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Journals
Books
Datasets

179of 183

Conferences

International Conference on Machine Learning (ICML)


European Conference on Machine Learning (ECML)
Neural Information Processing Systems (NIPS)
Algorithmic Learning Theory (ALT)
Computational Learning Theory (COLT)
Uncertainty in Artificial Intelligence (UAI)
International Joint Conference on Artificial Intelligence
(IJCAI)
International Conference on Artificial Neural Networks
(ICANN)
...

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Journals
Books
Datasets

180of 183

Introduction to Statistical
Machine Learning

Books

Pattern Recognition and


Machine Learning

c 2010
Christfried Webers
NICTA
The Australian National
University

The Elements of Statistical


Learning

MLSS
2010
Journals
Books
Datasets

Christopher M. Bishop

Trevor Hastie, Robert


Tibshirani, Jerome Friedman

181of 183

Introduction to Statistical
Machine Learning

Books

Pattern Classification

c 2010
Christfried Webers
NICTA
The Australian National
University

Introduction to Machine
Learning

MLSS
2010
Journals
Books
Datasets

Richard O. Duda, Peter E.


Hart, David G. Stork

Ethem Alpaydin

182of 183

Datasets

UCI Repository
http://archive.ics.uci.edu/ml/
UCI Knowledge Discovery Database Archive
http://kdd.ics.uci.edu/summary.data.
application.html
Statlib
http://lib.stat.cmu.edu/
Delve
http://www.cs.utoronto.ca/~delve/
Time Series Database
http://robjhyndman.com/TSDL

Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University

MLSS
2010
Journals
Books
Datasets

183of 183