8 views

Uploaded by Saksham Singhal

- Self Study Quant Trader
- Regression Notes
- Introduction to Machine Learning, In New Advances in Machine Learning
- Assignment
- Drdo Report Final - Aman
- 298
- Advances in Fermentation Machines
- Car Am i Aux 2013 Machine
- A Comparative Study of Existing Machine Learning Approaches for Parkinson's Disease Detection
- Application of Supervised Machine Learning Techniques
- seguridad
- Paper 5 - An Incremental Learning Algorithm Considering Texts' Reliability
- AI based intrusion detection system
- Topology-Based Kernels With Application to Inference Problems in Alzheimer’S Disease
- MLT Document Format
- Building a Sentiment Analytics Solution Powered by Machine Learning- Impetus White Paper
- CACLCHR
- Fast Kernel Classifiers
- Empirical Analysis of Open Source projects using Feature Selection and Filtering Techniques
- Real-Time Face Recognition System Using KPCA, LBP and Support Vector Machine

You are on page 1of 183

Machine Learning

Christfried Webers

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Outlines

NICTA

and

College of Engineering and Computer Science

The Australian National University

Overview

Linear Regression

Linear Classification

Neural Networks

Kernel Methods and SVM

Mixture Models and EM

Resources

More Machine Learning

MLSS-2010, 27 September - 6 October

T. Hastie, R. Tibshirani, J. Friedman, "The Elements of Statistical Learning")

1of 183

Overview

Definition

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Outlines

Related Fields

Overview

Linear Regression

Linear Classification

Neural Networks

Kernel Methods and SVM

Mixture Models and EM

Resources

More Machine Learning

2of 183

Linear Regression

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Outlines

10

11

Bayesian Regression

12

13

Predictive Distribution

14

Overview

Linear Regression

Linear Classification

Neural Networks

Kernel Methods and SVM

Mixture Models and EM

Resources

More Machine Learning

3of 183

Linear Classification

15

Classification

16

17

18

Decision Theory

19

20

21

22

Discrete Features

23

Logistic Regression

24

Feature Space

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Outlines

Overview

Linear Regression

Linear Classification

Neural Networks

Kernel Methods and SVM

Mixture Models and EM

Resources

More Machine Learning

4of 183

Neural Networks

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Outlines

25

Neural Networks

Linear Regression

Linear Classification

Neural Networks

Kernel Methods and SVM

Mixture Models and EM

Resources

More Machine Learning

26

Parameter Optimisation

5of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Outlines

27

Kernel Methods

Linear Regression

Linear Classification

Neural Networks

Kernel Methods and SVM

Mixture Models and EM

Resources

More Machine Learning

28

6of 183

29

K-means Clustering

30

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Outlines

Overview

Linear Regression

Linear Classification

31

32

33

Convergence of EM

Neural Networks

Kernel Methods and SVM

Mixture Models and EM

Resources

More Machine Learning

7of 183

Resources

34

35

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Outlines

Overview

Linear Regression

Linear Classification

36

Rejection Sampling

37

Importance Sampling

38

Neural Networks

Kernel Methods and SVM

Mixture Models and EM

Resources

More Machine Learning

8of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Outlines

Overview

Linear Regression

Linear Classification

39

Neural Networks

Kernel Methods and SVM

Mixture Models and EM

Resources

More Machine Learning

9of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

Part I

MLSS

2010

What is Machine

Learning?

Overview

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

10of 183

Definition

Machine learning is concerned with the design and

development of algorithms that allow computers (machines) to

improve their performance over time based on data.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

What is Machine

Learning?

generalisation

quantify learning: improve their performance over time

need to quantify performance

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

11of 183

Definition

Machine learning is concerned with the design and

development of algorithms that allow computers (machines) to

improve their performance over time based on data.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

What is Machine

Learning?

generalisation

quantify learning: improve their performance over time

need to quantify performance

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

respect to some class of tasks T and performance measure P,

if its performance at tasks in T, as measured by P, improves

with experience E.

12of 183

Machine Learning is essential when

humans are unable to explain their expertise (e.g. speech

recognition).

humans are not around for help (e.g. navigation on Mars,

underwater robotics).

large amount of data with possible hidden relationships

and correlations (empirical sciences, e.g. discover unusual

astronomical objects).

environment changes (fast) in time (e.g. mobile phone

network).

solutions need to be adapted to many particular cases

(e.g. junk mail).

Example: It is easier to write a program that learns to play

checkers or backgammon well by self-play rather than

converting the expertise of a master player to a program.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

What is Machine

Learning?

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

13of 183

Given examples of data (mail), and targets {Junk,NoJunk}.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

What is Machine

Learning?

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

Continue to learn from the user classifying new mail.

14of 183

Given handwritten ZIP codes on letters, money amounts

on cheques etc.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

What is Machine

Learning?

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

Nonsense input: Dont know preferred to some wrong

digit.

15of 183

Backgammon

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

What is Machine

Learning?

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

1992, 1995) played over a million games against itself.

Plays now on the level of human world champion.

16of 183

Introduction to Statistical

Machine Learning

Image Denoising

Original image

c 2010

Christfried Webers

NICTA

The Australian National

University

Noise added

Denoised

2010

What is Machine

Learning?

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

Images", ICML2006

17of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Audio Sources

Microphones

Audio Mixtures

What is Machine

Learning?

Definition

1.0

0.5

0.5

10

20

30

40

50

10

20

30

40

50

Examples of Machine

Learning

0.5

0.5

Related Fields

1.0

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

1.0

1.0

0.5

0.5

10

10

0.5

20

30

40

50

20

30

40

50

0.5

1.0

18of 183

autonomous robotics,

detecting credit card fraud,

detecting network intrusion,

bioinformatics,

neuroscience,

medical diagnosis,

stock market analysis,

...

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

What is Machine

Learning?

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

19of 183

Related Fields

Artificial Intelligence - AI

Statistics

Game Theory

Neuroscience, Psychology

Data Mining

Computer Science

Adaptive Control Theory

...

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

What is Machine

Learning?

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

20of 183

Unsupervised Learning

Association

Clustering

Density Estimation

Blind source

separation

Supervised Learning

Regression

Classification

Reinforcement Learning

Agents

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

What is Machine

Learning?

Definition

Others

Active Learning

SemiSupervised

Learning

Transductive

Learning

...

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

21of 183

Unsupervised Learning

Goal: Determine how the data are organised.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

What is Machine

Learning?

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

22of 183

Clustering : Group similar instances

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

What is Machine

Learning?

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

Example applications

Clustering customers in

Customer-Relationship-Management

Image compression: color quantisation

23of 183

Supervised Learning

Given pairs of data and targets (=labels).

Learn a mapping from the data to the targets (training).

Goal: Use the learned mapping to correctly predict the

target for new input data.

Need to generalise well from the training data/target pairs.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

What is Machine

Learning?

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

24of 183

Reinforcement Learning

the game (negative or positive).

Find suitable actions in a given environment with the goal

of maximising some reward.

correct input/output pairs never presented

Reward might only come after many actions.

Current action may not only influence the current reward,

but future rewards too.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

What is Machine

Learning?

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

25of 183

Introduction to Statistical

Machine Learning

Reinforcement Learning

observation1

reward1

c 2010

Christfried Webers

NICTA

The Australian National

University

observation2

reward2

observationi

receive

reward

receive

reward

rewardi

receive

reward

Agent

Agent

choose action

choose action

choose action

action1

action2

actioni

...

Agent

MLSS

2010

What is Machine

Learning?

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Well suited for problems with a long-term versus

short-term reward trade-off.

Naturally focusing on online performance.

Polynomial Curve Fitting

26of 183

Probability

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

occur or has occurred.

Example: Fair Six-Sided Die

Sample space

Events

Probability

Outcome

Conditional Probability

= {1, 2, 3, 4, 5, 6}

Even = {2, 4, 6}, Odd = {1, 3, 5}

P(3) = 16 , P(Odd) = P(Even) = 21

3

and Odd)

1

P(3 | Odd) = P(3P(Odd)

= 1/6

1/2 = 3

General Axioms

P({}) = 0 P(A) P() = 1,

P(A B) + P(A B) = P(A) + P(B),

P(A B) = P(A | B)P(B).

Rules of Probability

Sum rule: P(X) = Y P(X, Y)

Product rule: P(X, Y) = P(X|Y) P(Y)

2010

What is Machine

Learning?

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

27of 183

Introduction to Statistical

Machine Learning

Probability Jargon

c 2010

Christfried Webers

NICTA

The Australian National

University

Likelihood P(1101 | ) = (1 )

Maximum Likelihood (ML) estimate = arg max P(1101 | ) = 34

Prior If we are indifferent, then P() = const.

1

(actually )

Evidence P(1101) = P(1101 | )P() = 20

P(1101 | )P()

3

Posterior P( | 1101) =

(1

)

(Bayes Rule)

P(1101)

Maximum a Posterior (MAP) estimate = arg max P( | 1101) = 3

P(11011)

P(1101)

2

3

=

Expectation E [f | . . .] = f ()P( | . . . ), e.g. E [ | 1101] =

2

Variance var() = E ( E [])2 | 1101 = 63

1

Probability Density P() = P([, + ]) for 0

MLSS

2010

What is Machine

Learning?

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

2

3

28of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

sin(2x) + random noise

x = 0, . . . , 1

MLSS

2010

What is Machine

Learning?

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

29of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

N = 10

What is Machine

Learning?

x (x1 , . . . , xN )

t (t1 , . . . , tN )

xi R i = 1,. . . , N

ti R i = 1,. . . , N

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

30of 183

M : order of polynomial

y(x, w) = w0 + w1 x + w2 x2 + + wM xM

M

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

What is Machine

Learning?

Definition

wm xm

Introduction to Statistical

Machine Learning

m=0

nonlinear function of x

linear function of the unknown model parameter w

How can we find good parameters w = (w1 , . . . , wM )T ?

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

31of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

tn

MLSS

2010

y(xn , w)

What is Machine

Learning?

Definition

xn

Examples of Machine

Learning

Related Fields

prediction of the model for the training data

E(w) =

1

2

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

N

2

n=1

(y(xn , w) tn )

32of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

wm xm

y(x, w) =

m=0

MLSS

2010

M=0

= w0

What is Machine

Learning?

Definition

Examples of Machine

Learning

M =0

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

33of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

wm xm

y(x, w) =

m=0

MLSS

2010

M=1

= w0 + w1 x

What is Machine

Learning?

Definition

Examples of Machine

Learning

M =1

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

34of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

wm xm

y(x, w) =

m=0

= w0 + w1 x +

M=3

w2 x2

MLSS

2010

+ w3 x3

What is Machine

Learning?

Definition

Examples of Machine

Learning

M =3

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

35of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

wm xm

y(x, w) =

m=0

M=9

= w0 + w1 x + + w8 x8 + w9 x9

overfitting

MLSS

2010

What is Machine

Learning?

Definition

Examples of Machine

Learning

Related Fields

M =9

Fundamental Types of

Learning

Polynomial Curve Fitting

1

36of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

Get 100 new data points

Root-mean-square (RMS) error

ERMS =

MLSS

2010

2E(w )/N

What is Machine

Learning?

Definition

Examples of Machine

Learning

ERMS

Training

Test

Related Fields

Fundamental Types of

Learning

0.5

Polynomial Curve Fitting

37of 183

Introduction to Statistical

Machine Learning

w0

w1

w2

w3

w4

w5

w6

w7

w8

w9

M=0

0.19

M=1

0.82

-1.27

M=3

0.31

7.99

-25.43

17.37

c 2010

Christfried Webers

NICTA

The Australian National

University

M=9

0.35

232.37

-5321.83

48568.31

-231639.30

640042.26

-1061800.52

1042400.18

-557682.99

125201.43

2010

What is Machine

Learning?

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

38of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

N = 15

MLSS

2010

What is Machine

Learning?

N = 15

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

39of 183

Introduction to Statistical

Machine Learning

N = 100

heuristics : have no less than 5 to 10 times as many data

points than parameters

but number of parameters is not necessarily the most

appropriate measure of model complexity !

later: Bayesian approach

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

What is Machine

Learning?

Definition

Examples of Machine

Learning

N = 100

1

t

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

1

40of 183

Introduction to Statistical

Machine Learning

Regularisation

c 2010

Christfried Webers

NICTA

The Australian National

University

Add a regularisation term to the error function

1

E(w) =

2

What is Machine

Learning?

Definition

( y(xn , w) tn ) +

w

2

2

n=1

MLSS

2010

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

41of 183

Introduction to Statistical

Machine Learning

Regularisation

c 2010

Christfried Webers

NICTA

The Australian National

University

M=9

MLSS

2010

What is Machine

Learning?

ln = 18

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

42of 183

Introduction to Statistical

Machine Learning

Regularisation

c 2010

Christfried Webers

NICTA

The Australian National

University

M=9

MLSS

2010

What is Machine

Learning?

ln = 0

Definition

Examples of Machine

Learning

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

43of 183

Introduction to Statistical

Machine Learning

Regularisation

c 2010

Christfried Webers

NICTA

The Australian National

University

M=9

MLSS

2010

1

Training

Test

What is Machine

Learning?

ERMS

Definition

Examples of Machine

Learning

0.5

Related Fields

Fundamental Types of

Learning

Basic Probability Theory

Polynomial Curve Fitting

35

30

ln 25

20

44of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

Part II

MLSS

2010

Linear Basis Function

Models

Linear Regression

Least Squares

Regularized Least

Squares

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

45of 183

Introduction to Statistical

Machine Learning

input "feature" vector x = (1 x(0) , x(1) , . . . , x(D) )T RD+1

linear regression model

MLSS

2010

wj x(j) = wT x

y(x, w) =

j=0

X2

c 2010

Christfried Webers

NICTA

The Australian National

University

Models

Maximum Likelihood and

Least Squares

Regularized Least

Squares

2.0

Bayesian Regression

1.5

1.0

Regression

20

15

Predictive Distribution

Y 10

Limitations of Linear

Basis Function Models

5

0

1.0

1.5

2.0

X1

2.5

3.0

46of 183

Use training data (x1 , t1 ), . . . , (xN , tN )

and loss function (performance measure) to find best w.

Example : Residual sum of squares

N

Loss(w) =

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 3

n=1

Models

Maximum Likelihood and

Least Squares

w

Regularized Least

Squares

Bayesian Regression

Regression

Predictive Distribution

X2

Limitations of Linear

Basis Function Models

X1

47of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

j (x) R

M1

y(x, w) =

wj j (x) = w (x)

j=0

T

w0 is the bias parameter,

basis functions = (0 , . . . , M1 )T

convention 0 (x) = 1

MLSS

2010

Linear Basis Function

Models

Maximum Likelihood and

Least Squares

Regularized Least

Squares

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

48of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

j (x) = xj

Limitation : Polynomials are global functions of the input

variable x.

Extension: Split the input space into regions and fit a

different polynomial to each region (spline functions).

MLSS

2010

Linear Basis Function

Models

Maximum Likelihood and

Least Squares

Regularized Least

Squares

Bayesian Regression

0.5

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

0.5

1

1

1

49of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

j (x) = exp

(x j )2

2s2

No normalisation required, taken care of by the model

parameters w.

MLSS

2010

Linear Basis Function

Models

Maximum Likelihood and

Least Squares

Regularized Least

Squares

Bayesian Regression

0.75

Regression

Predictive Distribution

0.5

Limitations of Linear

Basis Function Models

0.25

0

1

50of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

j (x) =

x j

s

(a) =

1

1 + exp(a)

MLSS

2010

Linear Basis Function

Models

tanh(a) = 2(a) 1.

1

Least Squares

Regularized Least

Squares

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

0.75

Limitations of Linear

Basis Function Models

0.5

0.25

0

1

51of 183

Introduction to Statistical

Learning

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 ChapMachine

5

c 2010

Christfried Webers

NICTA

The Australian National

University

Haar Wavelets

Symmlet-8 Wavelets

6,35

6,15

5,15

5,1

Wavelets : localised in

both space and

frequency

mutually orthogonal to

simplify application.

MLSS

2010

Linear Basis Function

Models

Maximum Likelihood and

Least Squares

4,9

Regularized Least

Squares

4,4

Bayesian Regression

3,5

Regression

3,2

Predictive Distribution

2,3

Limitations of Linear

Basis Function Models

2,1

1,0

0.0

0.2

0.4

0.6

Time

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time

52of 183

Introduction to Statistical

Machine Learning

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 5

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Linear Basis Function

Models

Maximum Likelihood and

Least Squares

Regularized Least

Squares

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

53of 183

the simplest case, one can think of j (x) = xj .

Assume target t is given by

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Linear Basis Function

Models

Maximum Likelihood and

Least Squares

t = y(x, w) +

deterministic

Introduction to Statistical

Machine Learning

noise

Regularized Least

Squares

Bayesian Regression

precision (inverse variance) .

Thus

p(t | x, w, ) = N (t | y(x, w), 1 )

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

54of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

p(t | x, w, ) = N (t | y(x, w), 1 )

Set of inputs X = {x1 , . . . , xN } with corresponding target

values t = (t1 , . . . , tn )T .

Assume data are independent and identically distributed

(i.i.d.) (means : data are drawn independent and from the

same distribution). The likelihood of the target t is then

n=1

Models

Maximum Likelihood and

Least Squares

Regularized Least

Squares

Bayesian Regression

Example for Bayesian

Regression

p(t | X, w, ) =

MLSS

2010

Predictive Distribution

Limitations of Linear

Basis Function Models

=

n=1

N (tn | wT (xn ), 1 )

55of 183

Consider the logarithm of the likelihood p(t | X, w, ) (the

logarithm is a monoton function! )

N

ln p(t | X, w, ) =

n=1

ln N (tn | w (xn ),

2

2

ln

n=1

N

N

ln ln(2) ED (w)

2

2

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Linear Basis Function

Models

Maximum Likelihood and

Least Squares

Regularized Least

Squares

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

ED (w) =

1

2

n=1

56of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

ED (w) =

1

2

n=1

0 (x1 ) 1 (x1 )

0 (x2 ) 1 (x2 )

..

= ...

.

0 (xN ) 1 (xN )

1

(t w)T (t w)

2

MLSS

2010

Linear Basis Function

Models

...

...

..

.

...

M1 (x1 )

M1 (x2 )

..

M1 (xN )

Least Squares

Regularized Least

Squares

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

wML = arg max ln p(t | w, ) = arg min ED (w)

w

T

= ( )

t= t

57of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

ED (w) + EW (w)

Models

Simple quadratic regulariser

EW (w) =

MLSS

2010

Least Squares

Regularized Least

Squares

1 T

w w

2

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

T

wML = I +

Limitations of Linear

Basis Function Models

58of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

1

EW (w) =

2

MLSS

2010

M

q

j=1

|wj |

Models

Least Squares

Regularized Least

Squares

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

q = 0.5

q=1

q=2

q=4

59of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

Quadratic regulariser

1

2

Lasso regulariser

1

2

w2j

j=1

w2

MLSS

2010

j=1

Models

|wj |

Least Squares

Regularized Least

Squares

w2

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

w1

w1

60of 183

Introduction to Statistical

Machine Learning

Bayesian Regression

c 2010

Christfried Webers

NICTA

The Australian National

University

Bayes Theorem

posterior =

likelihood prior

normalisation

p(w | t) =

p(t | w) p(w)

p(t)

n=1

n=1

Regularized Least

Squares

Bayesian Regression

Models

Maximum Likelihood and

Least Squares

p(t | w) =

MLSS

2010

N (tn | wT (xn ), 1 )

1

= const exp{ (t w)T (t w)}

2

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

and , which is assumed to be constant.

61of 183

makes sense for the problem at hand

allows us to find a posterior in a nice form

Definition ( Conjugate Prior)

A class of prior probability distributions p(w) is conjugate to a

class of likelihood functions p(x | w) if the resulting posterior

distributions p(w | x) are in the same family as p(w).

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Linear Basis Function

Models

Maximum Likelihood and

Least Squares

Regularized Least

Squares

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

62of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

Likelihood

Bernoulli

Binomial

Poisson

Multinomial

Conjugate Prior

Beta

Beta

Gamma

Dirichlet

MLSS

2010

Linear Basis Function

Models

Maximum Likelihood and

Least Squares

Regularized Least

Squares

Bayesian Regression

Regression

Predictive Distribution

Likelihood

Uniform

Exponential

Normal

Multivariate normal

Conjugate Prior

Pareto

Gamma

Normal

Multivariate normal

Limitations of Linear

Basis Function Models

63of 183

Introduction to Statistical

Machine Learning

Bayesian Regression

c 2010

Christfried Webers

NICTA

The Australian National

University

likelihood

prior/posterior

p(w)

No data point

(N = 0): start with

prior.

Each posterior acts

as the prior for the

next data/target pair.

Nicely fits a

sequential learning

framework.

MLSS

2010

Linear Basis Function

Models

Maximum Likelihood and

Least Squares

p(t1 | w, x1 )

Bayes

p(w | t1, x1 )

Regularized Least

Squares

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

p(t2 | w, x2)

Bayes

64of 183

Single input x, single output t

Linear model y(x, w) = w0 + w1 x.

Data creation

1

2

3

Calculate f (xn , a) = a0 + a1 xn , where a0 = 0.3, a1 = 0.5.

Add Gaussian noise with standard deviation = 0.2,

tn = N (xn | f (xn , a), 0.04)

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Linear Basis Function

Models

Maximum Likelihood and

Least Squares

Regularized Least

Squares

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

65of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Linear Basis Function

Models

Maximum Likelihood and

Least Squares

Regularized Least

Squares

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

66of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Linear Basis Function

Models

Maximum Likelihood and

Least Squares

Regularized Least

Squares

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

67of 183

Introduction to Statistical

Machine Learning

Predictive Distribution

c 2010

Christfried Webers

NICTA

The Australian National

University

The Predictive Distribution is the probability of the test target t

given test data x, the training data set X and the training

targets t.

p(t | x, X, t)

p(t, w | x, X, t) dw

p(t | w, x, X, t) p(w | x, X, t) dw

testing only

training only

Models

Maximum Likelihood and

Least Squares

p(t | x, X, t) =

MLSS

2010

Regularized Least

Squares

(sum rule)

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

p(t | w, x) p(w | X, t) dw

68of 183

after N data points have been seen

1

noise of data

+ T (I + T )1

uncertainty of w

2

N+1

(x) N2 (x) and limN N2 (x) =

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

p(w | ) = N (w | 0, 1 I)

N2 (x) =

Introduction to Statistical

Machine Learning

Models

Maximum Likelihood and

Least Squares

Regularized Least

Squares

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

69of 183

1

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

t

0

MLSS

2010

Linear Basis Function

Models

Least Squares

Regularized Least

Squares

Bayesian Regression

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

added noise. Mean of the predictive distribution (red) and regions of

one standard deviation from mean (red shaded).

70of 183

Introduction to Statistical

Machine Learning

1

c 2010

Christfried Webers

NICTA

The Australian National

University

t

0

MLSS

2010

Linear Basis Function

Models

Least Squares

Regularized Least

Squares

Bayesian Regression

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

Example with artificial sinusoidal data from sin(2x) (green) and added

noise. Samples y(x, w) (red) from the posterior distribution p(w | X, t) .

71of 183

Basis function j (x) are fixed before the training data set is

observed.

Curse of dimensionality : Number of basis function grows

rapidly, often exponentially, with the dimensionality D.

But typical data sets have two nice properties which can

be exploited if the basis functions are not fixed :

Data lie close to a nonlinear manifold with intrinsic

dimension much smaller than D. Need algorithms which

place basis functions only where data are (e.g. radial basis

function networks, support vector machines, relevance

vector machines, neural networks).

Target variables may only depend on a few significant

directions within the data manifold. Need algorithms which

can exploit this property (Neural networks).

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Linear Basis Function

Models

Maximum Likelihood and

Least Squares

Regularized Least

Squares

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

72of 183

Curse of Dimensionality

Linear Algebra allows us to operate in n-dimensional

vector spaces using the intution from our 3-dimensional

world as a vector space. No surprises as long as n is finite.

If we add more structure to a vector space (e.g. inner

product, metric), our intution gained from the

3-dimensional world around us may be wrong.

Example: Sphere of radius r = 1. What is the fraction of

the volume of the sphere in a D-dimensional space which

lies between radius r = 1 and r = 1 ?

Volume scales like rD , therefore the formula for the volume

of a sphere is VD (r) = KD rD .

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Linear Basis Function

Models

Maximum Likelihood and

Least Squares

Regularized Least

Squares

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

VD (1) VD (1 )

= 1 (1 )D

VD (1)

73of 183

Introduction to Statistical

Machine Learning

Curse of Dimensionality

Fraction of the volume of the sphere in a D-dimensional

space which lies between radius r = 1 and r = 1

VD (1) VD (1 )

= 1 (1 )D

VD (1)

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Linear Basis Function

Models

Least Squares

D = 20

volume fraction

Regularized Least

Squares

D =5

0.8

D =2

0.6

Bayesian Regression

Example for Bayesian

Regression

D =1

Predictive Distribution

0.4

Limitations of Linear

Basis Function Models

0.2

0.2

0.4

0.6

0.8

74of 183

Introduction to Statistical

Machine Learning

Curse of Dimensionality

c 2010

Christfried Webers

NICTA

The Australian National

University

distribution for various values of the dimensionality D.

MLSS

2010

2

Linear Basis Function

Models

p(r)

D=1

Least Squares

D=2

Regularized Least

Squares

D = 20

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

2

r

75of 183

Introduction to Statistical

Machine Learning

Curse of Dimensionality

Probability density with respect to radius r of a Gaussian

distribution for various values of the dimensionality D.

Example: D = 2; assume = 0, = I

1

1

N (x | 0, I) =

exp xT x

2

2

1

1

=

exp (x12 + x22 )

2

2

MLSS

2010

Linear Basis Function

Models

Maximum Likelihood and

Least Squares

Coordinate transformation

x1 = r cos()

c 2010

Christfried Webers

NICTA

The Australian National

University

x2 = r sin()

Regularized Least

Squares

Bayesian Regression

p(r, | 0, I) = N (r(x), (x) | 0, I) | J |

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

given coordinate transformation.

p(r, | 0, I) =

1

1

r exp r2

2

2

76of 183

Introduction to Statistical

Machine Learning

Curse of Dimensionality

Probability density with respect to radius r of a Gaussian

distribution for D = 2 (and = 0, = I)

p(r, | 0, I) =

1

1

r exp r2

2

2

p(r | 0, I) =

Least Squares

1

1

r exp r2

2

2

1

d = r exp r2

2

D=1

p(r)

Regularized Least

Squares

Bayesian Regression

Example for Bayesian

Regression

Predictive Distribution

Limitations of Linear

Basis Function Models

D=2

1

MLSS

2010

Linear Basis Function

Models

2

c 2010

Christfried Webers

NICTA

The Australian National

University

D = 20

2

r

77of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

Part III

MLSS

2010

Classification

Linear Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

78of 183

Introduction to Statistical

Machine Learning

Classification

c 2010

Christfried Webers

NICTA

The Australian National

University

classes Ck where k = 1, . . . , K.

Divide the input space into different regions.

MLSS

2010

Classification

Generalised Linear

Model

Decision Theory

Fishers Linear

Discriminant

2

The Perceptron

Algorithm

4

Probabilistic Generative

Models

6

Discrete Features

8

Logistic Regression

4

Feature Space

79of 183

a discrete set.

Two classes : t {0, 1}

( t = 1 represents class C1 and t = 0 represents class C2 )

Can interpret the value of t as the probability of class C1 ,

with only two values possible for the probability, 0 or 1.

Note: Other conventions to map classes into integers

possible, check the setup.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

80of 183

multi-class setup.

Often used: 1-of-K coding scheme in which t is a vector of

length K which has all values 0 except for tj = 1, where j

comes from the membership in class Cj to encode.

Example: Given 5 classes, {C1 , . . . , C5 }. Membership in

class C2 will be encoded as the target vector

t = (0, 1, 0, 0, 0)T

Note: Other conventions to map multi-classes into integers

possible, check the setup.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

81of 183

Introduction to Statistical

Machine Learning

Linear Model

Idea: Use again a Linear Model as in regression: y(x, w) is

a linear function of the parameters w

y(xn , w) = wT (xn )

But generally y(xn , w) R.

Example: Which class is y(x, w) = 0.71623 ?

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

8

4

8

82of 183

Introduction to Statistical

Machine Learning

Apply a mapping f : R Z to the linear model to get the

discrete class labels.

Generalised Linear Model

y(xn , w) = f (wT (xn ))

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Classification

Generalised Linear

Model

Activation function: f ()

Link function : f 1 ()

Decision Theory

Fishers Linear

Discriminant

sign z

1.0

The Perceptron

Algorithm

0.5

0.5

0.0

0.5

1.0

0.5

1.0

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

83of 183

In increasing order of complexity

Find a discriminant function f (x) which maps each input

directly onto a class label.

Discriminative Models

1

class probabilities p(Ck | x).

Use decision theory to assign each new x to one of the

classes.

Generative Models

1

2

3

4

5

class-conditional probabilities p(x | Ck ).

Also, infer the prior class probabilities p(Ck ).

Use Bayes theorem to find the posterior p(Ck | x).

Alternatively, model the joint distribution p(x, Ck ) directly.

Use decision theory to assign each new x to one of the

classes.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

84of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

probability of a mistake

p(mistake) = p(x R1 , C2 ) + p(x R2 , C1 )

=

R1

p(x, C2 ) dx +

R2

p(x, C1 ) dx

MLSS

2010

Classification

Generalised Linear

Model

x0

Decision Theory

Fishers Linear

Discriminant

p(x, C1 )

The Perceptron

Algorithm

p(x, C2 )

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

x

R1

R2

85of 183

Weight each misclassification of x to the wrong class Cj

instead of assigning it to the correct class Ck by a factor Lkj .

The expected loss is now

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

E [L] =

k

Rj

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

86of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

Difficult cases:

posterior probabilities p(Ck | x) are very small

joint distributions p(x, Ck ) have comparable values

1.0

p(C1 |x)

p(C2 |x)

MLSS

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

0.0

reject region

Logistic Regression

Feature Space

87of 183

and minimisation of sum-of-squares error function resulted

in a closed-from solution for the parameter values.

Is this also possible for classification?

Given input data x belonging to one of K classes Ck .

Use 1-of-K binary coding scheme.

Each class is described by its own linear model

yk (x) = wTk x + wk0

k = 1, . . . , K

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

88of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

wk =

wk0

wk

RD+1

1

x=

x

W = w1

D+1

R

...

wK

R(D+1)K

y(x) = WT x

RK .

the largest value in the row vector y(x)

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

89of 183

Introduction to Statistical

Machine Learning

Determine W

c 2010

Christfried Webers

NICTA

The Australian National

University

class in the 1-of-K coding scheme.

Define a matrix T where row n corresponds to tTn .

The sum-of-squares error can now be written as

ED (W) =

1

tr (XW T)T (XW T)

2

W = (XT X)1 XT T = X T

MLSS

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

90of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

y(x) = WT x = TT (X )T x,

where X is given by the training data, and x is the new

input.

Interesting property: If for every tn the same linear

constraint aT tn + b = 0 holds, then the prediction y(x) will

also obey the same constraint

aT y(x) + b = 0.

MLSS

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

in tn is one, and therefore all components of y(x) will sum

to one. BUT: the components are not probabilities, as they

are not constraint to the interval (0, 1).

Discrete Features

Logistic Regression

Feature Space

91of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

approach ( Green curve : Decision boundary for the logistic

regression model described later)

MLSS

2010

Classification

Generalised Linear

Model

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

8

4

Decision Theory

Discrete Features

8

2

Logistic Regression

Feature Space

92of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

approach ( Green curve : Decision boundary for the logistic

regression model described later)

MLSS

2010

Classification

Generalised Linear

Model

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Decision Theory

Discrete Features

6

6

6

6

Logistic Regression

Feature Space

93of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

T

MLSS

2010

y(x) = w x

Classification

But there are many projections from a D-dimensional input

space onto one dimension.

Projection always means loss of information.

For classification we want to preserve the class separation

in one dimension.

Can we find a projection which maximally preserves the

class separation ?

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

94of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

and their histogram when projected to two different

one-dimensional spaces.

MLSS

2010

Classification

Generalised Linear

Model

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

95of 183

Given N1 input data of class C1 , and N2 input data of class

C2 , calculate the centres of the two classes

m1 =

1

N1

xn ,

m2 =

nC1

1

N2

xn

nC2

means onto w

m1 m2 = wT (m1 m2 )

Problem with non-uniform covariance

4

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

6

96of 183

Introduction to Statistical

Machine Learning

Measure also the within-class variance for each class

s2k =

nCk

c 2010

Christfried Webers

NICTA

The Australian National

University

(yn mk )2

MLSS

2010

where yn = w xn .

Maximise the Fisher criterion

Classification

Generalised Linear

Model

J(w) =

(m2 m1 )

s21 + s22

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

0

Logistic Regression

Feature Space

6

97of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

T

J(w) =

w SB w

wT SW w

Classification

Generalised Linear

Model

SB = (m2 m1 )(m2 m1 )T

SW is the within-class covariance

SW =

nC1

(xn m1 )(xn m1 )T +

MLSS

2010

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

nC2

(xn m2 )(xn m2 )T

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

98of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

J(w) =

wT SB w

wT SW w

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

S1

W (m2

m1 )

be used to construct one by choosing a threshold y0 in the

projection space.

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

99of 183

computer which could learn new skills by trial and error

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

100of 183

"Principles of neurodynamics: Perceptrons and the theory

of brain mechanisms" (Spartan Books, 1962)

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

101of 183

Two class model

Create feature vector (x) by a fixed nonlinear

transformation of the input x.

Generalised linear model

y(x) = f (wT (x))

with (x) containing some bias element 0 (x) = 1.

nonlinear activation function

f (a) =

+1, a 0

1, a < 0

t=

+1, if C1

1, if C2

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

102of 183

Problem : As a function of w, this is piecewise constant

and therefore the gradient is zero almost everywhere.

Better idea: Using the (1, +1) target coding scheme, we

want all patterns to satisfy wT (xn )tn > 0.

Perceptron Criterion : Add the errors for all patterns

belonging to the set of misclassified patterns M

EP (w) =

wT (xn )tn

nM

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

103of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

EP (w) =

w n tn

MLSS

2010

Classification

nM

Generalised Linear

Model

1

2

Update the weight vector w by

w( +1) = w( ) EP (w) = w( ) + n tn

=1

w( +1) = w( ) + n tn

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

104of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

(green)

MLSS

2010

w( +1) = w( ) + n tn

Classification

Generalised Linear

Model

Decision Theory

0.5

Fishers Linear

Discriminant

0.5

The Perceptron

Algorithm

0.5

0.5

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

1

1

0.5

0.5

1

1

0.5

0.5

105of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

(green)

MLSS

2010

w( +1) = w( ) + n tn

Classification

Generalised Linear

Model

Decision Theory

0.5

Fishers Linear

Discriminant

0.5

The Perceptron

Algorithm

0.5

0.5

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

1

1

0.5

0.5

1

1

0.5

0.5

106of 183

For a single update step

w( +1)T n tn = w( )T n tn (n tn )T n tn < w( )T n tn

T

because (n tn ) n tn = n tn > 0.

BUT: contributions to the error from the other misclassified

patterns might have increased.

AND: some correctly classified patterns might now be

misclassified.

Perceptron Convergence Theorem : If the training set is

linearly separable, the perceptron algorithm is guaranteed

to find a solution in a finite number of steps.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

107of 183

In increasing order of complexity

Find a discriminant function f (x) which maps each input

directly onto a class label.

Discriminative Models

1

class probabilities p(Ck | x).

Use decision theory to assign each new x to one of the

classes.

Generative Models

1

2

3

4

5

class-conditional probabilities p(x | Ck ).

Also, infer the prior class probabilities p(Ck ).

Use Bayes theorem to find the posterior p(Ck | x).

Alternatively, model the joint distribution p(x, Ck ) directly.

Use decision theory to assign each new x to one of the

classes.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

108of 183

Generative approach: model class-conditional densities

p(x | Ck ) and priors p(Ck ) to calculate the posterior

probability for class C1

p(x | C1 )p(C1 )

p(x | C1 )p(C1 ) + p(x | C2 )p(C2 )

1

=

= (a(x))

1 + exp(a(x))

p(C1 | x) =

p(x | C1 ) p(C1 )

p(x, C1 )

= ln

p(x | C2 ) p(C2 )

p(x, C2 )

1

.

(a) =

1 + exp(a)

a(x) = ln

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

109of 183

Introduction to Statistical

Machine Learning

Logistic Sigmoid

The logistic sigmoid function (a) =

c 2010

Christfried Webers

NICTA

The Australian National

University

1

1+exp(a)

finite interval (0, 1)

(a) = 1 (a)

d

Derivative da

(a) = (a) (a) = (a) (1 (a))

Inverse is called logit function a() = ln

MLSS

2010

Classification

Generalised Linear

Model

Decision Theory

Fishers Linear

Discriminant

1.0

The Perceptron

Algorithm

4

0.8

Probabilistic Generative

Models

2

0.6

0.2

0.4

0.6

0.4

0.8

Discrete Features

1.0

Logistic Regression

2

Feature Space

0.2

10

10

Logit a()

110of 183

p(Ck | x) =

p(x | Ck ) p(Ck )

=

j p(x | Cj ) p(Cj )

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

exp(ak )

j exp(aj )

Classification

Generalised Linear

Model

Inference and Decision

where

ak = ln(p(x | Ck ) p(Ck )).

Also called softmax function as it is a smoothed version of

the max function.

Example: If ak

aj for all j = k, then p(Ck | x) 1, and

p(Cj | x) 0.

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

111of 183

If each class-conditional probability is Gaussian and has a

different covariance, the quadratic terms 21 xT 1 x do no

longer cancel each other out.

We get a quadratic discriminant.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Classification

Generalised Linear

Model

Inference and Decision

2.5

Decision Theory

Fishers Linear

Discriminant

1.5

1

The Perceptron

Algorithm

0.5

0

Probabilistic Generative

Models

0.5

Discrete Features

1

1.5

Logistic Regression

2

2.5

Feature Space

112of 183

the simplest case xi {0, 1}.

For a D-dimensional input space, a general distribution

would be represented by a table with 2D entries.

Together with the normalisation constraint, this are 2D 1

independent variables.

Grows exponentially with the number of features.

The Naive Bayes assumption is that all features

conditioned on the class Ck are independent of each other.

D

p(x | Ck ) =

i=1

xkii (1 ki )1xi

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

113of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

D

p(x | Ck ) =

i=1

xkii (1 ki )1xi

exponential

MLSS

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

p(Ck | x) =

p(x | Ck )p(Ck )

=

j p(x | Cj )p(Cj )

exp(ak )

j exp(aj )

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

ak (x) =

i=1

Logistic Regression

Feature Space

114of 183

In increasing order of complexity

Find a discriminant function f (x) which maps each input

directly onto a class label.

Discriminative Models

1

class probabilities p(Ck | x).

Use decision theory to assign each new x to one of the

classes.

Generative Models

1

2

3

4

5

class-conditional probabilities p(x | Ck ).

Also, infer the prior class probabilities p(Ck ).

Use Bayes theorem to find the posterior p(Ck | x).

Alternatively, model the joint distribution p(x, Ck ) directly.

Use decision theory to assign each new x to one of the

classes.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

115of 183

sigmoid () acting on a linear function of the feature vector

p(C2 | ) = 1 p(C1 | )

Model dimension is equal to dimension of the feature

space M.

Compare this to fitting two Gaussians

2M + M(M + 1)/2 = M(M + 5)/2

means

shared covariance

advantage.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

116of 183

Determine the parameter via maximum likelihood for data

(n , tn ), n = 1, . . . , N, where n = (xn ). The class

membership is coded as tn {0, 1}.

Likelihood function

N

p(t | w) =

ytnn (1

n=1

1tn

yn )

where yn = p(C1 | n ).

Error function : negative log likelihood resulting in the

cross-entropy error function

n=1

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

E(w) = ln p(t | w) =

Introduction to Statistical

Machine Learning

{tn ln yn + (1 tn ) ln(1 yn )}

Logistic Regression

Feature Space

117of 183

Error function (cross-entropy error )

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

E(w) =

n=1

{tn ln yn + (1 tn ) ln(1 yn )}

yn = p(C1 | n ) = (wT n )

Gradient of the error function (using

Classification

d

da

= (1 ) )

n=1

Generalised Linear

Model

Inference and Decision

Decision Theory

E(w) =

MLSS

2010

(yn tn )n

for each data point error is product of deviation yn tn and

basis function n .

BUT : maximum likelihood solution can exhibit over-fitting

even for many data points; should use regularised error or

MAP then.

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

Discrete Features

Logistic Regression

Feature Space

118of 183

Introduction to Statistical

Machine Learning

Used direct input x until now.

All classification algorithms work also if we first apply a

fixed nonlinear transformation of the inputs using a vector

of basis functions (x).

Example: Use two Gaussian basis functions centered at

the green crosses in the input space.

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

x2

Probabilistic Generative

Models

0.5

Discrete Features

Logistic Regression

Feature Space

1

0

1

x1

0.5

1

119of 183

Introduction to Statistical

Machine Learning

Linear decision boundaries in the feature space

correspond to nonlinear decision boundaries in the input

space.

Classes which are NOT linearly separable in the input

space can become linearly separable in the feature space.

BUT: If classes overlap in input space, they will also

overlap in feature space.

Nonlinear features (x) can not remove the overlap; but

they may increase it !

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Classification

Generalised Linear

Model

Inference and Decision

Decision Theory

Fishers Linear

Discriminant

The Perceptron

Algorithm

Probabilistic Generative

Models

1

2

x2

Discrete Features

0

Logistic Regression

0.5

Feature Space

1

0

1

x1

0.5

120of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

Part IV

MLSS

2010

Neural Networks

Parameter Optimisation

Neural Networks

121of 183

Introduction to Statistical

Machine Learning

Functional Transformations

As before, the biases can be absorbed into the weights by

introducing an extra input x0 = 1 and a hidden unit z0 = 1.

(2)

wkj h

yk (x, w) = g

c 2010

Christfried Webers

NICTA

The Australian National

University

j=0

(1)

wji xi

MLSS

2010

i=0

Neural Networks

Parameter Optimisation

(2)

wkj j (x)

yk (x, w) = g

j=0

(1)

hidden units

zM

wM D

(2)

wKM

xD

yK

outputs

inputs

y1

x1

z1

x0

z0

(2)

w10

122of 183

(x) = (w0 + w1 x1 + w2 x2 ) for different parameter w.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Neural Networks

-10

-10

-5

Parameter Optimisation

-5

x2

x2

0

5

5

10

-10

10

-10

-5

-5

0

x1

x1

10

10

w = (0, 1, 0.1)

w = (0, 0.1, 1)

-10

-10

-5

-5

x2

x2

0

5

5

10

-10

10

-10

-5

-5

0

x1

x1

5

10

5

10

123of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

f (x) = x2

MLSS

2010

Neural Networks

Parameter Optimisation

and linear outputs trained on 50 data points sampled from the

interval (1, 1). Red: resulting output. Dashed: Output of the

hidden units.

124of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

f (x) = sin(x)

MLSS

2010

Neural Networks

Parameter Optimisation

and linear outputs trained on 50 data points sampled from the

interval (1, 1). Red: resulting output. Dashed: Output of the

hidden units.

125of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

f (x) = |x|

MLSS

2010

Neural Networks

Parameter Optimisation

and linear outputs trained on 50 data points sampled from the

interval (1, 1). Red: resulting output. Dashed: Output of the

hidden units.

126of 183

Neural network approximating Heaviside function

f (x) =

1, x 0

0, x < 0

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Neural Networks

Parameter Optimisation

and linear outputs trained on 50 data points sampled from the

interval (1, 1). Red: resulting output. Dashed: Output of the

hidden units.

127of 183

Neural network for two-class classification.

2 inputs, 2 hidden units with tanh activation function, 1

output with logistic sigmoid activation function.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

3

Neural Networks

Parameter Optimisation

1

0

1

2

2

unit contours. Green: Optimal decision boundary from the

known data distribution.

128of 183

Introduction to Statistical

Machine Learning

Parameter Optimisation

c 2010

Christfried Webers

NICTA

The Australian National

University

Sum-of-squares error function over all training data

E(w) =

1

2

n=1

MLSS

2010

Neural Networks

y(xn , w) tn 2 ,

Parameter Optimisation

vectors tn .

Find the parameter w which minimises E(w)

w = arg min E(w)

w

by gradient descent.

129of 183

Introduction to Statistical

Machine Learning

Error Backpropagation

c 2010

Christfried Webers

NICTA

The Australian National

University

derivative h (), and its output zi in the previous layer.

Error in the previous layer via the backpropagation formula

j = h (aj )

MLSS

2010

Neural Networks

wkj k .

Parameter Optimisation

zi

wji

En (w)

wji

= j zi .

j

wkj

zj

1

130of 183

As the number of weights is usually much larger than the

number of units (the network is well connected), the

n (w)

via error

complexity of calculating the gradient Ew

ji

backpropagation is of O(W) where W is the number of

weights.

Compare this to numerical differentiation using

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Neural Networks

Parameter Optimisation

En (wji + ) En (wji )

En (w)

=

+ O( )

wji

or the numerically more stable (fewer round-off errors)

symmetric differences

En (wji + ) En (wji )

En (w)

=

+ O( 2 )

wji

2

which both need O(W 2 ) operations.

131of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Neural Networks

Parameter Optimisation

M =1

M =3

M=1

M=3

M = 10

M = 10

132of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Neural Networks

Parameter Optimisation

0.45

0.25

0.4

0.2

0.15

10

20

30

40

50

0.35

10

20

30

40

50

133of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

Part V

MLSS

2010

Kernel Methods

Maximum Margin

Classifiers

134of 183

Introduction to Statistical

Machine Learning

Kernel Methods

c 2010

Christfried Webers

NICTA

The Australian National

University

linear combination of kernel functions which are evaluated

at the kept training data points and the new test point.

Let L(t, y(x) be any loss function

and J(f ) be any penalty quadratic in f ,

then minimum of penalised loss

has form f (x) =

N

n=1

MLSS

2010

Kernel Methods

Maximum Margin

Classifiers

N

n=1 n k(xn , x)

N

n=1 L(tn , (K)n )

with minimising

+ T K,

and Kernel Kij = Kji = k(xi , xj )

Kernel trick based on Mercers theorem: Any continuous,

symmetric, positive semi-definite kernel function k(x, y)

can be expressed as a dot product in a high-dimensional

(possibly infinite dimensional) space.

135of 183

Introduction to Statistical

Machine Learning

Support Vector Machines choose the decision boundary which

maximises the smallest distance to samples in both classes.

w = arg max min [tn (wT (xn ))]

w: w =1

tn {1, 1}

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Kernel Methods

Maximum Margin

Classifiers

y = 1

y=0

y=1

136of 183

Non-linear boundary for general (x).

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

w=

n (xn )

n=1

MLSS

2010

Kernel Methods

Maximum Margin

Classifiers

f (x) = wT (x) =

n k(xn , x)

n=1

137of 183

Introduction to Statistical

Machine Learning

Introduce slack variable n 0 for each data point n .

0,

n =

on margin boundary or beyond

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Kernel Methods

Maximum Margin

Classifiers

y = 1

y=0

y=1

>1

<1

=0

=0

138of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Kernel Methods

Maximum Margin

Classifiers

2

2

with = 0.45 applied to a nonseparable data set in two

dimensions. Support vectors are indicated by circles.

139of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

Part VI

MLSS

2010

K-means Clustering

Mixture Models and EM

Mixture of Bernoulli

Distributions

EM for Gaussian

Mixtures - Latent

Variables

Convergence of EM

140of 183

Introduction to Statistical

Machine Learning

K-means Clustering

Goal: Partition N features xn into K clusters using

Euclidian distance d(xi , xj ) = xi xj such that each

feature belongs to the cluster with the nearest mean.

N

Distortion measure : J(, cl(xi )) = n=1 d(xi , cl(xi ) )2

where cl(xi ) is the index of the cluster centre closest to xi .

Start with K arbitrary cluster centres k .

M-step: Minimise J w.r.t. cl(xi ): Assign each data point xi

to closest cluster with index cl(xi ).

E-step: Minimise J w.r.t. k : Find new k as the mean of

points belonging to cluster k.

Iteration over M/E-steps converges to local minimum of J.

2

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

K-means Clustering

Mixture Models and EM

Mixture of Bernoulli

Distributions

EM for Gaussian

Mixtures - Latent

Variables

Convergence of EM

(a)

2

2

2

141of 183

Introduction to Statistical

Machine Learning

2

(a)

(b)

c 2010

Christfried Webers

NICTA

The Australian National

University

(c)

MLSS

2010

2

2

2

(d)

2

2

(e)

K-means Clustering

2

2

(f)

Mixture of Bernoulli

Distributions

EM for Gaussian

Mixtures - Latent

Variables

Convergence of EM

2

2

(g)

2

2

(h)

2

0

(i)

2

2

2

2

142of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

Mixture of Gaussians:

P(x | , , ) =

K

k=1 k N (x | k , k )

Maximise likelihood

P(x | , , ) w.r.t. , , .

M-step: Minimise J

w.r.t. cl(xi ): Assign

each data point xi to

closest cluster with

index cl(xi ).

E-step: Minimise J

w.r.t. k : Find new

k as the mean of

points belonging to

cluster k.

0.30

0.30

0.25

0.25

0.20

0.20

0.15

0.15

0.10

0.10

0.05

MLSS

2010

0.05

0.00-20

-15

-10

-5

10

15

0.00-20

20

-15

-10

-5

10

15

K-means Clustering

20

Mixture of Bernoulli

Distributions

EM for Gaussian

Mixtures - Latent

Variables

L=1

Convergence of EM

2

2

2

(a)

2

2

(b)

2

L=2

2

(d)

(c)

(f)

2

0

L = 20

2

2

L=5

2

2

(e)

143of 183

Introduction to Statistical

Machine Learning

Given a Gaussian mixture and data X, maximise the log

likelihood w.r.t. the parameters (, , ).

1

coefficients k . Evaluate the log likelihood function.

E step : Evaluate the (zk ) using the current parameters

k N (x | k , k )

K

j=1 j N (x | j , j )

(zk ) =

3

1

=

Nk

new

=

k

1

Nk

(znk ) xn

n=1

knew

MLSS

2010

K-means Clustering

Mixture Models and EM

new

k

c 2010

Christfried Webers

NICTA

The Australian National

University

Nk

=

N

Mixture of Bernoulli

Distributions

EM for Gaussian

Mixtures - Latent

Variables

Convergence of EM

n=1

new T

(znk )(xn new

k )(xn k )

N

ln p(X | , , ) =

ln

n=1

k=1

new

knew N (x | new

k , k )

144of 183

Each governed by a Bernoulli distribution with parameter

i . Therefore

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

K-means Clustering

Mixture Models and EM

p(x | ) =

i=1

xi i (1 i )1xi

Mixture of Bernoulli

Distributions

EM for Gaussian

Mixtures - Latent

Variables

Convergence of EM

E [x] =

cov[x] = diag{i (1 i )}

145of 183

Introduction to Statistical

Machine Learning

Mixture

c 2010

Christfried Webers

NICTA

The Australian National

University

p(x | , ) =

with

k=1

k p(x | k )

MLSS

2010

p(x | k ) =

i=1

xkii (1 ki )1xi

K

j=1

N

Nk =

Mixture of Bernoulli

Distributions

EM for Gaussian

Mixtures - Latent

Variables

k p(xn | k )

(znk ) =

K-means Clustering

j p(xn | j )

Convergence of EM

(znk )

n=1

x =

k =

1

Nk

(znk )xn

k =

x

n=1

Nk

N

146of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

K-means Clustering

Examples from a digits data set, each pixel taken only binary

values.

Mixture of Bernoulli

Distributions

EM for Gaussian

Mixtures - Latent

Variables

Convergence of EM

component in the mixture.

Bernoulli distribution.

147of 183

Introduction to Statistical

Machine Learning

EM finds the maximum likelihod solution for models with

latent variables.

Two kinds of variables

Observed variables X

Latent variables Z

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

K-means Clustering

Log likelihood is then

Mixture of Bernoulli

Distributions

EM for Gaussian

Mixtures - Latent

Variables

ln p(X | ) = ln

p(X, Z | )

Convergence of EM

Assume maximisation of the distribution p(X, Z | ) over

the complete data set {X, Z} is straightforward.

But we only have the incomplete data set {X} and the

posterior distribution p(Z | X, ).

148of 183

Introduction to Statistical

Machine Learning

EM - Key Idea

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

averaged version Q(, old ) of the complete log-likelihood

ln p(X, Z | ), averaged over all states of Z.

Q(, old ) =

Z

K-means Clustering

Mixture Models and EM

Mixture of Bernoulli

Distributions

EM for Gaussian

Mixtures - Latent

Variables

Convergence of EM

149of 183

Introduction to Statistical

Machine Learning

EM Algorithm

1

2

3

E step Evaluate p(Z | X, old ).

M step Evaluate new given by

new = arg max Q(, old )

MLSS

2010

K-means Clustering

Mixture Models and EM

Mixture of Bernoulli

Distributions

where

Q(,

old

)=

Z

c 2010

Christfried Webers

NICTA

The Australian National

University

p(Z | X,

old

) ln p(X, Z | )

EM for Gaussian

Mixtures - Latent

Variables

Convergence of EM

values. If not yet converged, then

old = new

and go to step 2.

150of 183

Introduction to Statistical

Machine Learning

EM Algorithm - Convergence

c 2010

Christfried Webers

NICTA

The Australian National

University

Start with the product rule for the observed variables x, the

unobserved variables Z, and the parameters

ln p(X, Z | ) = ln p(Z | X, ) + ln p(X | ).

Apply

q(Z) ln p(X, Z | ) =

K-means Clustering

Mixture Models and EM

Rewrite as

ln p(X | ) =

MLSS

2010

Mixture of Bernoulli

Distributions

EM for Gaussian

Mixtures - Latent

Variables

Convergence of EM

q(Z) ln

Z

p(X, Z | )

q(Z)

q(Z) ln

Z

L(q,)

p(Z | X, )

q(Z)

KL(q p)

151of 183

Introduction to Statistical

Machine Learning

Kullback-Leibler Divergence

c 2010

Christfried Webers

NICTA

The Australian National

University

KL(q p) =

q(y) ln

y

KL(q p) =

q(y) ln

q(y)

p(y)

q(y)

dy

p(y)

=

=

q(y) ln

y

q(y) ln

KL(q p) 0

not symmetric: KL(q p) = KL(p q)

KL(q p) = 0 iff q = p.

invariant under parameter transformations

MLSS

2010

p(y)

q(y)

p(y)

dy

q(y)

K-means Clustering

Mixture Models and EM

Mixture of Bernoulli

Distributions

EM for Gaussian

Mixtures - Latent

Variables

Convergence of EM

152of 183

Introduction to Statistical

Machine Learning

EM Algorithm - Convergence

c 2010

Christfried Webers

NICTA

The Australian National

University

ln p(X | ) =

q(Z) ln

Z

p(X, Z | )

q(Z)

q(Z) ln

Z

L(q,)

p(Z | X, )

q(Z)

KL(q p)

MLSS

2010

K-means Clustering

Mixture Models and EM

Mixture of Bernoulli

Distributions

EM for Gaussian

Mixtures - Latent

Variables

KL(q||p)

Convergence of EM

L(q, )

ln p(X|)

153of 183

Introduction to Statistical

Machine Learning

EM Algorithm - E Step

old

respect to q().

L(q, old ) is a functional.

ln p(X | ) does NOT depend on q().

Maximum for L(q, old ) will occur when the

Kullback-Leibler divergence vanishes.

Therefore, choose q(Z) = p(Z | X, old )

ln p(X | ) =

q(Z) ln

Z

p(X, Z | )

q(Z)

L(q,)

q(Z) ln

Z

old

) with

p(Z | X, )

q(Z)

KL(q p)

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

K-means Clustering

Mixture Models and EM

Mixture of Bernoulli

Distributions

EM for Gaussian

Mixtures - Latent

Variables

Convergence of EM

KL(q||p) = 0

L(q, old )

ln p(X| old )

154of 183

Introduction to Statistical

Machine Learning

EM Algorithm - M Step

old

L(q, ) with respect to :

new = arg max L(q, old ) = arg max Z q() ln p(X, Z | )

L(q, new ) > L(q, old ) unless maximum already reached.

As q() = p(Z | X, old ) is fixed, p(Z | X, new ) will not be

equal to q(), and therefore the Kullback-Leiber distance

will be greater than zero (unless converged).

ln p(X | ) =

q(Z) ln

Z

p(X, Z | )

q(Z)

q(Z) ln

Z

L(q,)

p(Z | X, )

q(Z)

KL(q p)

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

K-means Clustering

Mixture Models and EM

Mixture of Bernoulli

Distributions

EM for Gaussian

Mixtures - Latent

Variables

Convergence of EM

KL(q||p)

L(q, new )

ln p(X| new )

155of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

ln p(X|)

2010

K-means Clustering

Mixture Models and EM

Mixture of Bernoulli

Distributions

EM for Gaussian

Mixtures - Latent

Variables

Convergence of EM

L (q, )

new

old

Blue curve : After E step. Green curve : After M step.

156of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

Part VII

MLSS

2010

Sampling from the

Uniform Distribution

Sampling

Distributions

Rejection Sampling

Importance Sampling

Markov Chain Monte

Carlo - The Idea

157of 183

generator : an algorithm generating a sequence of

numbers that approximates the properties of random

numbers.

Example : linear congruential generators

z(n+1) = (a z(n) + c) mod m

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Sampling from the

Uniform Distribution

Sampling from Standard

Distributions

Rejection Sampling

Importance Sampling

0 c < m, and seed z0 .

Other classes of pseudorandom number generators:

Carlo - The Idea

Linear feedback shift registers

Generalised feedback shift registers

158of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Sampling from the

Uniform Distribution

Defined by the recurrence

Distributions

Rejection Sampling

(n+1)

16

(n)

= (2 + 3) z

mod 2

31

Importance Sampling

Markov Chain Monte

Carlo - The Idea

159of 183

(n) T

in 3D . . .

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Sampling from the

Uniform Distribution

Sampling from Standard

Distributions

Rejection Sampling

Importance Sampling

Markov Chain Monte

Carlo - The Idea

160of 183

Introduction to Statistical

Machine Learning

(n+2)

(n+1)

(n) T

,z

,z

Plotting z

in 3D . . . and changing the

viewpoint results in 15 planes.

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Sampling from the

Uniform Distribution

Sampling from Standard

Distributions

Rejection Sampling

Importance Sampling

Markov Chain Monte

Carlo - The Idea

161of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

z(n+1) = (216 + 3) z(n)

mod 231

correlate three samples

z(n+2) = (216 + 3)2 z(n)

= (232 + 6 216 + 9)z(n)

= (6(216 + 3) 9)z(n)

MLSS

2010

Sampling from the

Uniform Distribution

Sampling from Standard

Distributions

Rejection Sampling

Importance Sampling

Markov Chain Monte

Carlo - The Idea

= 6z(n+1) 9z(n)

Planes", Proc National Academy of Sciences 61, 25-28,

1968.

162of 183

Suppose uniformly distributed samples of z in the interval

(0, 1) are available.

Calculate the cumulative distribution function

y

h(y) =

p(x) dx

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Sampling from the

Uniform Distribution

Sampling from Standard

Distributions

Rejection Sampling

Importance Sampling

Markov Chain Monte

Carlo - The Idea

y = h1 (z)

to obtain samples y distributed according to p(y).

163of 183

Goal: Sample from p(y) which is given in analytical form.

If a uniformly distributed random variable z is transformed

using y = h1 (z) then y will be distributed according to p(y).

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Sampling from the

Uniform Distribution

1

h(y)

Distributions

Rejection Sampling

Importance Sampling

Markov Chain Monte

Carlo - The Idea

p(y)

y

164of 183

Goal: Sample from the exponential distribution

p(y) =

ey

0

0y

y<0

Suppose uniformly distributed samples of z in the interval

(0, 1) are available.

Calculate the cumulative distribution function

y

h(y) =

p(x) dx =

ey dx = 1 ey

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Sampling from the

Uniform Distribution

Sampling from Standard

Distributions

Rejection Sampling

Importance Sampling

Markov Chain Monte

Carlo - The Idea

1

y = h1 (z) = ln(1 z)

exponential distribution.

165of 183

1

z1 , z2 (1, 1) (e.g. zi = 2z 1 for z from U(z | 0, 1))

Discard any pair (z1 , z2 ) unless z21 + z22 1. Results in a

uniform distribution inside of the unit circle p(z1 , z2 ) = 1/.

Evaluate r2 = z21 + z22 and

y1 = z1

y2 = z2

1/2

2 ln r2

r2

1/2

2 ln r

r2

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Sampling from the

Uniform Distribution

Sampling from Standard

Distributions

Rejection Sampling

Importance Sampling

Markov Chain Monte

Carlo - The Idea

p(y1 , y2 ) = p(z1 , z2 )

2

2

(z1 , z2 )

1

1

= ey1 /2 ey2 /2

(y1 , y2 )

2

2

166of 183

Introduction to Statistical

Machine Learning

Rejection Sampling

Assumption 1 : Sampling directly from p(z) is difficult, but

we can evaluate p(z) up to some unknown normalisation

constant Zp

1

p(z) = p(z)

Zp

Assumption 2 : We can draw samples from a simpler

distribution q(z) and for some constant k and all z holds

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Sampling from the

Uniform Distribution

Sampling from Standard

Distributions

Rejection Sampling

kq(z) p(z)

Importance Sampling

Markov Chain Monte

Carlo - The Idea

kq(z)

kq(z0 )

u0

z0

p(z)

z

167of 183

Introduction to Statistical

Machine Learning

Rejection Sampling

1

2

3

4

Generate a number from the u0 from the uniform

distribution over [0, k q(z0 )].

If u0 > p(z0 ) then reject the pair (z0 , u0 ).

The remaining pairs have uniform distribution under the

curve p(z).

The z values are distributed according to p(z).

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Sampling from the

Uniform Distribution

Sampling from Standard

Distributions

Rejection Sampling

Importance Sampling

kq(z)

kq(z0 )

u0

z0

Carlo - The Idea

p(z)

z

168of 183

Introduction to Statistical

Machine Learning

Importance Sampling

Provides a framework to directly calculate the expectation

Ep [f (z)] with respect to some distribution p(z).

Does NOT provide p(z).

Again use a proposal distribution q(z) and draw samples z

from it.

Then

E [f ] =

f (z) p(z) dz =

p(z)

f (z)

q(z)

p(z)

1

q(z) dz

q(z)

L

l=1

p(z(l) ) (l)

f (z )

q(z(l) )

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Sampling from the

Uniform Distribution

Sampling from Standard

Distributions

Rejection Sampling

Importance Sampling

Markov Chain Monte

Carlo - The Idea

f (z)

z

169of 183

Introduction to Statistical

Machine Learning

Consider both p(z) and q(z) to be not normalised.

p(z) =

p(z)

Zp

q(z) =

q(z)

.

Zq

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

E [f ]

Zq 1

Zp L

(l)

rl f (z(l) )

rl =

l=1

p(z )

.

q(z(l) )

Distributions

Rejection Sampling

Importance Sampling

Zp

1

Zq

L

Uniform Distribution

Carlo - The Idea

rl ,

l=1

L

E [f ]

wl f (z(l) )

l=1

wl =

rl

L

m=1 rm

170of 183

product f (z) p(z) is large.

Or at least where p(z) is large.

Importance weights rl correct the bias introduced by

sampling from the proposal distribution q(z) instead of the

wanted distribution p(z).

Success depends on how well q(z) approximates p(z).

If p(z) > 0 in same region, then q(z) > 0 necessary.

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Sampling from the

Uniform Distribution

Sampling from Standard

Distributions

Rejection Sampling

Importance Sampling

Markov Chain Monte

Carlo - The Idea

171of 183

Introduction to Statistical

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

Idea : Build a machine which uses the current sample to

decide which next sample to produce in such a way that

the overall distribution of the samples will be p(z) .

1

(r)

from a proposal distribution q(z | z(r) ) we know how to

sample from.

Accept or reject the new sample according to some

appropriate criterion.

z(l+1) =

z

z(r)

2010

Sampling from the

Uniform Distribution

Sampling from Standard

Distributions

Rejection Sampling

Importance Sampling

Markov Chain Monte

Carlo - The Idea

if accepted

if rejected

172of 183

Introduction to Statistical

Machine Learning

Metropolis Algorithm

c 2010

Christfried Webers

NICTA

The Australian National

University

q(zA | zB ) = q(zB | zA ).

Accept the new sample z with probability

p(z )

A(z , z ) = min 1,

p(z(r) )

(r)

MLSS

2010

Sampling from the

Uniform Distribution

Sampling from Standard

Distributions

Rejection Sampling

distribution in (0, 1). Accept new sample if A(z , z(r) ) > u.

Importance Sampling

Markov Chain Monte

Carlo - The Idea

z(l+1) =

z

z(r)

if accepted

if rejected

(Different from rejection sampling.)

173of 183

Introduction to Statistical

Machine Learning

Sampling from a Gaussian Distribution (black contour

shows one standard deviation).

Proposal distribution is isotropic Gaussian with standard

deviation 0.2.

150 candidates generated; 43 rejected.

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Sampling from the

Uniform Distribution

Distributions

2.5

Rejection Sampling

Importance Sampling

Markov Chain Monte

Carlo - The Idea

1.5

1

0.5

0

0.5

1.5

2.5

174of 183

nonsymmetric proposal distributions qk .

At step , draw a sample z from the distribution qk (z | z( ) )

where k labels the set of possible transitions.

Accept with probability

Ak (z, z( ) ) = min 1,

p(z ) qk (z( ) | z )

p(z( ) ) qk (z | z( ) )

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

2010

Sampling from the

Uniform Distribution

Sampling from Standard

Distributions

Rejection Sampling

Importance Sampling

Markov Chain Monte

Carlo - The Idea

Common choice : Gaussian centered on the current state.

small variance high acceptance rate, but slow walk

through the state space; samples not independent

large variance high rejection rate

175of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

Part VIII

MLSS

2010

More Machine Learning

176of 183

Graphical Models

Gaussian Processes

Sequential Data

Sequential Decision Theory

Learning Agents

Reinforcement Learning

Theoretical Model Selection

Additive Models and Trees and Related Methods

Approximate (Variational) Inference

Boosting

Concept Learning

Computational Learning Theory

Genetic Algorithms

Learning Sets of Rules

Analytical Learning

...

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

More Machine Learning

177of 183

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

Part IX

MLSS

2010

Journals

Books

Resources

Datasets

178of 183

Journals

Machine Learning

IEEE Transactions on Pattern Analysis and Machine

Intelligence

IEEE Transactions on Neural Networks

Neural Computation

Neural Networks

Annals of Statistics

Journal of the American Statistical Association

SIAM Journal on Applied Mathematics (SIAP)

...

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Journals

Books

Datasets

179of 183

Conferences

European Conference on Machine Learning (ECML)

Neural Information Processing Systems (NIPS)

Algorithmic Learning Theory (ALT)

Computational Learning Theory (COLT)

Uncertainty in Artificial Intelligence (UAI)

International Joint Conference on Artificial Intelligence

(IJCAI)

International Conference on Artificial Neural Networks

(ICANN)

...

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Journals

Books

Datasets

180of 183

Introduction to Statistical

Machine Learning

Books

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

Learning

MLSS

2010

Journals

Books

Datasets

Christopher M. Bishop

Tibshirani, Jerome Friedman

181of 183

Introduction to Statistical

Machine Learning

Books

Pattern Classification

c 2010

Christfried Webers

NICTA

The Australian National

University

Introduction to Machine

Learning

MLSS

2010

Journals

Books

Datasets

Hart, David G. Stork

Ethem Alpaydin

182of 183

Datasets

UCI Repository

http://archive.ics.uci.edu/ml/

UCI Knowledge Discovery Database Archive

http://kdd.ics.uci.edu/summary.data.

application.html

Statlib

http://lib.stat.cmu.edu/

Delve

http://www.cs.utoronto.ca/~delve/

Time Series Database

http://robjhyndman.com/TSDL

Machine Learning

c 2010

Christfried Webers

NICTA

The Australian National

University

MLSS

2010

Journals

Books

Datasets

183of 183

- Self Study Quant TraderUploaded byThiago F Macedo
- Regression NotesUploaded bylupethe3
- Introduction to Machine Learning, In New Advances in Machine LearningUploaded byEdwin
- AssignmentUploaded byPriyabrat Padhy
- Drdo Report Final - AmanUploaded byAman Gupta
- 298Uploaded bySalisu Borodo
- Advances in Fermentation MachinesUploaded byRobie Vasquez Camposano
- Car Am i Aux 2013 MachineUploaded bystef
- A Comparative Study of Existing Machine Learning Approaches for Parkinson's Disease DetectionUploaded byLucaGiovanniSolís
- Application of Supervised Machine Learning TechniquesUploaded bydeejay217
- seguridadUploaded byandres python
- Paper 5 - An Incremental Learning Algorithm Considering Texts' ReliabilityUploaded byEditor IJACSA
- AI based intrusion detection systemUploaded byNavneet_Sinha
- Topology-Based Kernels With Application to Inference Problems in Alzheimer’S DiseaseUploaded bysurya
- MLT Document FormatUploaded bylavamgmca
- Building a Sentiment Analytics Solution Powered by Machine Learning- Impetus White PaperUploaded byImpetus
- CACLCHRUploaded bysydney_2468
- Fast Kernel ClassifiersUploaded byEliseo Lopez
- Empirical Analysis of Open Source projects using Feature Selection and Filtering TechniquesUploaded byAnonymous vQrJlEN
- Real-Time Face Recognition System Using KPCA, LBP and Support Vector MachineUploaded byIJAERS JOURNAL
- RapidMiner MinibookUploaded byshr808
- Cost BehaviorUploaded bymoiz615
- syllabus DWDMUploaded bysaravanakumar15
- Features for IdentificationUploaded bysatyanarayana_boddul
- Forecasting 11Uploaded byRajesh Shrestha
- Organization of DataUploaded byMƛnishƛ Kéim
- Spam Email Classification 2Uploaded byAustin Kinion
- A Survey of Recent Trends in One Class ClassificationUploaded byAnonymous NJ3j2S4o59
- IRJET-Smart Management of Crop Cultivation using IoT and Machine LearningUploaded byIRJET Journal
- Batch Gradient Method for Training of Pi-Sigma Neural Network With PenaltyUploaded byAdam Hansen

- Lecture Notes in Control and Information Sciences.pdfUploaded byJhossep Popayán Ávila
- Theory of Group Representations - A. O. Barut.pdfUploaded byLucianoDoNascimentoAndre
- 11 GeometryUploaded byparul_itiec
- Spivak Calculus of Manifolds SolutionsUploaded bycweeks2
- Winitzki - Linear Algebra via Exterior Products - Large FormatUploaded bywinitzki
- s12_mthsc434_hw01Uploaded byمحمد
- Potential FemUploaded byDeval Shah
- functional_analysis_pdf.pdfUploaded byali
- A Weightless Neural Node Based on a Probabilistic Quantum MemoryUploaded byWilson Rosa de Oliveira
- Domingos, Geometrical Properties of Vectors and Covectors9812700447Uploaded byBarbara Yaeggy
- LectureNotes11GUploaded byEmilio Torres
- Carell J. - Basic Concepts of Linear Algebra (2003)Uploaded byMandric Vlad
- Wireless CommunicationUploaded byLucio Bianchi
- Lesson 2 Vector SpacesUploaded byGauthier Toudjeu
- Normed SpacesUploaded byBandile Mbele
- 3144 Final ReviewUploaded byLarry Johnson
- General Relativity 2011-2012Uploaded byEric Parker
- Linear Morphisms.pdfUploaded bysilviuboga
- Algebra-Through-Practice-Volume-2-Matrices-and-Vector-Spaces-A-Collection-of-Problems-in-Algebra-with-Solutions.pdfUploaded bysohamdey
- Matrix Theory and Linear AlgebraUploaded byMuhammad Shamoeel
- Abstract Wiener Space, RevisitedUploaded byheadbutt
- Munkres Chapter 2 Section 19 (Part I) « Abstract NonsenseUploaded byJarbas Dantas Silva
- Eisenstein criteriaUploaded bybond12314
- LectureNotes-RBFUploaded byDaniel Cervantes Cabrera
- 1701.05476Uploaded byCroco Ali
- top.grps[1]Uploaded bygalois
- Hecke Operators and Hecke TheoryUploaded byShashank Jaiswal
- FEM for 2D With MatlabUploaded byammar_harb
- Mat Lab PracticeUploaded byLeke Ogunranti
- KKP MTE 3110_3Uploaded bySiti Khirnie Kasbolah