CSIS0270/COMP3270: 12b. Statistical Learning - Bayes Classifier

CSIS0270/COMP3270
12b. Statistical Learning – Bayes Classifier
CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 1

Outline
Learning with Complete Data

Naive Bayes models
Maximum Likelihood Parameter Learning — Discrete
Models
Maximum Likelihood Parameter Learning — Continuous
Models
Bayes Classifier

Learning with Complete Data
Data and Hypothesis – construct a hypothesis from data.

Data act as evidence to support the hypothesis.
Assume a probability model where structure is fixed.
Try to learn the parameters of the models only.
Supervised Learning: data consists of inputs (attributes),
vector x, and label y (for learning).

Problem formulation
Feature extractors — extract features from the examples,

and form an input vector x.
Training data — bunch of < x, y > pairs.
Classifier — a parametric probabilistic models
Training process — learn the parameters of the model
Test data — unseen (in training) data x for testing the
performance of the classifier.
Application: OCR, Medical diagnosis, spam detection etc.

Naive Bayes Model
The model is called Naive because we assume all
attributes are independent of each other.
Assume the class label is C, and the features are
x1 , · · · , xn .
By Bayes rule, we have the posterior probability
P(x1 , · · · , xn |C)P(C)
P(C|x1 , · · · , xn ) =
P(x1 , · · · , xn )
= αP(x1 , · · · , xn |C)P(C)
Y
= αP(C) P(xi |C)
i
P(C) is the prior probability distribution
P(xi |C) is the class conditional probability distribution.
Both can be estimated from training data.
If we only need classification result based on posterior
probability, we can apply any monotonic increasing
function.
Maximum Likelihood Estimation — Discrete
Models
The result of max. likelihood estimation is usually well

known.
Likelihood: the probability of observing the data, given the
probabilistic model.
find the model that maximize this likelihood.
Procedure:
1 Write down the likelihood of the data as a function of the
parameter.
2 Actually, we can apply any monotonic increasing function,
as the maximization is the same, but we will use log
likelihood, as it will convert product into summation.
3 find the parameter that maximize the likelihood by finding
the first derivative and set to 0.

Example: Learning the probability I
Given a bag of containing only cherry and lime. Find the

probability of cherry.
Assume the probability of cherry to be θ. The
corresponding hypothesis is hθ .
If there are N fruits and c of them are cherries, then N − c
are limes.
The likelihood of observing this data set d is:
N
Y
P (d|hθ ) = P (dk |hθ ) = θc (1 − θ)N −c
k=1

Example: Learning the probability II
the log likelihood is
N
X
L(d|hθ ) = log P (d|hθ ) = log P (dk |hθ )
k=1
= c log θ + (N − c) log(1 − θ)
Taking first derivative, and set to 0
∂L(d|hθ ) c N −c
= − =0
∂θ θ 1−θ
c
⇒ θ=
N
The probability can be obtained by counting and find the
ratio!
Similar trivial results can be obtained for other conditional
probabilities using the same procedure.
Maximum Likelihood Estimation – Continuous
Model I
Usually we assume Normal Distribution (Gaussian Density

function)
1 (x−µ)2
P (x) = √ e− 2σ2
2πσ
Two parameters for each distribution, µ and σ.
Likelihood function for observing the set of attributes
{xk , k = 1, · · · , N } is
N N (xk −µ)2
1
e− 2σ2
Y Y
P(x|θ, σ) = P (xk |θ, σ) = √
k=1 k=1
2πσ

Model II
The log likelihood is
N (xk −µ)2
1
e− 2σ2
X
L= log √
k=1
2πσ
N
√ X (xk − µ)2
= N (− log 2π − log σ) −
2σ 2
k=1
Find derivatives and set to 0

N N
∂L 1 X 1 X
= 2 (xk − µ) = 0 ⇒ µ = xk
∂µ σ N
k=1 k=1
s
N PN 2
∂L N 1 k=1 (xk − µ)
X
2
=− + 3 (xk − µ) = 0 ⇒ σ =
∂σ σ σ N
k=1
Model III
The results are again trivial.

There are 2 parameters for each attribute for each class.
For the Naive Bayes model, we need to estimate the prior
probability (1 for each class), and the class conditional
probabilities (2 parameters per attribute)
You can do the same for other distribution.

Discriminant Functions I
Classification by discriminant functions: For c-class

classifiers, construct a set of c discriminant functions gi (x)
and assign x to class k, if its discriminant is largest, i.e.
gl (x) > gj (x), for all j 6= l
We can use the posterior probability as a discriminant

function.
Any monotonic increasing function on gi (x) is also a valid
discriminant function.
For Naive Bayes Classifier, with Normal density, d=number
of attributes
d (xi −µi )2
1
Y −
2σ 2
gj (x) = αP (j) √ e i
i=1
2πσi

Discriminant Functions II
Since α is a positive constant (independent of j), and log is

a monotonic increasing function, we have
d √ (xi − µi )2
X
gj0 (x) = log P (j) + − log 2π − log σi −
i=1
2σi2
Removing constant terms,

d
X (xi − µi )2
gj00 (x) = log P (j) − d log σi −
i=1
2σi2
The actual value of posterior probability is not needed.

You can treated the negation of the discriminant function as
a distance measure. The class with smallest distance wins.

Bayes Classifier
The assumption that attributes are independent of each
other is too naive.
In general, if they are not independent, we have to use a
multivariate distribution, e.g. the multivariate normal, and
to consider all attributes together (x):

1 1 t −1
p(x) = exp − (x − µ) Σ (x − µ)
(2π)d/2 |Σ|1/2 2
where µ is the mean vector, and Σ is the covariance matrix.
You can use the same procedure to find the posterior
probability and the corresponding discriminant function as:
1
gj00 (x) = log P (j) − log |Σj | − (x − µj )t Σ−1 (x − µj )
2
Number of parameters for dimension d is
prior probability P (j) : 1
mean vector : d
d(d+1)
Covariance matrix: symmetric matrix — 2
Problem of Dimensionality
For dimension d, number of parameters O(d2 ).

It is easy to have 1000 attributes in some problem
To avoid overfitting, if number of parameters is m, we need
several times of m training data for estimating the
parameters.
We have to reduce the dimension, or use Naive Bayes
instead.
Some dimension reduction techniques — KL-Transform,
Principal Component Analysis etc.

CSIS0270/COMP3270: 12b. Statistical Learning - Bayes Classifier

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

CSIS0270/COMP3270: 12b. Statistical Learning - Bayes Classifier

Загружено:

Авторское право:

Доступные форматы

CSIS0270/COMP3270

12b. Statistical Learning – Bayes Classifier

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 1

Learning with Complete Data

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 2

Data and Hypothesis – construct a hypothesis from data.

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 3

Feature extractors — extract features from the examples,

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 4

The result of max. likelihood estimation is usually well

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 6

Given a bag of containing only cherry and lime. Find the

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 7

Taking first derivative, and set to 0

Usually we assume Normal Distribution (Gaussian Density

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 9

Find derivatives and set to 0

The results are again trivial.

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 11

Classification by discriminant functions: For c-class

gl (x) > gj (x), for all j 6= l

We can use the posterior probability as a discriminant

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 12

Since α is a positive constant (independent of j), and log is

Removing constant terms,

The actual value of posterior probability is not needed.

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 13

For dimension d, number of parameters O(d2 ).

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 15

Вам также может понравиться