Вы находитесь на странице: 1из 15

CSIS0270/COMP3270

12b. Statistical Learning – Bayes Classifier

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 1


Outline

Learning with Complete Data


Naive Bayes models
Maximum Likelihood Parameter Learning — Discrete
Models
Maximum Likelihood Parameter Learning — Continuous
Models
Bayes Classifier

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 2


Learning with Complete Data

Data and Hypothesis – construct a hypothesis from data.


Data act as evidence to support the hypothesis.
Assume a probability model where structure is fixed.
Try to learn the parameters of the models only.
Supervised Learning: data consists of inputs (attributes),
vector x, and label y (for learning).

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 3


Problem formulation

Feature extractors — extract features from the examples,


and form an input vector x.
Training data — bunch of < x, y > pairs.
Classifier — a parametric probabilistic models
Training process — learn the parameters of the model
Test data — unseen (in training) data x for testing the
performance of the classifier.
Application: OCR, Medical diagnosis, spam detection etc.

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 4


Naive Bayes Model
The model is called Naive because we assume all
attributes are independent of each other.
Assume the class label is C, and the features are
x1 , · · · , xn .
By Bayes rule, we have the posterior probability
P(x1 , · · · , xn |C)P(C)
P(C|x1 , · · · , xn ) =
P(x1 , · · · , xn )
= αP(x1 , · · · , xn |C)P(C)
Y
= αP(C) P(xi |C)
i
P(C) is the prior probability distribution
P(xi |C) is the class conditional probability distribution.
Both can be estimated from training data.
If we only need classification result based on posterior
probability, we can apply any monotonic increasing
function.
CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 5
Maximum Likelihood Estimation — Discrete
Models

The result of max. likelihood estimation is usually well


known.
Likelihood: the probability of observing the data, given the
probabilistic model.
find the model that maximize this likelihood.
Procedure:
1 Write down the likelihood of the data as a function of the
parameter.
2 Actually, we can apply any monotonic increasing function,
as the maximization is the same, but we will use log
likelihood, as it will convert product into summation.
3 find the parameter that maximize the likelihood by finding
the first derivative and set to 0.

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 6


Example: Learning the probability I

Given a bag of containing only cherry and lime. Find the


probability of cherry.
Assume the probability of cherry to be θ. The
corresponding hypothesis is hθ .
If there are N fruits and c of them are cherries, then N − c
are limes.
The likelihood of observing this data set d is:
N
Y
P (d|hθ ) = P (dk |hθ ) = θc (1 − θ)N −c
k=1

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 7


Example: Learning the probability II
the log likelihood is
N
X
L(d|hθ ) = log P (d|hθ ) = log P (dk |hθ )
k=1
= c log θ + (N − c) log(1 − θ)

Taking first derivative, and set to 0

∂L(d|hθ ) c N −c
= − =0
∂θ θ 1−θ
c
⇒ θ=
N
The probability can be obtained by counting and find the
ratio!
Similar trivial results can be obtained for other conditional
probabilities using the same procedure.
CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 8
Maximum Likelihood Estimation – Continuous
Model I

Usually we assume Normal Distribution (Gaussian Density


function)
1 (x−µ)2
P (x) = √ e− 2σ2
2πσ
Two parameters for each distribution, µ and σ.
Likelihood function for observing the set of attributes
{xk , k = 1, · · · , N } is

N N (xk −µ)2
1
e− 2σ2
Y Y
P(x|θ, σ) = P (xk |θ, σ) = √
k=1 k=1
2πσ

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 9


Maximum Likelihood Estimation – Continuous
Model II
The log likelihood is
N (xk −µ)2
1
e− 2σ2
X
L= log √
k=1
2πσ
N
√ X (xk − µ)2
= N (− log 2π − log σ) −
2σ 2
k=1

Find derivatives and set to 0


N N
∂L 1 X 1 X
= 2 (xk − µ) = 0 ⇒ µ = xk
∂µ σ N
k=1 k=1
s
N PN 2
∂L N 1 k=1 (xk − µ)
X
2
=− + 3 (xk − µ) = 0 ⇒ σ =
∂σ σ σ N
k=1
CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 10
Maximum Likelihood Estimation – Continuous
Model III

The results are again trivial.


There are 2 parameters for each attribute for each class.
For the Naive Bayes model, we need to estimate the prior
probability (1 for each class), and the class conditional
probabilities (2 parameters per attribute)
You can do the same for other distribution.

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 11


Discriminant Functions I

Classification by discriminant functions: For c-class


classifiers, construct a set of c discriminant functions gi (x)
and assign x to class k, if its discriminant is largest, i.e.

gl (x) > gj (x), for all j 6= l

We can use the posterior probability as a discriminant


function.
Any monotonic increasing function on gi (x) is also a valid
discriminant function.
For Naive Bayes Classifier, with Normal density, d=number
of attributes
d (xi −µi )2
1
Y −
2σ 2
gj (x) = αP (j) √ e i

i=1
2πσi

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 12


Discriminant Functions II

Since α is a positive constant (independent of j), and log is


a monotonic increasing function, we have
d  √ (xi − µi )2
X  
gj0 (x) = log P (j) + − log 2π − log σi −
i=1
2σi2

Removing constant terms,


d
X (xi − µi )2
gj00 (x) = log P (j) − d log σi −
i=1
2σi2

The actual value of posterior probability is not needed.


You can treated the negation of the discriminant function as
a distance measure. The class with smallest distance wins.

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 13


Bayes Classifier
The assumption that attributes are independent of each
other is too naive.
In general, if they are not independent, we have to use a
multivariate distribution, e.g. the multivariate normal, and
to consider all attributes together (x):
 
1 1 t −1
p(x) = exp − (x − µ) Σ (x − µ)
(2π)d/2 |Σ|1/2 2
where µ is the mean vector, and Σ is the covariance matrix.
You can use the same procedure to find the posterior
probability and the corresponding discriminant function as:
1
gj00 (x) = log P (j) − log |Σj | − (x − µj )t Σ−1 (x − µj )
2
Number of parameters for dimension d is
prior probability P (j) : 1
mean vector : d
d(d+1)
Covariance matrix: symmetric matrix — 2
CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 14
Problem of Dimensionality

For dimension d, number of parameters O(d2 ).


It is easy to have 1000 attributes in some problem
To avoid overfitting, if number of parameters is m, we need
several times of m training data for estimating the
parameters.
We have to reduce the dimension, or use Naive Bayes
instead.
Some dimension reduction techniques — KL-Transform,
Principal Component Analysis etc.

CSIS0270/COMP3270 12b. Statistical Learning – Bayes Classifier 15

Вам также может понравиться