Slide 7

MAP Classier for Normal Distributions
Performance of the Bayes Classier

Error bounds
SYDE 372 - Winter 2011
Introduction to Pattern Recognition
Probability Measures for Classication:
Part II
Alexander Wong
Department of Systems Design Engineering
University of Waterloo
Alexander Wong SYDE 372 - Winter 2011
Error bounds
Outline
1
2
3
Error bounds
Error bounds
By far the most popular conditional class distribution model
is the Gaussian distribution:
p(x|A) = N(
A
,
2
A
) =
1
2
A
exp
_
1
2
(
x
A
A
)
2
_
(1)
and p(x|B) = N(
B
,
2
B
).
Error bounds
For the two-class case where both distributions are
Gaussian, the following MAP classier can be dened as:
N(
A
,
2
A
)
N(
B
,
2
B
)
A
>





<
B
2[ln[
A
P(B)] ln[
B
P(A)]]
(7)
Does this look familiar?
Error bounds
The decision boundary (threshold) for the MAP classier
where P(x|A) and P(x|B) are Gaussian distributions can
be nd by solving the following expression for x:
_
(
x
B
B
)
2
_
_
(
x
A
A
)
2
_
= 2[ln[
A
P(B)] ln[
B
P(A)]] (8)
x
2
_
1
2
B
2
A
_
2x
_
2
B
2
A
_
+
2
B
2
B
2
A
2
A
= 2 ln
_
A
P(B)
B
P(A)
_
(9)
Error bounds
For case where
A
=
B
, P(A) = P(B) =
1
2
:
x
2
_
1
2
B
2
A
_
2x
_
2
B
2
A
_
+
2
B
2
B
2
A
2
A
= 2 ln
_
A
P(B)
B
P(A)
_
(10)
x
2
(
2
A
2
B
) 2x(
B
2
A
2
B
) + (
2
B
2
A
2
A
2
B
) = 2 ln[1]
(11)
Since ln(1) = 0 and
A
=
B
,
x =
(
2
B
2
A
2
A
2
A
)
2(
B
2
A
2
A
)
(12)
x =
(
2
B
2
A
)
2(
B
A
)
(13)
Error bounds
Since (a
2
b
2
) = (a b)(a + b):
x =
(
B
A
)(
B
+
A
)
2(
B
A
)
(14)
x =
(
B
+
A
)
2
(15)
Therefore, for the case of equally likely, equi-variance
classes, the MAP rule reduces to a threshold midway
between the means.
Error bounds
For case where P(A) = P(B) and
A
=
B
, the threshold
shifts and a second threshold appears as the second
solution to the quadratic expression.
Error bounds
Example of a 1-D case:
Suppose that, given a pattern x, we wish to classify it as
one of two classes: class A and class B.
Suppose the two classes have patterns x which are
normally distributed as follows:
p(x|A) = N(
A
,
2
A
) =
1
2
A
exp
_
1
2
(
x
A
A
)
2
_
(16)
p(x|B) = N(
B
,
2
B
) =
1
2
B
exp
_
1
2
(
x
B
B
)
2
_
(17)
A
= 130,
B
= 150.
Error bounds
Question: If we know that in a previous case that 4
patterns belong to class A and 6 patterns belong to class
B, and both classes have the same standard deviation of
20, what is the MAP classier?
For the two-class case where both distributions are
Gaussian, the following MAP classier can be dened as:
N(
A
,
2
A
)
N(
B
,
2
B
)
A
>




<
B
800[ln[P(B)] ln[P(A)]] (22)
Error bounds
The prior probability P(A) and P(B) can be determined as:
P(A) = 4/(6 + 4) = 0.4 P(B) = 6/(6 + 4) = 0.6 (23)
Plugging in P(A) and P(B):
_
(x 150)
2
_
_
(x 130)
2
_
A
>






<
A
800[ln[1.5]] 5600
40
(30)
x
B
>
<
A
131.9 (31)
Error bounds
For the n-d case, where p(x|A) = N(
A
,
2
A
) and
p(x|B) = N(
B
,
2
B
),
P(A) exp
1
2
(x
A
)
T
1
A
(x
A
)
(2)
n
2
|
A
|
1/2
A
>



<
B
2 ln
_
P(B)
P(A)
_
+ln
_
|
A
|
|
B
|
_
(35)
Looks familiar?
Error bounds
MAP Decision Boundaries for Normal Distribution
What is the MAP decision boundaries if our classes can be
characterized by normal distributions?
x
T
Q
0
x + Q
1
x + Q
2
+ 2Q
3
+ Q
4
= 0, (36)
where,
Q
0
= S
1
A
S
1
B
(37)
Q
1
= 2[m
T
B
S
1
B
m
T
A
S
1
A
] (38)
Q
2
= m
T
A
S
1
A
m
A
m
T
B
S
1
B
m
B
(39)
Q
3
= ln
_
P(B)
P(A)
_
(40)
Q
4
= ln
_
|S
A
|
|S
B
|
_
(41)
Error bounds
MAP Classier: Example
Suppose we are given the following statistical information
about the classes:
Class A: m
A
= [0 0]
T
, S
A
=
_
4 0
0 4
_
, P(A)=0.6.
Class B: m
B
= [0 0]
T
, S
B
=
_
1 0
0 1
_
, P(B)=0.4.
Suppose we wish to build a MAP classier.
Compute the decision boundary.
Error bounds
Step 1: Compute S
A
1
and S
B
2
:
S
A
1
=
_
1/4 0
0 1/4
_
S
B
1
=
_
1 0
0 1
_
(42)
Step 2: Compute Q
0
, Q
1
, Q
2
, Q
3
:
Q
0
= S
1
A
S
1
B
=
_
1/4 0
0 1/4
_
_
1 0
0 1
_
=
_
3/4 0
0 3/4
_
(43)
Q
1
= 2[m
T
B
S
1
B
m
T
A
S
1
A
] = 0 (44)
Q
2
= m
T
A
S
1
A
m
A
m
T
B
S
1
B
m
B
= 0 (45)
Error bounds
Step 2: Compute Q
0
, Q
1
, Q
2
, Q
3
:
Q
3
= ln
_
P(B)
P(A)
_
= ln
_
0.4
0.6
_
= ln(4/6) (46)
Q
4
= ln
_
|S
A
|
|S
B
|
_
= ln
_
(1/4)(1/4) (0)(0)
(1)(1) (0)(0)
_
= ln(1/16).
(47)
Step 3: Plugging in Q
0
, Q
1
, Q
2
, Q
3
gives us:
x
T
Q
0
x + Q
1
x + Q
2
+ 2Q
3
+ Q
4
= 0, (48)
x
T
_
3/4 0
0 3/4
_
x 2 ln(4/6) ln(1/16) = 0, (49)
Error bounds
Simplifying gives us:
([x
1
x
2
]
T
)
T
_
3/4 0
0 3/4
_
[x
1
x
2
]
T
2 ln(4/6)ln(1/16) = 0,
(50)
[ 3/4x
1
3/4x
2
][x
1
x
2
]
T
+549/677+2731/985 = 0, (51)
3/4x
2
1
3/4x
2
2
+ 1609/449 = 0, (52)
The nal MAP decision boundary is:
x
2
1
+ x
2
2
= 2131/446, (53)
This is just a circle centered at (x
1
, x
2
) = (0, 0) with a
radius of 2011/920.
Error bounds
Relationship between MICD and MAP Classiers for Normal
Distributions
You will notice that the terms on the right has the same
form as the MICD distance metric!
(x
B
)
T
1
B
(x
B
)(x
A
)
T
1
A
(x
A
)
A
>

<
B
2 ln
_
P(B)
P(A)
_
+ ln
_
|
A
|
|
B
|
_
(55)
If 2 ln
_
P(B)
P(A)
_
+ ln
_
|
A
|
|
B
|
_
= 0, then the MAP classier
becomes just the MICD classier!
Error bounds
Distributions
Therefore, the MICD is only optimal in terms of probability
of error only if we have multivariate Normal distributions
N(, ) that have:
Equal a priori probabilities (P(A) = P(B))
Equal volume cases (|
A
| = |
B
|)
If that is the case, whats so special about
2 ln
_
P(B)
P(A)
_
+ ln
_
|
A
|
|
B
|
_
?
First term 2 ln
_
P(B)
P(A)
_
biases decision in favor of more likely
class according to a priori probabilities
Second term ln
_
|
A
|
|
B
|
_
biases decision in favor of class with
smaller volume (||)
Error bounds
Distributions
So under what circumstance does MAP classier perform
better than MICD?
Recall the case where we have only one feature (n = 1),
m = 0, and s
A
= s
B
.
The MICD classication rule for this case is:
(1/s
2
B
1/s
2
A
)x
2
> 0 (56)
(1/s
2
A
)x
2
< (1/s
2
B
)x
2
(57)
s
2
A
> s
2
B
(58)
The MICD classication rule decides in favor of the class
with the largest variance, regardless of x
Error bounds
Distributions
The MAP classication rule for this case is:
(1/s
2
B
1/s
2
A
)x
2
> 2 ln
_
P(A)
P(B)
_
+ ln
_
s
2
A
s
2
B
_
(59)
If P(A) = P(B)
(1/s
2
B
1/s
2
A
)x
2
> ln
_
s
2
A
s
2
B
_
(60)
Error bounds
Distributions
Looking at the MAP classication rule:
(1/s
2
B
1/s
2
A
)x
2
> ln
_
s
2
A
s
2
B
_
(61)
At the mean m = 0,
0 > ln
_
s
2
A
s
2
B
_
(62)
if s
2
A
< s
2
B
, the log term is negative and favors class A
if s
2
B
< s
2
A
, the log term is positive and favors class B
Therefore, the MAP classication rule decides in favor of
class with the lowest variance close to the mean, and
favors the class with highest variance beyond a certain
point in both directions.
Error bounds
How do we quantify how well the Bayes classier works?
Since the Bayes classier minimizes the probability of
error, one way to analyze how well it does is to compute
the probability of error P() itself.
Allows us to see the theoretical limit on the expected
performance, under the assumption of known probability
density functions.
Error bounds
Probability of error given pattern
For any pattern x such that P(A|x) > P(B|x):
x is classied as part of class A
The probability of error of classifying x as A is P(B|x)
Therefore, naturally, for any given x the probability of error
P(|x) is:
P(|x) = min[P(A|x), P(B|x)] (63)
Rationale: Since we always chose the maximum posterior
probability as our class, the minimum posterior probability
would be the probability of choosing incorrectly.
Error bounds
Recall our previous example of a 1-D case:
p(x|A) = N(
A
,
2
A
) =
1
2
A
exp
_
1
2
(
x
A
A
)
2
_
(64)
p(x|B) = N(
B
,
2
B
) =
1
2
B
exp
_
1
2
(
x
B
B
)
2
_
(65)
A
= 130,
B
= 150, P(A) = 0.4, P(B) = 0.6,
A
=
B
= 20.
For x = 140, what is the probability of error P(|x)?
Error bounds
Recall the MAP classier for this scenario:
x
B
>
<
A
131.9 (66)
Based on this MAP classier, the pattern x = 140 belongs
to class A.
Given the probability of error P(|x) is:
P(|x) = min[P(A|x), P(B|x)] (67)
Since B gives the maximum probability, the minimum
probability would be P(A|x).
Error bounds
Therefore, P(|x) for x = 140 is:
P(|x)|
x=140
= P(A|x)|
x=140
=
P(x|A)P(A)
P(x|A)P(A) + P(x|B)P(B)
|
x=140
(68)
P(|x)|
x=140
=
26/1477(0.4)
(26/1477)0.4 + (26/1477)(0.6)
(69)
P(|x)|
x=140
= 0.4. (70)
Error bounds
Expected probability of error
Now that we know the probability of error for a given x,
denoted as P(|x), the expected probability of error P()
can be found as:
P() =
_
P(|x)p(x)dx (71)
P() =
_
min[P(A|x), P(B|x)] p(x)dx (72)
In terms of class PDFs:
P() =
_
min[P(x|A)P(A), P(x|B)P(B)]dx (73)
Error bounds
Now if we were to dene decision regions R
A
and R
B
:
R
A
= x such that P(A|x) > P(B|x)
R
B
= x such that P(B|x) > P(A|x)
The expected probability of error can be dened as:
P() =
_
R
A
P(x|B)P(B)dx +
_
R
B
P(x|A)P(A)dx (74)
Rationale: For all patterns in R
A
, the probability of A will be
the maximum between A and B, so the probability of error
of patterns in R
A
is just the minimum probability (in this
case, the probability of B), and vice versa.
Error bounds
Example 1: univariate Normal, equal variance, equally
likely two class problem:
n = 1, P(A) = P(B) = 0.5,
A
=
B
= ,
A
<
B
Likelihood:
p(x|A) = N(
A
,
2
A
) =
1
2
A
exp
_
1
2
(
x
A
A
)
2
_
(75)
p(x|B) = N(
B
,
2
B
) =
1
2
B
exp
_
1
2
(
x
B
B
)
2
_
(76)
Find p()
Error bounds
Recall for the case of equally likely, equi-variance classes,
the MAP decision boundary reduces to a threshold midway
between the means.
x =
(
B
+
A
)
2
(77)
Since
A

(
B
+
A
)
2
Error bounds
Based on decision regions R
A
, R
B
, P(A), P(B), P(x|A),
P(x|B),
B
,
A
, the expected probability of error P()
becomes
P() =
_
R
A
P(B)P(x|B)dx +
_
R
B
P(A)P(x|A)dx (78)
P() =
1
2
_
(
B
+
A
)
2
P(x|B)dx +
1
2
_

(
B
+
A
)
2
P(x|A)dx (79)
P() =
1
2
_
(
B
+
A
)
2
N(
B
,
2
)dx +
1
2
_

(
B
+
A
)
2
N(
A
,
2
)dx
(80)
Error bounds
Since the two classes are symmetric (P(|A) = P(|B)),
P() =
1
2
_
(
B
+
A
)
2
N(
B
,
2
)dx +
1
2
_

(
B
+
A
)
2
N(
A
,
2
)dx
(81)
P() =
_

(
B
+
A
)
2
N(
A
,
2
)dx (82)
P() =
_

(
B
+
A
)
2
1
2
A
exp
_
1
2
(
x
A
A
)
2
_
dx (83)
Error bounds
Doing a change of variables, where y =
x
A
, dx = dy,
P() =
_

(
B
A
)
2
1
2
exp
_
1
2
y
2
_
dy (84)
This corresponds to an integral over a normalized (N(0, 1))
Normal random variable:
Q() =
_

2
exp
_
1
2
y
2
_
dy (85)
Plugging Q in gives us the nal expected probability of
error P():
P() = Q(
A
2
) (86)
Error bounds
Visualization of P():
P() is essentially the shaded area.
Error bounds
Observations:
As the distance between the means increase, the shaded
area becomes monotonically smaller and the expected
probability of error P() monotonically decreases.
At = 0,
A
=
B
= 0 and P() = 1/2 (makes sense since
the distributions completely overlap, and you have a 50/50
chance of either class)
lim
P() = 0.
Error bounds
For cases where P(A) = P(B) or
A
=
B
, the decision
boundary change AND an additional boundary is
introduced!
Luckily, P() can still be expressed using the Q() function
with appropriate change of variables.
Error bounds
Example:
P() is essentially the shaded area.
P() = P(A)Q(
1
) + P(B)[Q(
3
) Q(
4
)] + P(A)Q(
2
)
Error bounds
Lets take a look at the multivariate case (n>1)
For p(x|A) = N(
A
, ), p(x|B) = N(
B
, ), P(A) = P(B),
it can be shown that:
P() = Q(d
M
(
A
,
B
)/2) (87)
where d
M
(
A
,
B
) is the Mahalanobis distance between the
classes.
d
M
(
A
,
B
) = [(
A
B
)
T
1
(
A
B
)]
1/2
(88)
Error bounds
Why is P() like that for this case?
Remember that for all cases where the covariance matrices
AND the prior probabilities are the same, the decision
boundary between the classes is always a straight line in
hyperspace that is:
sloped based on (since our orthonormal whitening
transform is identical for both classes)
intersects with the midpoint of the line segment between
mu
A
and mu
B
The probability of error is just the area under P(x|A)p(A) on
the class B side of this decision boundary PLUS the area
under P(x|B)p(B) on the class A side of this decision
boundary.
Error bounds
Example of non-Gaussian density functions:
Suppose two classes have density functions and a priori
probabilities:
p(x|C
1
) =
_
ce
x
0 x 1
0 else
(89)
p(x|C
2
) =
_
ce
(1x)
0 x 1
0 else
(90)
P(C
1
) = P(C
2
) =
1
2
(91)
where c =

1e
is just the appropriate constant to

normalize the PDF.
Error bounds
Therefore, the expected probability of error is:
P() =
_
min[P(x|C
1
)P(C
1
), P(x|C
2
)P(C
2
)]dx (92)
P() =
_
R
C
1
P(x|C
2
)P(C
2
)dx+
_
R
C
2
P(x|C
1
)P(C
1
)dx (93)
P() =
_
0.5
0
0.5P(x|C
2
)dx +
_
1.0
0.5
0.5P(x|C
1
)dx (94)
Because of symmetry between the two classes
(P(|C
1
) = P(|C
2
)),
P() =
_
1.0
0.5
ce
x
dx (95)
P() =
c
_
e
/2
e
_
(96)
Error bounds
(b) Find P(|x):
From the decision boundary and decision regions we
determined in (a),
p(|x) =
_
P(C
2
|x) 0 x 1/2
P(C
1
|x) 1/2 x 1
(97)
p(|x) =
_
P(x|C
2
)P(C
2
)
P(x)
0 x 1/2
P(x|C
1
)P(C
1
)
P(x)
1/2 x 1
(98)
p(|x) =
_
e
x
0.5
e
x
+e
(1x)
0 x 1/2
e
(1x)
0.5
e
x
+e
(1x)
1/2 x 1
(99)
Error bounds
Error bounds
In practice, the exact P() is only easy to compute for
simple cases as shown before.
So how can we quantify the probability of error in such
cases?
Instead of nding the exact P(), we determine the
bounds on P(), which are:
Easier to compute
Leads to estimates of classier performance
Error bounds
Bhattacharrya bound
Using the following inequality:
min[a, b]
_
(a, b) (100)
The following holds true:
P() =
_
min[P(x|A)P(A), P(x|B)P(B)]dx (101)
P()
_
P(A)P(B)
_
_
P(x|A)P(x|B)dx (102)
Whats so special about this?
Answer: You dont need the actual decision regions to
compute this!
Error bounds
Bhattacharrya bound
Since P(A) + P(B) = 1 and the Bhattacharrya coefcient
can be dened as:
=
_
_
The upper bound (Bhattacharrya bound) of P() can be
written as
P()
1
2
(104)
Error bounds
Bhattacharrya bound: Example
Example: Consider a classier for a two class problem.
Both classes are multivariate normal. When both classes
are a priori equally likely, the Bhattacharrya bound is
P() 0.3.
New information is specied, such that we are told that the
a priori probabilities of the two classes are 0.2 and 0.8, for
A and B respectively.
What is the new upper bound for the probability of error?
Error bounds
Bhattacharrya bound Example
Step 1: Based on old bound, compute the Bhattacharrya
coefcient
P() = 0.3
_
P(A)P(B)
_
_
0.3
_
P(A)P(B)
_
_
=
_
_
P(x|A)P(x|B)dx
0.3
0.5 0.5
= 0.6 (107)
Error bounds
Bhattacharrya bound Example
Step 2: Based on Bhattacharrya coefcient and new
priors P(A) = 0.2 and P(B) = 0.8, the new upper bound
can be computed as:
P()
_
P(A)P(B)
_
_
P()
0.8 0.2 (109)

P()
0.8 0.2 0.6 = 0.24 (110)

Slide 7

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Slide 7

Загружено:

Авторское право:

Доступные форматы

MAP Classier for Normal Distributions

Performance of the Bayes Classier

is just the appropriate constant to

0.8 0.2 (109)

0.8 0.2 0.6 = 0.24 (110)

Вам также может понравиться