Вы находитесь на странице: 1из 41


p 646 Pattern Recognition

g and

Prof. Hong Man

Department of Electrical and

Computer Engineering
Stevens Institute of Technology
Non-Parametric Classification

Chapter 4 (Section 4.4 4.6):

Nearest Neighbor Estimation

The Nearest-Neighbor Rule

kn-Nearest Neighbor Estimation

A solution for the pproblem of the unknown best window

function. To estimate p(x) from n training samples or
L t the
th cell
ll volume
l be
b a function
f ti off the th number b off the
training data
te a cell
ce about x aandd let
et itt grows
g ows until
u t itt captures
captu es kn
samples (kn = f(n))
The included samples are called the kn nearest-
i hb off x
kn / n
The density is given as pn ( x) = (30)
kn-Nearest Neighbor Estimation

Two ppossibilities can occur:

If the density is high near x, the cell will be small which
provides a good resolution
If the density is low around x, the cell will grow large
and stop until higher density regions are reached
kn-Nearest Neighbor Estimation

It can be proven that lim kn = and lim kn / n = 0

n n
are necessary and sufficient for pn(x) converge to p(x) at
all points where p(x) is continuous
If kn = n and pn(x) is a good estimate of p(x), i.e. p1(x) =
pn(x) = p(x),
p(x) from (30) we have
Vn = 1/ ( )
n p ( x) and then Vn = V1 / n
this becomes similar to the Parzen-window
Parzen window approach
except that V1 is not determined arbitrarily.
kn-Nearest Neighbor Estimation

Peaks are at the middle of regions with k prototypes

kn-Nearest Neighbor Estimation
kn-Nearest Neighbor Estimation

kn nearest-neighbor
g illustration
For kn = n when n=1, k1=1, and the estimate is
pn ( x) =
2 | x x1 |
where x1 is the first training sample, and || is a distance
measure. 2|x
| - x1| is the volume. This is not a ggood
estimate (Figure 4.12)
As n increases, the estimate gets better.
This method will not generate zero p(x) for any x. (If a
fixed window, e.g. Parzen window, is used and no
sample falls inside this window,
window the density estimate for
this window will be zero. This will not happen here.)
kn-Nearest Neighbor Estimation
kn-Nearest Neighbor Estimation
kn-Nearest Neighbor Estimation

We can obtain a familyy of estimates byy makingg kn = k1 n

and adjusting the value k1.
Similar to Parzen window method, the choice of k1 is case
Usually k1 is selected in the way that, when the estimated
density is applied to classify new test samples from the
same density, it yields the lowest error rate.
kn-Nearest Neighbor Estimation

Estimation of a pposteriori pprobabilities. We can estimate

P(i|x) from a set of n labeled samples using the window
W placel a cellll off volume
l V aroundd x andd capture
t k
If ki sa
p es aamong
o g these
t ese k sa
p es tu
turn out to be
labeled i then the joint probability p(x,i) can be
ki / n
pn ( x, i ) =
kn-Nearest Neighbor Estimation

Then a reasonable estimate of a pposteriori pprobabilityy is

pn ( x, i ) ki
Pn (i | x) = c
p ( x, )
j =1
n j

ki/k is the fraction of the samples within the cell that are
l b l d j, i.e.
labeled i k samplesl ini the
th cell
ll andd ki outt off k are
labeled j
For minimum error rate,, the most frequently
q y
represented category within the cell is selected for this
cell, and any test sample lies in this cell is labeled as
this category.
The Nearest Neighbor Rule

Let Dn = {{x1, x2, ,, xn} be a set of n labeled pprototypes

Let x Dn be the closest prototype to a test point x then
the nearest-neighbor rule for classifying x is to assign it the
l b l associated
label i t d with
ith x
The nearest-neighbor rule leads to an error rate greater than
tthee minimum
u possible
poss b e -- tthee Bayes
ayes rate
If the number of prototype is large (unlimited), the error
rate of the nearest-neighbor classifier is never worse than
i the
h Bayes
B rate
The Nearest Neighbor Rule

The label associated with the nearest neighbor

g x is a
random variable, and the probability that =i is the a
posteriori probability P(i|x).
If n , it is
i always
l possible
ibl to
t find
fi d x
ffi i tl close
to x so that P(i | x) P(i | x)
definee m(x) tthat
We de at P((m| x) = max
a i P((i| x). Then
e the
t e
Bayes rule always select m for x.
The Nearest Neighbor Rule

This rule essentiallyy partitions

p the feature space
p into cells
and each cell containing a prototype x and all points
closer to it than to any other prototypes. All points in a cell
are labeled by the category of this xx, which is called
Voronoi tesselation of the space.
In each cell,
If P(m | x) P(i | x) 1, then the nearest neighbor
selection is almost always the same as the Bayes
If P(m | x) P(i | x) 1/c, then the nearest neighbor
selection is rarelyy the same as the Bayes
y selection, but
their error rates are similar (i.e. both are random guess)
The Nearest Neighbor Rule
The Nearest Neighbor Rule

The averageg probability

p y of error of the nearest-neighbor
rule for infinite sample is
P(e) = P(e | x) p( x)dx
The Bayes decision rule minimizes P(e) by minimizing
P(e|x) for every x, then

P* (e | x)  min ( P(e | x) ) = 1 P(m | x)

P*  min ( P (e) ) = P* (e | x) p ( x)dx
The Nearest Neighbor Rule

When onlyy n samples

p are used in nearest neighbor g rule,, the
conditional probability of error becomes
P(e | x) = P(e | x, x ') p( x ' | x)dx '
where x is the nearest neighbor prototype of x.
Each time when we take n samples, the nearest
neighbor xxmay be different
different, i.e.
i e xx is a random
As n, p(x| x) approaches a delta function centered
at x, p(x| x)(x-x)
(i.e. a nearest neighbor x can be always found very
close to x).
The Nearest Neighbor Rule

To solve Pn((e||x, x)
We have n pairs of random variables {(x1,1),
(x2,2), , (xn,n)}, where j is class label for xj and
j {
{ 1, 2, , c}
Because the state of nature when xn (the nearest
e g bo of
neighbor o xw when
e tota
total sample
sa p e iss n) iss ddrawn
aw iss
independent of the state of nature when x is drawn,
we have
P( , n' | x, xn' ) = P ( | x) P( n' | xn' )
The Nearest Neighbor Rule

If we use the nearest-neighbor

g decision rule,, the
error occurs when n, therefore
Pn (e | x, x ) = 1 P( = i , n' = i | x, xn' )
i =1
= 1 P(i | x) P(i | xn' )
i =1


lim Pn (e | x) = 1 P(i | x) P (i | xn' ) ( xn' x)dxn'
i =1
= 1 P 2 (i | x)
i 1
The Nearest Neighbor Rule

The overall asymptotic

y p nearest-neighbor
g error rate is
P = lim Pn (e)

li Pn (e | x) p ( x)dx
= lim d


= 1 P (i | x) p ( x)dx

i =1
The error rate is bounded (proof in Sec 4.5.3)
c *
P P P 2
* *
c 1
The Nearest Neighbor Rule
The k-Nearest Neighbor Rule

The k-nearest neighbor g rule is an extension of the nearest

neighbor rule.
Classify x by assigning it the label most frequently
t d among the th k nearestt samples l andd use a votingti
When e the
t e total
tota number u be of o prototypes
p ototypes approaches
app oac es infinity,ty,
these k neighbors will all converge to x.
In a two-class case, the k-nearest neighbor rule selects m
if a majority
j i off the h k neighbors
i hb are labeled l b l d m, this hi event
has the probability

i = ( k +1) / 2
P ( m | x ) i
[1 P ( m | x )]k i
The k-Nearest Neighbor Rule
The k-Nearest Neighbor Rule
The k-Nearest Neighbor Rule

k = 3 (odd value) and
Prototypes Labels
x = (0.10, 0.25)t
(0 15 0.35)
0 35) 1
(0.10, 0.28) 2
(0 09 0.30)
(0.09, 0 30) 1
(0.12, 0.20) 2
3 closest vectors to x with their labels are:
{(0.10, 0.28; 2); (0.12, 0.20; 2); (0.15, 0.35; 1)}
The majority
j y votingg scheme will assignsg the label 2 to x.
Metrics and Nearest Neighbor Classification

The nearest neighbor

g classifier relies on certain distance
function metric
Frequently we assume the metric is Euclidean distance in d
i but
b t it can be
b a generalized
li d scalar
l distance
di t
between two argument patterns D( , )
A metric
et c must
ust have
ave four
ou pproperties:
ope t es: for
o any
a y given
g ve vectors
vecto s
a, b and c
Non-negativity: D(a,b)0
Reflexivity: D(a,b)=0 iff a=b
Symmetry: D(a,b)=D(b,a)
Triangle lit D(a,b)+D(b,c)D(a,c)
T i l inequality:
Metrics and Nearest Neighbor Classification

The Euclidean distance in d dimensions satisfies these

1/ 2
D(a, b) = (ak bk )
k =1
lid distance
di t is
i very sensitive
iti tot the
th scales
l (units)
( it ) off
the coordinates, which has negative impact to the
performance of nearest-neighbor classifiers
Metrics and Nearest Neighbor Classification
Metrics and Nearest Neighbor Classification

Minkowski metric,, also referred to as the Lk norm

1/ k
Lk (a, b) = | ak bk |
k =1
Euclidean distance is the L2 norm
L1 norm is referred to as the Manhattan distance
L distance between a and b is the maximum of the
projections of |a-b| on the d coordinate axes.
Metrics and Nearest Neighbor Classification
Metrics and Nearest Neighbor Classification

Tanimoto metric,, for two sets S1 and S2

n1 + n2 2n12
DTanimoto ( S1 , S 2 ) =
n1 + n2 n12

where n1 and n2 are the number of elements in set S1 and S2

d n12 is
i the
th number
b ini both
b th sets.
Tanimoto metric is frequently used in taxonomy
Metrics and Nearest Neighbor Classification

Tanimoto metric examples:

Consider four words as sets of unordered letters:
pattern, pat, stop, pots
7 + 3 23 4 7 + 4 2 2 7
D( pattern, pat ) = = , D( pattern, stop ) = =
7 +33 7 7+42 9
7 + 4 2 2 7 3 + 4 2 2 3
D( pattern, pots) = = , D( pat , stop ) = =
7+42 9 3+ 4 2 5
3 + 4 2 2 3 4 + 4 2 4
D( pat , pots) = = , D( stop, pots ) = =0
3+ 4 2 5 4+44
Metrics and Nearest Neighbor Classification

Uncritical use of a pparticular metric in nearest-neighbor

classifier can cause low performance
The metric needs to be invariant to common transforms
such h as ttranslation,
l ti rotation,
t ti scaling
li etc.
It is very difficult to make a metric invariant to multiple
ta so s
Typical solutions may include pre-processing two
patterns to coalign, shifting the centers and placing in
same bounding
b di box b etc. Automatic
A i pre-processing
i can
also be difficult and unreliable.
Metrics and Nearest Neighbor Classification
Metrics and Nearest Neighbor Classification

g distance classifier is to use a novel distance
measure and a linear approximation to the arbitrary
A a classifier
l ifi needs d to
t handle
h dl r transforms,
t f suchh
as horizontal translation, vertical translation, shear,
rotation, scale and line thinning
We take each prototype xand perform each of the
transforms Fi(x; i) where i is the parameter
associated with this transform,
transform such as the angle in
Metrics and Nearest Neighbor Classification

A tangent
g vector TVi is constructed for each transform
TVi = Fi(x; i) - x
For each d-dimensional prototype x, an rd matrix T
is generated, consisting of the tangent vectors at x.
These vectors are linearly independent.
The prototype plus a linear combination of all tangent
vectors forms an approximation of an arbitrary
Metrics and Nearest Neighbor Classification
Metrics and Nearest Neighbor Classification

The tangent
g distance from a test point
p x to a pparticular
stored prototype x is defined as
Dtan(x, x) = mina [||(x + Ta) - x||]
where T is a matrix consisting of the r tangent vectors
at x, a is a vector of parameters for linear
co a d |||| can
b at o , and ca be Euclidean
uc dea distance.
d sta ce.
In classification of x, we will first find its tangent
distance to x by finding the optimizing value of a.
Thi minimization
This i i i i is i quadratic,
d i andd can be b done
d using
iterative gradient descent.
Metrics and Nearest Neighbor Classification