Вы находитесь на странице: 1из 41

# CpE

g and
Classification

## Department of Electrical and

Computer Engineering
Stevens Institute of Technology
Non-Parametric Classification

## Chapter 4 (Section 4.4 4.6):

kn-Nearest-Neighbor
Nearest Neighbor Estimation

## The Nearest-Neighbor Rule

kn-Nearest Neighbor Estimation

## A solution for the pproblem of the unknown best window

function. To estimate p(x) from n training samples or
prototypes:
Let
L t the
th cell
ll volume
l be
b a function
f ti off the th number b off the
th
training data
Ce
Center
te a cell
et itt grows
g ows until
u t itt captures
captu es kn
samples (kn = f(n))
The included samples are called the kn nearest-
neighbors
i hb off x
kn / n
The density is given as pn ( x) = (30)
Vn
kn-Nearest Neighbor Estimation

## Two ppossibilities can occur:

If the density is high near x, the cell will be small which
provides a good resolution
If the density is low around x, the cell will grow large
and stop until higher density regions are reached
kn-Nearest Neighbor Estimation

## It can be proven that lim kn = and lim kn / n = 0

n n
are necessary and sufficient for pn(x) converge to p(x) at
all points where p(x) is continuous
If kn = n and pn(x) is a good estimate of p(x), i.e. p1(x) =
pn(x) = p(x),
p(x) from (30) we have
Vn = 1/ ( )
n p ( x) and then Vn = V1 / n
this becomes similar to the Parzen-window
Parzen window approach
except that V1 is not determined arbitrarily.
kn-Nearest Neighbor Estimation

## Peaks are at the middle of regions with k prototypes

kn-Nearest Neighbor Estimation
kn-Nearest Neighbor Estimation

kn nearest-neighbor
g illustration
For kn = n when n=1, k1=1, and the estimate is
1
pn ( x) =
2 | x x1 |
where x1 is the first training sample, and || is a distance
measure. 2|x
| - x1| is the volume. This is not a ggood
estimate (Figure 4.12)
As n increases, the estimate gets better.
This method will not generate zero p(x) for any x. (If a
fixed window, e.g. Parzen window, is used and no
sample falls inside this window,
window the density estimate for
this window will be zero. This will not happen here.)
kn-Nearest Neighbor Estimation
kn-Nearest Neighbor Estimation
kn-Nearest Neighbor Estimation

## We can obtain a familyy of estimates byy makingg kn = k1 n

Similar to Parzen window method, the choice of k1 is case
dependent.
Usually k1 is selected in the way that, when the estimated
density is applied to classify new test samples from the
same density, it yields the lowest error rate.
kn-Nearest Neighbor Estimation

## Estimation of a pposteriori pprobabilities. We can estimate

P(i|x) from a set of n labeled samples using the window
methods
We
W placel a cellll off volume
l V aroundd x andd capture
t k
samples
If ki sa
samples
p es aamong
o g these
t ese k sa
samples
p es tu
turn out to be
labeled i then the joint probability p(x,i) can be
ki / n
pn ( x, i ) =
V
kn-Nearest Neighbor Estimation

## Then a reasonable estimate of a pposteriori pprobabilityy is

pn ( x, i ) ki
Pn (i | x) = c
=
k
p ( x, )
j =1
n j

ki/k is the fraction of the samples within the cell that are
l b l d j, i.e.
labeled i k samplesl ini the
th cell
ll andd ki outt off k are
labeled j
For minimum error rate,, the most frequently
q y
represented category within the cell is selected for this
cell, and any test sample lies in this cell is labeled as
this category.
category
The Nearest Neighbor Rule

## Let Dn = {{x1, x2, ,, xn} be a set of n labeled pprototypes

yp
Let x Dn be the closest prototype to a test point x then
the nearest-neighbor rule for classifying x is to assign it the
l b l associated
label i t d with
ith x
The nearest-neighbor rule leads to an error rate greater than
tthee minimum
u possible
poss b e -- tthee Bayes
ayes rate
ate
If the number of prototype is large (unlimited), the error
rate of the nearest-neighbor classifier is never worse than
twice
i the
h Bayes
B rate
The Nearest Neighbor Rule

## The label associated with the nearest neighbor

g x is a
random variable, and the probability that =i is the a
posteriori probability P(i|x).
If n , it is
i always
l possible
ibl to
t find
fi d x
sufficiently
ffi i tl close
l
to x so that P(i | x) P(i | x)
definee m(x) tthat
We de at P((m| x) = max
a i P((i| x). Then
e the
t e
Bayes rule always select m for x.
The Nearest Neighbor Rule

## This rule essentiallyy partitions

p the feature space
p into cells
and each cell containing a prototype x and all points
closer to it than to any other prototypes. All points in a cell
are labeled by the category of this xx, which is called
Voronoi tesselation of the space.
In each cell,
If P(m | x) P(i | x) 1, then the nearest neighbor
selection is almost always the same as the Bayes
selection
If P(m | x) P(i | x) 1/c, then the nearest neighbor
selection is rarelyy the same as the Bayes
y selection, but
their error rates are similar (i.e. both are random guess)
The Nearest Neighbor Rule
The Nearest Neighbor Rule

## The averageg probability

p y of error of the nearest-neighbor
g
rule for infinite sample is
P(e) = P(e | x) p( x)dx
The Bayes decision rule minimizes P(e) by minimizing
P(e|x) for every x, then

## P* (e | x)  min ( P(e | x) ) = 1 P(m | x)

P*  min ( P (e) ) = P* (e | x) p ( x)dx
The Nearest Neighbor Rule

## When onlyy n samples

p are used in nearest neighbor g rule,, the
conditional probability of error becomes
P(e | x) = P(e | x, x ') p( x ' | x)dx '
where x is the nearest neighbor prototype of x.
Each time when we take n samples, the nearest
neighbor xxmay be different
different, i.e.
i e xx is a random
variable.
As n, p(x| x) approaches a delta function centered
at x, p(x| x)(x-x)
(i.e. a nearest neighbor x can be always found very
close to x).
)
The Nearest Neighbor Rule

To solve Pn((e||x, x)
We have n pairs of random variables {(x1,1),
(x2,2), , (xn,n)}, where j is class label for xj and
j {
{ 1, 2, , c}
Because the state of nature when xn (the nearest
e g bo of
neighbor o xw when
e tota
total sample
sa p e iss n) iss ddrawn
aw iss
independent of the state of nature when x is drawn,
we have
P( , n' | x, xn' ) = P ( | x) P( n' | xn' )
The Nearest Neighbor Rule

## If we use the nearest-neighbor

g decision rule,, the
error occurs when n, therefore
c
Pn (e | x, x ) = 1 P( = i , n' = i | x, xn' )
'
n
i =1
c
= 1 P(i | x) P(i | xn' )
i =1

c

lim Pn (e | x) = 1 P(i | x) P (i | xn' ) ( xn' x)dxn'
n
i =1
c
= 1 P 2 (i | x)
i 1
i=
The Nearest Neighbor Rule

## The overall asymptotic

y p nearest-neighbor
g error rate is
P = lim Pn (e)
n

li Pn (e | x) p ( x)dx
= lim d
n

c

= 1 P (i | x) p ( x)dx
2

i =1
The error rate is bounded (proof in Sec 4.5.3)
c *
P P P 2
* *
P
c 1
The Nearest Neighbor Rule
The k-Nearest Neighbor Rule

## The k-nearest neighbor g rule is an extension of the nearest

neighbor rule.
Classify x by assigning it the label most frequently
represented
t d among the th k nearestt samples l andd use a votingti
scheme
W
When e the
t e total
tota number u be of o prototypes
p ototypes approaches
app oac es infinity,ty,
these k neighbors will all converge to x.
In a two-class case, the k-nearest neighbor rule selects m
if a majority
j i off the h k neighbors
i hb are labeled l b l d m, this hi event
has the probability
k
k

i
i = ( k +1) / 2
P ( m | x ) i
[1 P ( m | x )]k i
The k-Nearest Neighbor Rule
The k-Nearest Neighbor Rule
The k-Nearest Neighbor Rule

Example
p
k = 3 (odd value) and
Prototypes Labels
x = (0.10, 0.25)t
(0.15,
(0 15 0.35)
0 35) 1
(0.10, 0.28) 2
(0 09 0.30)
(0.09, 0 30) 1
(0.12, 0.20) 2
3 closest vectors to x with their labels are:
{(0.10, 0.28; 2); (0.12, 0.20; 2); (0.15, 0.35; 1)}
The majority
j y votingg scheme will assignsg the label 2 to x.
Metrics and Nearest Neighbor Classification

## The nearest neighbor

g classifier relies on certain distance
function metric
Frequently we assume the metric is Euclidean distance in d
di
dimensions,
i but
b t it can be
b a generalized
li d scalar
l distance
di t
between two argument patterns D( , )
A metric
et c must
ust have
ave four
ou pproperties:
ope t es: for
o any
a y given
g ve vectors
vecto s
a, b and c
Non-negativity: D(a,b)0
Reflexivity: D(a,b)=0 iff a=b
Symmetry: D(a,b)=D(b,a)
Triangle lit D(a,b)+D(b,c)D(a,c)
T i l inequality:
i
Metrics and Nearest Neighbor Classification

## The Euclidean distance in d dimensions satisfies these

properties
1/ 2
d
2
D(a, b) = (ak bk )
k =1
E
Euclidean
lid distance
di t is
i very sensitive
iti tot the
th scales
l (units)
( it ) off
the coordinates, which has negative impact to the
performance of nearest-neighbor classifiers
Metrics and Nearest Neighbor Classification
Metrics and Nearest Neighbor Classification

## Minkowski metric,, also referred to as the Lk norm

1/ k
d
k
Lk (a, b) = | ak bk |
k =1
Euclidean distance is the L2 norm
L1 norm is referred to as the Manhattan distance
L distance between a and b is the maximum of the
projections of |a-b| on the d coordinate axes.
axes
Metrics and Nearest Neighbor Classification
Metrics and Nearest Neighbor Classification

## Tanimoto metric,, for two sets S1 and S2

n1 + n2 2n12
DTanimoto ( S1 , S 2 ) =
n1 + n2 n12

## where n1 and n2 are the number of elements in set S1 and S2

and
d n12 is
i the
th number
b ini both
b th sets.
t
Tanimoto metric is frequently used in taxonomy
Metrics and Nearest Neighbor Classification

## Tanimoto metric examples:

p
Consider four words as sets of unordered letters:
pattern, pat, stop, pots
7 + 3 23 4 7 + 4 2 2 7
D( pattern, pat ) = = , D( pattern, stop ) = =
7 +33 7 7+42 9
7 + 4 2 2 7 3 + 4 2 2 3
D( pattern, pots) = = , D( pat , stop ) = =
7+42 9 3+ 4 2 5
3 + 4 2 2 3 4 + 4 2 4
D( pat , pots) = = , D( stop, pots ) = =0
3+ 4 2 5 4+44
Metrics and Nearest Neighbor Classification

## Uncritical use of a pparticular metric in nearest-neighbor

g
classifier can cause low performance
The metric needs to be invariant to common transforms
such h as ttranslation,
l ti rotation,
t ti scaling
li etc.
t
It is very difficult to make a metric invariant to multiple
ta so s
transforms
Typical solutions may include pre-processing two
patterns to coalign, shifting the centers and placing in
same bounding
b di box b etc. Automatic
A i pre-processing
i can
also be difficult and unreliable.
Metrics and Nearest Neighbor Classification
Metrics and Nearest Neighbor Classification

Tangent
g distance classifier is to use a novel distance
measure and a linear approximation to the arbitrary
transforms.
Assume
A a classifier
l ifi needs d to
t handle
h dl r transforms,
t f suchh
as horizontal translation, vertical translation, shear,
rotation, scale and line thinning
We take each prototype xand perform each of the
transforms Fi(x; i) where i is the parameter
associated with this transform,
transform such as the angle in
rotation.
Metrics and Nearest Neighbor Classification

A tangent
g vector TVi is constructed for each transform
TVi = Fi(x; i) - x
For each d-dimensional prototype x, an rd matrix T
is generated, consisting of the tangent vectors at x.
These vectors are linearly independent.
The prototype plus a linear combination of all tangent
vectors forms an approximation of an arbitrary
transform.
Metrics and Nearest Neighbor Classification
Metrics and Nearest Neighbor Classification

The tangent
g distance from a test point
p x to a pparticular
stored prototype x is defined as
Dtan(x, x) = mina [||(x + Ta) - x||]
where T is a matrix consisting of the r tangent vectors
at x, a is a vector of parameters for linear
combination,
co a d |||| can
b at o , and ca be Euclidean
uc dea distance.
d sta ce.
In classification of x, we will first find its tangent
distance to x by finding the optimizing value of a.
Thi minimization
This i i i i is i quadratic,
d i andd can be b done
d using
i