Вы находитесь на странице: 1из 6

CSCE  Homework 

Jason Dew
September , 

Abstract
In this homework I will explore the efficacy of different parameters in
the k nearest neighbor algorithm, including the value of k, weighting tac-
tics, and the distance measure used to calculate similarity. e breast cancer
Wisconsin (diagnostic) data¹ from the UCI Machine Learning repository² is
used.

 Weka installation
is was very straightforward on my platform of choice, Mac OS .. I also put
the weka.jar file in a standard location so that it can be used programmatically
via JRuby.

 Acquisition and preliminary analysis


. D  
e data set was very easy to find and I was impressed with the organization and
depth of the UCI repository.

. A 


All of the aributes in the given data set, except for the ID, are ordinal and range
from  to . e stacked boxplots in Figure  show this as well as the relationship
between the aributes. e means and standard deviations are also given in Fig-
ure .

As the ID has no predictive value, it was removed from consideration.

¹ hp://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
² hp://archive.ics.uci.edu/ml/index.html


Boxplots of each attribute

mitosis ● ● ● ● ● ● ● ●

normal_nucleoli ● ●

bland_chromatin ●

bare_nuclei

epithelial_cell_size ● ● ●

marginal_adhesion ● ●

cell_shape_uniformity

cell_size_uniformity

clump_thickness

2 4 6 8 10

Scale

Figure : Boxplots of the aributes in the data set.


standard
attribute mean
deviation
clump_thickness . .
cell_size_uniformity . .
cell_shape_uniformity . .
marginal_adhesion . .
epithelial_cell_size . .
bare_nuclei . .
bland_chromatin . .
normal_nucleoli . .
mitosis . .

Figure : Means and standard deviation for the aributes.

 Analysis of k-NN classifiers


. M
In order to train and test given a single data set, -fold cross-validation was used
and this seems to give good results. However, it is of note that the accuracy num-
bers are lower than when using the training set as the test set.

. R
In order to learn more about how the k-NN classifier works, I varied several op-
tions in addition to k, including weighting the similarities and varying the distance
measures. e accuracy of a classifier is defined as
# correct
accuracy =
# of instances
Figure  shows how the accuracy varies in k using the Euclidean distance measure
and no weighting. ere does not seem to be a clear paern here. Figure  shows
how the distance metric used affects the accuracy achieved. e differences are
between these are slight and the Euclidean distance does the best overall. Figure 
shows prey clearly that weighting the results either by using the inverse distance
or the similarity is a good idea. In this case, using the inverse distance does a beer
job.


k−NN accuracy for k ranging from 5 to 10


96.6


accuracy

96.4

● ●
96.2

5 6 7 8 9 10

Figure : Graph of the effect of k on the k-NN algorithm.


Comparison of distance metrics
96.8

● Euclidean
Manhattan

Chebyshev
96.6


96.4


accuracy

● ●
96.2


96.0

5 6 7 8 9 10

Figure : Graph of the effect of the distance metric used on the k-NN algorithm.


Comparison of weighting options
97.2

● None
Inverse distance
Similarity
97.0
96.8
accuracy


96.6


96.4

● ●
96.2

5 6 7 8 9 10

Figure : Graph of the effect of the use of weighting on the k-NN algorithm.