Вы находитесь на странице: 1из 12

kͲNearestNeighbor

Instructor:JessicaWuͲͲ HarveyMuddCollege

TheinstructorgratefullyacknowledgesEricEaton(UPenn),DavidKauchak (Pomona),andAndrewMoore
(CMU),andthemanyotherswhomadetheircoursematerialsfreelyavailableonline.

DataRepresentation
LearningGoals

‡ Describehowtorepresentcomplexdata
‡ Viewdatagraphically
DataRepresentation
‡ Mostlearningalgorithmsrequiredatainsome
numericrepresentation
± e.g.eachinputpatternisavector
‡ Ifdatanaturallyhasnumeric(realͲvalued)
features,justrepresentasvectorof(real)
numbers
± e.g.28x28imageby784x1vectorofpixel
intensities
‡ IfdatahasnonͲnumericrepresentation…
± let’slookatsomeexamples
BasedonslidebyPiyush Rai

DatatoFeatures
‡ Textdocument

BasedonslidebyPiyush Rai
DatatoFeatures
Let’sconsiderdatasetsimilartoTennisPlayingexample

‡ Featuresarecategorical(Low/High,Yes/No,Overcast/Rainy/Sunny,etc)
‡ Featureswithonly2possiblevalues
± canberepresentedas0/1
‡ Featureswithmorethan2possiblevalues
± canwemapSunny=0,Overcast=1,Rainy=2?

BasedonslidebyPiyush Rai

DatatoFeatures
‡ Well,wecouldmapSunny=0,Overcast=1,Rainy=2…
‡ Butsuchamappingmaynotalwaysbeappropriate
± imaginefeaturevaluesbeingred,blue,green
± red=0,blue=1,green=2impliesredmoresimilartoblue
thantogreen
‡ Solution:forfeaturewithK>2possiblevalues,create
binaryfeatures,oneforeachpossiblevalue
DataVisualization
Turnfeaturesintonumericalvalues
Let’svisualizethisdata (simplemappingusedheretokeepvisualizationsimple)
Weight Color Label Weight Color Label

4 Red Apple 4 0 Apple


1 B B B
A A
5 Yellow Apple 5 1 Apple

Color
6 Yellow Banana 6 1 Banana

3 Red Apple 3 0 Apple

7 Yellow Banana 7 1 Banana 0 A A


8 Yellow Banana 8 1 Banana

6 Yellow Apple 6 1 Apple 0 Weight 10

WecanviewexamplesaspointsinandͲdimensionalspace
whered isnumberoffeatures
BasedonslidebyDavidKauchak

(This slide intentionally left blank.)


kͲNearestNeighbor
LearningGoals

‡ DescribekNNalgorithm
‡ DescribeimpactofkinkNN
‡ DescribekNNvariants(optional)

ExamplesinaFeatureSpace

feature2 testexample
whatclass?

label1
label2
label3

feature1
Anotherclassificationalgorithm?
Toclassifyexamplex:
Labelx withlabelofclosestexampletox intrainingset
BasedonslidebyDavidKauchak
kͲNearestNeighbor(kͲNN)
Toclassifyexamplex:
± Findk nearest
nearestneighborsofx
neighborsofx
± Chooseaslabelthe
Chooseaslabelthemajoritylabel
majoritylabel withink nearest
neighbors

Howdowemeasure“nearest”?
commonapproach:standardEuclidean distancemetric
twoͲdimensional:
dist(a,b) = sqrt((a1 – b1)2 + (a2 – b2)2) (b1, b2,…, bn)
nͲdimensional:
dist(a,b) = sqrt(™i(ai – bi)2) (a1, a2, …, an)

BasedonslidebyDavidKauchak

OtherDistanceMeasures
‡ BinaryͲvaluedfeatures
± Hammingdistance dist(a,b) = ™i I(ai  bi)
countsnumberoffeatureswheretwoexamples
disagree

‡ Mixed featuretypes(somereal,somebinary)
± mixeddistancemeasures
± e.g.Euclideanforrealpart,Hammingforbinarypart

‡ Canalsoassignweights tofeatures
dist(a,b) = ™i wi·d(ai,bi)

BasedonslidebyPiyush Rai
kͲNNDecisionBoundaries

label1
label2
label3

WherearedecisionboundariesforkͲNN?
kͲNNgiveslocallydefined decisionboundariesbetweenclasses
(formssubsetofVoronoi diagramfortrainingdata)

BasedonslidebyDavidKauchak

kͲNNDecisionBoundaries
‡ Canbechangedbydifferentdistancemetrics

dist(a,b)=(a1 – b1)2+(a2 – b2)2 dist(a,b)=(a1 – b1)2+(3a2 – 3b2)2

‡ Becomemorecomplexasmoreexamplesarestored

BasedonslidebyEricEaton
(originallybyAndrewMoore)
kͲNearestNeighbor(kͲNN)
Toclassifyexamplex:
± Findk nearestneighborsofx
± Chooseaslabelthemajoritylabelwithink nearest
neighbors

Howdowechoosek?

BasedonslidebyDavidKauchak

,PSDFWRIk

Whatisroleofk?

Howdoesitrelatetooverfittingandunderfitting?

Howdidwecontrolthisfordecisiontrees?

BasedonslidebyDavidKauchak
kͲNearestNeighbor(kͲNN)
Toclassifyexamplex:
Toclassifyexampled:
± Findk nearestneighborsofd
nearestneighborsofx
± Chooseaslabelthe
Chooseaslabelthemajoritylabelwithink
majoritylabel withink nearest
neighbors

Howdowechoosek?
‡ OftendataͲdependentandheuristicͲbased
± commonheuristic:choose3,5,7(oddnumber)
‡ Usevalidationdata
‡ Ingeneral,k toosmallortoobigisbad

BasedonslidebyDavidKauchak andPiyush Rai

kͲNearestNeighbor(kͲNN)
Toclassifyexamplex:
± Findk nearestneighborsofx
± Chooseaslabelthemajoritylabelwithink nearest
neighbors

Anyvariants?
‡ Fixeddistance
± insteadofkͲNN,countmajorityfromallexampleswithin
fixeddistance(radiusͲbasedneighbors)
‡ Weighted
± insteadoftreatingallexamplesequally,weight“vote”of
examplessothatcloserexampleshavemorevote/weight
(oftenusesomesortofexponentialdecay)

BasedonslidebyDavidKauchak
kNN ProblemsandMLTerminology
LearningGoals

‡ DescribehowtospeedͲupkNN
‡ DefinenonͲparametricandparametricand
describedifferences
‡ Describecurseofdimensionality

SpeedingupkͲNN
‡ kͲNNisa“lazy”learningalgorithm
± doesvirtuallynothingattrainingtime
‡ Butclassification/predictioncanbecostly
whentrainingsetislarge
± forn trainingexamplesandd features,howmany
computationsrequiredforeachtestexample?
‡ Twostrategiesforalleviatingthisweakness
± editednearestneighbor:donotretainevery
traininginstance
± kͲdtree:usesmartdatastructuretolookupNN
BasedonslidebyDavidPage
Aside:NonͲparametricvsParametric
‡ nonͲparametric method
± notbasedonparameterizedfamiliesofprobability
distributions– makenoassumptionsabout
distributions ofvariablesbeingassessed
± complexitygrowswithamountoftrainingdata
± (notnoneͲparametric) bothDTandkNN
arenonͲparametric
‡ parametric method
± makesinferencesaboutparameters ofunderlying
dataͲgeneratingdistribution
± hasfixednumberofparameters

CurseofDimensionality
‡ Ourintuitionsaboutspace/distancedonotscale
withdimensions!
‡ NNbreaksdowninhighdimensionalspaces
because“neighborhood”becomesverylarge
‡ Ex:Supposewehave5000pointsuniformly
distributedinunithypercubeandwanttoapply
5ͲNNtotestexampleatorigin
± onaverage,needtoexplore5/5000=0.001ofvolume
± 1D:mustgodistanceof0.001onaverage
± 2D:mustgosqrt(0.001)§ 0.0316togetsquarethat
contains0.001ofvolume
± nͲD:inn dimensions,mustgo(0.001)1/n >>0.001

BasedonslidesbyDavidKauchak andDavidSontag
Summary:kͲNN
Whentoconsider
‡ examplesmaptopointsind
‡ smallnumber(<20)attributesperinstance
‡ lotsoftrainingdata
Advantages
‡ “training”isveryfast
‡ simpletoimplement
‡ learncomplextargetfunctions
‡ adaptswelltoonlinelearning
‡ donotloseinformation
‡ robusttonoisytrainingdata(whenk >1)
Disadvantages
‡ slowprediction
‡ easilyfooledbyirrelevantattributes
‡ sensitivetorangeoffeaturevalues
‡ notmuchinsight intoproblemdomainbecausenoexplicitmodel

BasedonslidesbyDavidSontagandDavidPage