Академический Документы
Профессиональный Документы
Культура Документы
edu/stat857)
Home>12.6AgglomerativeClustering
12.6AgglomerativeClustering
Agglomerativeclusteringcanbeusedaslongaswehavepairwisedistancesbetweenany
twoobjects.Themathematicalrepresentationoftheobjectsareirrelevantwhenthepairwise
distancesaregiven.Henceagglomerativeclusteringreadilyappliesfornonvectordata.
Let'sdenotethedatasetasA
= x1 , , xn
Theagglomerativeclusteringmethodisalsocalledabottomupmethodasopposedtok
meansorkcentermethodsthataretopdown.Inatopdownmethod,adatasetisdivided
intomoreandmoreclusters.Inabottomupapproach,allthedatapointsaretreatedas
individualclusterstostartwithandgraduallymergedintobiggerandbiggerclusters.
Inagglomerativeclusteringclustersaregeneratedhierarchically.Westartbytakingevery
datapointasacluster.Thenwemergetwoclustersatatime.Inordertodecidewhichtwo
clusterstomerge,wecomparethepairwisedistancesbetweenanytwoclustersandpicka
pairwiththeminimumdistance.Oncewemergetwoclustersintoabiggerone,anewcluster
iscreated.Thedistancesbetweenthisnewclusterandtheexistingonesarenotgiven.Some
schemehastobeusedtoobtainthesedistancesbasedonthetwomergedclusters.Wecall
thistheupdateofdistances.Variousschemesofupdatingdistanceswillbedescribedshortly.
Wecankeepmergingclustersuntilallthepointsaremergedintoonecluster.Atreecanbe
usedtovisualizethemergingprocess.Thistreeiscalledadendrogram.
Thekeytechnicaldetailhereishowwedefinethebetweenclusterdistance.Atthebottom
levelofthedendrogramwhereeveryclusteronlycontainsasinglepoint,thebetweencluster
distanceissimplytheEuclidiandistancebetweenthepoints.(Insomeapplications,the
betweenpointdistancesarepregiven.)However,oncewemergetwopointsintoacluster
howdowegetthedistancebetweenthenewclusterandtheexistingclusters?
Theideaistosomehowaggregatethedistancesbetweentheobjectscontainedinthe
clusters.
Forclusterscontainingonlyonedatapoint,thebetweenclusterdistanceisthe
betweenobjectdistance.
Forclusterscontainingmultipledatapoints,thebetweenclusterdistanceisan
agglomerativeversionofthebetweenobjectdistances.Thereareanumberofwaysto
accomplishthis.Examplesoftheseaggregateddistancesincludetheminimumor
maximumbetweenobjectdistancesforpairsofobjectsacrossthetwoclusters.
Forexample,supposewehavetwoclustersandeachonehasthreepoints.Onethingwe
coulddoistofinddistancesbetweenthesepoints.Inthiscase,therewouldbenine
distancesbetweenpairsofpointsacrossthetwoclusters.Theminimum(ormaximum,or
average)oftheninedistancescanbeusedasthedistancebetweenthetwoclusters.
Howdowedecidewhichaggregationschemetouse?Dependingonhowweupdatethe
distances,dramaticallydifferentresultsmaycomeup.Therefore,itisalwaysgoodpracticeto
lookattheresultsusingscatterplotsorothervisualizationmethodsinsteadofblindlytaking
theoutputofanyalgorithm.Clusteringisinevitablysubjectivesincethereisnogoldstandard.
Normallytheagglomerativebetweenclusterdistancecanbecomputedrecursively.The
aggregationasexplainedabovesoundscomputationallyintensiveandseeminglyimpractical.
Ifwehavetwoverylargeclusters,wehavetocheckallthebetweenobjectdistancesforall
pairsacrossthetwoclusters.Thegoodnewsisthatformanyoftheseaggregateddistances,
wecanupdatethemrecursivelywithoutcheckingallthepairsofobjects.Theupdate
approachwillbedescribedsoon.
ExampleDistances
Let'sseehowwewouldupdatebetweenclusterdistances.Supposewehavetwoclustersr
andsandthesetwoclusters,notnecessarilysinglepointclusters,aremergedintoanew
clustert.Letkbeanyotherexistingcluster.Wewantthedistancebetweenthenewcluster,t,
andtheexistingcluster,k.
Wewillgetthisdistancebasedonthedistancebetweenkandthetwocomponentclusters,r
ands.
Becauserandshaveexisted,thedistancebetweenrandkandthedistancebetweensand
karealreadycomputed.DenotethedistancesbyD(r, k) andD(s, k) .
WelistbelowvariouswaystogetD(t, k) fromD(r, k) andD(s, k) .
Singlelinkclustering:
D(t, k) = min(D(r, k), D(s, k))
istheminimumdistancebetweentwoobjectsinclustertandk
respectively.Itcanbeshownthattheabovewayofupdatingdistanceis
equivalenttodefiningthebetweenclusterdistanceastheminimumdistance
betweentwoobjectsacrossthetwoclusters.
D(t, k)
Completelinkclustering:
D(t, k) = max(D(r, k), D(s, k))
Here,D(t, k) isthemaximumdistancebetweentwoobjectsinclustertandk.
Averagelinkageclustering:
Therearetwocaseshere,theunweightedcaseandtheweightedcase.
Unweightedcase:
nr
D(t, k) =
ns
D(r, k) +
nr + ns
D(s, k)
nr + ns
Hereweneedtousethenumberofpointsinclusterrandthenumberofpointsin
clusters(thetwoclustersthatarebeingmergedtogetherintoabiggercluster),
andcomputethepercentageofpointsinthetwocomponentclusterswithrespect
tothemergedcluster.Thetwodistances,D(r, k) andD(s, k) ,areaggregated
byaweightedsum.
Weightedcase:
1
D(t, k) =
1
D(r, k) +
D(s, k)
2
Insteadofusingtheweightproportionaltotheclustersize,weusethearithmetic
mean.Whilethismightlookmorelikeanunweightedcase,itisactuallyweighted
intermsofthecontributionfromindividualpointsinthetwoclusters.Whenthe
twoclustersareweightedhalfandhalf,anypoint(i.e.,object)inthesmaller
clusterindividuallycontributesmoretotheaggregateddistancethanapointinthe
largercluster.Incontrast,ifthelargerclusterisgivenproportionallyhigher
weight,anypointineitherclustercontributesequallytotheaggregateddistance.
Centroidclustering:
Acentroidiscomputedforeachclusterandthedistancebetweenclustersis
definedasthedistancebetweentheirrespectivecentroids.
Unweightedcase:
D(t, k) =
nr
ns
D(r, k) +
nr + ns
D(s, k)
nr + ns
nr ns
nr + ns
Weightedcase:
1
D(t, k) =
1
D(r, k) +
1
D(s, k)
D(r, s)
4
D(r, s)
Ward'sclustering:
Weupdatethedistanceusingthefollowingformula:
D(t, k) =
nr + nk
nr + ns + nk
D(r, k) +
ns + nk
nr + ns + nk
D(s, k)
nk
D(r, s)
nr + ns + nk
Thisapproachattemptstomergethetwoclustersforwhichthechangeinthetotal
variationisminimized.Thetotalvariationofaclusteringresultisdefinedasthe
sumofsquarederrorsbetweeneveryobjectandthecentroidoftheclusterit
belongsto.
Thedendrogramgeneratedbysinglelinkageclusteringtendstolooklikeachain.Clusters
generatedbycompletelinkagemaynotbewellseparated.Othermethodsareoften
intermediatebetweenthetwo.
SourceURL:https://onlinecourses.science.psu.edu/stat857/node/107