Вы находитесь на странице: 1из 4

PublishedonSTAT897D(https://onlinecourses.science.psu.

edu/stat857)
Home>12.6AgglomerativeClustering

12.6AgglomerativeClustering
Agglomerativeclusteringcanbeusedaslongaswehavepairwisedistancesbetweenany
twoobjects.Themathematicalrepresentationoftheobjectsareirrelevantwhenthepairwise
distancesaregiven.Henceagglomerativeclusteringreadilyappliesfornonvectordata.
Let'sdenotethedatasetasA

= x1 , , xn

Theagglomerativeclusteringmethodisalsocalledabottomupmethodasopposedtok
meansorkcentermethodsthataretopdown.Inatopdownmethod,adatasetisdivided
intomoreandmoreclusters.Inabottomupapproach,allthedatapointsaretreatedas
individualclusterstostartwithandgraduallymergedintobiggerandbiggerclusters.
Inagglomerativeclusteringclustersaregeneratedhierarchically.Westartbytakingevery
datapointasacluster.Thenwemergetwoclustersatatime.Inordertodecidewhichtwo
clusterstomerge,wecomparethepairwisedistancesbetweenanytwoclustersandpicka
pairwiththeminimumdistance.Oncewemergetwoclustersintoabiggerone,anewcluster
iscreated.Thedistancesbetweenthisnewclusterandtheexistingonesarenotgiven.Some
schemehastobeusedtoobtainthesedistancesbasedonthetwomergedclusters.Wecall
thistheupdateofdistances.Variousschemesofupdatingdistanceswillbedescribedshortly.
Wecankeepmergingclustersuntilallthepointsaremergedintoonecluster.Atreecanbe
usedtovisualizethemergingprocess.Thistreeiscalledadendrogram.
Thekeytechnicaldetailhereishowwedefinethebetweenclusterdistance.Atthebottom
levelofthedendrogramwhereeveryclusteronlycontainsasinglepoint,thebetweencluster
distanceissimplytheEuclidiandistancebetweenthepoints.(Insomeapplications,the
betweenpointdistancesarepregiven.)However,oncewemergetwopointsintoacluster
howdowegetthedistancebetweenthenewclusterandtheexistingclusters?
Theideaistosomehowaggregatethedistancesbetweentheobjectscontainedinthe
clusters.
Forclusterscontainingonlyonedatapoint,thebetweenclusterdistanceisthe
betweenobjectdistance.
Forclusterscontainingmultipledatapoints,thebetweenclusterdistanceisan
agglomerativeversionofthebetweenobjectdistances.Thereareanumberofwaysto
accomplishthis.Examplesoftheseaggregateddistancesincludetheminimumor
maximumbetweenobjectdistancesforpairsofobjectsacrossthetwoclusters.
Forexample,supposewehavetwoclustersandeachonehasthreepoints.Onethingwe
coulddoistofinddistancesbetweenthesepoints.Inthiscase,therewouldbenine

distancesbetweenpairsofpointsacrossthetwoclusters.Theminimum(ormaximum,or
average)oftheninedistancescanbeusedasthedistancebetweenthetwoclusters.

Howdowedecidewhichaggregationschemetouse?Dependingonhowweupdatethe
distances,dramaticallydifferentresultsmaycomeup.Therefore,itisalwaysgoodpracticeto
lookattheresultsusingscatterplotsorothervisualizationmethodsinsteadofblindlytaking
theoutputofanyalgorithm.Clusteringisinevitablysubjectivesincethereisnogoldstandard.
Normallytheagglomerativebetweenclusterdistancecanbecomputedrecursively.The
aggregationasexplainedabovesoundscomputationallyintensiveandseeminglyimpractical.
Ifwehavetwoverylargeclusters,wehavetocheckallthebetweenobjectdistancesforall
pairsacrossthetwoclusters.Thegoodnewsisthatformanyoftheseaggregateddistances,
wecanupdatethemrecursivelywithoutcheckingallthepairsofobjects.Theupdate
approachwillbedescribedsoon.
ExampleDistances
Let'sseehowwewouldupdatebetweenclusterdistances.Supposewehavetwoclustersr
andsandthesetwoclusters,notnecessarilysinglepointclusters,aremergedintoanew
clustert.Letkbeanyotherexistingcluster.Wewantthedistancebetweenthenewcluster,t,
andtheexistingcluster,k.
Wewillgetthisdistancebasedonthedistancebetweenkandthetwocomponentclusters,r
ands.
Becauserandshaveexisted,thedistancebetweenrandkandthedistancebetweensand
karealreadycomputed.DenotethedistancesbyD(r, k) andD(s, k) .
WelistbelowvariouswaystogetD(t, k) fromD(r, k) andD(s, k) .
Singlelinkclustering:
D(t, k) = min(D(r, k), D(s, k))

istheminimumdistancebetweentwoobjectsinclustertandk
respectively.Itcanbeshownthattheabovewayofupdatingdistanceis
equivalenttodefiningthebetweenclusterdistanceastheminimumdistance
betweentwoobjectsacrossthetwoclusters.
D(t, k)

Completelinkclustering:
D(t, k) = max(D(r, k), D(s, k))

D(t, k) = max(D(r, k), D(s, k))

Here,D(t, k) isthemaximumdistancebetweentwoobjectsinclustertandk.

Averagelinkageclustering:
Therearetwocaseshere,theunweightedcaseandtheweightedcase.
Unweightedcase:
nr

D(t, k) =

ns

D(r, k) +

nr + ns

D(s, k)

nr + ns

Hereweneedtousethenumberofpointsinclusterrandthenumberofpointsin
clusters(thetwoclustersthatarebeingmergedtogetherintoabiggercluster),
andcomputethepercentageofpointsinthetwocomponentclusterswithrespect
tothemergedcluster.Thetwodistances,D(r, k) andD(s, k) ,areaggregated
byaweightedsum.
Weightedcase:
1
D(t, k) =

1
D(r, k) +

D(s, k)
2

Insteadofusingtheweightproportionaltotheclustersize,weusethearithmetic
mean.Whilethismightlookmorelikeanunweightedcase,itisactuallyweighted
intermsofthecontributionfromindividualpointsinthetwoclusters.Whenthe
twoclustersareweightedhalfandhalf,anypoint(i.e.,object)inthesmaller
clusterindividuallycontributesmoretotheaggregateddistancethanapointinthe
largercluster.Incontrast,ifthelargerclusterisgivenproportionallyhigher
weight,anypointineitherclustercontributesequallytotheaggregateddistance.
Centroidclustering:
Acentroidiscomputedforeachclusterandthedistancebetweenclustersis
definedasthedistancebetweentheirrespectivecentroids.
Unweightedcase:
D(t, k) =

nr

ns

D(r, k) +

nr + ns

D(s, k)

nr + ns

nr ns
nr + ns

Weightedcase:
1
D(t, k) =

1
D(r, k) +

1
D(s, k)

D(r, s)
4

D(r, s)


Ward'sclustering:
Weupdatethedistanceusingthefollowingformula:
D(t, k) =

nr + nk
nr + ns + nk

D(r, k) +

ns + nk
nr + ns + nk

D(s, k)

nk

D(r, s)

nr + ns + nk

Thisapproachattemptstomergethetwoclustersforwhichthechangeinthetotal
variationisminimized.Thetotalvariationofaclusteringresultisdefinedasthe
sumofsquarederrorsbetweeneveryobjectandthecentroidoftheclusterit
belongsto.
Thedendrogramgeneratedbysinglelinkageclusteringtendstolooklikeachain.Clusters
generatedbycompletelinkagemaynotbewellseparated.Othermethodsareoften
intermediatebetweenthetwo.
SourceURL:https://onlinecourses.science.psu.edu/stat857/node/107

Вам также может понравиться