Вы находитесь на странице: 1из 52

Chapter18:DataAnalysisandMining

DatabaseSystemConcepts
Silberschatz,KorthandSudarshan Seewww.dbbook.comforconditionsonreuse

Chapter18:DataAnalysisandMining
s DecisionSupportSystems s DataAnalysisandOLAP s DataWarehousing s DataMining

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

DecisionSupportSystems
s Decisionsupportsystemsareusedtomakebusinessdecisions,often

basedondatacollectedbyonlinetransactionprocessingsystems.
q q q

s Examplesofbusinessdecisions:

Whatitemstostock? Whatinsurancepremiumtochange? Towhomtosendadvertisements? Retailsalestransactiondetails Customerprofiles(income,age,gender,etc.)

s Examplesofdatausedformakingdecisions
q q

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

DecisionSupportSystems:Overview
s DataanalysistasksaresimplifiedbyspecializedtoolsandSQL

extensions
q

Exampletasks Foreachproductcategoryandeachregion,whatwerethetotal salesinthelastquarterandhowdotheycomparewiththesame quarterlastyear Asabove,foreachproductcategoryandeachcustomercategory

s Statisticalanalysispackages(e.g.,:S++)canbeinterfacedwith

databases
q

Statisticalanalysisisalargefield,butnotcoveredhere

s Dataminingseekstodiscoverknowledgeautomaticallyintheformof

statisticalrulesandpatternsfromlargedatabases.

s Adatawarehousearchivesinformationgatheredfrommultiplesources,

andstoresitunderaunifiedschema,atasinglesite.
q q

Importantforlargebusinessesthatgeneratedatafrommultiple divisions,possiblyatmultiplesites Datamayalsobepurchasedexternally


18.<number> Silberschatz,KorthandSudarshan

DatabaseSystemConcepts5thEdition,Aug26,2005

DataAnalysisandOLAP
s OnlineAnalyticalProcessing(OLAP)
q

Interactiveanalysisofdata,allowingdatatobesummarizedand viewedindifferentwaysinanonlinefashion(withnegligibledelay)

s Datathatcanbemodeledasdimensionattributesandmeasure

attributesarecalledmultidimensionaldata.
q

Measureattributes

measuresomevalue canbeaggregatedupon e.g.theattributenumberofthesalesrelation definethedimensionsonwhichmeasureattributes(or aggregatesthereof)areviewed e.g.theattributesitem_name,color,andsizeofthesales relation

Dimensionattributes

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

CrossTabulationofsalesbyitemname andcolor

s Thetableaboveisanexampleofacrosstabulation(crosstab),also

referredtoasapivottable.
q q q q

Valuesforoneofthedimensionattributesformtherowheaders Valuesforanotherdimensionattributeformthecolumnheaders Otherdimensionattributesarelistedontop Valuesinindividualcellsare(aggregatesof)thevaluesofthe dimensionattributesthatspecifythecell.


18.<number> Silberschatz,KorthandSudarshan

DatabaseSystemConcepts5thEdition,Aug26,2005

RelationalRepresentationofCrosstabs
s Crosstabscanberepresented

asrelations

s Weusethevalueallisusedto

representaggregates

s TheSQL:1999standard

actuallyusesnullvaluesin placeofalldespiteconfusion withregularnullvalues

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

DataCube
s Adatacubeisamultidimensionalgeneralizationofacrosstab s Canhavendimensions;weshow3below s Crosstabscanbeusedasviewsonadatacube

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

OnlineAnalyticalProcessing
s Pivoting:changingthedimensionsusedinacrosstabiscalled s Slicing:creatingacrosstabforfixedvaluesonly
q

Sometimescalleddicing,particularlywhenvaluesformultiple dimensionsarefixed.

s Rollup:movingfromfinergranularitydatatoacoarsergranularity s Drilldown:Theoppositeoperationthatofmovingfromcoarser

granularitydatatofinergranularitydata

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

HierarchiesonDimensions
s Hierarchyondimensionattributes:letsdimensionstobeviewed

atdifferentlevelsofdetail

5 E.g.thedimensionDateTimecanbeusedtoaggregatebyhourof day,date,dayofweek,month,quarteroryear

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

CrossTabulationWithHierarchy
s Crosstabscanbeeasilyextendedtodealwithhierarchies

5 Candrilldownorrolluponahierarchy

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

OLAPImplementation
s TheearliestOLAPsystemsusedmultidimensionalarraysinmemoryto

storedatacubes,andarereferredtoasmultidimensionalOLAP (MOLAP)systems. relationalOLAP(ROLAP)systems

s OLAPimplementationsusingonlyrelationaldatabasefeaturesarecalled s Hybridsystems,whichstoresomesummariesinmemoryandstorethe

basedataandothersummariesinarelationaldatabase,arecalled hybridOLAP(HOLAP)systems.

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

OLAPImplementation(Cont.)
s EarlyOLAPsystemsprecomputedallpossibleaggregatesinorderto

provideonlineresponse
q

Spaceandtimerequirementsfordoingsocanbeveryhigh 2ncombinationsofgroupby Itsufficestoprecomputesomeaggregates,andcomputeotherson demandfromoneoftheprecomputedaggregates

Cancomputeaggregateon(itemname,color)fromanaggregate on(itemname,color,size) Forallbutafewnondecomposableaggregatessuchas median

ischeaperthancomputingitfromscratch s Severaloptimizationsavailableforcomputingmultipleaggregates
q q

Cancomputeaggregateon(itemname,color)fromanaggregateon (itemname,color,size) Cancomputeaggregateson(itemname,color,size), (itemname,color)and(itemname)usingasinglesorting ofthebasedata


18.<number> Silberschatz,KorthandSudarshan

DatabaseSystemConcepts5thEdition,Aug26,2005

ExtendedAggregationinSQL:1999
s Thecubeoperationcomputesunionofgroupbysoneverysubsetofthe

specifiedattributes

s E.g.considerthequery

selectitemname,color,size,sum(number) fromsales groupbycube(itemname,color,size) Thiscomputestheunionofeightdifferentgroupingsofthesalesrelation: {(itemname,color,size),(itemname,color), (itemname,size),(color,size), (itemname),(color), (size),()} where()denotesanemptygroupbylist.


s Foreachgrouping,theresultcontainsthenullvalue

forattributesnotpresentinthegrouping.

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

ExtendedAggregation(Cont.)
s Relationalrepresentationofcrosstabthatwesawearlier,butwithnullin

placeofall,canbecomputedby

selectitemname,color,sum(number) fromsales groupbycube(itemname,color)


s Thefunctiongrouping()canbeappliedonanattribute
q

Returns1ifthevalueisanullvaluerepresentingall,andreturns0inall othercases.

selectitemname,color,size,sum(number), grouping(itemname)asitemnameflag, grouping(color)ascolorflag, grouping(size)assizeflag, fromsales groupbycube(itemname,color,size)


s Canusethefunctiondecode()intheselectclausetoreplace

suchnullsbyavaluesuchasall
q

E.g.replaceitemnameinfirstqueryby decode(grouping(itemname),1,all,itemname)
18.<number> Silberschatz,KorthandSudarshan

DatabaseSystemConcepts5thEdition,Aug26,2005

ExtendedAggregation(Cont.)
s Therollupconstructgeneratesuniononeveryprefixofspecifiedlistof

attributes

s E.g.

selectitemname,color,size,sum(number) fromsales groupbyrollup(itemname,color,size) Generatesunionoffourgroupings: {(itemname,color,size),(itemname,color),(itemname),()}


s Rollupcanbeusedtogenerateaggregatesatmultiplelevelsofa

hierarchy.

s E.g.,supposetableitemcategory(itemname,category)givesthe

categoryofeachitem.Then

selectcategory,itemname,sum(number) fromsales,itemcategory wheresales.itemname=itemcategory.itemname groupbyrollup(category,itemname) wouldgiveahierarchicalsummarybyitemnameandbycategory.

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

ExtendedAggregation(Cont.)
s Multiplerollupsandcubescanbeusedinasinglegroupbyclause
q

Eachgeneratessetofgroupbylists,crossproductofsetsgivesoverall setofgroupbylists

s E.g.,

selectitemname,color,size,sum(number) fromsales groupbyrollup(itemname),rollup(color,size) generatesthegroupings {itemname,()}X{(color,size),(color),()} ={(itemname,color,size),(itemname,color),(itemname), (color,size),(color),()}

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

Ranking
s Rankingisdoneinconjunctionwithanorderbyspecification. s Givenarelationstudentmarks(studentid,marks)findtherankofeach

student.

selectstudentid,rank()over(orderbymarksdesc)assrank fromstudentmarks
s Anextraorderbyclauseisneededtogettheminsortedorder

selectstudentid,rank()over(orderbymarksdesc)assrank fromstudentmarks orderbysrank


s Rankingmayleavegaps:e.g.if2studentshavethesametopmark,both

haverank1,andthenextrankis3
q

dense_rankdoesnotleavegaps,sonextdenserankwouldbe2

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

Ranking(Cont.)
s Rankingcanbedonewithinpartitionofthedata. s Findtherankofstudentswithineachsection.

selectstudentid,section, rank()over(partitionbysectionorderbymarksdesc) assecrank fromstudentmarks,studentsection wherestudentmarks.studentid=studentsection.studentid orderbysection,secrank


s Multiplerankclausescanoccurinasingleselectclause s Rankingisdoneafterapplyinggroupbyclause/aggregation

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

Ranking(Cont.)
s Otherrankingfunctions:
q q

percent_rank(withinpartition,ifpartitioningisdone) cume_dist(cumulativedistribution)

fractionoftupleswithprecedingvalues

row_number(nondeterministicinpresenceofduplicates)

s SQL:1999permitstheusertospecifynullsfirstornullslast

selectstudentid, rank()over(orderbymarksdescnullslast)assrank fromstudentmarks

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

Ranking(Cont.)
s Foragivenconstantn,therankingthefunctionntile(n)takesthe

tuplesineachpartitioninthespecifiedorder,anddividestheminton bucketswithequalnumbersoftuples. selectthreetile,sum(salary) from( selectsalary,ntile(3)over(orderbysalary)asthreetile fromemployee)ass groupbythreetile

s E.g.:

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

Windowing
s Usedtosmoothoutrandomvariations. s E.g.:movingaverage:Givensalesvaluesforeachdate,calculateforeach

datetheaverageofthesalesonthatday,thepreviousday,andthenext day
q

s WindowspecificationinSQL:

Givenrelationsales(date,value)

selectdate,sum(value)over (orderbydatebetweenrows1precedingand1following) fromsales


s Examplesofotherwindowspecifications:
q q q

betweenrowsunboundedprecedingandcurrent rowsunboundedpreceding rangebetween10precedingandcurrentrow

Allrowswithvaluesbetweencurrentrowvalue10tocurrentvalue Notincludingcurrentrow
18.<number> Silberschatz,KorthandSudarshan

rangeinterval10daypreceding

DatabaseSystemConcepts5thEdition,Aug26,2005

Windowing(Cont.)
s Candowindowingwithinpartitions s E.g.Givenarelationtransaction(accountnumber,datetime,value),

wherevalueispositiveforadepositandnegativeforawithdrawal
q

Findtotalbalanceofeachaccountaftereachtransactiononthe account selectaccountnumber,datetime, sum(value)over (partitionbyaccountnumber orderbydatetime rowsunboundedpreceding) asbalance fromtransaction orderbyaccountnumber,datetime

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

DataWarehousing
s Datasourcesoftenstoreonlycurrentdata,nothistoricaldata s Corporatedecisionmakingrequiresaunifiedviewofallorganizational

data,includinghistoricaldata

s Adatawarehouseisarepository(archive)ofinformationgathered
q q

frommultiplesources,storedunderaunifiedschema,atasinglesite Greatlysimplifiesquerying,permitsstudyofhistoricaltrends Shiftsdecisionsupportqueryloadawayfromtransaction processingsystems

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

DataWarehousing

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

DesignIssues
s Whenandhowtogatherdata
q

Sourcedrivenarchitecture:datasourcestransmitnewinformation towarehouse,eithercontinuouslyorperiodically(e.g.atnight) Destinationdrivenarchitecture:warehouseperiodicallyrequests newinformationfromdatasources Keepingwarehouseexactlysynchronizedwithdatasources(e.g. usingtwophasecommit)istooexpensive


UsuallyOKtohaveslightlyoutofdatedataatwarehouse Data/updatesareperiodicallydownloadedformonline transactionprocessing(OLTP)systems.

s Whatschematouse
q

Schemaintegration

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

MoreWarehouseDesignIssues
s Datacleansing
q q

E.g.correctmistakesinaddresses(misspellings,zipcodeerrors) Mergeaddresslistsfromdifferentsourcesandpurgeduplicates Warehouseschemamaybea(materialized)viewofschemafrom datasources Rawdatamaybetoolargetostoreonline Aggregatevalues(totals/subtotals)oftensuffice Queriesonrawdatacanoftenbetransformedbyqueryoptimizer touseaggregatevalues

s Howtopropagateupdates
q

s Whatdatatosummarize
q q q

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

WarehouseSchemas
s Dimensionvaluesareusuallyencodedusingsmallintegersand

mappedtofullvaluesviadimensiontables
q

s Resultantschemaiscalledastarschema

Morecomplicatedschemastructures

Snowflakeschema:multiplelevelsofdimensiontables Constellation:multiplefacttables

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

DataWarehouseSchema

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

DataMining
s Dataminingistheprocessofsemiautomaticallyanalyzinglarge

databasestofindusefulpatterns
q

s Predictionbasedonpasthistory

Predictifacreditcardapplicantposesagoodcreditrisk,basedon someattributes(income,jobtype,age,..)andpasthistory Predictifapatternofphonecallingcardusageislikelytobe fraudulent Classification

s Someexamplesofpredictionmechanisms:
q

Givenanewitemwhoseclassisunknown,predicttowhichclass itbelongs Givenasetofmappingsforanunknownfunction,predictthe functionresultforanewparametervalue

Regressionformulae

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

DataMining(Cont.)
s DescriptivePatterns
q

Associations

Findbooksthatareoftenboughtbysimilarcustomers.Ifa newsuchcustomerbuysonesuchbook,suggesttheothers too. E.g.associationbetweenexposuretochemicalXandcancer, E.g.typhoidcaseswereclusteredinanareasurroundinga contaminatedwell Detectionofclustersremainsimportantindetectingepidemics

Associationsmaybeusedasafirststepindetectingcausation

Clusters

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

ClassificationRules
s Classificationruleshelpassignnewobjectstoclasses.
q

E.g.,givenanewautomobileinsuranceapplicant,shouldheorshe beclassifiedaslowrisk,mediumriskorhighrisk?

s Classificationrulesforaboveexamplecoulduseavarietyofdata,such

aseducationallevel,salary,age,etc.
q

personP,P.degree=mastersandP.income>75,000 P.credit=excellent personP,P.degree=bachelorsand (P.income25,000andP.income75,000) P.credit=good

s Rulesarenotnecessarilyexact:theremaybesomemisclassifications s Classificationrulescanbeshowncompactlyasadecisiontree.

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

DecisionTree

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

ConstructionofDecisionTrees
s Trainingset:adatasampleinwhichtheclassificationisalready

known.
q

s Greedytopdowngenerationofdecisiontrees.

Eachinternalnodeofthetreepartitionsthedataintogroups basedonapartitioningattribute,andapartitioningcondition forthenode Leafnode:

all(ormost)oftheitemsatthenodebelongtothesameclass, or allattributeshavebeenconsidered,andnofurtherpartitioning ispossible.

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

BestSplits
s Pickbestattributesandconditionsonwhichtopartition s ThepurityofasetSoftraininginstancescanbemeasuredquantitativelyin

severalways.
q

Notation:numberofclasses=k,numberofinstances=|S|, fractionofinstancesinclassi=pi.
k

s TheGinimeasureofpurityisdefinedas

Gini(S)=1 p2i
i1

q q

Whenallinstancesareinasingleclass,theGinivalueis0 Itreachesitsmaximum(of11/k)ifeachclassthesamenumberof instances.

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

BestSplits(Cont.)
s Anothermeasureofpurityistheentropymeasure,whichisdefinedas

entropy(S)= pilog2pi
i1

s WhenasetSissplitintomultiplesetsSi,I=1,2,,r,wecanmeasurethe

purityoftheresultantsetofsetsas: purity(S1,S2,..,Sr)=
r

| S i|

i=1 |S|

purity(Si)

s TheinformationgainduetoparticularsplitofSintoSi,i=1,2,.,r

Informationgain(S,{S1,S2,.,Sr)=purity(S)purity(S1,S2,Sr)

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

BestSplits(Cont.)
s Measureofcostofasplit:

Informationcontent(S,{S1,S2,..,Sr}))=

r |S | i

i1 |S|

log2

|S i | |S |

s Informationgainratio=Informationgain(S,{S1,S2,,Sr})

Informationcontent(S,{S1,S2,..,Sr})
s Thebestsplitistheonethatgivesthemaximuminformationgainratio

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

FindingBestSplits
s Categoricalattributes(withnomeaningfulorder):
q q

Multiwaysplit,onechildforeachvalue Binarysplit:tryallpossiblebreakupofvaluesintotwosets,and pickthebest Binarysplit:

s Continuousvaluedattributes(canbesortedinameaningfulorder)
q

Sortvalues,tryeachasasplitpoint E.g.ifvaluesare1,10,15,25,splitat1,10,15 Pickthevaluethatgivesbestsplit Aseriesofbinarysplitsonthesameattributehasroughly equivalenteffect

Multiwaysplit:

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

DecisionTreeConstructionAlgorithm
ProcedureGrowTree(S) Partition(S); ProcedurePartition(S) if(purity(S)>por|S|<s)then return; foreachattributeA evaluatesplitsonattributeA; Usebestsplitfound(acrossallattributes)topartition SintoS1,S2,.,Sr, fori=1,2,..,r Partition(Si);

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

OtherTypesofClassifiers
s Neuralnetclassifiersarestudiedinartificialintelligenceandarenotcovered

here

s BayesianclassifiersuseBayestheorem,whichsays

p(cj|d)=p(d|cj)p(cj) p(d) where p(cj|d)=probabilityofinstancedbeinginclasscj, p(d|cj)=probabilityofgeneratinginstancedgivenclasscj, p(cj)=probabilityofoccurrenceofclasscj,and p(d)=probabilityofinstancedoccuring

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

NaveBayesianClassifiers
s Bayesianclassifiersrequire
q q q

computationofp(d|cj) precomputationofp(cj) p(d)canbeignoredsinceitisthesameforallclasses

s Tosimplifythetask,naveBayesianclassifiersassumeattributes

haveindependentdistributions,andtherebyestimate p(d|cj)=p(d1|cj)*p(d2|cj)*.*(p(dn|cj)
q

Eachofthep(di|cj)canbeestimatedfromahistogramondi valuesforeachclasscj

thehistogramiscomputedfromthetraininginstances

Histogramsonmultipleattributesaremoreexpensivetocompute andstore

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

Regression
s Regressiondealswiththepredictionofavalue,ratherthanaclass.
q

Givenvaluesforasetofvariables,X1,X2,,Xn,wewishtopredictthe valueofavariableY. Y=a0+a1*X1+a2*X2++an*Xn

s Onewayistoinfercoefficientsa0,a1,a1,,ansuchthat s Findingsuchalinearpolynomialiscalledlinearregression.
q

Ingeneral,theprocessoffindingacurvethatfitsthedataisalsocalled curvefitting. becauseofnoiseinthedata,or becausetherelationshipisnotexactlyapolynomial

s Thefitmayonlybeapproximate
q q

s Regressionaimstofindcoefficientsthatgivethebestpossiblefit.

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

AssociationRules
s Retailshopsareofteninterestedinassociationsbetweendifferentitems

thatpeoplebuy.
q q

Someonewhobuysbreadisquitelikelyalsotobuymilk ApersonwhoboughtthebookDatabaseSystemConceptsisquite likelyalsotobuythebookOperatingSystemConcepts. E.g.whenacustomerbuysaparticularbook,anonlineshopmay suggestassociatedbooks.

s Associationsinformationcanbeusedinseveralways.
q

s Associationrules:

breadmilkDBConcepts,OSConceptsNetworks
q q

Lefthandside:antecedent,righthandside:consequent Anassociationrulemusthaveanassociatedpopulation;the populationconsistsofasetofinstances

E.g.eachtransaction(sale)atashopisaninstance,andtheset ofalltransactionsisthepopulation

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

AssociationRules(Cont.)
s Ruleshaveanassociatedsupport,aswellasanassociatedconfidence. s Supportisameasureofwhatfractionofthepopulationsatisfiesboththe

antecedentandtheconsequentoftherule.
q

E.g.supposeonly0.001percentofallpurchasesincludemilkand screwdrivers.Thesupportfortheruleismilkscrewdriversislow.

s Confidenceisameasureofhowoftentheconsequentistruewhenthe

antecedentistrue.
q

E.g.therulebreadmilkhasaconfidenceof80percentif80 percentofthepurchasesthatincludebreadalsoincludemilk.

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

FindingAssociationRules
s s

Wearegenerallyonlyinterestedinassociationruleswithreasonably highsupport(e.g.supportof2%orgreater) Navealgorithm


1. 2.

Considerallpossiblesetsofrelevantitems. Foreachsetfinditssupport(i.e.counthowmanytransactions purchaseallitemsintheset).


5

Largeitemsets:setswithsufficientlyhighsupport FromitemsetAgeneratetheruleA{b}bforeachbA. Supportofrule=support(A). Confidenceofrule=support(A)/support(A{b})

3.

Uselargeitemsetstogenerateassociationrules.
1.

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

FindingSupport
s Determinesupportofitemsetsviaasinglepassonsetoftransactions
q

Largeitemsets:setswithahighcountattheendofthepass

s Ifmemorynotenoughtoholdallcountsforallitemsetsusemultiplepasses,

consideringonlysomeitemsetsineachpass.

s Optimization:Onceanitemsetiseliminatedbecauseitscount(support)istoo

smallnoneofitssupersetsneedstobeconsidered.
q

s Theaprioritechniquetofindlargeitemsets:

Pass1:countsupportofallsetswithjust1item.Eliminatethoseitems withlowsupport Passi:candidates:everysetofiitemssuchthatallitsi1itemsubsets arelarge


Countsupportofallcandidates Stopiftherearenocandidates

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

OtherTypesofAssociations
s Basicassociationruleshaveseverallimitations s Deviationsfromtheexpectedprobabilityaremoreinteresting
q q

E.g.ifmanypeoplepurchasebread,andmanypeoplepurchasecereal, quiteafewwouldbeexpectedtopurchaseboth Weareinterestedinpositiveaswellasnegativecorrelationsbetween setsofitems


Positivecorrelation:cooccurrenceishigherthanpredicted Negativecorrelation:cooccurrenceislowerthanpredicted

s Sequenceassociations/correlations
q

E.g.wheneverbondsgoup,stockpricesgodownin2days E.g.deviationfromasteadygrowth E.g.salesofwinterweargodowninsummer


s Deviationsfromtemporalpatterns
q q

Notsurprising,partofaknownpattern. Lookfordeviationfromvaluepredictedusingpastpatterns
18.<number> Silberschatz,KorthandSudarshan

DatabaseSystemConcepts5thEdition,Aug26,2005

Clustering
s Clustering:Intuitively,findingclustersofpointsinthegivendatasuchthat

similarpointslieinthesamecluster
q

s Canbeformalizedusingdistancemetricsinseveralways

Grouppointsintoksets(foragivenk)suchthattheaveragedistance ofpointsfromthecentroidoftheirassignedgroupisminimized

Centroid:pointdefinedbytakingaverageofcoordinatesineach dimension.

Anothermetric:minimizeaveragedistancebetweeneverypairof pointsinacluster Dataminingsystemsaimatclusteringtechniquesthatcanhandlevery largedatasets E.g.theBirchclusteringalgorithm(moreshortly)

s Hasbeenstudiedextensivelyinstatistics,butonsmalldatasets
q

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

HierarchicalClustering
s Examplefrombiologicalclassification
q

(thewordclassificationheredoesnotmeanapredictionmechanism)

chordata mammaliareptilia leopardshumanssnakescrocodiles


s Otherexamples:Internetdirectorysystems(e.g.Yahoo,moreonthislater) s Agglomerativeclusteringalgorithms
q

Buildsmallclusters,thenclustersmallclustersintobiggerclusters,and soon Startwithallitemsinasinglecluster,repeatedlyrefine(break)clusters intosmallerones

s Divisiveclusteringalgorithms
q

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

ClusteringAlgorithms
s Clusteringalgorithmshavebeendesignedtohandleverylarge

datasets
q

s E.g.theBirchalgorithm

Mainidea:useaninmemoryRtreetostorepointsthatarebeing clustered InsertpointsoneatatimeintotheRtree,merginganewpoint withanexistingclusterifislessthansomedistanceaway Iftherearemoreleafnodesthanfitinmemory,mergeexisting clustersthatareclosetoeachother Attheendoffirstpasswegetalargenumberofclustersatthe leavesoftheRtree

Mergeclusterstoreducethenumberofclusters

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

CollaborativeFiltering
s Goal:predictwhatmovies/books/apersonmaybeinterestedin,on

thebasisof
q q q

Pastpreferencesoftheperson Otherpeoplewithsimilarpastpreferences Thepreferencesofsuchpeopleforanewmovie/book/ Clusterpeopleonthebasisofpreferencesformovies Thenclustermoviesonthebasisofbeinglikedbythesame clustersofpeople Againclusterpeoplebasedontheirpreferencesfor(thenewly createdclustersof)movies Repeatabovetillequilibrium

s Oneapproachbasedonrepeatedclustering
q q q q

s Aboveproblemisaninstanceofcollaborativefiltering,whereusers

collaborateinthetaskoffilteringinformationtofindinformationof interest

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

OtherTypesofMining
s Textmining:applicationofdataminingtotextualdocuments
q q q

clusterWebpagestofindrelatedpages clusterpagesauserhasvisitedtoorganizetheirvisithistory classifyWebpagesautomaticallyintoaWebdirectory

s Datavisualizationsystemshelpusersexaminelargevolumesofdata

anddetectpatternsvisually
q

Canvisuallyencodelargeamountsofinformationonasingle screen Humansareverygoodadetectingvisualpatterns

DatabaseSystemConcepts5thEdition,Aug26,2005

18.<number>

Silberschatz,KorthandSudarshan

Вам также может понравиться