Академический Документы
Профессиональный Документы
Культура Документы
DatabaseSystemConcepts
Silberschatz,KorthandSudarshan Seewww.dbbook.comforconditionsonreuse
Chapter18:DataAnalysisandMining
s DecisionSupportSystems s DataAnalysisandOLAP s DataWarehousing s DataMining
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
DecisionSupportSystems
s Decisionsupportsystemsareusedtomakebusinessdecisions,often
basedondatacollectedbyonlinetransactionprocessingsystems.
q q q
s Examplesofbusinessdecisions:
s Examplesofdatausedformakingdecisions
q q
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
DecisionSupportSystems:Overview
s DataanalysistasksaresimplifiedbyspecializedtoolsandSQL
extensions
q
s Statisticalanalysispackages(e.g.,:S++)canbeinterfacedwith
databases
q
Statisticalanalysisisalargefield,butnotcoveredhere
s Dataminingseekstodiscoverknowledgeautomaticallyintheformof
statisticalrulesandpatternsfromlargedatabases.
s Adatawarehousearchivesinformationgatheredfrommultiplesources,
andstoresitunderaunifiedschema,atasinglesite.
q q
DatabaseSystemConcepts5thEdition,Aug26,2005
DataAnalysisandOLAP
s OnlineAnalyticalProcessing(OLAP)
q
Interactiveanalysisofdata,allowingdatatobesummarizedand viewedindifferentwaysinanonlinefashion(withnegligibledelay)
s Datathatcanbemodeledasdimensionattributesandmeasure
attributesarecalledmultidimensionaldata.
q
Measureattributes
Dimensionattributes
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
CrossTabulationofsalesbyitemname andcolor
s Thetableaboveisanexampleofacrosstabulation(crosstab),also
referredtoasapivottable.
q q q q
DatabaseSystemConcepts5thEdition,Aug26,2005
RelationalRepresentationofCrosstabs
s Crosstabscanberepresented
asrelations
s Weusethevalueallisusedto
representaggregates
s TheSQL:1999standard
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
DataCube
s Adatacubeisamultidimensionalgeneralizationofacrosstab s Canhavendimensions;weshow3below s Crosstabscanbeusedasviewsonadatacube
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
OnlineAnalyticalProcessing
s Pivoting:changingthedimensionsusedinacrosstabiscalled s Slicing:creatingacrosstabforfixedvaluesonly
q
Sometimescalleddicing,particularlywhenvaluesformultiple dimensionsarefixed.
s Rollup:movingfromfinergranularitydatatoacoarsergranularity s Drilldown:Theoppositeoperationthatofmovingfromcoarser
granularitydatatofinergranularitydata
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
HierarchiesonDimensions
s Hierarchyondimensionattributes:letsdimensionstobeviewed
atdifferentlevelsofdetail
5 E.g.thedimensionDateTimecanbeusedtoaggregatebyhourof day,date,dayofweek,month,quarteroryear
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
CrossTabulationWithHierarchy
s Crosstabscanbeeasilyextendedtodealwithhierarchies
5 Candrilldownorrolluponahierarchy
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
OLAPImplementation
s TheearliestOLAPsystemsusedmultidimensionalarraysinmemoryto
s OLAPimplementationsusingonlyrelationaldatabasefeaturesarecalled s Hybridsystems,whichstoresomesummariesinmemoryandstorethe
basedataandothersummariesinarelationaldatabase,arecalled hybridOLAP(HOLAP)systems.
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
OLAPImplementation(Cont.)
s EarlyOLAPsystemsprecomputedallpossibleaggregatesinorderto
provideonlineresponse
q
ischeaperthancomputingitfromscratch s Severaloptimizationsavailableforcomputingmultipleaggregates
q q
DatabaseSystemConcepts5thEdition,Aug26,2005
ExtendedAggregationinSQL:1999
s Thecubeoperationcomputesunionofgroupbysoneverysubsetofthe
specifiedattributes
s E.g.considerthequery
forattributesnotpresentinthegrouping.
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
ExtendedAggregation(Cont.)
s Relationalrepresentationofcrosstabthatwesawearlier,butwithnullin
placeofall,canbecomputedby
Returns1ifthevalueisanullvaluerepresentingall,andreturns0inall othercases.
suchnullsbyavaluesuchasall
q
E.g.replaceitemnameinfirstqueryby decode(grouping(itemname),1,all,itemname)
18.<number> Silberschatz,KorthandSudarshan
DatabaseSystemConcepts5thEdition,Aug26,2005
ExtendedAggregation(Cont.)
s Therollupconstructgeneratesuniononeveryprefixofspecifiedlistof
attributes
s E.g.
hierarchy.
s E.g.,supposetableitemcategory(itemname,category)givesthe
categoryofeachitem.Then
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
ExtendedAggregation(Cont.)
s Multiplerollupsandcubescanbeusedinasinglegroupbyclause
q
Eachgeneratessetofgroupbylists,crossproductofsetsgivesoverall setofgroupbylists
s E.g.,
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
Ranking
s Rankingisdoneinconjunctionwithanorderbyspecification. s Givenarelationstudentmarks(studentid,marks)findtherankofeach
student.
selectstudentid,rank()over(orderbymarksdesc)assrank fromstudentmarks
s Anextraorderbyclauseisneededtogettheminsortedorder
haverank1,andthenextrankis3
q
dense_rankdoesnotleavegaps,sonextdenserankwouldbe2
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
Ranking(Cont.)
s Rankingcanbedonewithinpartitionofthedata. s Findtherankofstudentswithineachsection.
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
Ranking(Cont.)
s Otherrankingfunctions:
q q
percent_rank(withinpartition,ifpartitioningisdone) cume_dist(cumulativedistribution)
fractionoftupleswithprecedingvalues
row_number(nondeterministicinpresenceofduplicates)
s SQL:1999permitstheusertospecifynullsfirstornullslast
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
Ranking(Cont.)
s Foragivenconstantn,therankingthefunctionntile(n)takesthe
s E.g.:
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
Windowing
s Usedtosmoothoutrandomvariations. s E.g.:movingaverage:Givensalesvaluesforeachdate,calculateforeach
datetheaverageofthesalesonthatday,thepreviousday,andthenext day
q
s WindowspecificationinSQL:
Givenrelationsales(date,value)
Allrowswithvaluesbetweencurrentrowvalue10tocurrentvalue Notincludingcurrentrow
18.<number> Silberschatz,KorthandSudarshan
rangeinterval10daypreceding
DatabaseSystemConcepts5thEdition,Aug26,2005
Windowing(Cont.)
s Candowindowingwithinpartitions s E.g.Givenarelationtransaction(accountnumber,datetime,value),
wherevalueispositiveforadepositandnegativeforawithdrawal
q
Findtotalbalanceofeachaccountaftereachtransactiononthe account selectaccountnumber,datetime, sum(value)over (partitionbyaccountnumber orderbydatetime rowsunboundedpreceding) asbalance fromtransaction orderbyaccountnumber,datetime
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
DataWarehousing
s Datasourcesoftenstoreonlycurrentdata,nothistoricaldata s Corporatedecisionmakingrequiresaunifiedviewofallorganizational
data,includinghistoricaldata
s Adatawarehouseisarepository(archive)ofinformationgathered
q q
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
DataWarehousing
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
DesignIssues
s Whenandhowtogatherdata
q
s Whatschematouse
q
Schemaintegration
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
MoreWarehouseDesignIssues
s Datacleansing
q q
s Howtopropagateupdates
q
s Whatdatatosummarize
q q q
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
WarehouseSchemas
s Dimensionvaluesareusuallyencodedusingsmallintegersand
mappedtofullvaluesviadimensiontables
q
s Resultantschemaiscalledastarschema
Morecomplicatedschemastructures
Snowflakeschema:multiplelevelsofdimensiontables Constellation:multiplefacttables
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
DataWarehouseSchema
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
DataMining
s Dataminingistheprocessofsemiautomaticallyanalyzinglarge
databasestofindusefulpatterns
q
s Predictionbasedonpasthistory
s Someexamplesofpredictionmechanisms:
q
Regressionformulae
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
DataMining(Cont.)
s DescriptivePatterns
q
Associations
Associationsmaybeusedasafirststepindetectingcausation
Clusters
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
ClassificationRules
s Classificationruleshelpassignnewobjectstoclasses.
q
E.g.,givenanewautomobileinsuranceapplicant,shouldheorshe beclassifiedaslowrisk,mediumriskorhighrisk?
s Classificationrulesforaboveexamplecoulduseavarietyofdata,such
aseducationallevel,salary,age,etc.
q
s Rulesarenotnecessarilyexact:theremaybesomemisclassifications s Classificationrulescanbeshowncompactlyasadecisiontree.
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
DecisionTree
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
ConstructionofDecisionTrees
s Trainingset:adatasampleinwhichtheclassificationisalready
known.
q
s Greedytopdowngenerationofdecisiontrees.
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
BestSplits
s Pickbestattributesandconditionsonwhichtopartition s ThepurityofasetSoftraininginstancescanbemeasuredquantitativelyin
severalways.
q
Notation:numberofclasses=k,numberofinstances=|S|, fractionofinstancesinclassi=pi.
k
s TheGinimeasureofpurityisdefinedas
Gini(S)=1 p2i
i1
q q
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
BestSplits(Cont.)
s Anothermeasureofpurityistheentropymeasure,whichisdefinedas
entropy(S)= pilog2pi
i1
s WhenasetSissplitintomultiplesetsSi,I=1,2,,r,wecanmeasurethe
purityoftheresultantsetofsetsas: purity(S1,S2,..,Sr)=
r
| S i|
i=1 |S|
purity(Si)
s TheinformationgainduetoparticularsplitofSintoSi,i=1,2,.,r
Informationgain(S,{S1,S2,.,Sr)=purity(S)purity(S1,S2,Sr)
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
BestSplits(Cont.)
s Measureofcostofasplit:
Informationcontent(S,{S1,S2,..,Sr}))=
r |S | i
i1 |S|
log2
|S i | |S |
s Informationgainratio=Informationgain(S,{S1,S2,,Sr})
Informationcontent(S,{S1,S2,..,Sr})
s Thebestsplitistheonethatgivesthemaximuminformationgainratio
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
FindingBestSplits
s Categoricalattributes(withnomeaningfulorder):
q q
s Continuousvaluedattributes(canbesortedinameaningfulorder)
q
Multiwaysplit:
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
DecisionTreeConstructionAlgorithm
ProcedureGrowTree(S) Partition(S); ProcedurePartition(S) if(purity(S)>por|S|<s)then return; foreachattributeA evaluatesplitsonattributeA; Usebestsplitfound(acrossallattributes)topartition SintoS1,S2,.,Sr, fori=1,2,..,r Partition(Si);
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
OtherTypesofClassifiers
s Neuralnetclassifiersarestudiedinartificialintelligenceandarenotcovered
here
s BayesianclassifiersuseBayestheorem,whichsays
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
NaveBayesianClassifiers
s Bayesianclassifiersrequire
q q q
s Tosimplifythetask,naveBayesianclassifiersassumeattributes
haveindependentdistributions,andtherebyestimate p(d|cj)=p(d1|cj)*p(d2|cj)*.*(p(dn|cj)
q
Eachofthep(di|cj)canbeestimatedfromahistogramondi valuesforeachclasscj
thehistogramiscomputedfromthetraininginstances
Histogramsonmultipleattributesaremoreexpensivetocompute andstore
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
Regression
s Regressiondealswiththepredictionofavalue,ratherthanaclass.
q
s Onewayistoinfercoefficientsa0,a1,a1,,ansuchthat s Findingsuchalinearpolynomialiscalledlinearregression.
q
s Thefitmayonlybeapproximate
q q
s Regressionaimstofindcoefficientsthatgivethebestpossiblefit.
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
AssociationRules
s Retailshopsareofteninterestedinassociationsbetweendifferentitems
thatpeoplebuy.
q q
s Associationsinformationcanbeusedinseveralways.
q
s Associationrules:
breadmilkDBConcepts,OSConceptsNetworks
q q
E.g.eachtransaction(sale)atashopisaninstance,andtheset ofalltransactionsisthepopulation
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
AssociationRules(Cont.)
s Ruleshaveanassociatedsupport,aswellasanassociatedconfidence. s Supportisameasureofwhatfractionofthepopulationsatisfiesboththe
antecedentandtheconsequentoftherule.
q
E.g.supposeonly0.001percentofallpurchasesincludemilkand screwdrivers.Thesupportfortheruleismilkscrewdriversislow.
s Confidenceisameasureofhowoftentheconsequentistruewhenthe
antecedentistrue.
q
E.g.therulebreadmilkhasaconfidenceof80percentif80 percentofthepurchasesthatincludebreadalsoincludemilk.
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
FindingAssociationRules
s s
3.
Uselargeitemsetstogenerateassociationrules.
1.
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
FindingSupport
s Determinesupportofitemsetsviaasinglepassonsetoftransactions
q
Largeitemsets:setswithahighcountattheendofthepass
s Ifmemorynotenoughtoholdallcountsforallitemsetsusemultiplepasses,
consideringonlysomeitemsetsineachpass.
s Optimization:Onceanitemsetiseliminatedbecauseitscount(support)istoo
smallnoneofitssupersetsneedstobeconsidered.
q
s Theaprioritechniquetofindlargeitemsets:
Countsupportofallcandidates Stopiftherearenocandidates
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
OtherTypesofAssociations
s Basicassociationruleshaveseverallimitations s Deviationsfromtheexpectedprobabilityaremoreinteresting
q q
Positivecorrelation:cooccurrenceishigherthanpredicted Negativecorrelation:cooccurrenceislowerthanpredicted
s Sequenceassociations/correlations
q
s Deviationsfromtemporalpatterns
q q
Notsurprising,partofaknownpattern. Lookfordeviationfromvaluepredictedusingpastpatterns
18.<number> Silberschatz,KorthandSudarshan
DatabaseSystemConcepts5thEdition,Aug26,2005
Clustering
s Clustering:Intuitively,findingclustersofpointsinthegivendatasuchthat
similarpointslieinthesamecluster
q
s Canbeformalizedusingdistancemetricsinseveralways
Grouppointsintoksets(foragivenk)suchthattheaveragedistance ofpointsfromthecentroidoftheirassignedgroupisminimized
Centroid:pointdefinedbytakingaverageofcoordinatesineach dimension.
s Hasbeenstudiedextensivelyinstatistics,butonsmalldatasets
q
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
HierarchicalClustering
s Examplefrombiologicalclassification
q
(thewordclassificationheredoesnotmeanapredictionmechanism)
s Divisiveclusteringalgorithms
q
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
ClusteringAlgorithms
s Clusteringalgorithmshavebeendesignedtohandleverylarge
datasets
q
s E.g.theBirchalgorithm
Mergeclusterstoreducethenumberofclusters
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
CollaborativeFiltering
s Goal:predictwhatmovies/books/apersonmaybeinterestedin,on
thebasisof
q q q
s Oneapproachbasedonrepeatedclustering
q q q q
s Aboveproblemisaninstanceofcollaborativefiltering,whereusers
collaborateinthetaskoffilteringinformationtofindinformationof interest
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan
OtherTypesofMining
s Textmining:applicationofdataminingtotextualdocuments
q q q
s Datavisualizationsystemshelpusersexaminelargevolumesofdata
anddetectpatternsvisually
q
DatabaseSystemConcepts5thEdition,Aug26,2005
18.<number>
Silberschatz,KorthandSudarshan