Академический Документы
Профессиональный Документы
Культура Документы
Srinath Srinivasa
IIIT Bangalore
sri@iiitb.ac.in
Overview
• Why Data Mining?
• Data Mining concepts
• Data Mining algorithms
– Tabular data mining
– Association, Classification and Clustering
– Sequence data mining
– Streaming data mining
• Data Warehousing concepts
Why Data Mining
From a managerial perspective:
Analyzing trends
Wealth generation
Security
• No Query…
+ =
Interestingness Hidden
Data criteria patterns
Data Mining Type
of
Patterns
+ =
Interestingness Hidden
Data criteria patterns
Data Mining
Type of data Type of
Interestingness criteria
+ =
Interestingness Hidden
Data criteria patterns
Type of Data
• Tabular (Ex: Transaction data)
– Relational
– Multi-dimensional
• Spatial (Ex: Remote sensing data)
• Temporal (Ex: Log information)
– Streaming (Ex: multimedia, network traffic)
– Spatio-temporal (Ex: GIS)
• Tree (Ex: XML data)
• Graphs (Ex: WWW, BioMolecular data)
• Sequence (Ex: DNA, activity logs)
• Text, Multimedia …
Type of Interestingness
• Frequency
• Rarity
• Correlation
• Length of occurrence (for sequence and temporal
data)
• Consistency
• Repeating / periodicity
• “Abnormal” behavior
• Other patterns of interestingness…
Data Mining vs Statistical Inference
Statistics:
Statistical
Conceptual Reasoning
Model
(Hypothesis
)
“Proof”
(Validation of Hypothesis)
Data Mining vs Statistical Inference
Data mining:
Mining
Algorithm
Data Based on
Interestingness
Pattern
(model, rule,
hypothesis)
discovery
Data Mining Concepts
Associations and Item-sets:
Yes/No
Cloudy Sunny Overcast
Overcast
(Cloudy,Pleasant,Play)
Cloudy
Don’t Play
Play
Sunny
Warm Pleasant Chilly
Clustering Techniques
General Strategy:
• The order of items within an itemset does not matter; but the
order of itemsets matter
• A subsequence is a sequence with some itemsets deleted
Mining Sequence Data
Some Definitions:
aabb
ababcac
abbac
…
a,b,c
abc
aabc “Most general” state machine
aabbc c
abbc b
c
a b
a c c
b b
“Most specific” state machine
Mining Sequence Data
“Shortest-run Generalization” (Srinivasa and Spiliopoulou 2000)
aabcb a a b c b
aac a a b c b
c
aabc a a b c b
c
b
a a c b
Mining Streaming Data
Characteristics of streaming data:
• No storage
var = (num-avg)2
A A + num2
B B + 2*avg*num
C C + avg2
var = A + B + C
Mining Streaming Data
-Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999)
(1-)
Data Cleaning
Inventory
le s Data
Sa Warehouse
(OLAP)
OLTP vs OLAP
Transactional Data (OLTP) Analysis Data (OLAP)
Small or medium size databases Very large databases
Transient data Archival data
Frequent insertions and Infrequent updates
updates
Small query shadow Very large query shadow
Normalization important to De-normalization important to
handle updates handle queries
Data Cleaning
• Performs logical transformation of
transactional data to suit the data
warehouse
• Model of operations model of
enterprise
• Usually a semi-automatic process
Data Cleaning
Data Warehouse
Orders
Order_id Customers
Price Products
Cust_id Orders
Inventory
Price
Inventory Time
Prod_id
Price
Sales
Price_chng
Cust_id
Cust_prof
Tot_sales
Multi-dimensional Data Model
Price
Products
rs
de
Or
Customers
Jan’01 Jun’01 Jan’02 Jun’02
Time
Some MDBMS Operations
• Roll-up
– Add dimensions
• Drill-down
– Collapse dimensions
• Vector-distance operations (ex:
clustering)
• Vector space browsing
Star Schema
Dim Dim
Tbl_1 Tbl_1