Вы находитесь на странице: 1из 70

Data Mining, Data

Warehousing and Knowledge


Discovery
Basic Algorithms and Concepts

Srinath Srinivasa
IIIT Bangalore
sri@iiitb.ac.in
Overview
• Why Data Mining?
• Data Mining concepts
• Data Mining algorithms
– Tabular data mining
– Association, Classification and Clustering
– Sequence data mining
– Streaming data mining
• Data Warehousing concepts
Why Data Mining
From a managerial perspective:

Analyzing trends
Wealth generation

Security

Strategic decision making


Data Mining
• Look for hidden patterns and trends in
data that is not immediately apparent
from summarizing the data

• No Query…

• …But an “Interestingness criteria”


Data Mining

+ =
Interestingness Hidden
Data criteria patterns
Data Mining Type
of
Patterns

+ =
Interestingness Hidden
Data criteria patterns
Data Mining
Type of data Type of
Interestingness criteria

+ =
Interestingness Hidden
Data criteria patterns
Type of Data
• Tabular (Ex: Transaction data)
– Relational
– Multi-dimensional
• Spatial (Ex: Remote sensing data)
• Temporal (Ex: Log information)
– Streaming (Ex: multimedia, network traffic)
– Spatio-temporal (Ex: GIS)
• Tree (Ex: XML data)
• Graphs (Ex: WWW, BioMolecular data)
• Sequence (Ex: DNA, activity logs)
• Text, Multimedia …
Type of Interestingness
• Frequency
• Rarity
• Correlation
• Length of occurrence (for sequence and temporal
data)
• Consistency
• Repeating / periodicity
• “Abnormal” behavior
• Other patterns of interestingness…
Data Mining vs Statistical Inference
Statistics:

Statistical
Conceptual Reasoning
Model
(Hypothesis
)

“Proof”
(Validation of Hypothesis)
Data Mining vs Statistical Inference
Data mining:

Mining
Algorithm
Data Based on
Interestingness

Pattern
(model, rule,
hypothesis)
discovery
Data Mining Concepts
Associations and Item-sets:

An association is a rule of the form: if X then Y.


It is denoted as X  Y
Example:
If India wins in cricket, sales of sweets go up.

For any rule if X  Y  Y  X, then X and Y are called


an “interesting item-set”.
Example:
People buying school uniforms in June also buy school bags
(People buying school bags in June also buy school uniforms)
Data Mining Concepts
Support and Confidence:

The support for a rule R is the ratio of the number of occurrences


of R, given all occurrences of all rules.

The confidence of a rule X  Y, is the ratio of the number of


occurrences of Y given X, among all other occurrences given X.
Data Mining Concepts
Support and Confidence:
Support for {Bag, Uniform} =
Bag Uniform Crayons 5/10 = 0.5
Books Bag Uniform
Bag Uniform Pencil
Bag Pencil Book Confidence for Bag  Uniform =
Uniform Crayons Bag 5/8 = 0.625
Bag Pencil Book
Crayons Uniform Bag
Books Crayons Bag
Uniform Crayons Pencil
Pencil Uniform Books
Mining for Frequent Item-sets
The Apriori Algorithm:

Given minimum required support s as interestingness criterion:


1. Search for all individual elements (1-element item-set) that
have a minimum support of s
2. Repeat
1. From the results of the previous search for i-element
item-sets, search for all i+1 element item-sets that have a
minimum support of s
2. This becomes the set of all frequent (i+1)-element item-
sets that are interesting
3. Until item-set size reaches maximum..
Mining for Frequent Item-sets
The Apriori Algorithm: (Example)
Let minimum support = 0.3
Bag Uniform Crayons
Interesting 1-element item-sets:
Books Bag Uniform
{Bag}, {Uniform}, {Crayons}, {Pencil},
Bag Uniform Pencil {Books}
Bag Pencil Books
Uniform Crayons Bag Interesting 2-element item-sets:
Bag Pencil Books {Bag,Uniform} {Bag,Crayons} {Bag,Pencil}
{Bag,Books} {Uniform,Crayons}
Crayons Uniform Bag
{Uniform,Pencil} {Pencil,Books}
Books Crayons Bag
Uniform Crayons Pencil
Pencil Uniform Books
Mining for Frequent Item-sets
The Apriori Algorithm: (Example)
Let minimum support = 0.3
Bag Uniform Crayons
Books Bag Uniform Interesting 3-element item-sets:
{Bag,Uniform,Crayons}
Bag Uniform Pencil
Bag Pencil Books
Uniform Crayons Bag
Bag Pencil Books
Crayons Uniform Bag
Books Crayons Bag
Uniform Crayons Pencil
Pencil Uniform Books
Mining for Association Rules
Association rules are of the form
Bag Uniform Crayons AB
Books Bag Uniform
Bag Uniform Pencil Which are directional…
Bag Pencil Books
Uniform Crayons Bag Association rule mining requires two
Bag Pencil Books thresholds:
Crayons Uniform Bag
Books Crayons Bag minsup and minconf
Uniform Crayons Pencil
Pencil Uniform Books
Mining for Association Rules
Mining association rules using apriori
General Procedure:
Bag Uniform Crayons
1. Use apriori to generate frequent
Books Bag Uniform itemsets of different sizes
Bag Uniform Pencil 2. At each iteration divide each frequent
Bag Pencil Books itemset X into two parts LHS and
Uniform Crayons Bag RHS. This represents a rule of the
form LHS  RHS
Bag Pencil Books
3. The confidence of such a rule is
Crayons Uniform Bag support(X)/support(LHS)
Books Crayons Bag 4. Discard all rules whose confidence is
Uniform Crayons Pencil less than minconf.
Pencil Uniform Books
Mining for Association Rules
Mining association rules using apriori
Example:
Bag Uniform Crayons
The frequent itemset {Bag, Uniform,
Books Bag Uniform Crayons} has a support of 0.3.
Bag Uniform Pencil
Bag Pencil Books This can be divided into the following
Uniform Crayons Bag rules:
{Bag}  {Uniform, Crayons}
Bag Pencil Books
{Bag, Uniform}  {Crayons}
Crayons Uniform Bag {Bag, Crayons}  {Uniform}
Books Crayons Bag {Uniform}  {Bag, Crayons}
Uniform Crayons Pencil {Uniform, Crayons}  {Bag}
Pencil Uniform Books {Crayons}  {Bag, Uniform}
Mining for Association Rules
Mining association rules using apriori
Confidence for these rules are as follows:
Bag Uniform Crayons
{Bag}  {Uniform, Crayons} 0.375
Books Bag Uniform {Bag, Uniform}  {Crayons} 0.6
Bag Uniform Pencil {Bag, Crayons}  {Uniform} 0.75
Bag Pencil Books {Uniform}  {Bag, Crayons} 0.428
Uniform Crayons Bag {Uniform, Crayons}  {Bag} 0.75
{Crayons}  {Bag, Uniform} 0.75
Bag Pencil Books
Crayons Uniform Bag If minconf is 0.7, then we have discovered the
Books Crayons Bag following rules…
Uniform Crayons Pencil
Pencil Uniform Books
Mining for Association Rules
Mining association rules using apriori
People who buy a school bag and a set of
crayons are likely to buy school
Bag Uniform Crayons
uniform.
Books Bag Uniform
Bag Uniform Pencil People who buy school uniform and a set
Bag Pencil Books of crayons are likely to buy a school
Uniform Crayons Bag bag.
Bag Pencil Books
People who buy just a set of crayons are
Crayons Uniform Bag likely to buy a school bag and school
Books Crayons Bag uniform as well.
Uniform Crayons Pencil
Pencil Uniform Books
Generalized Association Rules
Since customers can buy any number of items in one transaction,
the transaction relation would be in the form of a list of individual
purchases.

Bill No. Date Item


15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
Generalized Association Rules
A transaction for the purposes of data mining is obtained by
performing a GROUP BY of the table over various fields.

Bill No. Date Item


15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
Generalized Association Rules
A GROUP BY over Bill No. would show frequent buying patterns
across different customers.
A GROUP BY over Date would show frequent buying patterns
across different days.

Bill No. Date Item


15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
Classification and Clustering
Given a set of data elements:

Classification maps each data element to one of a set of


pre-determined classes based on the difference among
data elements belonging to different classes

Clustering groups data elements into different groups


based on the similarity between elements within a single
group
Classification Techniques
Decision Tree Identification

Outlook Temp Play? Classification problem


Sunny 30 Yes
Overcast 15 No Weather
Sunny 16 Yes 
Cloudy 27 Yes Play(Yes,No)
Overcast 25 Yes
Overcast 17 No
Cloudy 17 No
Cloudy 35 Yes
Classification Techniques
Hunt’s method for decision tree identification:

Given N element types and m decision classes:


1. For i  1 to N do
1. Add element i to the i-1 element item-sets from the
previous iteration
2. Identify the set of decision classes for each item-set
3. If an item-set has only one decision class, then that
item-set is done, remove that item-set from subsequent
iterations
2. done
Classification Techniques
Decision Tree Identification Example

Outlook Temp Play?


Sunny Warm Yes Sunny Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Yes/No
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Yes/No
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Classification Techniques
Decision Tree Identification Example

Outlook Temp Play?


Sunny Warm Yes Sunny Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Yes/No
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Yes/No
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Classification Techniques
Decision Tree Identification Example

Outlook Temp Play?


Cloudy
Sunny Warm Yes Yes
Warm
Overcast Chilly No
Sunny Chilly Yes Cloudy
No
Cloudy Pleasant Yes Chilly
Overcast Pleasant Yes
Cloudy
Overcast Chilly No Yes
Pleasant
Cloudy Chilly No
Cloudy Warm Yes
Classification Techniques
Decision Tree Identification Example

Outlook Temp Play?


Overcast
Sunny Warm Yes Warm
Overcast Chilly No
Sunny Chilly Yes Overcast
No
Cloudy Pleasant Yes Chilly
Overcast Pleasant Yes
Overcast
Overcast Chilly No Yes
Pleasant
Cloudy Chilly No
Cloudy Warm Yes
Classification Techniques
Decision Tree Identification Example

Yes/No
Cloudy Sunny Overcast

Yes/No Yes Yes/No

Warm Pleasant Chilly


Chilly
No Pleasant
Yes No Yes
Yes
Classification Techniques
Decision Tree Identification Example

• Top down technique for decision tree identification

• Decision tree created is sensitive to the order in which


items are considered

• If an N-item-set does not result in a clear decision,


classification classes have to be modeled by rough sets.
Other Classification Algorithms

Quinlan’s depth-first strategy builds the decision tree in a


depth-first fashion, by considering all possible tests that give a
decision and selecting the test that gives the best information
gain. It hence eliminates tests that are inconclusive.

SLIQ (Supervised Learning in Quest) developed in the


QUEST project of IBM uses a top-down breadth-first strategy
to build a decision tree. At each level in the tree, an entropy
value of each node is calculated and nodes having the lowest
entropy values selected and expanded.
Clustering Techniques
Clustering partitions the data set into clusters or equivalence
classes.

Similarity among members of a class more than similarity


among members across classes.

Similarity measures: Euclidian distance or other application


specific measures.
Euclidian Distance for Tables
(Overcast,Chilly,Don’t Play)

Overcast

(Cloudy,Pleasant,Play)
Cloudy
Don’t Play

Play
Sunny
Warm Pleasant Chilly
Clustering Techniques
General Strategy:

1. Draw a graph connecting items which are close to one


another with edges.

2. Partition the graph into maximally connected


subcomponents.
1. Construct an MST for the graph
2. Merge items that are connected by the minimum
weight of the MST into a cluster
Clustering Techniques
Clustering types:

Hierarchical clustering: Clusters are formed at different


levels by merging clusters at a lower level

Partitional clustering: Clusters are formed at only one level


Clustering Techniques
Nearest Neighbour Clustering Algorithm:

Given n elements x1, x2, … xn, and threshold t, .


1. j  1, k  1, Clusters = {}
2. Repeat
1. Find the nearest neighbour of xj
2. Let the nearest neighbour be in cluster m
3. If distance to nearest neighbour > t, then create a new
cluster and k  k+1; else assign xj to cluster m
4. j  j+1
3. until j > n
Clustering Techniques
Iterative partitional clustering:

Given n elements x1, x2, … xn, and k clusters, each with a


center.
1. Assign each element to its closest cluster center
2. After all assignments have been made, compute the
cluster centroids for each of the cluster
3. Repeat the above two steps with the new centroids until
the algorithm converges
Mining Sequence Data
Characteristics of Sequence Data:

• Collection of data elements which are ordered sequences

• In a sequence, each item has an index associated with it

• A k-sequence is a sequence of length k. Support for sequence


j is the number of m-sequences (m>=j) which contain j as a
sequence

• Sequence data: transaction logs, DNA sequences, patient


ailment history, …
Mining Sequence Data
Some Definitions:

• A sequence is a list of itemsets of finite length.


• Example:
• {pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil}
• … the purchases of a single customer over time…

• The order of items within an itemset does not matter; but the
order of itemsets matter
• A subsequence is a sequence with some itemsets deleted
Mining Sequence Data
Some Definitions:

• A sequence S’ = {a1, a2, …, am} is said to be contained


within another sequence S, if S contains a subsequence {b1, b2,
… bm} such that a1  b1, a2  b2, …, am  bm.

• Hence, {pen}{pencil}{ruler,pencil} is contained in


{pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil}
Mining Sequence Data
Apriori Algorithm for Sequences:

1. L1  Set of all interesting 1-sequences


2. k  1
3. while Lk is not empty do
1. Generate all candidate k+1 sequences
2. Lk+1  Set of all interesting k+1-sequences
4. done
Mining Sequence Data
Generating Candidate Sequences:

Given L1, L2, … Lk, candidate sequences of Lk+1 are generated


as follows:

For each sequence s in Lk, concatenate s with all new 1-


sequences found while generating Lk-1
Mining Sequence Data
Example:
minsup = 0.5
abcde Interesting 1-sequences:
bdae a
aebd b
be d
eabda e
aaaa
baaa Candidate 2-sequences
cbdb aa, ab, ad, ae
abbab ba, bb, bd, be
abde da, db, dd, de
ea, eb, ed, ee
Mining Sequence Data
Example:
minsup = 0.5
abcde Interesting 2-sequences:
bdae ab, bd
aebd
be Candidate 2-sequences
eabda aba, abb, abd, abe,
aaaa aab, bab, dab, eab,
baaa bda, bdb, bdd, bde,
cbdb bbd, dbd, ebd.
abbab
abde Interesting 3-sequences = {}
Mining Sequence Data
Language Inference:

Given a set of sequences, consider each sequence as the


behavioural trace of a machine, and infer the machine that can
display the given sequence as behavior.

aabb
ababcac
abbac

Input set of sequences Output state machine


Mining Sequence Data
• Inferring the syntax of a language given
its sentences
• Applications: discerning behavioural
patterns, emergent properties
discovery, collaboration modeling, …
• State machine discovery is the reverse
of state machine construction
• Discovery is “maximalist” in nature…
Mining Sequence Data
“Maximal” nature of language inference:

a,b,c
abc
aabc “Most general” state machine
aabbc c
abbc b
c
a b

a c c
b b
“Most specific” state machine
Mining Sequence Data
“Shortest-run Generalization” (Srinivasa and Spiliopoulou 2000)

Given a set of n sequences:


1. Create a state machine for the first sequence
2. for j  2 to n do
1. Create a state machine for the jth sequence
2. Merge this sequence into the earlier sequence as follows:
1. Merge all halt states in the new state machine to the
halt state in the existing state machine
2. If two or more paths to the halt state share the same
suffix, merge the suffixes together into a single path
3. Done
Mining Sequence Data
“Shortest-run Generalization” (Srinivasa and Spiliopoulou 2000)

aabcb a a b c b

aac a a b c b
c
aabc a a b c b

c
b
a a c b
Mining Streaming Data
Characteristics of streaming data:

• Large data sequence

• No storage

• Often an infinite sequence

• Examples: Stock market quotes, streaming audio/video,


network traffic
Mining Streaming Data
Running mean:

Let n = number of items read so far,

avg = running average calculated so far,

On reading the next number num:

avg  (n*avg+num) / (n+1)


n  n+1
Mining Streaming Data
Running variance:

var = (num-avg)2

= num2 - 2*num*avg + avg2

Let A = num2 of all numbers read so far


B = 2*num*avg of all numbers read so far
C = avg2 of all numbers read so far
avg = average of numbers read so far
n = number of numbers read so far
Mining Streaming Data
Running variance:

On reading next number num:

avg  (avg*n + num) / (n+1)


n  n+1

A  A + num2
B  B + 2*avg*num
C  C + avg2

var = A + B + C
Mining Streaming Data
-Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999)

Let streaming data be in the form of “frames” where each


frame comprises of one or more data elements.

Support for data element k within a frame is defined as


(#occurrences of k)/(#elements in frame)

-Consistency for data element k is the “sustained” support


for k over all frames read so far, with a “leakage” of (1- )
Mining Streaming Data
-Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999)
*sup(k)

(1-)

levelt(k) = (1-)*levelt-1(k) + *sup(k)


Data Warehousing
• A platform for online analytical processing (OLAP)
• Warehouses collect transactional data from several
transactional databases and organize them in a fashion
amenable to analysis
• Also called “data marts”
• A critical component of the decision support system (DSS) of
enterprises
• Some typical DW queries:
– Which item sells best in each region that has retail outlets
– Which advertising strategy is best for South India?
– Which (age_group/occupation) in South India likes fast
food, and which (age_group/occupation) likes to cook?
Data Warehousing
OLTP
Orde
r Proc
essin
g

Data Cleaning
Inventory

le s Data
Sa Warehouse
(OLAP)
OLTP vs OLAP
Transactional Data (OLTP) Analysis Data (OLAP)
Small or medium size databases Very large databases
Transient data Archival data
Frequent insertions and Infrequent updates
updates
Small query shadow Very large query shadow
Normalization important to De-normalization important to
handle updates handle queries
Data Cleaning
• Performs logical transformation of
transactional data to suit the data
warehouse
• Model of operations  model of
enterprise
• Usually a semi-automatic process
Data Cleaning
Data Warehouse
Orders
Order_id Customers
Price Products
Cust_id Orders
Inventory
Price
Inventory Time
Prod_id
Price
Sales
Price_chng
Cust_id
Cust_prof
Tot_sales
Multi-dimensional Data Model
Price

Products

rs
de
Or
Customers
Jan’01 Jun’01 Jan’02 Jun’02

Time
Some MDBMS Operations
• Roll-up
– Add dimensions
• Drill-down
– Collapse dimensions
• Vector-distance operations (ex:
clustering)
• Vector space browsing
Star Schema

Dim Dim
Tbl_1 Tbl_1

Dim Fact table


Dim
Tbl_1 Tbl_1
WWW Based References
• http://www.kdnuggets.com/
• http://www.megaputer.com/
• http://www.almaden.ibm.com/cs/quest/index.html
• http://fas.sfu.ca/cs
/research/groups/DB/sections/publication/kdd/kdd.html
• http://www.cs.su.oz.au/~thierry/ckdd.html
• http://www.dwinfocenter.org/
• http://datawarehouse.itoolbox.com/
• http://www.knowledgestorm.com/
• http://www.bitpipe.com/
• http://www.dw-institute.com/
• http://www.datawarehousing.com/
References
• Agrawal, R. Srikant: ``Fast Algorithms for Mining Association
Rules'', Proc. of the 20th Int'l Conference on Very Large Databases,
Santiago, Chile, Sept. 1994.
• R. Agrawal, R. Srikant, ``Mining Sequential Patterns'', Proc. of the
Int'l Conference on Data Engineering (ICDE), Taipei, Taiwan, March
1995.
• R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R. Srikant:
"The Quest Data Mining System", Proc. of the 2nd Int'l Conference
on Knowledge Discovery in Databases and Data Mining, Portland,
Oregon, August, 1996.
• Surajit Chaudhuri, Umesh Dayal. An Overview of Data Warehousing
and OLAP Technology. ACM SIGMOD Record. 26(1), March 1997.
• Jennifer Widom. Research Problems in Data Warehousing. Proc. of
Int’l Conf. On Information and Knowledge Management, 1995.
References
• A. Shoshani. OLAP and Statistical Databases: Similarities and
Differences. Proc. of ACM PODS 1997.
• Panos Vassiliadis, Timos Sellis. A Survey on Logical Models
for OLAP Databases. ACM SIGMOD Record
• M. Gyssens, Laks VS Lakshmanan. A Foundation for Multi-
Dimensional Databases. Proc of VLDB 1997, Athens, Greece.
• Srinath Srinivasa, Myra Spiliopoulou. Modeling Interactions
Based on Consistent Patterns. Proc. of CoopIS 1999,
Edinburg, UK.
• Srinath Srinivasa, Myra Spiliopoulou. Discerning Behavioral
Patterns By Mining Transaction Logs. Proc. of ACM SAC 2000,
Como, Italy.

Вам также может понравиться