Data Mining, Data Warehousing and Knowledge Discovery

Data Mining, Data
Warehousing and Knowledge

Discovery
Basic Algorithms and Concepts
Srinath Srinivasa
IIIT Bangalore
sri@iiitb.ac.in
Overview
• Why Data Mining?
• Data Mining concepts
• Data Mining algorithms
– Tabular data mining
– Association, Classification and Clustering
– Sequence data mining
– Streaming data mining
• Data Warehousing concepts
Why Data Mining
From a managerial perspective:
Analyzing trends
Wealth generation
Security
Strategic decision making

Data Mining
• Look for hidden patterns and trends in
data that is not immediately apparent
from summarizing the data
• No Query…
• …But an “Interestingness criteria”

Data Mining
+ =
Interestingness Hidden
Data criteria patterns
Data Mining Type
of
Patterns
+ =
Data Mining
Type of data Type of
Interestingness criteria
+ =
Type of Data
• Tabular (Ex: Transaction data)
– Relational
– Multi-dimensional
• Spatial (Ex: Remote sensing data)
• Temporal (Ex: Log information)
– Streaming (Ex: multimedia, network traffic)
– Spatio-temporal (Ex: GIS)
• Tree (Ex: XML data)
• Graphs (Ex: WWW, BioMolecular data)
• Sequence (Ex: DNA, activity logs)
• Text, Multimedia …
Type of Interestingness
• Frequency
• Rarity
• Correlation
• Length of occurrence (for sequence and temporal
data)
• Consistency
• Repeating / periodicity
• “Abnormal” behavior
• Other patterns of interestingness…
Data Mining vs Statistical Inference
Statistics:
Statistical
Conceptual Reasoning
Model
(Hypothesis
)
“Proof”
(Validation of Hypothesis)
Data Mining vs Statistical Inference
Data mining:
Mining
Algorithm
Data Based on
Interestingness
Pattern
(model, rule,
hypothesis)
discovery
Data Mining Concepts
Associations and Item-sets:
An association is a rule of the form: if X then Y.

It is denoted as X  Y
Example:
If India wins in cricket, sales of sweets go up.
For any rule if X  Y  Y  X, then X and Y are called

an “interesting item-set”.
Example:
People buying school uniforms in June also buy school bags
(People buying school bags in June also buy school uniforms)
Support and Confidence:
The support for a rule R is the ratio of the number of occurrences

of R, given all occurrences of all rules.
The confidence of a rule X  Y, is the ratio of the number of

occurrences of Y given X, among all other occurrences given X.
Support and Confidence:
Support for {Bag, Uniform} =
Bag Uniform Crayons 5/10 = 0.5
Books Bag Uniform
Bag Uniform Pencil
Bag Pencil Book Confidence for Bag  Uniform =
Uniform Crayons Bag 5/8 = 0.625
Bag Pencil Book
Crayons Uniform Bag
Books Crayons Bag
Uniform Crayons Pencil
Pencil Uniform Books
Mining for Frequent Item-sets
The Apriori Algorithm:
Given minimum required support s as interestingness criterion:

1. Search for all individual elements (1-element item-set) that
have a minimum support of s
2. Repeat
1. From the results of the previous search for i-element
item-sets, search for all i+1 element item-sets that have a
minimum support of s
2. This becomes the set of all frequent (i+1)-element item-
sets that are interesting
3. Until item-set size reaches maximum..
The Apriori Algorithm: (Example)
Let minimum support = 0.3
Bag Uniform Crayons
Interesting 1-element item-sets:
Books Bag Uniform
{Bag}, {Uniform}, {Crayons}, {Pencil},
Bag Uniform Pencil {Books}
Bag Pencil Books
Uniform Crayons Bag Interesting 2-element item-sets:
Bag Pencil Books {Bag,Uniform} {Bag,Crayons} {Bag,Pencil}
{Bag,Books} {Uniform,Crayons}
Crayons Uniform Bag
{Uniform,Pencil} {Pencil,Books}
Books Crayons Bag
The Apriori Algorithm: (Example)
Let minimum support = 0.3
Bag Uniform Crayons
Books Bag Uniform Interesting 3-element item-sets:
{Bag,Uniform,Crayons}
Bag Uniform Pencil
Bag Pencil Books
Uniform Crayons Bag
Bag Pencil Books
Crayons Uniform Bag
Books Crayons Bag
Mining for Association Rules
Association rules are of the form
Bag Uniform Crayons AB
Books Bag Uniform
Bag Uniform Pencil Which are directional…
Bag Pencil Books
Uniform Crayons Bag Association rule mining requires two
Bag Pencil Books thresholds:
Crayons Uniform Bag
Books Crayons Bag minsup and minconf
Mining association rules using apriori
General Procedure:
Bag Uniform Crayons
1. Use apriori to generate frequent
Books Bag Uniform itemsets of different sizes
Bag Uniform Pencil 2. At each iteration divide each frequent
Bag Pencil Books itemset X into two parts LHS and
Uniform Crayons Bag RHS. This represents a rule of the
form LHS  RHS
Bag Pencil Books
3. The confidence of such a rule is
Crayons Uniform Bag support(X)/support(LHS)
Books Crayons Bag 4. Discard all rules whose confidence is
Uniform Crayons Pencil less than minconf.
Example:
Bag Uniform Crayons
The frequent itemset {Bag, Uniform,
Books Bag Uniform Crayons} has a support of 0.3.
Bag Uniform Pencil
Bag Pencil Books This can be divided into the following
Uniform Crayons Bag rules:
{Bag}  {Uniform, Crayons}
Bag Pencil Books
{Bag, Uniform}  {Crayons}
Crayons Uniform Bag {Bag, Crayons}  {Uniform}
Books Crayons Bag {Uniform}  {Bag, Crayons}
Uniform Crayons Pencil {Uniform, Crayons}  {Bag}
Pencil Uniform Books {Crayons}  {Bag, Uniform}
Confidence for these rules are as follows:
Bag Uniform Crayons
{Bag}  {Uniform, Crayons} 0.375
Books Bag Uniform {Bag, Uniform}  {Crayons} 0.6
Bag Uniform Pencil {Bag, Crayons}  {Uniform} 0.75
Bag Pencil Books {Uniform}  {Bag, Crayons} 0.428
Uniform Crayons Bag {Uniform, Crayons}  {Bag} 0.75
{Crayons}  {Bag, Uniform} 0.75
Bag Pencil Books
Crayons Uniform Bag If minconf is 0.7, then we have discovered the
Books Crayons Bag following rules…
People who buy a school bag and a set of
crayons are likely to buy school
Bag Uniform Crayons
uniform.
Books Bag Uniform
Bag Uniform Pencil People who buy school uniform and a set
Bag Pencil Books of crayons are likely to buy a school
Uniform Crayons Bag bag.
Bag Pencil Books
People who buy just a set of crayons are
Crayons Uniform Bag likely to buy a school bag and school
Books Crayons Bag uniform as well.
Generalized Association Rules
Since customers can buy any number of items in one transaction,
the transaction relation would be in the form of a list of individual
purchases.
Bill No. Date Item

15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
A transaction for the purposes of data mining is obtained by
performing a GROUP BY of the table over various fields.
Bill No. Date Item

15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
A GROUP BY over Bill No. would show frequent buying patterns
across different customers.
A GROUP BY over Date would show frequent buying patterns
across different days.
Bill No. Date Item

15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
Classification and Clustering
Given a set of data elements:
Classification maps each data element to one of a set of

pre-determined classes based on the difference among
data elements belonging to different classes
Clustering groups data elements into different groups

based on the similarity between elements within a single
group
Classification Techniques
Decision Tree Identification
Outlook Temp Play? Classification problem

Sunny 30 Yes
Overcast 15 No Weather
Sunny 16 Yes 
Cloudy 27 Yes Play(Yes,No)
Overcast 25 Yes
Overcast 17 No
Cloudy 17 No
Cloudy 35 Yes
Hunt’s method for decision tree identification:
Given N element types and m decision classes:

1. For i  1 to N do
1. Add element i to the i-1 element item-sets from the
previous iteration
2. Identify the set of decision classes for each item-set
3. If an item-set has only one decision class, then that
item-set is done, remove that item-set from subsequent
iterations
2. done
Decision Tree Identification Example
Outlook Temp Play?

Sunny Warm Yes Sunny Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Yes/No
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Yes/No
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Outlook Temp Play?

Sunny Warm Yes Sunny Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Yes/No
Cloudy Pleasant Yes
Overcast Yes/No
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Outlook Temp Play?

Cloudy
Sunny Warm Yes Yes
Warm
Overcast Chilly No
Sunny Chilly Yes Cloudy
No
Cloudy Pleasant Yes Chilly
Cloudy
Overcast Chilly No Yes
Pleasant
Cloudy Chilly No
Cloudy Warm Yes
Outlook Temp Play?

Overcast
Sunny Warm Yes Warm
Overcast Chilly No
Sunny Chilly Yes Overcast
No
Cloudy Pleasant Yes Chilly
Overcast
Overcast Chilly No Yes
Pleasant
Cloudy Chilly No
Cloudy Warm Yes
Yes/No
Cloudy Sunny Overcast
Yes/No Yes Yes/No
Warm Pleasant Chilly

Chilly
No Pleasant
Yes No Yes
Yes
• Top down technique for decision tree identification
• Decision tree created is sensitive to the order in which

items are considered
• If an N-item-set does not result in a clear decision,

classification classes have to be modeled by rough sets.
Other Classification Algorithms
Quinlan’s depth-first strategy builds the decision tree in a

depth-first fashion, by considering all possible tests that give a
decision and selecting the test that gives the best information
gain. It hence eliminates tests that are inconclusive.
SLIQ (Supervised Learning in Quest) developed in the

QUEST project of IBM uses a top-down breadth-first strategy
to build a decision tree. At each level in the tree, an entropy
value of each node is calculated and nodes having the lowest
entropy values selected and expanded.
Clustering Techniques
Clustering partitions the data set into clusters or equivalence
classes.
Similarity among members of a class more than similarity

among members across classes.
Similarity measures: Euclidian distance or other application

specific measures.
Euclidian Distance for Tables
(Overcast,Chilly,Don’t Play)
Overcast
(Cloudy,Pleasant,Play)
Cloudy
Don’t Play
Play
Sunny
Warm Pleasant Chilly
General Strategy:
1. Draw a graph connecting items which are close to one

another with edges.
2. Partition the graph into maximally connected

subcomponents.
1. Construct an MST for the graph
2. Merge items that are connected by the minimum
weight of the MST into a cluster
Clustering types:
Hierarchical clustering: Clusters are formed at different

levels by merging clusters at a lower level
Partitional clustering: Clusters are formed at only one level

Nearest Neighbour Clustering Algorithm:
Given n elements x1, x2, … xn, and threshold t, .

1. j  1, k  1, Clusters = {}
2. Repeat
1. Find the nearest neighbour of xj
2. Let the nearest neighbour be in cluster m
3. If distance to nearest neighbour > t, then create a new
cluster and k  k+1; else assign xj to cluster m
4. j  j+1
3. until j > n
Iterative partitional clustering:
Given n elements x1, x2, … xn, and k clusters, each with a

center.
1. Assign each element to its closest cluster center
2. After all assignments have been made, compute the
cluster centroids for each of the cluster
3. Repeat the above two steps with the new centroids until
the algorithm converges
Mining Sequence Data
Characteristics of Sequence Data:
• Collection of data elements which are ordered sequences
• In a sequence, each item has an index associated with it
• A k-sequence is a sequence of length k. Support for sequence

j is the number of m-sequences (m>=j) which contain j as a
sequence
• Sequence data: transaction logs, DNA sequences, patient

ailment history, …
Some Definitions:
• A sequence is a list of itemsets of finite length.

• Example:
• {pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil}
• … the purchases of a single customer over time…
• The order of items within an itemset does not matter; but the
order of itemsets matter
• A subsequence is a sequence with some itemsets deleted
Some Definitions:
• A sequence S’ = {a1, a2, …, am} is said to be contained

within another sequence S, if S contains a subsequence {b1, b2,
… bm} such that a1  b1, a2  b2, …, am  bm.
• Hence, {pen}{pencil}{ruler,pencil} is contained in

{pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil}
Apriori Algorithm for Sequences:
1. L1  Set of all interesting 1-sequences

2. k  1
3. while Lk is not empty do
1. Generate all candidate k+1 sequences
2. Lk+1  Set of all interesting k+1-sequences
4. done
Generating Candidate Sequences:
Given L1, L2, … Lk, candidate sequences of Lk+1 are generated

as follows:
For each sequence s in Lk, concatenate s with all new 1-

sequences found while generating Lk-1
Example:
minsup = 0.5
abcde Interesting 1-sequences:
bdae a
aebd b
be d
eabda e
aaaa
baaa Candidate 2-sequences
cbdb aa, ab, ad, ae
abbab ba, bb, bd, be
abde da, db, dd, de
ea, eb, ed, ee
Example:
minsup = 0.5
abcde Interesting 2-sequences:
bdae ab, bd
aebd
be Candidate 2-sequences
eabda aba, abb, abd, abe,
aaaa aab, bab, dab, eab,
baaa bda, bdb, bdd, bde,
cbdb bbd, dbd, ebd.
abbab
abde Interesting 3-sequences = {}
Language Inference:
Given a set of sequences, consider each sequence as the

behavioural trace of a machine, and infer the machine that can
display the given sequence as behavior.
aabb
ababcac
abbac
…
Input set of sequences Output state machine

• Inferring the syntax of a language given
its sentences
• Applications: discerning behavioural
patterns, emergent properties
discovery, collaboration modeling, …
• State machine discovery is the reverse
of state machine construction
• Discovery is “maximalist” in nature…
“Maximal” nature of language inference:
a,b,c
abc
aabc “Most general” state machine
aabbc c
abbc b
c
a b
a c c
b b
“Most specific” state machine
“Shortest-run Generalization” (Srinivasa and Spiliopoulou 2000)
Given a set of n sequences:

1. Create a state machine for the first sequence
2. for j  2 to n do
1. Create a state machine for the jth sequence
2. Merge this sequence into the earlier sequence as follows:
1. Merge all halt states in the new state machine to the
halt state in the existing state machine
2. If two or more paths to the halt state share the same
suffix, merge the suffixes together into a single path
3. Done
“Shortest-run Generalization” (Srinivasa and Spiliopoulou 2000)
aabcb a a b c b
aac a a b c b
c
aabc a a b c b
c
b
a a c b
Mining Streaming Data
Characteristics of streaming data:
• Large data sequence
• No storage
• Often an infinite sequence
• Examples: Stock market quotes, streaming audio/video,

network traffic
Running mean:
Let n = number of items read so far,
avg = running average calculated so far,
On reading the next number num:
avg  (n*avg+num) / (n+1)

n  n+1
Running variance:
var = (num-avg)2
= num2 - 2*num*avg + avg2
Let A = num2 of all numbers read so far

B = 2*num*avg of all numbers read so far
C = avg2 of all numbers read so far
avg = average of numbers read so far
n = number of numbers read so far
Running variance:
On reading next number num:
avg  (avg*n + num) / (n+1)

n  n+1
A  A + num2
B  B + 2*avg*num
C  C + avg2
var = A + B + C
-Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999)
Let streaming data be in the form of “frames” where each

frame comprises of one or more data elements.
Support for data element k within a frame is defined as

(#occurrences of k)/(#elements in frame)
-Consistency for data element k is the “sustained” support

for k over all frames read so far, with a “leakage” of (1- )
-Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999)
*sup(k)
(1-)
levelt(k) = (1-)*levelt-1(k) + *sup(k)

Data Warehousing
• A platform for online analytical processing (OLAP)
• Warehouses collect transactional data from several
transactional databases and organize them in a fashion
amenable to analysis
• Also called “data marts”
• A critical component of the decision support system (DSS) of
enterprises
• Some typical DW queries:
– Which item sells best in each region that has retail outlets
– Which advertising strategy is best for South India?
– Which (age_group/occupation) in South India likes fast
food, and which (age_group/occupation) likes to cook?
Data Warehousing
OLTP
Orde
r Proc
essin
g
Data Cleaning
Inventory
le s Data
Sa Warehouse
(OLAP)
OLTP vs OLAP
Transactional Data (OLTP) Analysis Data (OLAP)
Small or medium size databases Very large databases
Transient data Archival data
Frequent insertions and Infrequent updates
updates
Small query shadow Very large query shadow
Normalization important to De-normalization important to
handle updates handle queries
Data Cleaning
• Performs logical transformation of
transactional data to suit the data
warehouse
• Model of operations  model of
enterprise
• Usually a semi-automatic process
Data Cleaning
Data Warehouse
Orders
Order_id Customers
Price Products
Cust_id Orders
Inventory
Price
Inventory Time
Prod_id
Price
Sales
Price_chng
Cust_id
Cust_prof
Tot_sales
Multi-dimensional Data Model
Price
Products
rs
de
Or
Customers
Jan’01 Jun’01 Jan’02 Jun’02
Time
Some MDBMS Operations
• Roll-up
– Add dimensions
• Drill-down
– Collapse dimensions
• Vector-distance operations (ex:
clustering)
• Vector space browsing
Star Schema
Dim Dim
Tbl_1 Tbl_1
Dim Fact table

Dim
Tbl_1 Tbl_1
WWW Based References
• http://www.kdnuggets.com/
• http://www.megaputer.com/
• http://www.almaden.ibm.com/cs/quest/index.html
• http://fas.sfu.ca/cs
/research/groups/DB/sections/publication/kdd/kdd.html
• http://www.cs.su.oz.au/~thierry/ckdd.html
• http://www.dwinfocenter.org/
• http://datawarehouse.itoolbox.com/
• http://www.knowledgestorm.com/
• http://www.bitpipe.com/
• http://www.dw-institute.com/
• http://www.datawarehousing.com/
References
• Agrawal, R. Srikant: ``Fast Algorithms for Mining Association
Rules'', Proc. of the 20th Int'l Conference on Very Large Databases,
Santiago, Chile, Sept. 1994.
• R. Agrawal, R. Srikant, ``Mining Sequential Patterns'', Proc. of the
Int'l Conference on Data Engineering (ICDE), Taipei, Taiwan, March
1995.
• R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R. Srikant:
"The Quest Data Mining System", Proc. of the 2nd Int'l Conference
on Knowledge Discovery in Databases and Data Mining, Portland,
Oregon, August, 1996.
• Surajit Chaudhuri, Umesh Dayal. An Overview of Data Warehousing
and OLAP Technology. ACM SIGMOD Record. 26(1), March 1997.
• Jennifer Widom. Research Problems in Data Warehousing. Proc. of
Int’l Conf. On Information and Knowledge Management, 1995.
References
• A. Shoshani. OLAP and Statistical Databases: Similarities and
Differences. Proc. of ACM PODS 1997.
• Panos Vassiliadis, Timos Sellis. A Survey on Logical Models
for OLAP Databases. ACM SIGMOD Record
• M. Gyssens, Laks VS Lakshmanan. A Foundation for Multi-
Dimensional Databases. Proc of VLDB 1997, Athens, Greece.
• Srinath Srinivasa, Myra Spiliopoulou. Modeling Interactions
Based on Consistent Patterns. Proc. of CoopIS 1999,
Edinburg, UK.
• Srinath Srinivasa, Myra Spiliopoulou. Discerning Behavioral
Patterns By Mining Transaction Logs. Proc. of ACM SAC 2000,
Como, Italy.

Data Mining, Data Warehousing and Knowledge Discovery

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Mining, Data Warehousing and Knowledge Discovery

Загружено:

Авторское право:

Доступные форматы

Data Mining, Data

Warehousing and Knowledge

Strategic decision making

• …But an “Interestingness criteria”

An association is a rule of the form: if X then Y.

For any rule if X  Y  Y  X, then X and Y are called

The support for a rule R is the ratio of the number of occurrences

The confidence of a rule X  Y, is the ratio of the number of

Given minimum required support s as interestingness criterion:

Bill No. Date Item

Bill No. Date Item

Bill No. Date Item

Classification maps each data element to one of a set of

Clustering groups data elements into different groups

Outlook Temp Play? Classification problem

Given N element types and m decision classes:

Outlook Temp Play?

Outlook Temp Play?

Outlook Temp Play?

Outlook Temp Play?

Yes/No Yes Yes/No

Warm Pleasant Chilly

• Top down technique for decision tree identification

• Decision tree created is sensitive to the order in which

• If an N-item-set does not result in a clear decision,

Quinlan’s depth-first strategy builds the decision tree in a

SLIQ (Supervised Learning in Quest) developed in the

Similarity among members of a class more than similarity

Similarity measures: Euclidian distance or other application

1. Draw a graph connecting items which are close to one

2. Partition the graph into maximally connected

Hierarchical clustering: Clusters are formed at different

Partitional clustering: Clusters are formed at only one level

Given n elements x1, x2, … xn, and threshold t, .

Given n elements x1, x2, … xn, and k clusters, each with a

• Collection of data elements which are ordered sequences

• In a sequence, each item has an index associated with it

• A k-sequence is a sequence of length k. Support for sequence

• Sequence data: transaction logs, DNA sequences, patient

• A sequence is a list of itemsets of finite length.

• A sequence S’ = {a1, a2, …, am} is said to be contained

• Hence, {pen}{pencil}{ruler,pencil} is contained in

1. L1  Set of all interesting 1-sequences

Given L1, L2, … Lk, candidate sequences of Lk+1 are generated

For each sequence s in Lk, concatenate s with all new 1-

Given a set of sequences, consider each sequence as the

Input set of sequences Output state machine

Given a set of n sequences:

• Large data sequence

• Often an infinite sequence

• Examples: Stock market quotes, streaming audio/video,

Let n = number of items read so far,

avg = running average calculated so far,

On reading the next number num:

avg  (n*avg+num) / (n+1)

= num2 - 2*num*avg + avg2

Let A = num2 of all numbers read so far

On reading next number num:

avg  (avg*n + num) / (n+1)

Let streaming data be in the form of “frames” where each

Support for data element k within a frame is defined as

-Consistency for data element k is the “sustained” support

levelt(k) = (1-)*levelt-1(k) + *sup(k)

= num2 - 2numavg + avg2

levelt(k) = (1-)levelt-1(k) + sup(k)