Академический Документы
Профессиональный Документы
Культура Документы
&
Its Business Applications
MANISH GUPTA
Principal Analytics Consultant
Innovation Labs, 24/7 Customer Pvt. Ltd.
Bangalore-560071
(Email: manish.gupta@247customer.com)
Why Data Mining?
Data explosion problem
2
What is Data Mining?
(Knowledge Discovery in Databases)
Definition
– Extraction of interesting (non-trivial,
implicit, previously unknown and
potentially useful) information or
patterns from data in large
databases
3
Data Mining Vs DBMS-SQL
DBMS-SQL Data Mining
Queries based on the data Infers knowledge from
held the data to answer queries
Examples: Examples:
Last months sales for What characteristics do
each product customers have whose
Sales grouped by policies have lapsed ?
customer age etc. Is the sales of this
List of customers product dependent on the
whose policies lapsed sales of some other
product?
4
Data Mining:
Confluence of Multiple Disciplines
Database
Statistics
Technology
Other
Disciplines
Data Mining Visualization
Information Machine
Science Learning
5
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses
OLAP, DBA
Data Sources
Paper, Files, Information Providers, Database Systems
6
Data Mining: A KDD Process
Pattern Evaluation
– Data mining: the core of
knowledge discovery
process. Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases 7
Architecture of a Typical Data
Mining System
Pattern evaluation
Data
Databases Warehouse
8
Applications
Business Domain
– Market-Basket Databases
– Financial Databases
– Insurance Database
– Telecommunication Database
– Business Analytics
– CRM
Defence Domain
– MSDF
– ELINT Data Analysis
– Emitter Classification
– Intrusion Detection
9
Business Applications
10
Market Analysis & Management
Where are the data sources for analysis?
– Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
Target marketing
– Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
– Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis
– Associations/co-relations between product sales
– Prediction based on the association information
11
Contd…
Customer profiling
– data mining can tell you what types of customers buy what products
(clustering or classification)
12
Data Mining Techniques
Clustering
Classification
Association Rules Mining
(Market Basket Analysis)
13
Clustering
14
Clustering: Basic Idea
Clustering
– Grouping a set of data objects into clusters
– Similar objects within the same cluster
– Dissimilar objects in different clusters
Clustering is unsupervised
– No previous categorization known
– Totally data driven
15
Clustering:Example
*** *
* ** * ****
*
Outlier
** **
*
*
** *
* ** *
*
A good clustering method will produce high quality clusters
with
– high intra-class similarity
– low inter-class similarity
16
Similarity Computation
Distance between objects used as metric
The definitions of distance functions usually
different for different type of attributes
Must satisfy following properties
d(i,j) 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
17
Distance Calculation: objects Xi and Xj
(p attributes)
Minskowski
q q q
d (i, j) q (| x x | | x x | ... | x x | )
i1 j1 i2 j2 ip jp
Euclidian
d (i, j) (| x x |2 | x x |2 ... | x x | 2 )
i1 j1 i2 j2 ip jp
Manhattan
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
18
Methods
Partition Methods
– Iterative Methods
– Convergence criteria specified by
the user
Hierarchical Methods
– Agglomerative / Divisive
– Use Dendrogram representation
19
Partitioning Methods
K-Means Clustering
– Decide k – no. of clusters
– Randomly pick k seeds – use as centroids
– Repeat until condition
Scan database and assign each object to a cluster
Compute centroids
Evaluate quality of clustering
20
Example:
Records Feature1 Feature2 Feature3 Feature4
L1 3 10 23 36
L2 12 6 12 41
L3 5 12 17 24
L4 4 8 7 13
L5 1 16 1 28
L6 18 0 22 51
L7 6 8 6 12
L8 15 5 2 6
L9 0 10 15 18
L10 9 2 24 15
21
Initialization
We take, the number of cluster centers as
3 i.e. K=3.
Let’s take the initial Cluster Centers as
– L1 ( 3,10,23,36)
– L5 (1,16,1,28)
– L8 (15,5,2,6)
22
Pictorial View of the Clusters After First Iteration
L2
L6
L5
L1 L3
(9.4,6,19.6,33.4) L9
L10 (0.5,13,8,23)
L4
L7
L8
(8.3,7,5,10.3)
23
Pictorial View of the Clusters After Second Iteration
L2
L6
L5
L1 L3
(10.5,4.5,20.3,35.8) L9 (2,12.7,11,23.3)
L10
L4
L7
(8.33,7,5,10.3)
L8
24
Pictorial View of the Clusters After Third Iteration
L2
L6
L5
L1 L3
(10.5,4.5,20.3,35.8) L9 (2,12.7,11,23.3)
L10
L4
L7
The cluster
centers remain (8.33,7,5,10.3)
same as in the L8
second iteration so
we stop here. 25
Hierarchical Methods
Agglomerative Methods
– Bottom Up approach
Divisive Methods
– Top Down Approach
26
Agglomerative Approach
Distance Matrix
Dendrogram 0
dab
Database 0
dac dbc
0
dad dbd dcd
0
A,B,C,D dae dbe dce dde
0
daf dbf dcf ddf def
A, B, C E,F 0
A B C D E F
27
Divisive Approach
Database
A,B,C,D E,F
A, B, C
A B C D E F
28
Clustering: Applications
Marketing Management
– Discover distinct groups in customer bases, and
then use this knowledge to develop targeted
marketing programs
Banking
– ATM Location identification
Text Mining
– Grouping documents with similar characteristics
29
Clustering Stock Market Data
Clustering Companies using Dow
Jones Index
Trading System Development
Clustering for Customer Profiling
34
New Product line development
35
Crime Hot Spot Analysis
Clustering for Medical
Diagnostics
Human Genome Project:
– Finding Relationships between diseases,
cellular functions, and drugs.
Wincosin Breast Cancer Study
Cancer Diagnosis and Predictions
Classification
38
Easy to agree these are sunset pictures!
40
The classification task
Input: a training set of tuples, each
labelled with one class label
Output: a model (classifier) which assigns
a class label to each tuple based on the
other attributes.
The model can be used to predict the
class of new tuples, for which the class
label is missing or unknown
Training step
Classification
Algorithms
Training
Data
45
Decision Tree Induction
Recursive partitioning of T until stopping criterion
satisfied (purity of partition, depth of tree etc.)
Decide the split criterion
– Select the splitting attribute
– Partition the data according to the selected attribute
– Apply induction method recursively on each partition
46
Decision tree inducers
ID3 – RJ Quinlan – 1986
– Simple, Uses information gain, no pruning
C4.5 - RJ Quinlan – 1993
– Uses gain ratio, handles numeric attributes and
missing values, error-based pruning
SLIQ -Mehta et al., 1996
– Scalable, one scan of database, uses gini index
CART - Brieman et al. 1984
– constructs binary tree, cost-complexity pruning, can
generate regression trees
47
Attribute Selection Criteria
Information Gain
– Entropy(C, S) = - pi Σlog(pi)
– Entropy (Before split) - Entropy (After split)
Gain Ratio
– Information Gain / Entropy (Before split)
Gini Index
– Measures divergence
– Gini(C, S) = 1 – Σpi2
48
Classical example: play tennis?
Outlook Temperature Humidity Windy Class
sunny hot high false N
Training sunny hot high true N
overcast hot high false P
set from rain mild high false P
Quinlan’s rain cool normal false P
book rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
Information gain - play tennis example
Outlook Temperature Humidity W indy Class
sunny hot high false N outlook
sunny hot high true N
overcast hot high false P sunny rain
rain mild high false P overcast
rain cool normal false P
rain cool normal true N
overcast cool normal true P humidity P windy
sunny mild high false N
sunny cool normal false P
rain mild normal false P high normal true false
sunny mild normal true P
overcast mild high true P N P N P
overcast hot normal false P
rain mild high true N
sunny rain
overcast
humidity P windy
N P N P
From decision trees to classification rules
One rule is generated for each path in the tree
from the root to a leaf
Rules are generally simpler to understand than
trees
outlook
sunny rain
IF outlook=sunny
overcast
AND humidity=normal
humidity P windy
THEN play tennis
high normal true false
N P N P
Advantages & Limitations
Advantages
– Self explanatory
– handle both numeric and categorical data
– Non-parametric method
Limitations
– Most algorithms predict only categorical attrib
– OverTraining
– Need for pruning
– Large Tree
53
Classification: Application
Bank Loan Granting System
– Decision trees constructed from bank-loan
histories to produce algorithms to decide
whether to grant a loan or not.
Anti – Money Laundering System
– KYC Status
Email Classification System
– Spam or Not
60
Stock Research Application
Efficient Prediction of Option Prices using
Machine Learning Techniques
Prediction of both European and American Option Prices using
General Regression Neural Network and Support Vector
Regression.
Stock Portfolio Management: Prediction of Risk
using Text Classification
Prediction or classification of risk in investment of a particular
company by Text classification using Naïve Bayes(NB) and K-
Nearest Neighbor(KNN).
Prediction of Financial Data Series: using
MATLAB GARCH Toolbox
61
Pattern Recognition:
Artificial Neural Network Application
62
Emitter Classification
ELINT Data Analysis
Identification of Radars and Platform
Successfully Delivered DAPR software to
Indian Navy (INTEG)
63
Speaker 1
Speaker 2
Text
Although the spoken words are the same the recorded digital signals
are very different! 64
Pattern Recognition Example
Noisy image Recognized
pattern
65
Association Rule Mining
(Market Basket Analysis)
66
Association Rules
Intrarecord Links
– Finding associations among sets of objects
in transaction databases, relational
databases.
– Rule form: “Antecedent Consequent
[support, confidence]”.
Examples.
– shirt, tie, socks shoes [0.5%, 60%]
– White bread, butter egg [2.3%, 80%]
67
Preliminaries
Given: (1) database of transactions
(2) each transaction is a list of items
Find: all rules that correlate the presence of one set of items
with that of another set of items
– E.g., 95% of people who purchase PC and color printer
also purchase compter table
Business Questions:
– * Electronic items (What the store should do to boost
sale of electronic items)
– Herbal Health products * (What other products should
the store stocks up?)
68
Formal Definition
If X and Y are two item-sets, such that X
Y = , then for an association rule
X Y
Support is the probability that X and Y
occur together [ P(X U Y)]
Confidence is the conditional probability
that Y occurs in a transaction, given X is
present in the same transaction [P(Y/X)]
69
Itemset and Support
Strength of the
Item A & C(4) Item
discovered rule C(5)
Computed as
P(X , Y)/P(X)
A C (4/4)
C A (4/5) Item A(4)
71
Interestingness Trans Items
ID
Minimum Confidence 50 A, C, E
72
The Apriori algorithm
The best known algorithm
Two steps:
– Find all itemsets that have minimum support (frequent
itemsets, also called large itemsets).
– Use frequent itemsets to generate rules.
73
TID
Items
Apriori Example
1 Bread, Milk
2 Beer, Diaper, Bread, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Bread, Diaper, Milk
75
An example
F3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4},
{1, 3, 5}, {2, 3, 4}}
After join
– C4 = {{1, 2, 3, 4}, {1, 3, 4, 5}}
After pruning:
– C4 = {{1, 2, 3, 4}}
because {1, 4, 5} is not in F3 ({1, 3, 4, 5} is removed)
76
Generating rules: an example
Suppose {2,3,4} is frequent, with sup=50%
– Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with
sup=50%, 50%, 75%, 75%, 75%, 75% respectively
– These generate these association rules:
2,3 4, confidence=100%
2,4 3, confidence=100%
3,4 2, confidence=67%
2 3,4, confidence=67%
3 2,4, confidence=67%
4 2,3, confidence=67%
All rules have support = 50%
77
On Apriori Algorithm
Seems to be very expensive
Level-wise search
K = the size of the largest itemset
It makes at most K passes over data
In practice, K is bounded (10).
The algorithm is very fast. Under some conditions,
all rules can be found in linear time.
Scale up to large data sets
78
AR: Applications
Retail Marketing
– Floor Planning, Discounting, Catalogue Design
Medical Diagnosis
– Comparison of the genotype of people with/without a
condition allowed the discovery of a set of genes that
together account for many cases of diabetes.
Geographical Information systems
– Link Analysis
79
Walmart Study
80
Typical Business Decisions
(For Walmart)
What to put on sale?
How to design coupons?
How to place merchandise etc on shelf to
maximise profit ?
81
Walmart Study Contd.
IBM Almadem Group did an exhaustive study on Walmart
( Biggest chain of super store in US) for discovering buying
patterns of customers.
Interesting rules came out of the study
83
M
88