Вы находитесь на странице: 1из 82

Data Mining

&
Its Business Applications
MANISH GUPTA
Principal Analytics Consultant
Innovation Labs, 24/7 Customer Pvt. Ltd.
Bangalore-560071
(Email: manish.gupta@247customer.com)
Why Data Mining?
Data explosion problem

– Automated data collection tools and mature database technology


lead to tremendous amounts of data stored in databases, data
warehouses and other information repositories

We are drowning in data, but starving for knowledge!

Secret of Success in Business is knowing that which nobody else


knows.

Solution: Data Warehousing and Data Mining

2
What is Data Mining?
(Knowledge Discovery in Databases)
Definition
– Extraction of interesting (non-trivial,
implicit, previously unknown and
potentially useful) information or
patterns from data in large
databases

3
Data Mining Vs DBMS-SQL
DBMS-SQL Data Mining
Queries based on the data Infers knowledge from
held the data to answer queries

Examples: Examples:
Last months sales for What characteristics do
each product customers have whose
 Sales grouped by policies have lapsed ?
customer age etc.  Is the sales of this
 List of customers product dependent on the
whose policies lapsed sales of some other
product?

4
Data Mining:
Confluence of Multiple Disciplines
Database
Statistics
Technology

Other
Disciplines
Data Mining Visualization

Information Machine
Science Learning

5
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Analysis, Querying and Reporting

Data Warehouses
OLAP, DBA
Data Sources
Paper, Files, Information Providers, Database Systems
6
Data Mining: A KDD Process

Pattern Evaluation
– Data mining: the core of
knowledge discovery
process. Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

Databases 7
Architecture of a Typical Data
Mining System

Graphical user interface

Pattern evaluation

Data mining engine


Knowledge-base
Database or data
warehouse server
Data cleaning & data integration Filtering

Data
Databases Warehouse
8
Applications
Business Domain
– Market-Basket Databases
– Financial Databases
– Insurance Database
– Telecommunication Database
– Business Analytics
– CRM
Defence Domain
– MSDF
– ELINT Data Analysis
– Emitter Classification
– Intrusion Detection
9
Business Applications

Database analysis and decision support


– Market analysis and management
target marketing, customer relation management,
market basket analysis, cross selling, market
segmentation
– Fraud detection and management
Other Applications
– Text mining (news group, email, documents) and
Web analysis.

10
Market Analysis & Management
Where are the data sources for analysis?
– Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
Target marketing
– Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
– Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis
– Associations/co-relations between product sales
– Prediction based on the association information
11
Contd…
Customer profiling
– data mining can tell you what types of customers buy what products
(clustering or classification)

Identifying customer requirements


– identifying the best products for different customers
– use prediction to find what factors will attract new customers

Provides summary information


– various multidimensional summary reports
– statistical summary information (data central tendency and variation)

12
Data Mining Techniques
Clustering
Classification
Association Rules Mining
(Market Basket Analysis)

13
Clustering

14
Clustering: Basic Idea
Clustering
– Grouping a set of data objects into clusters
– Similar objects within the same cluster
– Dissimilar objects in different clusters
Clustering is unsupervised
– No previous categorization known
– Totally data driven

15
Clustering:Example

*** *
* ** * ****
*
Outlier
** **
*
*
** *
* ** *
*
A good clustering method will produce high quality clusters
with
– high intra-class similarity
– low inter-class similarity

16
Similarity Computation
Distance between objects used as metric
The definitions of distance functions usually
different for different type of attributes
Must satisfy following properties
d(i,j)  0
d(i,j) = d(j,i)
d(i,j)  d(i,k) + d(k,j)

17
Distance Calculation: objects Xi and Xj
(p attributes)
Minskowski
q q q
d (i, j)  q (| x  x |  | x  x | ... | x  x | )
i1 j1 i2 j2 ip jp
Euclidian
d (i, j)  (| x  x |2  | x  x |2 ... | x  x | 2 )
i1 j1 i2 j2 ip jp

Manhattan
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

18
Methods
Partition Methods
– Iterative Methods
– Convergence criteria specified by
the user
Hierarchical Methods
– Agglomerative / Divisive
– Use Dendrogram representation

19
Partitioning Methods
K-Means Clustering
– Decide k – no. of clusters
– Randomly pick k seeds – use as centroids
– Repeat until condition
Scan database and assign each object to a cluster
Compute centroids
Evaluate quality of clustering

20
Example:
Records Feature1 Feature2 Feature3 Feature4

L1 3 10 23 36
L2 12 6 12 41
L3 5 12 17 24
L4 4 8 7 13
L5 1 16 1 28
L6 18 0 22 51
L7 6 8 6 12
L8 15 5 2 6
L9 0 10 15 18
L10 9 2 24 15

21
Initialization
We take, the number of cluster centers as
3 i.e. K=3.
Let’s take the initial Cluster Centers as
– L1 ( 3,10,23,36)
– L5 (1,16,1,28)
– L8 (15,5,2,6)

22
Pictorial View of the Clusters After First Iteration

L2
L6
L5
L1 L3
(9.4,6,19.6,33.4) L9
L10 (0.5,13,8,23)
L4
L7

L8
(8.3,7,5,10.3)
23
Pictorial View of the Clusters After Second Iteration

L2
L6
L5
L1 L3
(10.5,4.5,20.3,35.8) L9 (2,12.7,11,23.3)
L10
L4
L7

(8.33,7,5,10.3)
L8

24
Pictorial View of the Clusters After Third Iteration

L2
L6
L5
L1 L3
(10.5,4.5,20.3,35.8) L9 (2,12.7,11,23.3)
L10
L4
L7
The cluster
centers remain (8.33,7,5,10.3)
same as in the L8
second iteration so
we stop here. 25
Hierarchical Methods
Agglomerative Methods
– Bottom Up approach

Divisive Methods
– Top Down Approach

26
Agglomerative Approach
Distance Matrix

Dendrogram 0
dab
Database 0
dac dbc
0
dad dbd dcd
0
A,B,C,D dae dbe dce dde
0
daf dbf dcf ddf def
A, B, C E,F 0

A B C D E F

27
Divisive Approach

Database

A,B,C,D E,F

A, B, C

A B C D E F

28
Clustering: Applications
Marketing Management
– Discover distinct groups in customer bases, and
then use this knowledge to develop targeted
marketing programs

Banking
– ATM Location identification
Text Mining
– Grouping documents with similar characteristics

29
Clustering Stock Market Data
Clustering Companies using Dow
Jones Index
Trading System Development
Clustering for Customer Profiling
34
New Product line development

35
Crime Hot Spot Analysis
Clustering for Medical
Diagnostics
Human Genome Project:
– Finding Relationships between diseases,
cellular functions, and drugs.
Wincosin Breast Cancer Study
Cancer Diagnosis and Predictions
Classification

38
Easy to agree these are sunset pictures!

These are all A’s! Handwritten Characters from


NIST database
In most cases, easy for experts to attach class labels
– difficult to explain why! 39
Classification
Supervised learning method
Use historical data to construct a model
(Hypothesis Formulation)
Discover relationship between input attributes
and target
Use the model for prediction
Major Classification Methods
– Decision Tree(ID3, CART, C4.5, SLIQ)
– Neural Network(MLP)
– Support Vector Machine
– Bayesian Classifiers(NBC, BBN)
– K-Nearest Neighbor(KNN)

40
The classification task
Input: a training set of tuples, each
labelled with one class label
Output: a model (classifier) which assigns
a class label to each tuple based on the
other attributes.
The model can be used to predict the
class of new tuples, for which the class
label is missing or unknown
Training step
Classification
Algorithms
Training
Data

NAME AGE INCOME CREDIT Classifier


Mary 20 - 30 low poor (Model)
James 30 - 40 low fair
Bill 30 - 40 high good
John 20 - 30 med fair
IF age = 30 - 40
Marc 40 - 50 high good
OR income = high
Annie 40 - 50 high good
THEN credit = good
Test step
Classifier
(Model)
Test
Data

NAME AGE INCOME CREDIT CREDIT


Paul 20 - 30 high good good
Jenny 40 - 50 low fair fair
Rick 30 - 40 high fair good
Prediction
Classifier
(Model)
Unseen
Data

NAME AGE INCOME CREDIT


Doc 20 - 30 high good
Phil 30 - 40 low good
Kate 40 - 50 med fair
Classification: Approaches
Decision Tree Induction
Neural Networks
Support Vector Machine
Bayesian Approach
Rule Induction

45
Decision Tree Induction
Recursive partitioning of T until stopping criterion
satisfied (purity of partition, depth of tree etc.)
Decide the split criterion
– Select the splitting attribute
– Partition the data according to the selected attribute
– Apply induction method recursively on each partition

46
Decision tree inducers
ID3 – RJ Quinlan – 1986
– Simple, Uses information gain, no pruning
C4.5 - RJ Quinlan – 1993
– Uses gain ratio, handles numeric attributes and
missing values, error-based pruning
SLIQ -Mehta et al., 1996
– Scalable, one scan of database, uses gini index
CART - Brieman et al. 1984
– constructs binary tree, cost-complexity pruning, can
generate regression trees

47
Attribute Selection Criteria
Information Gain
– Entropy(C, S) = - pi Σlog(pi)
– Entropy (Before split) - Entropy (After split)
Gain Ratio
– Information Gain / Entropy (Before split)
Gini Index
– Measures divergence
– Gini(C, S) = 1 – Σpi2

48
Classical example: play tennis?
Outlook Temperature Humidity Windy Class
sunny hot high false N
 Training sunny hot high true N
overcast hot high false P
set from rain mild high false P
Quinlan’s rain cool normal false P
book rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
Information gain - play tennis example
Outlook Temperature Humidity W indy Class
sunny hot high false N outlook
sunny hot high true N
overcast hot high false P sunny rain
rain mild high false P overcast
rain cool normal false P
rain cool normal true N
overcast cool normal true P humidity P windy
sunny mild high false N
sunny cool normal false P
rain mild normal false P high normal true false
sunny mild normal true P
overcast mild high true P N P N P
overcast hot normal false P
rain mild high true N

 Choosing best split at root node:


 gain(outlook) = 0.246
 gain(temperature) = 0.029
 gain(humidity) = 0.151
 gain(windy) = 0.048
 Criterion biased towards attributes with many
values – corrections proposed (gain ratio)
Decision tree obtained with ID3
(Quinlan 86)
outlook

sunny rain
overcast

humidity P windy

high normal true false

N P N P
From decision trees to classification rules
One rule is generated for each path in the tree
from the root to a leaf
Rules are generally simpler to understand than
trees
outlook

sunny rain
IF outlook=sunny
overcast
AND humidity=normal
humidity P windy
THEN play tennis
high normal true false

N P N P
Advantages & Limitations
Advantages
– Self explanatory
– handle both numeric and categorical data
– Non-parametric method
Limitations
– Most algorithms predict only categorical attrib
– OverTraining
– Need for pruning
– Large Tree

53
Classification: Application
Bank Loan Granting System
– Decision trees constructed from bank-loan
histories to produce algorithms to decide
whether to grant a loan or not.
Anti – Money Laundering System
– KYC Status
Email Classification System
– Spam or Not

60
Stock Research Application
Efficient Prediction of Option Prices using
Machine Learning Techniques
Prediction of both European and American Option Prices using
General Regression Neural Network and Support Vector
Regression.
Stock Portfolio Management: Prediction of Risk
using Text Classification
Prediction or classification of risk in investment of a particular
company by Text classification using Naïve Bayes(NB) and K-
Nearest Neighbor(KNN).
Prediction of Financial Data Series: using
MATLAB GARCH Toolbox
61
Pattern Recognition:
Artificial Neural Network Application

Letter Recognition System


– Zip Code Identification System
– Apple's Newton uses a neural net
Speech Recognition System
– Voice Dialing
Image Processing
– Bioinfomatics

62
Emitter Classification
ELINT Data Analysis
Identification of Radars and Platform
Successfully Delivered DAPR software to
Indian Navy (INTEG)

63
Speaker 1

Speaker 2

Text
Although the spoken words are the same the recorded digital signals
are very different! 64
Pattern Recognition Example
Noisy image Recognized
pattern

65
Association Rule Mining
(Market Basket Analysis)

66
Association Rules
Intrarecord Links
– Finding associations among sets of objects
in transaction databases, relational
databases.
– Rule form: “Antecedent Consequent
[support, confidence]”.
Examples.
– shirt, tie, socks  shoes [0.5%, 60%]
– White bread, butter  egg [2.3%, 80%]

67
Preliminaries
Given: (1) database of transactions
(2) each transaction is a list of items
Find: all rules that correlate the presence of one set of items
with that of another set of items
– E.g., 95% of people who purchase PC and color printer
also purchase compter table
Business Questions:
– *  Electronic items (What the store should do to boost
sale of electronic items)
– Herbal Health products  * (What other products should
the store stocks up?)

68
Formal Definition
If X and Y are two item-sets, such that X
 Y =  , then for an association rule
X Y
Support is the probability that X and Y
occur together [ P(X U Y)]
Confidence is the conditional probability
that Y occurs in a transaction, given X is
present in the same transaction [P(Y/X)]

69
Itemset and Support

Tran Items Item A & C(4) Item C (5)


s ID
10 A, B,C
20 A, C
30 A, C, D
Item A(4)
40 B, C, E
50 A, C, E Sup(A): 4 (80%), Sup (AB): 1 (20%)
Sup (ABC): 1 (20%), Sup (ABCD): 0
Sup(ABCDE): 0
70
Confidence

Strength of the
Item A & C(4) Item
discovered rule C(5)
Computed as
P(X , Y)/P(X)
A  C (4/4)
C  A (4/5) Item A(4)

71
Interestingness Trans Items
ID

Minimum Support 10 A, B,C


– User specified parameter (Frequent itemsets) 20 A, C
– For minsup of 50% F = {A, C, AC} 30 A, C, D
– For minsup of 30% F = {A, B, C, AC, E} 40 B, C, E

Minimum Confidence 50 A, C, E

– Report rules that satisfy minimum confidence


level
– With minconf of 50% some of the discovered
rules are
A C [75%], AB C[100%], E F[100%]
etc.

72
The Apriori algorithm
The best known algorithm
Two steps:
– Find all itemsets that have minimum support (frequent
itemsets, also called large itemsets).
– Use frequent itemsets to generate rules.

E.g., a frequent itemset


{Chicken, Clothes, Milk} [sup = 3/7]
and one rule from the frequent itemset
Clothes  Milk, Chicken [sup = 3/7, conf = 3/3]

73
TID
Items
Apriori Example
1 Bread, Milk
2 Beer, Diaper, Bread, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Bread, Diaper, Milk

1-itemsets 2-itemsets 3-itemsets


74
TID Items
T100 1, 3, 4 Example –
T200 2, 3, 5 Finding frequent itemsets
T300 1, 2, 3, 5
T400 2, 5 minsup=0.5
Dataset T itemset:count
1. scan T  C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3
 F1: {1}:2, {2}:3, {3}:3, {5}:3

 C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}

2. scan T  C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2


 F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2

 C3: {2, 3,5}

3. scan T  C3: {2, 3, 5}:2  F3: {2, 3, 5}

75
An example
F3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4},
{1, 3, 5}, {2, 3, 4}}

After join
– C4 = {{1, 2, 3, 4}, {1, 3, 4, 5}}
After pruning:
– C4 = {{1, 2, 3, 4}}
because {1, 4, 5} is not in F3 ({1, 3, 4, 5} is removed)

76
Generating rules: an example
Suppose {2,3,4} is frequent, with sup=50%
– Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with
sup=50%, 50%, 75%, 75%, 75%, 75% respectively
– These generate these association rules:
2,3  4, confidence=100%
2,4  3, confidence=100%
3,4  2, confidence=67%
2  3,4, confidence=67%
3  2,4, confidence=67%
4  2,3, confidence=67%
All rules have support = 50%

77
On Apriori Algorithm
Seems to be very expensive
Level-wise search
K = the size of the largest itemset
It makes at most K passes over data
In practice, K is bounded (10).
The algorithm is very fast. Under some conditions,
all rules can be found in linear time.
Scale up to large data sets

78
AR: Applications

Retail Marketing
– Floor Planning, Discounting, Catalogue Design
Medical Diagnosis
– Comparison of the genotype of people with/without a
condition allowed the discovery of a set of genes that
together account for many cases of diabetes.
Geographical Information systems
– Link Analysis

79
Walmart Study

80
Typical Business Decisions
(For Walmart)
What to put on sale?
How to design coupons?
How to place merchandise etc on shelf to
maximise profit ?

81
Walmart Study Contd.
IBM Almadem Group did an exhaustive study on Walmart
( Biggest chain of super store in US) for discovering buying
patterns of customers.
Interesting rules came out of the study

Useful: On Fridays convenience store customers often


purchase diapers and beer together.
Trivial: Customers who purchase maintenance
agreements are very likely to purchase large
appliances.
Defence Applications
Applications in Defence
– Finding Associations in
Terrorists activities ( Ex: 9/11
attack)
– Finding Associations in studying
behavior of the enemy during
war
– Finding Associations which may
lead to intrusion threat at
strategic locations.

83
M

The MINDS Project


I
N
D
S

 MINDS – MINnesota INtrusion Detection System


–Characterization of attacks using
association pattern analysis

Start Dest Number


Tid SrcIP Dest IP
time Port of bytes

1 206.163.37.95 11:07:20 160.94.179.223 139 192

2 206.163.37.95 11:13:56 160.94.179.219 139 195

3 206.163.37.95 11:14:29 160.94.179.217 139 180

4 206.163.37.95 11:14:30 160.94.179.255 139 199

5 206.163.37.95 11:14:32 160.94.179.254 139 186

6 206.163.37.95 11:14:35 160.94.179.253 139 177

7 206.163.37.95 11:14:36 160.94.179.252 139 172

8 206.163.37.95 11:14:38 160.94.179.251 139Rules Discovered:


192
Rules Discovered:
139{Src IP
9 206.163.37.95 11:14:41 160.94.179.250
IP==206.163.37.95,
{Src 195 206.163.37.95,
10 206.163.37.95 11:14:44 160.94.179.249
Dest Port
Port =139,
139 Dest 163
= 139,
10
Bytes 
Bytes  [150,200]}
[150, 200]}-->
-->{SCAN}
{SCAN}
84
List of Papers
Published
1. Robust Approach for Estimating Probabilities in Naive-Bayes Classifier for
Gene Expression Data, Elsevier: Expert Systems with Applications, 2010, doi:
10.1016/j.eswa.2010.06.076.
2. Best Paper Award:, Ranking Police Administration Units on the Basis of Crime
Prevention Measures using Data Envelopment Analysis and Clustering, 6th
International Conference on E-Governance (ICEG 2008), 40-53.
3. Towards situation awareness in integrated air defence using clustering and
case based reasoning, Springer: Lecture Notes in Computer Science, 5909,
579–584, 2009.
4. Adaptive Query Interface for Mining Crime Data, Springer: Lecture Notes in
Computer Science (LNCS), 4777, 2007, 285-296.
5. Robust Approach for Estimating Probabilities in Naive-Bayes Classifier,
Springer: Lecture Notes in Computer Science (LNCS), 4815, 2007, 11-16.
6. A Multivariate Time Series Clustering Approach for Crime Trends Prediction,
Proc. of IEEE System Man & Cybernetics, 2008, 892-896.
7. Crime Data Mining for Indian Police Information System, Proc. 5th International
Conference on E-governance (ICEG 2007), 388-397.
8. Clustering with Varying Weights on Types of Crime, ORSI Conference, 2008
List of Papers (Contd.)
Communicated
1. An Efficient Statistical Feature Selection Approach for Classification of Gene
Expression Data, Journal of Biomedical Informatics, July 2010, Resubmission with
minor modification.
2. Towards A Framework of Intelligent Decision Support System for Indian Police,
Elsevier: Decision Support Systems, May 2010.
3. A Statistical Approach for Feature Selection and Ranking, Elsevier: Patter
Recognition, June 2010.
4. A Novel Approach for Distance-Based Semi-Supervised Clustering using Functional
Link Neural Network, Springer Soft Computing, June 2010.
5. An Efficient Similarity Measure based Multivariate Time Series Clustering Approach
for Performance Analysis, IEEE System Man and Cybernetics, May 2010.
6. Issues and Challenges for Emitter Classification in the Context of Electronic
Warfare. Defence Science Journal.
7. A Novel Approach for Weighted Clustering using Hyperlink-Induced Topic Search
(HITS) Algorithm, Defence Science Journal.
References
Han and Kamber, Data Mining: Concepts and Techniques, Morgan Kauffman
Arun Pujari, Data Mining Techniques, University Press
Hand et al., Principles of Data Mining, PHI.
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.
J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for
data mining. In Proc. 1996 Int. Conf. Extending Database Technology
(EDBT'96), Avignon, France, March 1996.
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between
sets of items in large databases. SIGMOD'93, 207-216, Washington, D.C.
R. Agrawal and R. Srikant. Fast algorithms for mining association rules.
VLDB'94 487-499, Santiago, Chile.
J. McQueen, “Some methods for classification and analysis of multivariate
observations”, Proc. Symp. Math. Statist. And Probability, 5th, Berkeley, 1,
1967, 281–298.
A. K. Jain, M. N. Murty, P. J. Flynn, “Data clustering: a review”, ACM
Computing Surveys, 31(3), 1999, 264–323.
87
Questions?

88

Вам также может понравиться