Академический Документы
Профессиональный Документы
Культура Документы
Training Program
on
Data Mining & Business Analytics
using
Rapid Miner
Boby J
1
Indian Statistical Institute
Contents
2
Indian Statistical Institute
DATA
PREPROCESSING
3
DATA PREPROCESSING Indian Statistical Institute
4
Indian Statistical Institute
Missing Value Handling
5
Indian Statistical Institute
Missing Value Handling
If % Missing is > 20%, then the data is not sufficient to develop the model.
Ignore the corresponding attribute and proceed
7
Indian Statistical Institute
Missing Value Handling
Last 3 Month's
Usage A B C Grand Total
Missing 1 1 0 2 Conclusion
Non Missing 5 6 6 17
Grand Total 6 7 6 19 None of the
cases 100%
Average Recharge A B C Grand Total
values are
Missing 1 1 0 2
missing
Non Missing 5 6 6 17
Grand Total 6 7 6 19
Last 3 Month's
Usage A B C
Missing 16.67 14.29 0.00 Conclusion
Non Missing 83.33 85.71 100.00
Grand Total 100 100 100 None of the cases
100% values are
Average Recharge A B C
missing
Missing 16.67 14.29 0.00
Non Missing 83.33 85.71 100.00
Grand Total 100 100 100
Projected Grow th A B C
Missing 0 0 0
Non Missing 100 100 100
Grand Total 100 100 100 9
Indian Statistical Institute
Missing Value Handling
Example: 3 Choices
Choice 1: Ignore missing value records
Current Last 3
Month's Month's Average Projected
SL No. Usage Usage Recharge Growth Circle
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
4 4.6 3.1 98.5 9..2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
11 6.5 2.8 95.4 98.5 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
18 6.2 3.4 94.6 97.3 C
10
Indian Statistical Institute
Missing Value Handling
15
Indian Statistical Institute
MARKET BASKET
ANALYSIS
16
Indian Statistical Institute
MARKET BASKET ANALYSIS
A modeling technique based upon the logic that if a customer buy a certain group of
items, he is more (or less) likely to buy another group of items
Example:
Those who buy cigarettes are more likely to buy match box also.
17
Indian Statistical Institute
MARKET BASKET ANALYSIS
Example
Id Items
1 Milk, Bread
2 Bread, Biscuits, Toys, Eggs
3 Milk, Biscuits, Toys, Fruits
4 Bread, Milk, Toys, Biscuits
5 Milk, Bread, Biscuits, Fruits
Itemset:
k – itemset
An itemset consisting of k items
Id Items
1 Milk, Bread
2 Bread, Biscuits, Toys, Eggs
3 Milk, Biscuits, Toys, Fruits
4 Bread, Milk, Toys, Biscuits
5 Milk, Bread, Biscuits, Fruits
19
Indian Statistical Institute
MARKET BASKET ANALYSIS
Support count:
Example
{Milk, Bread, Biscuits} = 2
Id Items
1 Milk, Bread
2 Bread, Biscuits, Toys, Eggs
3 Milk, Biscuits, Toys, Fruits
4 Bread, Milk, Toys, Biscuits
5 Milk, Bread, Biscuits, Fruits
20
Indian Statistical Institute
MARKET BASKET ANALYSIS
Support :
Example
{Milk, Bread, Biscuits} = 2 / 5
Id Items
1 Milk, Bread
2 Bread, Biscuits, Toys, Eggs
3 Milk, Biscuits, Toys, Fruits
4 Bread, Milk, Toys, Biscuits
5 Milk, Bread, Biscuits, Fruits
Frequent Itemset
21
An itemset whose support is greater than or equal to minimum support
Indian Statistical Institute
MARKET BASKET ANALYSIS
Confidence
Conditional probability that an item will appear in transactions that contain another
items
Example
Confidence that Toys will appear in transaction containing Milk & Biscuits
= {Milk, Biscuits, Toys} / {Milk, Biscuits} = 2 / 3 = 0.67
Id Items
1 Milk, Bread
2 Bread, Biscuits, Toys, Eggs
3 Milk, Biscuits, Toys, Fruits
4 Bread, Milk, Toys, Biscuits
5 Milk, Bread, Biscuits, Fruits
22
Indian Statistical Institute
MARKET BASKET ANALYSIS
2. Rule Generation
Fix minimum confidence value
Generate high confidence rules from each frequent itemset
23
Indian Statistical Institute
MARKET BASKET ANALYSIS
24
Indian Statistical Institute
MARKET BASKET ANALYSIS
25
Indian Statistical Institute
MARKET BASKET ANALYSIS
Step 1:
Generate itemsets of length = 1 & calculate support
26
Indian Statistical Institute
MARKET BASKET ANALYSIS
Step 2:
eliminate itemsets with support count < minimum support count (2)
27
Indian Statistical Institute
MARKET BASKET ANALYSIS
Step 2:
eliminate itemsets with support count < minimum support count (2)
28
Indian Statistical Institute
MARKET BASKET ANALYSIS
Step 3:
generate itemsets of length = 2
Step 4:
eliminate itemsets with support count < minimum support count (2)
Step 4:
eliminate itemsets with support count < minimum support count (2)
31
Indian Statistical Institute
MARKET BASKET ANALYSIS
Step 5:
generate itemsets of length = 3
32
Indian Statistical Institute
MARKET BASKET ANALYSIS
Step 6:
generate itemsets of length = 4
Itemset Support Count
A, B, C, E 1
33
Indian Statistical Institute
MARKET BASKET ANALYSIS
Result:
34
Indian Statistical Institute
MARKET BASKET ANALYSIS
35
Indian Statistical Institute
MARKET BASKET ANALYSIS
Lift
Example
Criteria : Lift ≥ 1
Lift (A , C) = 1.12 > Lift (A , E) indicates that A has a greater impact on the
frequency of C than it has on the frequency of E
37
Indian Statistical Institute
MARKET BASKET ANALYSIS
38
Indian Statistical Institute
MARKET BASKET ANALYSIS
Exercise 2:
The market basket Software data set contains the details of transaction at a
software product company.
LINEAR
REGRESSION
40
Indian Statistical Institute
CORRELATION & REGRESSION
Correlation:
41
Indian Statistical Institute
CORRELATION & REGRESSION
Correlation: Usage
Explore the relationship between the output characteristic and input or process
variable.
42
Indian Statistical Institute
CORRELATION & REGRESSION
Scatter Plot
20
16
12
Y
8
4
0
0 3 6 9 12
X
43
Indian Statistical Institute
CORRELATION & REGRESSION
Scatter Plot
9
8
7
6
5
Y
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10
X
44
Indian Statistical Institute
Scatter Plot
100
90
80
70
60
Y
50
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100
X
45
Indian Statistical Institute
CORRELATION & REGRESSION
Scatter Plot
30
25
20
Y
15
10
0
0 2 4 6 8 10 12
X
46
Indian Statistical Institute
CORRELATION & REGRESSION
Symbol : r
Range : -1 to 1
Sign : Type of correlation
Value : Degree of correlation
Examples:
r = 0.6 , 60 % positive correlation
r = -0.82, 82% negative correlation
r = 0, No correlation
47
Indian Statistical Institute
CORRELATION & REGRESSION
Collect data on x and y: When x is low, y is also low & vice versa
x y
2 5
3 7
1 3
5 11
6 12
7 15 48
Indian Statistical Institute
CORRELATION & REGRESSION
SL No. x y
1 2 5
2 3 7
3 1 3
4 5 11
5 6 12
6 7 15
Mean 4 8.83
49
Indian Statistical Institute
CORRELATION & REGRESSION
4 1 2.17
5 2 3.17
6 3 6.17
50
Indian Statistical Institute
CORRELATION & REGRESSION
Generally when x values are negative, y values are also negative & vice versa
1 -2 -3.83
2 -1 -1.83
3 -3 -5.83
4 1 2.17
5 2 3.17
6 3 6.17
51
Indian Statistical Institute
CORRELATION & REGRESSION
Then
Product of x & y values will be positive
1 -2 -3.83 7.66
2 -1 -1.83 1.83
3 -3 -5.83 17.49
4 1 2.17 2.17
5 2 3.17 6.34
6 3 6.17 18.51
Sum = Sxy 54
52
Indian Statistical Institute
CORRELATION & REGRESSION
1 -2 -3.83 7.66
2 -1 -1.83 1.83
3 -3 -5.83 17.49
4 1 2.17 2.17
5 2 3.17 6.34
6 3 6.17 18.51
Sum = Sxy 54
53
Indian Statistical Institute
CORRELATION & REGRESSION
Collect data on x and y: When x is low then y will be high & vice versa
x y
2 12
3 11
1 15
5 7
6 5
7 3
54
Indian Statistical Institute
CORRELATION & REGRESSION
SL No. x y
1 2 12
2 3 11
3 1 15
4 5 7
5 6 5
6 7 3
Mean 4 8.83
55
Indian Statistical Institute
CORRELATION & REGRESSION
56
Indian Statistical Institute
CORRELATION & REGRESSION
Generally when x values are negative, y values are positive & vice versa
1
-2 3.67
2 -1 2.67
3
-3 6.67
4 1 -1.33
5
2 -3.33
6 3 -5.33
57
Indian Statistical Institute
CORRELATION & REGRESSION
Then
Product of x & y values will be negative
1
-2 3.67 -7.34
2 -1 2.67 -2.67
3
-3 6.67 -20.01
4 1 -1.33 -1.33
5
2 -3.33 -6.66
6 3 -5.33 -15.99
Sum = Sxy - 54
58
Indian Statistical Institute
CORRELATION & REGRESSION
1
-2 3.67 -7.34
2 -1 2.67 -2.67
3
-3 6.67 -20.01
4 1 -1.33 -1.33
5
2 -3.33 -6.66
6 3 -5.33 -15.99
Sum = Sxy - 54
59
Indian Statistical Institute
CORRELATION & REGRESSION
Coefficient of Correlation:
In Short
If correlation is positive
Sxy will be positive
If correlation is negative
Sxy will be negative
60
Indian Statistical Institute
CORRELATION & REGRESSION
Coefficient of Correlation:
61
Indian Statistical Institute
CORRELATION & REGRESSION
Coefficient of Correlation:
1
-2 3.67 -7.34 4 14.6689
2 -1 2.67 -2.67 1 3.3489
3
-3 6.67 -20.01 9 33.9889
4 1 -1.33 -1.33 1 4.7089
5
2 -3.33 -6.66 4 10.0489
6 3 -5.33 -15.99 9 38.0689
Sum Sxy: -54 Sxx: 28 Syy:104.83
Regression
Correlation helps
To check whether two variables are related
If related
Identify the type & degree of relationship
63
Indian Statistical Institute
CORRELATION & REGRESSION
Regression
Regression helps
• To identify the exact form of the relationship
• To model output in terms of input or process variables
Examples:
Yield = 5 + 3 x Time - 2 x Temperature
Y = 2 - 5x
64
Indian Statistical Institute
CORRELATION & REGRESSION
Multiple Regression
65
Indian Statistical Institute
CORRELATION & REGRESSION
Exercise 1: The data on Vendor performance score and the number of On Time,
Complete, Undamaged & Correctly billed shipments from the vendors of a
supply chain management company are given below. Can you develop a
model for Vendor performance score in terms of other variables?
Ontime Complete Undamaged Correctly Performance
Vendor Id Shipment Shipment Shipmetns billed Score
1 950 990 980 550 2985
2 1450 1425 1475 975 4576
3 1700 1575 1730 1320 5435
4 1800 1515 1890 1615 5955
5 1675 1420 1756 1456 5400
6 1756 1645 1835 1489 5590
7 1236 1462 1335 1435 4675
8 1100 1523 1565 1625 4960
9 1325 1725 1570 1520 5325
10 1450 1620 1463 1430 5170
11 1570 1458 1356 1630 5190
66
Indian Statistical Institute
LINEAR REGRESSION
67
Indian Statistical Institute
CLASSIFICATION
METHODS
68
Indian Statistical Institute
INTRODUCTION
Objective
To develop a mathematical model for an attribute or response metric (Y) in terms of
other available attributes (Xs).
When to Use
Xs : Continuous or discrete
Y : Discrete
69
Indian Statistical Institute
CLASSIFICATION METHODS
Usage:
For classifying future or unknown data
70
Indian Statistical Institute
CLASSIFICATION METHODS
Example:
Attribute 1 x1
Attribute 2 x2
Label : y Y1 (Red) , y2 (Blue)
x1 x2 Y x1 x2 Y
11.35 23 Blue 11.85 39.9 Red
11.59 22.3 Blue 12.09 39.5 Red
12.19 24.5 Blue 12.69 37.8 Red
13.23 26.4 Blue 13.73 38.2 Red
13.51 30.2 Blue 14.01 37.8 Red
13.68 32 Blue 14.18 36.5 Red
14.78 33.1 Blue 15.28 36 Red
15.11 33 Blue 15.61 37.1 Red
15.55 25.2 Blue 16.05 33.1 Red
16.37 24.1 Blue 16.87 32.4 Red
16.99 22 Blue 17.49 31 Red
18.23 23.5 Blue 18.73 32 Red
18.83 24.1 Blue 19.33 31.8 Red
71
19.06 25 Blue 19.56 30.9 Red
Indian Statistical Institute
CLASSIFICATION METHODS
Example:
Attribute 1 x1
Attribute 2 x2
Label : y Y1 (Red) , y2 (Blue)
40
38
36
34
32
x2
30
28
26
24
22
20
10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00
x1
72
Indian Statistical Institute
CLASSIFICATION METHODS
Example:
Attribute 1 x1
Attribute 2 x2 x2
30
28
26
24
22
20
10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00
x1
73
Indian Statistical Institute
CLASSIFICATION METHODS
Example:
Attribute 1 x1
Attribute 2 x2 x2
30
28
26
24
22
20
10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00
x1
74
Indian Statistical Institute
CLASSIFICATION METHODS
Example:
Attribute 1 x1
Attribute 2 x2 x2
30
28
26 y2 y1
24
22
20
10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00
x1
75
Indian Statistical Institute
CLASSIFICATION METHODS
Example: Rules
Attribute 1 x1
Attribute 2 x2 x2
If x2 > 35 then y = y1 y1 x1 y2
76
Indian Statistical Institute
CLASSIFICATION METHODS
Example: The following table 1 gives the profile of customers (Refund, Marital
Status & Taxable Income) who has taken loan from a bank. The table also
shows how many of them really cheated the bank.
1. Can you develop a decision rule to classify the customer as whether they will
cheat or not based on the value of 3 attributes (Refund, Marital Status &
Taxable Income)
2. Validate the model using the test data given in table 2
Example:Result
79
Indian Statistical Institute
CLASSIFICATION METHODS
Example:Decision Tree
Actual Class
Predicted Class = Yes Class = No
Class
Class = Yes a b
Class = No c d
2. Accuracy: (a+d) / (a + b + c + d)
3. Precision: a / (a + b)
4. Recall: a / (a + c)
82
Indian Statistical Institute
CLASSIFICATION METHODS
1 No No
2 No Yes
3 No No
4 No No
5 Yes Yes
1. Confusion Matrix
Actual Class
Predicted Cheat = No Cheat = Yes
Class
Cheat = No 3 0
Cheat = Yes 1 1 83
Indian Statistical Institute
CLASSIFICATION METHODS
1. Confusion Matrix
Actual Class
Predicted Cheat = No Cheat = Yes
Class
Cheat = No 3 0
Cheat = Yes 1 1
3. Precision: 3 / (3 + 0) = 3 / 3 = 1.0
4. Recall: 3 / (3 + 1) = 3 / 4= 0.75
5. F Measure = 2 x Precision x Accuracy / (Precision + Accuracy)
= 2 x 1.0 x 0.75 / (1.00 + 0.75) = 0.86 84
Indian Statistical Institute
CLASSIFICATION METHODS
Challenges
How to represent the entire information in the dataset using minimum number
of rules?
How to develop the smallest tree?
Solution
Split Attribute
First Marital Status
Second Refund
Third Taxable Income
85
Indian Statistical Institute
CLASSIFICATION METHODS
CHAID Algorithm
CLASSIFICATION METHODS
CHAID Algorithm
Example: A marketing company wants to optimize their mailing campaign by sending
the brochure mail only to those customers who responded to previous mail
campaigns. The profile of customers are given below? Can you develop a rule to
identify the profile of customers who are likely to respond?
Number of variables = 4
CLASSIFICATION METHODS
88
Indian Statistical Institute
CLASSIFICATION METHODS
Exercise 1: A bank wants to know the profile of customers who will buy a Personal
Equity Plan (Pep) after the mailing campaign? The data is given in the
file named bank-data.xls.
89
Indian Statistical Institute
CLASSIFICATION METHODS
CLASSIFICATION METHODS
91
Indian Statistical Institute
CLASSIFICATION METHODS
CLUSTER ANALYSIS
93
Indian Statistical Institute
CLUSTER ANALYSIS
Objective
To classify the records or items into a smaller number of groups based on the values
of available attributes.
When to Use
When there is no Y attribute
All attributes are considered as Xs only
94
Indian Statistical Institute
CLUSTER ANALYSIS
Methodology to group objects based on many attributes such that objects in a group
95
Indian Statistical Institute
CLUSTER ANALYSIS
Types of Clustering
• K Mean Clustering
• K Medoid Clustering
96
Indian Statistical Institute
CLUSTER ANALYSIS
K Mean Clustering
Methodology to group objects based on many attributes such that objects in a cluster will
be closer (or more similar) to the centroid of the cluster than to the centroid of any other
cluster.
97
Indian Statistical Institute
CLUSTER ANALYSIS
Example:
Euclidean Distance
CLUSTER ANALYSIS
K Mean Clustering:Algorithm
99
Indian Statistical Institute
CLUSTER ANALYSIS
K Mean Clustering:Example
Cluster the following data with 3 attributes (Spend in 3 quarters) into 2 clusters
SL No. Quarter 1 Quarter 2 Quarter 3
1 1.425172 31.08748 108.5436
2 3.017551 34.17728 103.4577
3 3.803405 34.78973 101.7977
4 4.299151 31.02313 107.3701
5 5.352034 22.80945 109.9353
6 6.038361 22.21948 100.1809
7 6.128493 25.04893 111.0543
8 8.381028 23.6761 106.3302
9 8.989409 27.62143 106.7186
10 9.788646 27.35268 105.7799
Step 1: k = 2
Step 2: Randomly identify 2 centroids
Centroid Quarter 1 Quarter 2 Quarter 3
1 1.5 35 100 100
2 9.8 22 111
Indian Statistical Institute
CLUSTER ANALYSIS
K Mean Clustering:Example
101
Indian Statistical Institute
CLUSTER ANALYSIS
K Mean Clustering:Example
102
Indian Statistical Institute
CLUSTER ANALYSIS
K Mean Clustering:Example
Mean
Centroid Quarter 1 Quarter 2 Quarter 3
1 3.13632 32.7694 105.2923
2 7.446328 24.78801 106.6665
103
Step 6: Recalculate the centroids and repeat the steps
Indian Statistical Institute
CLUSTER ANALYSIS
Exercise 1: The data on the % Erlang utilization of mobile towers of a telecom service
provider is given in Erlang_Utilization.xls? Kindly group the towers into 5
clusters based on the utilization?
104
Indian Statistical Institute
CLUSTER ANALYSIS
K Medoid Clustering
Methodology to group objects based on many attributes such that objects in a cluster will
be closer (or more similar) to the most centrally located object of the cluster.
105