Course Material DMBA

Indian Statistical Institute
Training Program
on
Data Mining & Business Analytics
using
Rapid Miner
Boby J
1
Contents
1. Introduction to Rapid Miner

2. Missing Value Analysis
3. Data Visualization
4. Market Basket Analysis
5. Correlation & Regression
6. Data partitioning & Classification
7. Cluster Analysis
2
DATA
PREPROCESSING
3
DATA PREPROCESSING Indian Statistical Institute
1. Missing Value Handling
4
Missing Value Handling
Example: Suppose a telecom company wants to introduce a scoring mechanism to rate

its circles based on the following parameters
1. Current Month’s Usage
2. Last 3 Month’s Usage
3. Average Recharge
4. Projected Growth
The data set is given in next slide. There are some missing values. How to
proceed?
5
Example: Circle wise Data

Current Last 3
Month's Month's Average Projected
SL No. Usage Usage Recharge Growth Circle
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
3 3.2 99.2 A
4 4.6 3.1 98.5 9..2 A
5 5 98.4 99.2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
10 2.3 96 98.3 B
11 6.5 2.8 95.4 98.5 B
12 5.7 95.5 98.3 B
13 6.3 3.3 98.6 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
17 3 94.8 98 C
18 6.2 3.4 94.6 97.3 C
19 5.9 3 94.9 98.8 C
6
Step 1: Calculate the % of missing values in each attribute

Current Last 3
Usage Usage Recharge Growth Circle
Missing Values 3 2 2 0 0
Total Records 19 19 19 19 19
% Missing 15.79 10.53 10.53 0.00 0.00
If % Missing is > 20%, then the data is not sufficient to develop the model.
Ignore the corresponding attribute and proceed
7
Step 3: Prepare Pivot table of attributes

Current Month's
Usage A B C Grand Total
Missing 1 1 1 3
Non Missing 5 6 5 16
Grand Total 6 7 6 19
Last 3 Month's
Usage A B C Grand Total
Missing 1 1 0 2 Conclusion
Grand Total 6 7 6 19 None of the
cases 100%
Average Recharge A B C Grand Total
values are
Missing 1 1 0 2
missing
Projected Grow th A B C Grand Total

Missing 0 0 0 0
Grand Total 6 7 6 19 8
Step 3: Prepare Pivot table of attributes

Current Month's
Usage A B C
Missing 16.67 14.29 16.67
Non Missing 83.33 85.71 83.33
Grand Total 100 100 100
Last 3 Month's
Usage A B C
Missing 16.67 14.29 0.00 Conclusion
Non Missing 83.33 85.71 100.00
Grand Total 100 100 100 None of the cases
100% values are
Average Recharge A B C
missing
Missing 16.67 14.29 0.00
Non Missing 83.33 85.71 100.00
Grand Total 100 100 100
Projected Grow th A B C
Missing 0 0 0
Non Missing 100 100 100
Example: 3 Choices
Choice 1: Ignore missing value records
Current Last 3
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
4 4.6 3.1 98.5 9..2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
11 6.5 2.8 95.4 98.5 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
18 6.2 3.4 94.6 97.3 C
10

Choice 2. Replace the missing values with attribute mean, minimum,
maximum or mode
Current Last 3
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
3 3.2 99.2 A
4 4.6 3.1 98.5 9..2 A
5 5 98.4 99.2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
10 2.3 96 98.3 B
11 6.5 2.8 95.4 98.5 B
12 5.7 95.5 98.3 B
13 6.3 3.3 98.6 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
17 3 94.8 98 C
18 6.2 3.4 94.6 97.3 C
19 5.9 3 94.9 98.8 C
Mean 6.0 3.1 96.1 98.5
Min 4.6 2.3 94.3 97.3 11
Max 7.0 3.9 99.4 99.4

Choice 2. Replace the missing values with attribute mean, minimum,
maximum or mode
Current Last 3
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
3 6 3.2 96.1 99.2 A
4 4.6 3.1 98.5 9..2 A
5 5 3.1 98.4 99.2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
10 6 2.3 96 98.3 B
11 6.5 2.8 95.4 98.5 B
12 5.7 3.1 95.5 98.3 B
13 6.3 3.3 96.1 98.6 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
17 6 3 94.8 98 C
18 6.2 3.4 94.6 97.3 C
19 5.9 3 94.9 98.8 C
Mean 6.0 3.1 96.1 98.5
Min 4.6 2.3 94.3 97.3 12
Max 7.0 3.9 99.4 99.4

Choice 3 : Replace the missing values with attribute mean corresponding
the circle
Current Last 3
SL No. Usage Usage Recharge Growth circle
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
3 3.2 99.2 A
4 4.6 3.1 98.5 9..2 A
5 5 98.4 99.2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
10 2.3 96 98.3 B
11 6.5 2.8 95.4 98.5 B
12 5.7 95.5 98.3 B
13 6.3 3.3 98.6 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
17 3 94.8 98 C
18 6.2 3.4 94.6 97.3 C
19 5.9 3 94.9 98.8 C
Mean 5.00 3.34 98.64 99.24 A
Mean 6.47 2.98 95.47 98.45 B 13
Mean 6.36 3.03 94.73 97.97 C

Choice 3 : Replace the missing values with attribute mean corresponding
the circle
Current Last 3
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
3 5 3.2 98.64 99.2 A
4 4.6 3.1 98.5 9..2 A
5 5 3.34 98.4 99.2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
10 6.47 2.3 96 98.3 B
11 6.5 2.8 95.4 98.5 B
12 5.7 2.98 95.5 98.3 B
13 6.3 3.3 95.47 98.6 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
17 6.36 3 94.8 98 C
18 6.2 3.4 94.6 97.3 C
19 5.9 3 94.9 98.8 C
Mean 5.00 3.34 98.64 99.24 A
Mean 6.47 2.98 95.47 98.45 B 14
Mean 6.36 3.03 94.73 97.97 C
DATA PREPROCESSING: Missing Value Handling
Exercise: The data on 3 modes of transport of a supply chain management company

are given below. Handle the missing values?
SL No Delivery Speed Vehicles Extra Handling Cost Mode of Transport

1 27.75 3 2 Water
2 3 445 Water
3 28.2 3 1 460 Water
4 8.75 1 0 980 Direct Truck
5 9.25 0 950 Direct Truck
6 9.15 1 1 Direct Truck
7 15.2 3 2 820 LTL Truck
8 16.2 2 2.5 810 LTL Truck
9 3 1.5 835 LTL Truck
LTL Truck : Less than truck load
15
MARKET BASKET
ANALYSIS
16
MARKET BASKET ANALYSIS
A modeling technique based upon the logic that if a customer buy a certain group of
items, he is more (or less) likely to buy another group of items
Example:
Those who buy cigarettes are more likely to buy match box also.
17
Association Rule Mining:
Developing rules that predict the occurrence of of an item based on the

occurrence of other items in the transaction
Example
Id Items
1 Milk, Bread
2 Bread, Biscuits, Toys, Eggs
3 Milk, Biscuits, Toys, Fruits
4 Bread, Milk, Toys, Biscuits
5 Milk, Bread, Biscuits, Fruits
{Milk, Bread} {Biscuits} with probability = 2 / 3

18
Itemset:
A collection of one or more items
k – itemset
An itemset consisting of k items
Id Items
1 Milk, Bread
19
Support count:
Frequency of occurrence of an itemset
Example
{Milk, Bread, Biscuits} = 2
Id Items
1 Milk, Bread
20
Support :
Proportion or fraction of transaction that contain an itemset
Example
{Milk, Bread, Biscuits} = 2 / 5
Id Items
1 Milk, Bread
Frequent Itemset
21
An itemset whose support is greater than or equal to minimum support
Confidence
Conditional probability that an item will appear in transactions that contain another
items
Example
Confidence that Toys will appear in transaction containing Milk & Biscuits
= {Milk, Biscuits, Toys} / {Milk, Biscuits} = 2 / 3 = 0.67
Id Items
1 Milk, Bread
22
Association Rule Mining

1. Frequent Itemset Generation
Fix minimum support value
Generate all itemsets whose support ≥ minimum support
2. Rule Generation
Fix minimum confidence value
Generate high confidence rules from each frequent itemset
23
Frequent Itemset Generation: Apriori Algorithm

a. Fix minimum support count
b. Generate all itemsets of length = 1
c. Calculate the support for each itemset
d. Eliminate all itemsets with support count < minimum support count
e. Repeat steps c & d for itemsets of length = 2, 3, ---
24

Example:
Minimum Support count = 2

Id Items
1 A,C,D
2 B,C,E
3 A,B,C,E
4 B,E
5 A,E
6 A,C,E
25

Example:
Step 1:
Generate itemsets of length = 1 & calculate support
Item Support count

A 4
B 3
C 4
D 1
E 5
26

Example:
Step 2:
eliminate itemsets with support count < minimum support count (2)
Item Support count

A 4
B 3
C 4
D 1
E 5
27

Example:
Step 2:
Item Support count

A 4
B 3
C 4
E 5
28

Example:
Step 3:
generate itemsets of length = 2
Item Support count

A, B 1
A, C 3
A,E 3
B, C 2
B, E 3
C,E 3
29

Example:
Step 4:
Item Support count

A, B 1
A, C 3
A,E 3
B, C 2
B, E 3
C,E 3
30

Example:
Step 4:
Item Support count

A, C 3
A,E 3
B, C 2
B, E 3
C,E 3
31

Example:
Step 5:
Item Support count

A, C, E 2
B, C, E 2
32

Example:
Step 6:
Itemset Support Count
A, B, C, E 1
33

Example:
Result:
Item Support count Support

A, C, E 2 0.33
B, C, E 2 0.33
A,C 3 0.50
A,E 3 0.50
B,C 2 0.33
B,E 3 0.50
C,E 3 0.50
34
Association Rule Mining: Apriori Algorithm

Example:
Minimum Support = 0.50

Minimum Confidence = 0.5
Item Support count Support

A, C, E 2 0.33
B, C, E 2 0.33
A,C 3 0.50
A,E 3 0.50
B,C 2 0.33
B,E 3 0.50
C,E 3 0.50
35
Association Rule Mining: Apriori Algorithm

Example:
Minimum Support = 0.50

Minimum Confidence = 0.5
Item Support Confidence

A C 0.50 0.75
A E 0.50 0.75
B E 0.50 1.00
C E 0.50 0.75
C A 0.50 0.75
E A 0.50 0.60
E B 0.50 0.60
E C 0.50 0.60
36
Association Rule Mining: Other Measures
Lift
Lift (A C) = Confidence (A C) / Support (C)
Example
Item Confidence Support Lift

A C 0.75 C = 0.67 1.12
A E 0.75 E = 0.83 0.93
Criteria : Lift ≥ 1
Lift (A , C) = 1.12 > Lift (A , E) indicates that A has a greater impact on the
frequency of C than it has on the frequency of E
37
Exercise 1:The data on transactions from a mobile outlet is given below.
1. Generate frequent items sets with a support of at least 25%?

2. Generate association of items with a confidence of at least 50%?
3. Estimate the chance that Mobile Slim, Landline and Broadband will
be subscribed together?
4. Estimate the chance that the customers who buy Landline will also
purchase Broadband & Ring tones?
38
Exercise 2:
The market basket Software data set contains the details of transaction at a
software product company.
1. Identify the frequent product types with a support of minimum 25% ?

2. Also identify the association of products with a confidence of minimum 50%
?
3. What is the chance that Operating System and Office Suite will be
purchased together?
4. What is the chance that Operating System and Visual Studio will be
purchased together?
5. Estimate the chance that the customers who buy Operating System will also
purchase Office Suite ?
6. Estimate the chance that the customers who buy Operating System will also
purchase Visual Studio?
39
LINEAR
REGRESSION
40
CORRELATION & REGRESSION
Correlation:
Correlation analysis is a technique to identify the relationship between two

variables.
Type and degree of relationship between two variables.
41
Correlation: Usage
Explore the relationship between the output characteristic and input or process
variable.
Output variable : Y : Dependent variable

Input / Process variable : X : Independent variable
42
Positive Correlation: Y increases as X increases & vice versa
Scatter Plot
20
16
12
Y
8
4
0
0 3 6 9 12
X
43
Negative Correlation: Y decreases as X increases & vice versa
Scatter Plot
9
8
7
6
5
Y
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10
X
44
No Correlation: Random Distribution of points
Scatter Plot
100
90
80
70
60
Y
50
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100
X
45
Is there any correlation ?
Scatter Plot
30
25
20
Y
15
10
0
0 2 4 6 8 10 12
X
46
Measure of Correlation: Coefficient of Correlation
Symbol : r
Range : -1 to 1
Sign : Type of correlation
Value : Degree of correlation
Examples:
r = 0.6 , 60 % positive correlation
r = -0.82, 82% negative correlation
r = 0, No correlation
47
Coefficient of Correlation: Positive Correlation
Collect data on x and y: When x is low, y is also low & vice versa
x y
2 5
3 7
1 3
5 11
6 12
7 15 48
Calculate Mean of x & y values
SL No. x y
1 2 5
2 3 7
3 1 3
4 5 11
5 6 12
6 7 15
Mean 4 8.83
49
Take x – Mean x and y – Mean y
SL No. x – Mean x y – Mean y

Conclusion:
1 -2 -3.83 Low values will become
2 -1 -1.83 negative & high values will
become positive
3 -3 -5.83
4 1 2.17
5 2 3.17
6 3 6.17
50
Generally when x values are negative, y values are also negative & vice versa
1 -2 -3.83
2 -1 -1.83
3 -3 -5.83
4 1 2.17
5 2 3.17
6 3 6.17
51
Then
Product of x & y values will be positive
SL No. x – Mean x y – Mean y Product
1 -2 -3.83 7.66
2 -1 -1.83 1.83
3 -3 -5.83 17.49
4 1 2.17 2.17
5 2 3.17 6.34
6 3 6.17 18.51
Sum = Sxy 54
52
Sum of Product of x & y values (Sxy) will be positive
1 -2 -3.83 7.66
2 -1 -1.83 1.83
3 -3 -5.83 17.49
4 1 2.17 2.17
5 2 3.17 6.34
6 3 6.17 18.51
Sum = Sxy 54
53
Coefficient of Correlation: Negative Correlation
Collect data on x and y: When x is low then y will be high & vice versa
x y
2 12
3 11
1 15
5 7
6 5
7 3
54
Calculate Mean of x & y values
SL No. x y
1 2 12
2 3 11
3 1 15
4 5 7
5 6 5
6 7 3
Mean 4 8.83
55
Take x – Mean x and y – Mean y

Conclusion:
1
-2 3.67 Low values will become
2 negative & high values will
-1 2.67 become positive
3
-3 6.67
4 1 -1.33
5
2 -3.33
6 3 -5.33
56
Generally when x values are negative, y values are positive & vice versa
1
-2 3.67
2 -1 2.67
3
-3 6.67
4 1 -1.33
5
2 -3.33
6 3 -5.33
57
Then
Product of x & y values will be negative
1
-2 3.67 -7.34
2 -1 2.67 -2.67
3
-3 6.67 -20.01
4 1 -1.33 -1.33
5
2 -3.33 -6.66
6 3 -5.33 -15.99
Sum = Sxy - 54
58
Sum of Product of x & y values Sxy will be negative
1
-2 3.67 -7.34
2 -1 2.67 -2.67
3
-3 6.67 -20.01
4 1 -1.33 -1.33
5
2 -3.33 -6.66
6 3 -5.33 -15.99
Sum = Sxy - 54
59
Coefficient of Correlation:
In Short
If correlation is positive
Sxy will be positive
If correlation is negative
Sxy will be negative
60
To avoid scale issues

Sxy is divided by √ (Sxx.Syy)
Sxy = Σ(x-Mean x)(y-Mean y)

Sxx = Σ(x-Mean x)2
Syy = Σ(y-Mean y)2
Correlation Coefficient r = Sxy / √ (Sxx.Syy)
61
SL No. x – Mean x y – Mean y Product (x – Mean x)2 (y – Mean y)2
1
-2 3.67 -7.34 4 14.6689
2 -1 2.67 -2.67 1 3.3489
3
-3 6.67 -20.01 9 33.9889
4 1 -1.33 -1.33 1 4.7089
5
2 -3.33 -6.66 4 10.0489
6 3 -5.33 -15.99 9 38.0689
Sum Sxy: -54 Sxx: 28 Syy:104.83
r = Sxy / √Sxx.Syy = -54 / √(28 x 104.83) = -0.9967

62
Regression
Correlation helps
To check whether two variables are related
If related
Identify the type & degree of relationship
63
Regression
Regression helps
• To identify the exact form of the relationship
• To model output in terms of input or process variables
Examples:
Yield = 5 + 3 x Time - 2 x Temperature
Y = 2 - 5x
64
Multiple Regression
To model output variable y in terms of two or more variables.

General Form:
Y = a + b1X1 + b2X2 + - - - + bkXk
Two variable case:

Y = a + b1X1 + b2X2
65
Exercise 1: The data on Vendor performance score and the number of On Time,
Complete, Undamaged & Correctly billed shipments from the vendors of a
supply chain management company are given below. Can you develop a
model for Vendor performance score in terms of other variables?
Ontime Complete Undamaged Correctly Performance
Vendor Id Shipment Shipment Shipmetns billed Score
1 950 990 980 550 2985
2 1450 1425 1475 975 4576
3 1700 1575 1730 1320 5435
4 1800 1515 1890 1615 5955
5 1675 1420 1756 1456 5400
6 1756 1645 1835 1489 5590
7 1236 1462 1335 1435 4675
8 1100 1523 1565 1625 4960
9 1325 1725 1570 1520 5325
10 1450 1620 1463 1430 5170
11 1570 1458 1356 1630 5190
66
LINEAR REGRESSION
Exercise 2: A construction company wants to develop a model the concrete

compressive strength. The attributes of interest are given in the table
below. The training data is given in the file Concrete_Data.xls .
1. Can you develop the model?
2. How much close it will predict the values?
1 Cement (component 1)(kg in a m^3 mixture)

2 Blast Furnace Slag (component 2)(kg in a m^3 mixture)
3 Fly Ash (component 3)(kg in a m^3 mixture)
4 Water (component 4)(kg in a m^3 mixture)
5 Superplasticizer (component 5)(kg in a m^3 mixture)
6 Coarse Aggregate (component 6)(kg in a m^3 mixture)
7 Fine Aggregate (component 7)(kg in a m^3 mixture)
8 Age (day)
9 Concrete compressive strength(MPa, megapascals)
67
CLASSIFICATION
METHODS
68
INTRODUCTION
Objective
To develop a mathematical model for an attribute or response metric (Y) in terms of
other available attributes (Xs).
When to Use
Xs : Continuous or discrete
Y : Discrete
69
CLASSIFICATION METHODS
Classifies data (develops a model) based on the training data

Each sample is assumed to belong to a predefined class
Sample data set used for building the model is training set
Usage:
For classifying future or unknown data
70
Example:
Attribute 1 x1
Attribute 2 x2
Label : y Y1 (Red) , y2 (Blue)
x1 x2 Y x1 x2 Y
11.35 23 Blue 11.85 39.9 Red
11.59 22.3 Blue 12.09 39.5 Red
12.19 24.5 Blue 12.69 37.8 Red
13.23 26.4 Blue 13.73 38.2 Red
13.51 30.2 Blue 14.01 37.8 Red
13.68 32 Blue 14.18 36.5 Red
14.78 33.1 Blue 15.28 36 Red
15.11 33 Blue 15.61 37.1 Red
15.55 25.2 Blue 16.05 33.1 Red
16.37 24.1 Blue 16.87 32.4 Red
16.99 22 Blue 17.49 31 Red
18.23 23.5 Blue 18.73 32 Red
18.83 24.1 Blue 19.33 31.8 Red
71
19.06 25 Blue 19.56 30.9 Red
Example:
Attribute 1 x1
Attribute 2 x2
Label : y Y1 (Red) , y2 (Blue)
40
38
36
34
32
x2
30
28
26
24
22
20
10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00
x1
72
Example:
Attribute 1 x1
Attribute 2 x2 x2
Label : y y1 (Red) , y2 (Blue) > 35

40
38 y1
36
34
32
x2
30
28
26
24
22
20
10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00
x1
73
Example:
Attribute 1 x1
Attribute 2 x2 x2
Label : y y1 (Red) , y2 (Blue) > 35 < 28

40
38 y1 y2
36
34
32
x2
30
28
26
24
22
20
10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00
x1
74
Example:
Attribute 1 x1
Attribute 2 x2 x2

40
38 y1 x1 y2
36
34
32 < 15.5 > 15.5
x2
30
28
26 y2 y1
24
22
20
10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00
x1
75
Example: Rules
Attribute 1 x1
Attribute 2 x2 x2
If x2 > 35 then y = y1 y1 x1 y2
If x2 < 28, then y = y2

< 15.5 > 15.5
If 28 > x2 > 35 & x1 > 15.5, then y = y1
y2 y1
If 28 > x2 > 35 & x1 < 15.5, then y = y2
76
Example: The following table 1 gives the profile of customers (Refund, Marital
Status & Taxable Income) who has taken loan from a bank. The table also
shows how many of them really cheated the bank.
1. Can you develop a decision rule to classify the customer as whether they will
cheat or not based on the value of 3 attributes (Refund, Marital Status &
Taxable Income)
2. Validate the model using the test data given in table 2
Table 2: Test Data
SL No Refund Marital Taxable Cheat

Status Income
1 Yes Married > 80 K No
2 No Single > 80 K No
3 No Single < 80 K No
4 No Married > 80 K No
77
5 No Divorced > 80 K Yes
Table 1: Training Data Set
SL No Refund Marital Status Taxable Income Cheat

1 Yes Single > 80 K No
6 No Married < 80 K No
7 Yes Divorced > 80 K No
8 No Single > 80 K Yes
Class variable: Cheat

78
Number of predefined classes: 2 (Cheat = No & Cheat = Yes)
Example:Result
If Marital Status = Married then cheat : No

If Marital Status = Single & Refund = Yes then cheat : No
If Marital Status = Single, Refund = No & Taxable Income < 80K then cheat: No
If Marital Status = Single, Refund = No & Taxable Income > 80K then cheat: Yes
If Marital Status = Divorced & Refund = Yes then cheat : No
If Marital Status = Divorced & Refund = No then cheat : Yes
79
Example:Decision Tree

Status Income
1 Yes Single > 80 K No
6 No Married < 80 K No
7 Yes Divorced > 80 K No
80
Example: Test Data Set

Status Income
2 No Single > 80 K No
SL No Refund Marital Taxable Cheat Predicted

Status Income Cheat
1 Yes Married > 80K No No
2 No Single > 80 K No Yes
3 No Single < 80K No No
4 No Married > 80 K No No
81
5 No Divorced > 80 K Yes Yes
Performance Evaluation Measures

1. Confusion Matrix
Actual Class
Predicted Class = Yes Class = No
Class
Class = Yes a b
Class = No c d
2. Accuracy: (a+d) / (a + b + c + d)
3. Precision: a / (a + b)
4. Recall: a / (a + c)
5. F Measure = 2 x Precision x Accuracy / (Precision + Accuracy)
82
Example: Performance Evaluation Measures
SL No Cheat Predicted Cheat
1 No No
2 No Yes
3 No No
4 No No
5 Yes Yes
1. Confusion Matrix
Actual Class
Predicted Cheat = No Cheat = Yes
Class
Cheat = No 3 0
Cheat = Yes 1 1 83
Example: Performance Evaluation Measures
1. Confusion Matrix
Actual Class
Predicted Cheat = No Cheat = Yes
Class
Cheat = No 3 0
Cheat = Yes 1 1
2. Accuracy: (3+1) / (3 + 1 + 0 + 1) = 4 / 5 = 0.8
3. Precision: 3 / (3 + 0) = 3 / 3 = 1.0
4. Recall: 3 / (3 + 1) = 3 / 4= 0.75
5. F Measure = 2 x Precision x Accuracy / (Precision + Accuracy)
= 2 x 1.0 x 0.75 / (1.00 + 0.75) = 0.86 84
Challenges
How to represent the entire information in the dataset using minimum number
of rules?
How to develop the smallest tree?
Solution
Select the attribute with maximum information for first split
Split Attribute
First Marital Status
Second Refund
Third Taxable Income
85
CHAID Algorithm
Example: A marketing company wants to optimize their mailing campaign by sending

the brochure mail only to those customers who responded to previous mail
campaigns. The profile of customers are given below. Can you develop a rule to
identify the profile of customers who are likely to respond?
SL No District House Type Income Previous_Customer Outcome
1 Suburban Detached High No No Response
2 Suburban Detached High Yes No Response
3 Rural Detached High No Responded
4 Urban Semi-detached High No Responded
5 Urban Semi-detached Low No Responded
6 Urban Semi-detached Low Yes No Response
7 Rural Semi-detached Low Yes Responded
8 Suburban Terrace High No No Response
9 Suburban Semi-detached Low No Responded
10 Urban Terrace Low No Responded
11 Suburban Terrace Low Yes Responded
12 Rural Terrace High Yes Responded
13 Rural Detached Low No Responded
86
14 Urban Terrace High Yes No Response
CHAID Algorithm
campaigns. The profile of customers are given below? Can you develop a rule to
Number of variables = 4
SL No Variable Name Number of values

1 District 3
2 House Type 3
3 Income 2
4 Previous Customer 2
Total Combination of Customer Profiles = 3 x 3 x 2 x 2 = 36

87

campaigns. The profile of customers are given below? Can you develop a rule to
88
Exercise 1: A bank wants to know the profile of customers who will buy a Personal
Equity Plan (Pep) after the mailing campaign? The data is given in the
file named bank-data.xls.
1. Can you develop a decision methodology?

2. How good is your model?
89
Exercise 1:. The file contains the following fields.
Id a unique identification number

Age age of customer in years (numeric)
Sex MALE / FEMALE
Region inner_city/rural/suburban/town
Income income of customer (numeric)
Married is the customer married (YES/NO)
Children number of children (numeric)
Car does the customer own a car (YES/NO)
Save_acct does the customer have a saving account (YES/NO)
Current_acct does the customer have a current account (YES/NO)
Mortgage does the customer have a mortgage (YES/NO)
Pep did the customer buy a PEP (Personal Equity Plan) after the
last mailing (YES/NO) 90
Exercise 2: The profile of the customers of a telecom service provider in

grace period is given in churn.xls file.
1. Can you develop a a model to identify potential churners
(disconnections) so that organization can win back the customers by
providing different offers?
2. How good is the decision rule?
91
Exercise 2:. The file contains the following fields.

12) Recharge amount in last month
1) Service class
13) Recharge amount in last two
2) Class change in last week
months
3) Class change in last15 days
14) Recharge count in last week
4) Class change in last month
15) Recharge cont in last 15 days
5) Class change in last two months
16) Recharge count in last month
6) Usage amount in last week
17) Recharge count in last two
7) Usage amount in last15 days months
8) Usage amount in last month 18) Closing balance in last week
9) Usage amount in last two months 19) Closing balance in last15 days
10) Recharge amount in last week 20) Closing balance in last month
11) Recharge amount in last15 days 21) Closing balance in last two
months
92
CLUSTER ANALYSIS
93
CLUSTER ANALYSIS
Objective
To classify the records or items into a smaller number of groups based on the values
of available attributes.
When to Use
When there is no Y attribute
All attributes are considered as Xs only
94
CLUSTER ANALYSIS
Methodology to group objects based on many attributes such that objects in a group
will be similar (or related) to one another

will be different from (or unrelated to) the objects in other groups
95
CLUSTER ANALYSIS
Types of Clustering
• K Mean Clustering
• K Medoid Clustering
96
CLUSTER ANALYSIS
K Mean Clustering
Methodology to group objects based on many attributes such that objects in a cluster will
be closer (or more similar) to the centroid of the cluster than to the centroid of any other
cluster.
1. Each cluster is associated with a centroid

2. Each point is assigned to the cluster with the closest centroid
3. Number of clusters, K must be specified
4. Initially centroids are often chosen randomly
5. The centroid is (typically) the mean of the points in the cluster
6. Closeness is measured by Euclidean distance
97
CLUSTER ANALYSIS
K Mean Clustering:Euclidean Distance
D(x, y) = √((x1 – y1)2 + (x2 – y2)2 + - - - + ((xk – yk)2 )
Example:
SL No Attribute 1 Attribute 2 Attribute 3 Attribute 4 Attribute 5

1 25.8 15.7 56.0 0.57 5.3
2 23.6 18.1 48.9 0.62 7.2
Euclidean Distance
Attribute 1 Attribute 2 Attribute 3 Attribute 4 Attribute 5

Difference 2.2 -2.4 7.1 -0.05 -1.9
Square 4.84 5.76 50.41 0.0025 3.61
Sum 64.6225 98
Sq Root 8.038812101
CLUSTER ANALYSIS
K Mean Clustering:Algorithm
1. Get the number of clusters (k) required from the user

2. Randomly select k centroids
3. Calculate the Euclidean distance of each data record to each & every
centroid
4. For each record, identify the cluster with minimum Euclidean distance
5. Allocate the record to the cluster with minimum distance
6. Recalculate the centroids
7. Repeat steps 3 to 6 until there is no change in the cluster elements
99
CLUSTER ANALYSIS
K Mean Clustering:Example
Cluster the following data with 3 attributes (Spend in 3 quarters) into 2 clusters
SL No. Quarter 1 Quarter 2 Quarter 3
1 1.425172 31.08748 108.5436
2 3.017551 34.17728 103.4577
3 3.803405 34.78973 101.7977
4 4.299151 31.02313 107.3701
5 5.352034 22.80945 109.9353
6 6.038361 22.21948 100.1809
7 6.128493 25.04893 111.0543
8 8.381028 23.6761 106.3302
9 8.989409 27.62143 106.7186
10 9.788646 27.35268 105.7799
Step 1: k = 2
Step 2: Randomly identify 2 centroids
Centroid Quarter 1 Quarter 2 Quarter 3
1 1.5 35 100 100
2 9.8 22 111
CLUSTER ANALYSIS
Step 3: Calculate the Euclidean distance of each point from centroid 1

Difference from Centroid 1 Sum of Euclidean
SL No. Quarter 1 Quarter 2 Quarter 3 Squares Distance
1 -0.07483 -3.91252 8.543587 88.30632 9.397144437
2 1.517551 -0.82272 3.4577 14.93552 3.864650273
3 2.303405 -0.21027 1.797705 8.58163 2.929441858
4 2.799151 -3.97687 7.370058 77.96853 8.829979305
5 3.852034 -12.1906 9.935263 262.1572 16.19127037
6 4.538361 -12.7805 0.180881 183.9713 13.56360037
7 4.628493 -9.95107 11.05433 242.6451 15.57706944
8 6.881028 -11.3239 6.330205 215.6508 14.68505506
9 7.489409 -7.37857 6.71863 155.6745 12.47695732
10 8.288646 -7.64732 5.779929 160.5907 12.67243867
101
CLUSTER ANALYSIS
Step 4: Calculate the Euclidean distance of each point from centroid 2
Difference from Centroid 2 Sum of Euclidean

SL No. Quarter 1 Quarter 2 Quarter 3 Squares Distance
1 -8.37483 9.087476 -2.45641 158.7539 12.59975882
2 -6.78245 12.17728 -7.5423 251.174 15.84846897
3 -5.99659 12.78973 -9.2023 284.2187 16.85878574
4 -5.50085 9.023126 -3.62994 124.8526 11.17374688
5 -4.44797 0.809445 -1.06474 21.57327 4.644703413
6 -3.76164 0.219475 -10.8191 131.2514 11.45650187
7 -3.67151 3.048927 0.054333 22.77887 4.772721122
8 -1.41897 1.676096 -4.6698 26.62977 5.160403756
9 -0.81059 5.621434 -4.28137 50.58771 7.112503764
10 -0.01135 5.352683 -5.22007 55.90048 7.476662339
102
CLUSTER ANALYSIS
Step 5: Allocate records to clusters with minimum distance

SL No. Quarter 1 Quarter 2 Quarter 3 Cluster 1 Cluster 2 Allocation
1 1.425172 31.08748 108.5436 9.397144 12.59975882 1
2 3.017551 34.17728 103.4577 3.86465 15.84846897 1
3 3.803405 34.78973 101.7977 2.929442 16.85878574 1
4 4.299151 31.02313 107.3701 8.829979 11.17374688 1
5 5.352034 22.80945 109.9353 16.19127 4.644703413 2
6 6.038361 22.21948 100.1809 13.5636 11.45650187 2
7 6.128493 25.04893 111.0543 15.57707 4.772721122 2
8 8.381028 23.6761 106.3302 14.68506 5.160403756 2
9 8.989409 27.62143 106.7186 12.47696 7.112503764 2
10 9.788646 27.35268 105.7799 12.67244 7.476662339 2
Mean
Centroid Quarter 1 Quarter 2 Quarter 3
1 3.13632 32.7694 105.2923
2 7.446328 24.78801 106.6665
103
Step 6: Recalculate the centroids and repeat the steps
CLUSTER ANALYSIS
Exercise 1: The data on the % Erlang utilization of mobile towers of a telecom service
provider is given in Erlang_Utilization.xls? Kindly group the towers into 5
clusters based on the utilization?
104
CLUSTER ANALYSIS
K Medoid Clustering
Methodology to group objects based on many attributes such that objects in a cluster will
be closer (or more similar) to the most centrally located object of the cluster.
1. Number of clusters, K must be specified

2. Closeness is measured by Euclidean distance
Exercise: Perform the exercises 1 to 3 using k medoid clustering method
105

Course Material DMBA

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Course Material DMBA

Загружено:

Авторское право:

Доступные форматы

Indian Statistical Institute

1. Introduction to Rapid Miner

1. Missing Value Handling

Example: Suppose a telecom company wants to introduce a scoring mechanism to rate

Example: Circle wise Data

Step 1: Calculate the % of missing values in each attribute

Step 3: Prepare Pivot table of attributes

Projected Grow th A B C Grand Total

Step 3: Prepare Pivot table of attributes

Example: Circle wise Data

Example: Circle wise Data

Example: Circle wise Data

Example: Circle wise Data

DATA PREPROCESSING: Missing Value Handling

Exercise: The data on 3 modes of transport of a supply chain management company

SL No Delivery Speed Vehicles Extra Handling Cost Mode of Transport

LTL Truck : Less than truck load

Association Rule Mining:

Developing rules that predict the occurrence of of an item based on the

{Milk, Bread} {Biscuits} with probability = 2 / 3

A collection of one or more items

Frequency of occurrence of an itemset

Proportion or fraction of transaction that contain an itemset

Association Rule Mining

Frequent Itemset Generation: Apriori Algorithm

Frequent Itemset Generation: Apriori Algorithm

Minimum Support count = 2

Frequent Itemset Generation: Apriori Algorithm

Minimum Support count = 2

Item Support count

Frequent Itemset Generation: Apriori Algorithm

Minimum Support count = 2

Item Support count

Frequent Itemset Generation: Apriori Algorithm

Minimum Support count = 2

Item Support count

Frequent Itemset Generation: Apriori Algorithm

Minimum Support count = 2

Item Support count

Frequent Itemset Generation: Apriori Algorithm

Minimum Support count = 2

Item Support count

Frequent Itemset Generation: Apriori Algorithm

Minimum Support count = 2

Item Support count

Frequent Itemset Generation: Apriori Algorithm

Minimum Support count = 2

Item Support count

Frequent Itemset Generation: Apriori Algorithm

Minimum Support count = 2

Frequent Itemset Generation: Apriori Algorithm

Minimum Support count = 2

Item Support count Support

Association Rule Mining: Apriori Algorithm

Minimum Support = 0.50

Item Support count Support

Association Rule Mining: Apriori Algorithm

Minimum Support = 0.50

Item Support Confidence

Association Rule Mining: Other Measures

Lift (A C) = Confidence (A C) / Support (C)

Item Confidence Support Lift

Exercise 1:The data on transactions from a mobile outlet is given below.

1. Generate frequent items sets with a support of at least 25%?