Вы находитесь на странице: 1из 105

Indian Statistical Institute

Training Program
on
Data Mining & Business Analytics
using

Rapid Miner

Boby J
1
Indian Statistical Institute

Contents

1. Introduction to Rapid Miner


2. Missing Value Analysis
3. Data Visualization
4. Market Basket Analysis
5. Correlation & Regression
6. Data partitioning & Classification
7. Cluster Analysis

2
Indian Statistical Institute

DATA
PREPROCESSING

3
DATA PREPROCESSING Indian Statistical Institute

1. Missing Value Handling

4
Indian Statistical Institute
Missing Value Handling

Example: Suppose a telecom company wants to introduce a scoring mechanism to rate


its circles based on the following parameters
1. Current Month’s Usage
2. Last 3 Month’s Usage
3. Average Recharge
4. Projected Growth
The data set is given in next slide. There are some missing values. How to
proceed?

5
Indian Statistical Institute
Missing Value Handling

Example: Circle wise Data


Current Last 3
Month's Month's Average Projected
SL No. Usage Usage Recharge Growth Circle
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
3 3.2 99.2 A
4 4.6 3.1 98.5 9..2 A
5 5 98.4 99.2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
10 2.3 96 98.3 B
11 6.5 2.8 95.4 98.5 B
12 5.7 95.5 98.3 B
13 6.3 3.3 98.6 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
17 3 94.8 98 C
18 6.2 3.4 94.6 97.3 C
19 5.9 3 94.9 98.8 C
6
Indian Statistical Institute
Missing Value Handling

Step 1: Calculate the % of missing values in each attribute


Current Last 3
Month's Month's Average Projected
Usage Usage Recharge Growth Circle
Missing Values 3 2 2 0 0
Total Records 19 19 19 19 19
% Missing 15.79 10.53 10.53 0.00 0.00

If % Missing is > 20%, then the data is not sufficient to develop the model.
Ignore the corresponding attribute and proceed

7
Indian Statistical Institute
Missing Value Handling

Step 3: Prepare Pivot table of attributes


Current Month's
Usage A B C Grand Total
Missing 1 1 1 3
Non Missing 5 6 5 16
Grand Total 6 7 6 19

Last 3 Month's
Usage A B C Grand Total
Missing 1 1 0 2 Conclusion
Non Missing 5 6 6 17
Grand Total 6 7 6 19 None of the
cases 100%
Average Recharge A B C Grand Total
values are
Missing 1 1 0 2
missing
Non Missing 5 6 6 17
Grand Total 6 7 6 19

Projected Grow th A B C Grand Total


Missing 0 0 0 0
Non Missing 6 7 6 19
Grand Total 6 7 6 19 8
Indian Statistical Institute
Missing Value Handling

Step 3: Prepare Pivot table of attributes


Current Month's
Usage A B C
Missing 16.67 14.29 16.67
Non Missing 83.33 85.71 83.33
Grand Total 100 100 100

Last 3 Month's
Usage A B C
Missing 16.67 14.29 0.00 Conclusion
Non Missing 83.33 85.71 100.00
Grand Total 100 100 100 None of the cases
100% values are
Average Recharge A B C
missing
Missing 16.67 14.29 0.00
Non Missing 83.33 85.71 100.00
Grand Total 100 100 100

Projected Grow th A B C
Missing 0 0 0
Non Missing 100 100 100
Grand Total 100 100 100 9
Indian Statistical Institute
Missing Value Handling

Example: 3 Choices
Choice 1: Ignore missing value records
Current Last 3
Month's Month's Average Projected
SL No. Usage Usage Recharge Growth Circle
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
4 4.6 3.1 98.5 9..2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
11 6.5 2.8 95.4 98.5 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
18 6.2 3.4 94.6 97.3 C

10
Indian Statistical Institute
Missing Value Handling

Example: Circle wise Data


Choice 2. Replace the missing values with attribute mean, minimum,
maximum or mode
Current Last 3
Month's Month's Average Projected
SL No. Usage Usage Recharge Growth Circle
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
3 3.2 99.2 A
4 4.6 3.1 98.5 9..2 A
5 5 98.4 99.2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
10 2.3 96 98.3 B
11 6.5 2.8 95.4 98.5 B
12 5.7 95.5 98.3 B
13 6.3 3.3 98.6 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
17 3 94.8 98 C
18 6.2 3.4 94.6 97.3 C
19 5.9 3 94.9 98.8 C
Mean 6.0 3.1 96.1 98.5
Min 4.6 2.3 94.3 97.3 11
Max 7.0 3.9 99.4 99.4
Indian Statistical Institute
Missing Value Handling

Example: Circle wise Data


Choice 2. Replace the missing values with attribute mean, minimum,
maximum or mode
Current Last 3
Month's Month's Average Projected
SL No. Usage Usage Recharge Growth Circle
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
3 6 3.2 96.1 99.2 A
4 4.6 3.1 98.5 9..2 A
5 5 3.1 98.4 99.2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
10 6 2.3 96 98.3 B
11 6.5 2.8 95.4 98.5 B
12 5.7 3.1 95.5 98.3 B
13 6.3 3.3 96.1 98.6 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
17 6 3 94.8 98 C
18 6.2 3.4 94.6 97.3 C
19 5.9 3 94.9 98.8 C
Mean 6.0 3.1 96.1 98.5
Min 4.6 2.3 94.3 97.3 12
Max 7.0 3.9 99.4 99.4
Indian Statistical Institute
Missing Value Handling

Example: Circle wise Data


Choice 3 : Replace the missing values with attribute mean corresponding
the circle
Current Last 3
Month's Month's Average Projected
SL No. Usage Usage Recharge Growth circle
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
3 3.2 99.2 A
4 4.6 3.1 98.5 9..2 A
5 5 98.4 99.2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
10 2.3 96 98.3 B
11 6.5 2.8 95.4 98.5 B
12 5.7 95.5 98.3 B
13 6.3 3.3 98.6 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
17 3 94.8 98 C
18 6.2 3.4 94.6 97.3 C
19 5.9 3 94.9 98.8 C
Mean 5.00 3.34 98.64 99.24 A
Mean 6.47 2.98 95.47 98.45 B 13
Mean 6.36 3.03 94.73 97.97 C
Indian Statistical Institute
Missing Value Handling

Example: Circle wise Data


Choice 3 : Replace the missing values with attribute mean corresponding
the circle
Current Last 3
Month's Month's Average Projected
SL No. Usage Usage Recharge Growth Circle
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
3 5 3.2 98.64 99.2 A
4 4.6 3.1 98.5 9..2 A
5 5 3.34 98.4 99.2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
10 6.47 2.3 96 98.3 B
11 6.5 2.8 95.4 98.5 B
12 5.7 2.98 95.5 98.3 B
13 6.3 3.3 95.47 98.6 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
17 6.36 3 94.8 98 C
18 6.2 3.4 94.6 97.3 C
19 5.9 3 94.9 98.8 C
Mean 5.00 3.34 98.64 99.24 A
Mean 6.47 2.98 95.47 98.45 B 14
Mean 6.36 3.03 94.73 97.97 C
Indian Statistical Institute

DATA PREPROCESSING: Missing Value Handling

Exercise: The data on 3 modes of transport of a supply chain management company


are given below. Handle the missing values?

SL No Delivery Speed Vehicles Extra Handling Cost Mode of Transport


1 27.75 3 2 Water
2 3 445 Water
3 28.2 3 1 460 Water
4 8.75 1 0 980 Direct Truck
5 9.25 0 950 Direct Truck
6 9.15 1 1 Direct Truck
7 15.2 3 2 820 LTL Truck
8 16.2 2 2.5 810 LTL Truck
9 3 1.5 835 LTL Truck

LTL Truck : Less than truck load

15
Indian Statistical Institute

MARKET BASKET
ANALYSIS

16
Indian Statistical Institute
MARKET BASKET ANALYSIS

A modeling technique based upon the logic that if a customer buy a certain group of
items, he is more (or less) likely to buy another group of items

Example:

Those who buy cigarettes are more likely to buy match box also.

17
Indian Statistical Institute
MARKET BASKET ANALYSIS

Association Rule Mining:

Developing rules that predict the occurrence of of an item based on the


occurrence of other items in the transaction

Example

Id Items
1 Milk, Bread
2 Bread, Biscuits, Toys, Eggs
3 Milk, Biscuits, Toys, Fruits
4 Bread, Milk, Toys, Biscuits
5 Milk, Bread, Biscuits, Fruits

{Milk, Bread} {Biscuits} with probability = 2 / 3


18
Indian Statistical Institute
MARKET BASKET ANALYSIS

Itemset:

A collection of one or more items

k – itemset
An itemset consisting of k items

Id Items
1 Milk, Bread
2 Bread, Biscuits, Toys, Eggs
3 Milk, Biscuits, Toys, Fruits
4 Bread, Milk, Toys, Biscuits
5 Milk, Bread, Biscuits, Fruits

19
Indian Statistical Institute
MARKET BASKET ANALYSIS

Support count:

Frequency of occurrence of an itemset

Example
{Milk, Bread, Biscuits} = 2

Id Items
1 Milk, Bread
2 Bread, Biscuits, Toys, Eggs
3 Milk, Biscuits, Toys, Fruits
4 Bread, Milk, Toys, Biscuits
5 Milk, Bread, Biscuits, Fruits

20
Indian Statistical Institute
MARKET BASKET ANALYSIS

Support :

Proportion or fraction of transaction that contain an itemset

Example
{Milk, Bread, Biscuits} = 2 / 5

Id Items
1 Milk, Bread
2 Bread, Biscuits, Toys, Eggs
3 Milk, Biscuits, Toys, Fruits
4 Bread, Milk, Toys, Biscuits
5 Milk, Bread, Biscuits, Fruits

Frequent Itemset
21
An itemset whose support is greater than or equal to minimum support
Indian Statistical Institute
MARKET BASKET ANALYSIS

Confidence
Conditional probability that an item will appear in transactions that contain another
items
Example
Confidence that Toys will appear in transaction containing Milk & Biscuits
= {Milk, Biscuits, Toys} / {Milk, Biscuits} = 2 / 3 = 0.67

Id Items
1 Milk, Bread
2 Bread, Biscuits, Toys, Eggs
3 Milk, Biscuits, Toys, Fruits
4 Bread, Milk, Toys, Biscuits
5 Milk, Bread, Biscuits, Fruits

22
Indian Statistical Institute
MARKET BASKET ANALYSIS

Association Rule Mining


1. Frequent Itemset Generation
Fix minimum support value
Generate all itemsets whose support ≥ minimum support

2. Rule Generation
Fix minimum confidence value
Generate high confidence rules from each frequent itemset

23
Indian Statistical Institute
MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm


a. Fix minimum support count
b. Generate all itemsets of length = 1
c. Calculate the support for each itemset
d. Eliminate all itemsets with support count < minimum support count
e. Repeat steps c & d for itemsets of length = 2, 3, ---

24
Indian Statistical Institute
MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm


Example:

Minimum Support count = 2


Id Items
1 A,C,D
2 B,C,E
3 A,B,C,E
4 B,E
5 A,E
6 A,C,E

25
Indian Statistical Institute
MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm


Example:

Minimum Support count = 2

Step 1:
Generate itemsets of length = 1 & calculate support

Item Support count


A 4
B 3
C 4
D 1
E 5

26
Indian Statistical Institute
MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm


Example:

Minimum Support count = 2

Step 2:
eliminate itemsets with support count < minimum support count (2)

Item Support count


A 4
B 3
C 4
D 1
E 5

27
Indian Statistical Institute
MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm


Example:

Minimum Support count = 2

Step 2:
eliminate itemsets with support count < minimum support count (2)

Item Support count


A 4
B 3
C 4
E 5

28
Indian Statistical Institute
MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm


Example:

Minimum Support count = 2

Step 3:
generate itemsets of length = 2

Item Support count


A, B 1
A, C 3
A,E 3
B, C 2
B, E 3
C,E 3
29
Indian Statistical Institute
MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm


Example:

Minimum Support count = 2

Step 4:
eliminate itemsets with support count < minimum support count (2)

Item Support count


A, B 1
A, C 3
A,E 3
B, C 2
B, E 3
C,E 3
30
Indian Statistical Institute
MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm


Example:

Minimum Support count = 2

Step 4:
eliminate itemsets with support count < minimum support count (2)

Item Support count


A, C 3
A,E 3
B, C 2
B, E 3
C,E 3

31
Indian Statistical Institute
MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm


Example:

Minimum Support count = 2

Step 5:
generate itemsets of length = 3

Item Support count


A, C, E 2
B, C, E 2

32
Indian Statistical Institute
MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm


Example:

Minimum Support count = 2

Step 6:
generate itemsets of length = 4
Itemset Support Count
A, B, C, E 1

33
Indian Statistical Institute
MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm


Example:

Minimum Support count = 2

Result:

Item Support count Support


A, C, E 2 0.33
B, C, E 2 0.33
A,C 3 0.50
A,E 3 0.50
B,C 2 0.33
B,E 3 0.50
C,E 3 0.50

34
Indian Statistical Institute
MARKET BASKET ANALYSIS

Association Rule Mining: Apriori Algorithm


Example:

Minimum Support = 0.50


Minimum Confidence = 0.5

Item Support count Support


A, C, E 2 0.33
B, C, E 2 0.33
A,C 3 0.50
A,E 3 0.50
B,C 2 0.33
B,E 3 0.50
C,E 3 0.50

35
Indian Statistical Institute
MARKET BASKET ANALYSIS

Association Rule Mining: Apriori Algorithm


Example:

Minimum Support = 0.50


Minimum Confidence = 0.5

Item Support Confidence


A C 0.50 0.75
A E 0.50 0.75
B E 0.50 1.00
C E 0.50 0.75
C A 0.50 0.75
E A 0.50 0.60
E B 0.50 0.60
E C 0.50 0.60
36
Indian Statistical Institute
MARKET BASKET ANALYSIS

Association Rule Mining: Other Measures

Lift

Lift (A C) = Confidence (A C) / Support (C)

Example

Item Confidence Support Lift


A C 0.75 C = 0.67 1.12
A E 0.75 E = 0.83 0.93

Criteria : Lift ≥ 1
Lift (A , C) = 1.12 > Lift (A , E) indicates that A has a greater impact on the
frequency of C than it has on the frequency of E

37
Indian Statistical Institute
MARKET BASKET ANALYSIS

Exercise 1:The data on transactions from a mobile outlet is given below.

1. Generate frequent items sets with a support of at least 25%?


2. Generate association of items with a confidence of at least 50%?
3. Estimate the chance that Mobile Slim, Landline and Broadband will
be subscribed together?
4. Estimate the chance that the customers who buy Landline will also
purchase Broadband & Ring tones?

38
Indian Statistical Institute
MARKET BASKET ANALYSIS

Exercise 2:

The market basket Software data set contains the details of transaction at a
software product company.

1. Identify the frequent product types with a support of minimum 25% ?


2. Also identify the association of products with a confidence of minimum 50%
?
3. What is the chance that Operating System and Office Suite will be
purchased together?
4. What is the chance that Operating System and Visual Studio will be
purchased together?
5. Estimate the chance that the customers who buy Operating System will also
purchase Office Suite ?
6. Estimate the chance that the customers who buy Operating System will also
purchase Visual Studio?
39
Indian Statistical Institute

LINEAR
REGRESSION

40
Indian Statistical Institute
CORRELATION & REGRESSION

Correlation:

Correlation analysis is a technique to identify the relationship between two


variables.
Type and degree of relationship between two variables.

41
Indian Statistical Institute
CORRELATION & REGRESSION

Correlation: Usage

Explore the relationship between the output characteristic and input or process
variable.

Output variable : Y : Dependent variable


Input / Process variable : X : Independent variable

42
Indian Statistical Institute
CORRELATION & REGRESSION

Positive Correlation: Y increases as X increases & vice versa

Scatter Plot

20
16
12
Y

8
4
0
0 3 6 9 12
X

43
Indian Statistical Institute
CORRELATION & REGRESSION

Negative Correlation: Y decreases as X increases & vice versa

Scatter Plot

9
8
7
6
5
Y

4
3
2
1
0
1 2 3 4 5 6 7 8 9 10
X
44
Indian Statistical Institute

CORRELATION & REGRESSION

No Correlation: Random Distribution of points

Scatter Plot

100
90
80
70
60
Y

50
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100

X
45
Indian Statistical Institute
CORRELATION & REGRESSION

Is there any correlation ?

Scatter Plot

30

25

20
Y

15

10

0
0 2 4 6 8 10 12

X
46
Indian Statistical Institute
CORRELATION & REGRESSION

Measure of Correlation: Coefficient of Correlation

Symbol : r
Range : -1 to 1
Sign : Type of correlation
Value : Degree of correlation

Examples:
r = 0.6 , 60 % positive correlation
r = -0.82, 82% negative correlation
r = 0, No correlation

47
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation: Positive Correlation

Collect data on x and y: When x is low, y is also low & vice versa

x y
2 5
3 7
1 3
5 11
6 12
7 15 48
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation: Positive Correlation

Calculate Mean of x & y values

SL No. x y

1 2 5
2 3 7
3 1 3

4 5 11
5 6 12
6 7 15
Mean 4 8.83

49
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation: Positive Correlation

Take x – Mean x and y – Mean y

SL No. x – Mean x y – Mean y


Conclusion:
1 -2 -3.83 Low values will become
2 -1 -1.83 negative & high values will
become positive
3 -3 -5.83

4 1 2.17
5 2 3.17
6 3 6.17

50
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation: Positive Correlation

Generally when x values are negative, y values are also negative & vice versa

SL No. x – Mean x y – Mean y

1 -2 -3.83
2 -1 -1.83
3 -3 -5.83

4 1 2.17
5 2 3.17
6 3 6.17
51
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation: Positive Correlation

Then
Product of x & y values will be positive

SL No. x – Mean x y – Mean y Product

1 -2 -3.83 7.66
2 -1 -1.83 1.83
3 -3 -5.83 17.49

4 1 2.17 2.17
5 2 3.17 6.34
6 3 6.17 18.51
Sum = Sxy 54
52
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation: Positive Correlation

Sum of Product of x & y values (Sxy) will be positive

SL No. x – Mean x y – Mean y Product

1 -2 -3.83 7.66
2 -1 -1.83 1.83
3 -3 -5.83 17.49

4 1 2.17 2.17
5 2 3.17 6.34
6 3 6.17 18.51
Sum = Sxy 54
53
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation: Negative Correlation

Collect data on x and y: When x is low then y will be high & vice versa

x y
2 12
3 11
1 15
5 7
6 5
7 3
54
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation: Negative Correlation

Calculate Mean of x & y values

SL No. x y

1 2 12
2 3 11
3 1 15

4 5 7
5 6 5
6 7 3
Mean 4 8.83

55
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation: Negative Correlation

Take x – Mean x and y – Mean y

SL No. x – Mean x y – Mean y


Conclusion:
1
-2 3.67 Low values will become
2 negative & high values will
-1 2.67 become positive
3
-3 6.67
4 1 -1.33
5
2 -3.33
6 3 -5.33

56
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation: Negative Correlation

Generally when x values are negative, y values are positive & vice versa

SL No. x – Mean x y – Mean y

1
-2 3.67
2 -1 2.67
3
-3 6.67
4 1 -1.33
5
2 -3.33
6 3 -5.33
57
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation: Negative Correlation

Then
Product of x & y values will be negative

SL No. x – Mean x y – Mean y Product

1
-2 3.67 -7.34
2 -1 2.67 -2.67
3
-3 6.67 -20.01
4 1 -1.33 -1.33
5
2 -3.33 -6.66
6 3 -5.33 -15.99
Sum = Sxy - 54
58
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation: Negative Correlation

Sum of Product of x & y values Sxy will be negative

SL No. x – Mean x y – Mean y Product

1
-2 3.67 -7.34
2 -1 2.67 -2.67
3
-3 6.67 -20.01
4 1 -1.33 -1.33
5
2 -3.33 -6.66
6 3 -5.33 -15.99
Sum = Sxy - 54
59
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation:

In Short
If correlation is positive
Sxy will be positive
If correlation is negative
Sxy will be negative

60
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation:

To avoid scale issues


Sxy is divided by √ (Sxx.Syy)

Sxy = Σ(x-Mean x)(y-Mean y)


Sxx = Σ(x-Mean x)2
Syy = Σ(y-Mean y)2
Correlation Coefficient r = Sxy / √ (Sxx.Syy)

61
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation:

SL No. x – Mean x y – Mean y Product (x – Mean x)2 (y – Mean y)2

1
-2 3.67 -7.34 4 14.6689
2 -1 2.67 -2.67 1 3.3489
3
-3 6.67 -20.01 9 33.9889
4 1 -1.33 -1.33 1 4.7089
5
2 -3.33 -6.66 4 10.0489
6 3 -5.33 -15.99 9 38.0689
Sum Sxy: -54 Sxx: 28 Syy:104.83

r = Sxy / √Sxx.Syy = -54 / √(28 x 104.83) = -0.9967


62
Indian Statistical Institute
CORRELATION & REGRESSION

Regression

Correlation helps
To check whether two variables are related
If related
Identify the type & degree of relationship

63
Indian Statistical Institute
CORRELATION & REGRESSION

Regression

Regression helps
• To identify the exact form of the relationship
• To model output in terms of input or process variables

Examples:
Yield = 5 + 3 x Time - 2 x Temperature
Y = 2 - 5x

64
Indian Statistical Institute
CORRELATION & REGRESSION

Multiple Regression

To model output variable y in terms of two or more variables.


General Form:
Y = a + b1X1 + b2X2 + - - - + bkXk

Two variable case:


Y = a + b1X1 + b2X2

65
Indian Statistical Institute
CORRELATION & REGRESSION

Exercise 1: The data on Vendor performance score and the number of On Time,
Complete, Undamaged & Correctly billed shipments from the vendors of a
supply chain management company are given below. Can you develop a
model for Vendor performance score in terms of other variables?
Ontime Complete Undamaged Correctly Performance
Vendor Id Shipment Shipment Shipmetns billed Score
1 950 990 980 550 2985
2 1450 1425 1475 975 4576
3 1700 1575 1730 1320 5435
4 1800 1515 1890 1615 5955
5 1675 1420 1756 1456 5400
6 1756 1645 1835 1489 5590
7 1236 1462 1335 1435 4675
8 1100 1523 1565 1625 4960
9 1325 1725 1570 1520 5325
10 1450 1620 1463 1430 5170
11 1570 1458 1356 1630 5190

66
Indian Statistical Institute
LINEAR REGRESSION

Exercise 2: A construction company wants to develop a model the concrete


compressive strength. The attributes of interest are given in the table
below. The training data is given in the file Concrete_Data.xls .
1. Can you develop the model?
2. How much close it will predict the values?

1 Cement (component 1)(kg in a m^3 mixture)


2 Blast Furnace Slag (component 2)(kg in a m^3 mixture)
3 Fly Ash (component 3)(kg in a m^3 mixture)
4 Water (component 4)(kg in a m^3 mixture)
5 Superplasticizer (component 5)(kg in a m^3 mixture)
6 Coarse Aggregate (component 6)(kg in a m^3 mixture)
7 Fine Aggregate (component 7)(kg in a m^3 mixture)
8 Age (day)
9 Concrete compressive strength(MPa, megapascals)

67
Indian Statistical Institute

CLASSIFICATION
METHODS

68
Indian Statistical Institute

INTRODUCTION

Objective
To develop a mathematical model for an attribute or response metric (Y) in terms of
other available attributes (Xs).

When to Use
Xs : Continuous or discrete
Y : Discrete

69
Indian Statistical Institute
CLASSIFICATION METHODS

Classifies data (develops a model) based on the training data


Each sample is assumed to belong to a predefined class
Sample data set used for building the model is training set

Usage:
For classifying future or unknown data

70
Indian Statistical Institute
CLASSIFICATION METHODS

Example:

Attribute 1 x1
Attribute 2 x2
Label : y Y1 (Red) , y2 (Blue)

x1 x2 Y x1 x2 Y
11.35 23 Blue 11.85 39.9 Red
11.59 22.3 Blue 12.09 39.5 Red
12.19 24.5 Blue 12.69 37.8 Red
13.23 26.4 Blue 13.73 38.2 Red
13.51 30.2 Blue 14.01 37.8 Red
13.68 32 Blue 14.18 36.5 Red
14.78 33.1 Blue 15.28 36 Red
15.11 33 Blue 15.61 37.1 Red
15.55 25.2 Blue 16.05 33.1 Red
16.37 24.1 Blue 16.87 32.4 Red
16.99 22 Blue 17.49 31 Red
18.23 23.5 Blue 18.73 32 Red
18.83 24.1 Blue 19.33 31.8 Red
71
19.06 25 Blue 19.56 30.9 Red
Indian Statistical Institute
CLASSIFICATION METHODS

Example:

Attribute 1 x1
Attribute 2 x2
Label : y Y1 (Red) , y2 (Blue)

40
38
36
34
32
x2

30
28
26
24
22
20
10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00
x1

72
Indian Statistical Institute
CLASSIFICATION METHODS

Example:

Attribute 1 x1
Attribute 2 x2 x2

Label : y y1 (Red) , y2 (Blue) > 35


40
38 y1
36
34
32
x2

30
28
26
24
22
20
10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00
x1

73
Indian Statistical Institute
CLASSIFICATION METHODS

Example:

Attribute 1 x1
Attribute 2 x2 x2

Label : y y1 (Red) , y2 (Blue) > 35 < 28


40
38 y1 y2
36
34
32
x2

30
28
26
24
22
20
10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00
x1

74
Indian Statistical Institute
CLASSIFICATION METHODS

Example:

Attribute 1 x1
Attribute 2 x2 x2

Label : y y1 (Red) , y2 (Blue) > 35 < 28


40
38 y1 x1 y2
36
34
32 < 15.5 > 15.5
x2

30
28
26 y2 y1
24
22
20
10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00
x1

75
Indian Statistical Institute
CLASSIFICATION METHODS

Example: Rules

Attribute 1 x1
Attribute 2 x2 x2

Label : y y1 (Red) , y2 (Blue) > 35 < 28

If x2 > 35 then y = y1 y1 x1 y2

If x2 < 28, then y = y2


< 15.5 > 15.5
If 28 > x2 > 35 & x1 > 15.5, then y = y1
y2 y1
If 28 > x2 > 35 & x1 < 15.5, then y = y2

76
Indian Statistical Institute
CLASSIFICATION METHODS

Example: The following table 1 gives the profile of customers (Refund, Marital
Status & Taxable Income) who has taken loan from a bank. The table also
shows how many of them really cheated the bank.
1. Can you develop a decision rule to classify the customer as whether they will
cheat or not based on the value of 3 attributes (Refund, Marital Status &
Taxable Income)
2. Validate the model using the test data given in table 2

Table 2: Test Data

SL No Refund Marital Taxable Cheat


Status Income
1 Yes Married > 80 K No
2 No Single > 80 K No
3 No Single < 80 K No
4 No Married > 80 K No
77
5 No Divorced > 80 K Yes
Indian Statistical Institute
CLASSIFICATION METHODS

Table 1: Training Data Set

SL No Refund Marital Status Taxable Income Cheat


1 Yes Single > 80 K No
2 No Married > 80 K No
3 No Single < 80 K No
4 Yes Married > 80 K No
5 No Divorced > 80 K Yes
6 No Married < 80 K No
7 Yes Divorced > 80 K No
8 No Single > 80 K Yes
9 No Married > 80 K No
10 No Single > 80 K Yes

Class variable: Cheat


78
Number of predefined classes: 2 (Cheat = No & Cheat = Yes)
Indian Statistical Institute
CLASSIFICATION METHODS

Example:Result

If Marital Status = Married then cheat : No


If Marital Status = Single & Refund = Yes then cheat : No
If Marital Status = Single, Refund = No & Taxable Income < 80K then cheat: No
If Marital Status = Single, Refund = No & Taxable Income > 80K then cheat: Yes
If Marital Status = Divorced & Refund = Yes then cheat : No
If Marital Status = Divorced & Refund = No then cheat : Yes

79
Indian Statistical Institute
CLASSIFICATION METHODS

Example:Decision Tree

SL No Refund Marital Taxable Cheat


Status Income
1 Yes Single > 80 K No
2 No Married > 80 K No
3 No Single < 80 K No
4 Yes Married > 80 K No
5 No Divorced > 80 K Yes
6 No Married < 80 K No
7 Yes Divorced > 80 K No
8 No Single > 80 K Yes
9 No Married > 80 K No
10 No Single > 80 K Yes
80
Indian Statistical Institute
CLASSIFICATION METHODS

Example: Test Data Set


SL No Refund Marital Taxable Cheat
Status Income
1 Yes Married > 80 K No
2 No Single > 80 K No
3 No Single < 80 K No
4 No Married > 80 K No
5 No Divorced > 80 K Yes

SL No Refund Marital Taxable Cheat Predicted


Status Income Cheat
1 Yes Married > 80K No No
2 No Single > 80 K No Yes
3 No Single < 80K No No
4 No Married > 80 K No No
81
5 No Divorced > 80 K Yes Yes
Indian Statistical Institute
CLASSIFICATION METHODS

Performance Evaluation Measures


1. Confusion Matrix

Actual Class
Predicted Class = Yes Class = No
Class
Class = Yes a b
Class = No c d

2. Accuracy: (a+d) / (a + b + c + d)

3. Precision: a / (a + b)

4. Recall: a / (a + c)

5. F Measure = 2 x Precision x Accuracy / (Precision + Accuracy)

82
Indian Statistical Institute
CLASSIFICATION METHODS

Example: Performance Evaluation Measures

SL No Cheat Predicted Cheat

1 No No
2 No Yes
3 No No
4 No No
5 Yes Yes

1. Confusion Matrix

Actual Class
Predicted Cheat = No Cheat = Yes
Class
Cheat = No 3 0
Cheat = Yes 1 1 83
Indian Statistical Institute
CLASSIFICATION METHODS

Example: Performance Evaluation Measures

1. Confusion Matrix

Actual Class
Predicted Cheat = No Cheat = Yes
Class
Cheat = No 3 0
Cheat = Yes 1 1

2. Accuracy: (3+1) / (3 + 1 + 0 + 1) = 4 / 5 = 0.8

3. Precision: 3 / (3 + 0) = 3 / 3 = 1.0

4. Recall: 3 / (3 + 1) = 3 / 4= 0.75
5. F Measure = 2 x Precision x Accuracy / (Precision + Accuracy)
= 2 x 1.0 x 0.75 / (1.00 + 0.75) = 0.86 84
Indian Statistical Institute
CLASSIFICATION METHODS

Challenges

How to represent the entire information in the dataset using minimum number
of rules?
How to develop the smallest tree?

Solution

Select the attribute with maximum information for first split

Split Attribute
First Marital Status
Second Refund
Third Taxable Income

85
Indian Statistical Institute

CLASSIFICATION METHODS

CHAID Algorithm

Example: A marketing company wants to optimize their mailing campaign by sending


the brochure mail only to those customers who responded to previous mail
campaigns. The profile of customers are given below. Can you develop a rule to
identify the profile of customers who are likely to respond?
SL No District House Type Income Previous_Customer Outcome
1 Suburban Detached High No No Response
2 Suburban Detached High Yes No Response
3 Rural Detached High No Responded
4 Urban Semi-detached High No Responded
5 Urban Semi-detached Low No Responded
6 Urban Semi-detached Low Yes No Response
7 Rural Semi-detached Low Yes Responded
8 Suburban Terrace High No No Response
9 Suburban Semi-detached Low No Responded
10 Urban Terrace Low No Responded
11 Suburban Terrace Low Yes Responded
12 Rural Terrace High Yes Responded
13 Rural Detached Low No Responded
86
14 Urban Terrace High Yes No Response
Indian Statistical Institute

CLASSIFICATION METHODS

CHAID Algorithm
Example: A marketing company wants to optimize their mailing campaign by sending
the brochure mail only to those customers who responded to previous mail
campaigns. The profile of customers are given below? Can you develop a rule to
identify the profile of customers who are likely to respond?

Number of variables = 4

SL No Variable Name Number of values


1 District 3
2 House Type 3
3 Income 2
4 Previous Customer 2

Total Combination of Customer Profiles = 3 x 3 x 2 x 2 = 36


87
Indian Statistical Institute

CLASSIFICATION METHODS

Example: A marketing company wants to optimize their mailing campaign by sending


the brochure mail only to those customers who responded to previous mail
campaigns. The profile of customers are given below? Can you develop a rule to
identify the profile of customers who are likely to respond?

88
Indian Statistical Institute

CLASSIFICATION METHODS

Exercise 1: A bank wants to know the profile of customers who will buy a Personal
Equity Plan (Pep) after the mailing campaign? The data is given in the
file named bank-data.xls.

1. Can you develop a decision methodology?


2. How good is your model?

89
Indian Statistical Institute

CLASSIFICATION METHODS

Exercise 1:. The file contains the following fields.

Id a unique identification number


Age age of customer in years (numeric)
Sex MALE / FEMALE
Region inner_city/rural/suburban/town
Income income of customer (numeric)
Married is the customer married (YES/NO)
Children number of children (numeric)
Car does the customer own a car (YES/NO)
Save_acct does the customer have a saving account (YES/NO)
Current_acct does the customer have a current account (YES/NO)
Mortgage does the customer have a mortgage (YES/NO)
Pep did the customer buy a PEP (Personal Equity Plan) after the
last mailing (YES/NO) 90
Indian Statistical Institute

CLASSIFICATION METHODS

Exercise 2: The profile of the customers of a telecom service provider in


grace period is given in churn.xls file.
1. Can you develop a a model to identify potential churners
(disconnections) so that organization can win back the customers by
providing different offers?
2. How good is the decision rule?

91
Indian Statistical Institute

CLASSIFICATION METHODS

Exercise 2:. The file contains the following fields.


12) Recharge amount in last month
1) Service class
13) Recharge amount in last two
2) Class change in last week
months
3) Class change in last15 days
14) Recharge count in last week
4) Class change in last month
15) Recharge cont in last 15 days
5) Class change in last two months
16) Recharge count in last month
6) Usage amount in last week
17) Recharge count in last two
7) Usage amount in last15 days months
8) Usage amount in last month 18) Closing balance in last week
9) Usage amount in last two months 19) Closing balance in last15 days
10) Recharge amount in last week 20) Closing balance in last month
11) Recharge amount in last15 days 21) Closing balance in last two
months
92
Indian Statistical Institute

CLUSTER ANALYSIS

93
Indian Statistical Institute

CLUSTER ANALYSIS

Objective
To classify the records or items into a smaller number of groups based on the values
of available attributes.

When to Use
When there is no Y attribute
All attributes are considered as Xs only

94
Indian Statistical Institute

CLUSTER ANALYSIS

Methodology to group objects based on many attributes such that objects in a group

will be similar (or related) to one another


will be different from (or unrelated to) the objects in other groups

95
Indian Statistical Institute

CLUSTER ANALYSIS

Types of Clustering

• K Mean Clustering
• K Medoid Clustering

96
Indian Statistical Institute

CLUSTER ANALYSIS

K Mean Clustering

Methodology to group objects based on many attributes such that objects in a cluster will
be closer (or more similar) to the centroid of the cluster than to the centroid of any other
cluster.

1. Each cluster is associated with a centroid


2. Each point is assigned to the cluster with the closest centroid
3. Number of clusters, K must be specified
4. Initially centroids are often chosen randomly
5. The centroid is (typically) the mean of the points in the cluster
6. Closeness is measured by Euclidean distance

97
Indian Statistical Institute

CLUSTER ANALYSIS

K Mean Clustering:Euclidean Distance

D(x, y) = √((x1 – y1)2 + (x2 – y2)2 + - - - + ((xk – yk)2 )

Example:

SL No Attribute 1 Attribute 2 Attribute 3 Attribute 4 Attribute 5


1 25.8 15.7 56.0 0.57 5.3
2 23.6 18.1 48.9 0.62 7.2

Euclidean Distance

Attribute 1 Attribute 2 Attribute 3 Attribute 4 Attribute 5


Difference 2.2 -2.4 7.1 -0.05 -1.9
Square 4.84 5.76 50.41 0.0025 3.61
Sum 64.6225 98
Sq Root 8.038812101
Indian Statistical Institute

CLUSTER ANALYSIS

K Mean Clustering:Algorithm

1. Get the number of clusters (k) required from the user


2. Randomly select k centroids
3. Calculate the Euclidean distance of each data record to each & every
centroid
4. For each record, identify the cluster with minimum Euclidean distance
5. Allocate the record to the cluster with minimum distance
6. Recalculate the centroids
7. Repeat steps 3 to 6 until there is no change in the cluster elements

99
Indian Statistical Institute

CLUSTER ANALYSIS

K Mean Clustering:Example

Cluster the following data with 3 attributes (Spend in 3 quarters) into 2 clusters
SL No. Quarter 1 Quarter 2 Quarter 3
1 1.425172 31.08748 108.5436
2 3.017551 34.17728 103.4577
3 3.803405 34.78973 101.7977
4 4.299151 31.02313 107.3701
5 5.352034 22.80945 109.9353
6 6.038361 22.21948 100.1809
7 6.128493 25.04893 111.0543
8 8.381028 23.6761 106.3302
9 8.989409 27.62143 106.7186
10 9.788646 27.35268 105.7799

Step 1: k = 2
Step 2: Randomly identify 2 centroids
Centroid Quarter 1 Quarter 2 Quarter 3
1 1.5 35 100 100
2 9.8 22 111
Indian Statistical Institute

CLUSTER ANALYSIS

K Mean Clustering:Example

Step 3: Calculate the Euclidean distance of each point from centroid 1


Difference from Centroid 1 Sum of Euclidean
SL No. Quarter 1 Quarter 2 Quarter 3 Squares Distance
1 -0.07483 -3.91252 8.543587 88.30632 9.397144437
2 1.517551 -0.82272 3.4577 14.93552 3.864650273
3 2.303405 -0.21027 1.797705 8.58163 2.929441858
4 2.799151 -3.97687 7.370058 77.96853 8.829979305
5 3.852034 -12.1906 9.935263 262.1572 16.19127037
6 4.538361 -12.7805 0.180881 183.9713 13.56360037
7 4.628493 -9.95107 11.05433 242.6451 15.57706944
8 6.881028 -11.3239 6.330205 215.6508 14.68505506
9 7.489409 -7.37857 6.71863 155.6745 12.47695732
10 8.288646 -7.64732 5.779929 160.5907 12.67243867

101
Indian Statistical Institute

CLUSTER ANALYSIS

K Mean Clustering:Example

Step 4: Calculate the Euclidean distance of each point from centroid 2

Difference from Centroid 2 Sum of Euclidean


SL No. Quarter 1 Quarter 2 Quarter 3 Squares Distance
1 -8.37483 9.087476 -2.45641 158.7539 12.59975882
2 -6.78245 12.17728 -7.5423 251.174 15.84846897
3 -5.99659 12.78973 -9.2023 284.2187 16.85878574
4 -5.50085 9.023126 -3.62994 124.8526 11.17374688
5 -4.44797 0.809445 -1.06474 21.57327 4.644703413
6 -3.76164 0.219475 -10.8191 131.2514 11.45650187
7 -3.67151 3.048927 0.054333 22.77887 4.772721122
8 -1.41897 1.676096 -4.6698 26.62977 5.160403756
9 -0.81059 5.621434 -4.28137 50.58771 7.112503764
10 -0.01135 5.352683 -5.22007 55.90048 7.476662339

102
Indian Statistical Institute

CLUSTER ANALYSIS

K Mean Clustering:Example

Step 5: Allocate records to clusters with minimum distance


SL No. Quarter 1 Quarter 2 Quarter 3 Cluster 1 Cluster 2 Allocation
1 1.425172 31.08748 108.5436 9.397144 12.59975882 1
2 3.017551 34.17728 103.4577 3.86465 15.84846897 1
3 3.803405 34.78973 101.7977 2.929442 16.85878574 1
4 4.299151 31.02313 107.3701 8.829979 11.17374688 1
5 5.352034 22.80945 109.9353 16.19127 4.644703413 2
6 6.038361 22.21948 100.1809 13.5636 11.45650187 2
7 6.128493 25.04893 111.0543 15.57707 4.772721122 2
8 8.381028 23.6761 106.3302 14.68506 5.160403756 2
9 8.989409 27.62143 106.7186 12.47696 7.112503764 2
10 9.788646 27.35268 105.7799 12.67244 7.476662339 2

Mean
Centroid Quarter 1 Quarter 2 Quarter 3
1 3.13632 32.7694 105.2923
2 7.446328 24.78801 106.6665
103
Step 6: Recalculate the centroids and repeat the steps
Indian Statistical Institute

CLUSTER ANALYSIS

Exercise 1: The data on the % Erlang utilization of mobile towers of a telecom service
provider is given in Erlang_Utilization.xls? Kindly group the towers into 5
clusters based on the utilization?

104
Indian Statistical Institute

CLUSTER ANALYSIS

K Medoid Clustering

Methodology to group objects based on many attributes such that objects in a cluster will
be closer (or more similar) to the most centrally located object of the cluster.

1. Number of clusters, K must be specified


2. Closeness is measured by Euclidean distance

Exercise: Perform the exercises 1 to 3 using k medoid clustering method

105

Вам также может понравиться