Вы находитесь на странице: 1из 12

DATA MINING

PRESENTED FOR:
Cristian Alfonso Celis González – 47161659
Fabián Alejandro León Alméciga – 47151158

WORKSHOP REPORT "CHAPTER 4 - CLASSIFICATION"


Objective: Classify a set of data according to the assigned techniques.
Methodology: This workshop should be done in pairs. They must solve exercises 2, 3 and 5 of chapter
4 of the guide book (page 198) and submit a report until March 14 on the Moodle platform.
Exercise 2:
Consider the training examples shown in Table 4.7 for a binary classification problem.

Table N° 1: Table of data necessary to develop exercise 2 (Source: Introduction to Data Mining).

(a) Compute the Gini index for the overall collection of training examples.
Ans// The equation of Gini Index is:

𝐺𝐼𝑁𝐼(𝑡) = 1 − ∑[𝜌(𝑗⁄𝑡)]2
𝑗

The contingency table for overall collection training examples is:


DATA MINING
PRESENTED FOR:
Cristian Alfonso Celis González – 47161659
Fabián Alejandro León Alméciga – 47151158

Class
C0 10
C1 10

10 2 10 2 1
𝐺𝑖𝑛𝑖 = 1 − ( ) − ( ) =
20 20 2
The Gini index for the overall collection is 0,5
(b) Compute the Gini index for the Customer ID attribute.
Ans// The equation of GINI split is:
𝐾
𝑛𝑖
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = ∑ 𝐺𝐼𝑁𝐼(𝑖)
𝑛
𝑖=1

Then:

10 12 10 12 10 12
𝐺𝑖𝑛𝑖 = (1 − ( )) − (1 − ( )) − ⋯ − (1 − ( )) =0
20 1 20 1 20 1
1 2 20

The Gini index for the customer ID attribute is zero (0) because it is unique to each registry.
(c) Compute the Gini index for the Gender attribute.
Ans// The contingency table after the splitting on attribute Gender is:

Attribute = Gender
M F
C0 3 1
Class
C1 1 5
Then:

10 6 2 4 2 10 4 2 6 2
𝐺𝑖𝑛𝑖 = (1 − ( ) − ( ) ) + (1 − ( ) − ( ) ) = 0,48
20 10 10 20 10 10

The Gini index for the Gender attribute is 0,48


(d) Compute the Gini index for the Car Type attribute using multiway split.
Ans// The contingency table after the splitting on attribute Car Type is:

Attribute = Car Type


Family Sports Luxury
C0 1 8 1
Class
C1 3 0 7
DATA MINING
PRESENTED FOR:
Cristian Alfonso Celis González – 47161659
Fabián Alejandro León Alméciga – 47151158

Then:

4 1 2 3 2 8 8 2 0 2 8 1 2 7 2
𝐺𝑖𝑛𝑖 = (1 − ( ) − ( ) ) + (1 − ( ) − ( ) ) + (1 − ( ) − ( ) ) = 0,1625
20 4 4 20 8 8 20 8 8

The Gini index for the Car Type attribute is 0,1625.


(e) Compute the Gini index for the Shirt Size attribute using multiway split.
Ans// The contingency table after the splitting on attribute Shirt Size is:

Attribute = Shirt Size


Small Medium Large Extra Large
C0 3 3 2 2
Class
C1 2 4 2 2
Then:

5 3 2 2 2 7 3 2 4 2 4 2 2 2 2
𝐺𝑖𝑛𝑖 = (1 − ( ) − ( ) ) + (1 − ( ) − ( ) ) + (1 − ( ) − ( ) )
20 5 5 20 7 7 20 4 4
2 2
4 2 2
+ (1 − ( ) − ( ) ) = 0,4914
20 4 4

The Gini index for the Shirt Size attribute is 0,4914.


(f) Which attribute is better, Gender, Car Type, or Shirt Size?
Ans//

Gender Car Type Shirt Size


Gini = 0,48 Gini = 0,1625 Gini = 0,4914

The attribute Car Type is the best considering that it has the lowest GINI index.
(g) Explain why Customer ID should not be used as the attribute test condition even
though it has the lowest Gini.
Ans// Because it is a unique attribute for each registry, it is not repeated.
DATA MINING
PRESENTED FOR:
Cristian Alfonso Celis González – 47161659
Fabián Alejandro León Alméciga – 47151158

Exercise 3:
Consider the training examples shown in Table 4.8 for a binary classification problem.

Table N ° 2: Table of data necessary to develop exercise 3 (Source: Introduction to Data Mining).

(a) What is the entropy of this collection of training examples with respect to the positive
class?
Ans// The equation of entropy is:

𝐸 = − ∑ 𝜌(𝑗⁄𝑡) log 2 𝜌(𝑗⁄𝑡)


𝑗

The contingency table for collection training examples with to the positive class is:

Target Class
+ 4
- 5
Then:
4 4 5 5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( ) = 0,991
9 9 9 9
The entropy of this collection with respect to the positive class is 0,991
(b) What are the information gains of a1 and a2 relative to these training examples?
Ans//
DATA MINING
PRESENTED FOR:
Cristian Alfonso Celis González – 47161659
Fabián Alejandro León Alméciga – 47151158

a1 T F
+ 3 1
Target Class
- 1 4

3 3 1 1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝑇 = − 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( ) = 0,8113
4 4 4 4

1 1 4 4
𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐹 = − 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( ) = 0,7219
5 5 5 5

a2 T F
+ 2 2
Target Class
- 3 2

2 2 3 3
𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝑇 = − 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( ) = 0,9709
5 5 5 5

2 2 2 2
𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐹 = − 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( ) = 1
4 4 4 4

4 5
𝛥 = 𝐼(𝑝𝑎𝑟𝑒𝑛𝑡)𝑎1 = 0,991 − ( ∗ 0,8113) − ( ∗ 0,7219) = 0,22936
9 9
5 4
𝛥 = 𝐼(𝑝𝑎𝑟𝑒𝑛𝑡)𝑎2 = 0,991 − ( ∗ 0,9709) − ( ∗ 1) = 0,00716
9 9
The relative gain of a1 and a2 are 0,22936 and 0,00716 respectively.
DATA MINING
PRESENTED FOR:
Cristian Alfonso Celis González – 47161659
Fabián Alejandro León Alméciga – 47151158

(c) For a3, which is a continuous attribute, compute the information gain for every
possible split.
Ans//

Sorted values 1 3 4 5 6 7 8
Split positions 0,5 2 3,5 4,5 5,5 6,5 7,5 8,5
Target Class <= > <= > <= > <= > <= > <= > <= > <= >
+ 0 4 1 3 1 3 2 2 2 2 3 1 4 0 4 0
- 0 5 0 5 1 4 1 4 3 2 3 2 4 1 5 0

9 4 4 5 5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦1 = 0 + [− 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( )] = 0,991
9 9 9 9 9

𝛥 𝐸𝑛𝑡𝑟𝑜𝑝𝑦1 = 0,991 − 0,991 = 0

1 1 1 0 0 8 3 3 5 5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦2 = [− 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( )] + [− 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( )] = 0,84839
9 1 1 1 1 9 8 8 8 8

𝛥 𝐸𝑛𝑡𝑟𝑜𝑝𝑦2 = 0,991 − 0,84839 = 0,1426

2 1 1 1 1 7 3 3 4 4
𝐸𝑛𝑡𝑟𝑜𝑝𝑦3 = [− 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( )] + [− 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( )] = 0,98851
9 2 2 2 2 9 7 7 7 7

𝛥 𝐸𝑛𝑡𝑟𝑜𝑝𝑦3 = 0,991 − 0,98851 = 0,00249

3 2 2 1 1 6 2 2 4 4
𝐸𝑛𝑡𝑟𝑜𝑝𝑦4 = [− 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( )] + [− 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( )] = 0,9183
9 3 3 3 3 9 6 6 6 6

𝛥 𝐸𝑛𝑡𝑟𝑜𝑝𝑦4 = 0,991 − 0,9183 = 0,0727

5 2 2 3 3 4 2 2 2 2
𝐸𝑛𝑡𝑟𝑜𝑝𝑦5 = [− 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( )] + [− 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( )] = 0,98386
9 5 5 5 5 9 4 4 4 4
DATA MINING
PRESENTED FOR:
Cristian Alfonso Celis González – 47161659
Fabián Alejandro León Alméciga – 47151158

𝛥 𝐸𝑛𝑡𝑟𝑜𝑝𝑦5 = 0,991 − 0,98386 = 0,00714

6 3 3 3 3 3 1 1 2 2
𝐸𝑛𝑡𝑟𝑜𝑝𝑦6 = [− 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( )] + [− 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( )] = 0,97276
9 6 6 6 6 9 3 3 3 3

𝛥 𝐸𝑛𝑡𝑟𝑜𝑝𝑦6 = 0,991 − 0,97276 = 0,0182

8 4 4 4 4 1 0 0 1 1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦7 = [− 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( )] + [− 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( )] = 0,8889
9 8 8 8 8 9 1 1 1 1

𝛥 𝐸𝑛𝑡𝑟𝑜𝑝𝑦7 = 0,991 − 0,8889 = 0,1021

9 4 4 5 5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦8 = [− 𝐿𝑜𝑔2 ( ) − 𝐿𝑜𝑔2 ( )] + 0 = 0,991
9 9 9 9 9

𝛥 𝐸𝑛𝑡𝑟𝑜𝑝𝑦8 = 0,991 − 0,991 = 0

Sorted values 1 3 4 5 6 7 8
Split positions 0,5 2 3,5 4,5 5,5 6,5 7,5 8,5
Target Class <= > <= > <= > <= > <= > <= > <= > <= >
+ 0 4 1 3 1 3 2 2 2 2 3 1 4 0 4 0
- 0 5 0 5 1 4 1 4 3 2 3 2 4 1 5 0
Gain 0 0,1426 0,00249 0,0727 0,00714 0,0182 0,1021 0

(d) What is the best split (among a1, a2 and a3) according to the information gain?
Ans//

a1 a2 a3

0,22936 0,00716 0,1426

Considering that a1 is the one that provides the most information gain, it should be considered as
the best split.
DATA MINING
PRESENTED FOR:
Cristian Alfonso Celis González – 47161659
Fabián Alejandro León Alméciga – 47151158

(e) What is the best split (between a1 and a2) according to the classification error rate?
Ans//

7
𝑒𝑟𝑟𝑜𝑟𝑎1 = 1 − = 0,222
9

5
𝑒𝑟𝑟𝑜𝑟𝑎2 = 1 − = 0,444
9
According to the error a1 gives the best Split.
(f) What is the best split (between a1 and a2) according to the Gini index?
Ans//
4 3 2 1 2 5 1 2 4 2
𝐺𝑖𝑛𝑖𝑎1 = (1 − ( ) − ( ) ) + (1 − ( ) − ( ) ) = 0,3444
9 4 4 9 5 5

5 2 2 3 2 4 2 2 2 2
𝐺𝑖𝑛𝑖𝑎2 = (1 − ( ) − ( ) ) + (1 − ( ) − ( ) ) = 0,4888
9 5 5 9 4 4

According to the Gini index a1 gives the best Split.


DATA MINING
PRESENTED FOR:
Cristian Alfonso Celis González – 47161659
Fabián Alejandro León Alméciga – 47151158

Ejercicio 5:
Consider the following data set for a binary class problem.

Table N ° 3: Table of data necessary to develop exercise 5 (Source: Introduction to Data Mining).

(a) Calculate the information gain when splitting on A and B. Which attribute would
the decision tree induction algorithm choose?

Ans// The equation of entropy is:

𝐸 = − ∑ 𝜌(𝑗⁄𝑡) log 2 𝜌(𝑗⁄𝑡)


𝑗

The equation of “Information Gain” is:

𝑘
𝑛𝑖
𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 = 𝐸(𝑝) − (∑ 𝐸 (𝑖))
𝑛
𝑖=1

The initial entropy before any splitting is:


4 4 6 6
𝐸𝑖𝑛𝑖𝑡𝑖𝑎𝑙 = − ( log 2 ( )) − ( log 2 ( )) = 0.97095
10 10 10 10
After splitting on attributes, A and B, the contingency tables are:

Attribute
A=T A=F
+ 4 0
Class
- 3 3
DATA MINING
PRESENTED FOR:
Cristian Alfonso Celis González – 47161659
Fabián Alejandro León Alméciga – 47151158

Attribute
B=T B=F
+ 3 1
Class
- 1 5

The information gain after splitting on A is:


4 4 3 3
𝐸𝐴=𝑇 = − ( log 2 ( )) − ( log 2 ( )) = 0.9852
7 7 7 7
0 0 3 3
𝐸𝐴=𝐹 = − ( log 2 ( )) − ( log 2 ( )) = 0
3 3 3 3
7 3
𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 𝐴 = 𝐸𝑖𝑛𝑖𝑡𝑖𝑎𝑙 − ( × 𝐸𝐴=𝑇 ) − ( × 𝐸𝐴=𝐹 ) = 0.2813
10 10
The information gain after splitting on B is:
3 3 1 1
𝐸𝐵=𝑇 = − ( log 2 ( )) − ( log 2 ( )) = 0.8113
4 4 4 4
1 1 5 5
𝐸𝐵=𝐹 = − ( log 2 ( )) − ( log 2 ( )) = 0.6500
6 6 6 6
4 6
𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 𝐵 = 𝐸𝑖𝑛𝑖𝑡𝑖𝑎𝑙 − ( × 𝐸𝐵=𝑇 ) − ( × 𝐸𝐵=𝐹 ) = 0.2564
10 10
Following the gain information of each attribute, the attribute A going to be chosen to split the node,
given that it has a bigger gain information than the attribute B.
(b) Calculate the gain in the Gini index when splitting on A and B. Which attribute
would the decision tree induction algorithm choose?

Ans// The equation of Gini Index is:

𝐺𝐼𝑁𝐼(𝑡) = 1 − ∑[𝜌(𝑗⁄𝑡)]2
𝑗

The equation of “Information Gain” is:

𝑘
𝑛𝑖
𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 = 𝐺𝐼𝑁𝐼(𝑝) − (∑ 𝐺𝐼𝑁𝐼 (𝑖))
𝑛
𝑖=1

The initial Gini Index before any splitting is:

4 2 6 2
𝐺𝐼𝑁𝐼𝑖𝑛𝑖𝑡𝑖𝑎𝑙 = 1 − ( ) − ( ) = 0.48
10 10
After splitting on attributes, A and B, the contingency tables are the same that the previous point.
DATA MINING
PRESENTED FOR:
Cristian Alfonso Celis González – 47161659
Fabián Alejandro León Alméciga – 47151158

The gain in Gini after splitting on A is:

4 2 3 2
𝐺𝐼𝑁𝐼𝐴=𝑇 = 1 − ( ) − ( ) = 0.4898
7 7
0 2 3 2
𝐺𝐼𝑁𝐼𝐴=𝐹 = 1 − ( ) − ( ) = 0
3 3
7 3
𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 𝐴 = 𝐺𝐼𝑁𝐼𝑖𝑛𝑖𝑡𝑖𝑎𝑙 − ( × 𝐺𝐼𝑁𝐼𝐴=𝑇 ) − ( × 𝐺𝐼𝑁𝐼𝐴=𝐹 ) = 0.1371
10 10
The gain in Gini after splitting on B is:

3 2 1 2
𝐺𝐼𝑁𝐼𝐵=𝑇 = 1 − ( ) − ( ) = 0.375
4 4
1 2 5 2
𝐺𝐼𝑁𝐼𝐵=𝐹 = 1 − ( ) − ( ) = 0.2778
6 6
4 6
𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 𝐵 = 𝐺𝐼𝑁𝐼𝑖𝑛𝑖𝑡𝑖𝑎𝑙 − ( × 𝐺𝐼𝑁𝐼𝐵=𝑇 ) − ( × 𝐺𝐼𝑁𝐼𝐵=𝐹 ) = 0.1633
10 10
Following the gain information of each attribute, the attribute B going to be chosen to split the node,
given that it has a bigger gain information than the attribute A.
DATA MINING
PRESENTED FOR:
Cristian Alfonso Celis González – 47161659
Fabián Alejandro León Alméciga – 47151158

(c) Figure 4.13 shows that entropy and the Gini index are both monotonously increasing
on the range [0, 0.5] and they are both monotonously decreasing on the range [0.5, 1].
Is it possible that information gain and the gain in the Gini index favor different
attributes? Explain.

Ilustración N° 1: Comparación entre las medidas de impureza para problemas de clasificación binaria. (Fuente:
Introduction to Data Mining).

Ans// Through the previous figure, it can be seen that the entropy and the Gini index are
monotonically growing and decreasing within a similar range. However, the gains are on a different
scale according to the measures. Therefore, it does not behave in the same way, generating the
possibility that the gain of information and the gain in the Gini index favor different attributes as
evidenced in the previous points (Exercise 5.a and exercises 5.b) given that the attribute that going to
be selected in the decision tree is different according to the calculation parameter selected (Gini Index
or Entropy).

Вам также может понравиться