Вы находитесь на странице: 1из 15

Business Analytics

Cluster Analysis

1
Copyright © Ivy Professional School - All Rights Reserved
2
Copyright © Ivy Professional School - All Rights Reserved
Euclidean Distance

dist((x1, y1), (x2, y2)) = (x1 − x2)² + (y1 − y2)²

3
Copyright © Ivy Professional School - All Rights Reserved
K-Means clustering – How it works

4
Copyright © Ivy Professional School - All Rights Reserved
Fix how many clusters – E.g. 2, assign centroids.
Measure distance to create clusters

5
Copyright © Ivy Professional School - All Rights Reserved
Average of the group as new centroid

6
Copyright © Ivy Professional School - All Rights Reserved
Reassign the data points into clusters

7
Copyright © Ivy Professional School - All Rights Reserved
Keep repeating the process till the
centroids don’t change anymore.

8
Copyright © Ivy Professional School - All Rights Reserved
Check if data can be clustered..!

Hopkins Test

If p value is less than 0.5 (remember it’s not 0.05) then


it means the data can be clustered.

9
Copyright © Ivy Professional School - All Rights Reserved
How many clusters?

To identify the optimal number of clusters we may use


different methods like “silhouette”, “scree plot” etc.

10
Copyright © Ivy Professional School - All Rights Reserved
Goodness of Fit of the clusters

Within cluster sum of squares by cluster : Closer to 1


better the clusters. It ranges from 0 to 1

11
Copyright © Ivy Professional School - All Rights Reserved
Internal Validation

1. Connectivity - what extent items are placed in the same cluster as their nearest
neighbors in the data space. It has a value between 0 and infinity and should
be minimized.

2. Average Silhouette width - It lies between -1 (poorly clustered observations) to


1 (well clustered observations). It should be maximized.

3. Dunn index - It is the ratio between the smallest distance between observations
not in the same cluster to the largest intra-cluster distance. It has a value
between 0 and infinity and should be maximized.

12
Copyright © Ivy Professional School - All Rights Reserved
Stability Validation

1. The average proportion of non-overlap (APN)

2. The average distance (AD)

3. The average distance between means (ADM)

4. The figure of merit (FOM)

All values should be minimized

13
Copyright © Ivy Professional School - All Rights Reserved
External Validation

1. Corrected Rand Index provides a measure for assessing the similarity


between two partitions, adjusted for chance. Its range is -1 (no
agreement) to 1 (perfect agreement). It should be maximized.

2. Variation of Information (VI) is a measure of the distance between two


clustering (partitions of elements). It is closely related to mutual
information. It should be minimized.

This validation can only be used when there is pre-defined clusters

14
Copyright © Ivy Professional School - All Rights Reserved
Visit Ivy’s Blog for Career Tips, Latest Info, Job Alerts -
www.ivyproschool.com/blog
Interact with us at -

Ivy Professional School


14 B | Camac Street | Kolkata – 17
www.ivyproschool.com | info@ivyproschool.com
T: 033 400 11221 | SMS: 9748 441111

15
Copyright © Ivy Professional School - All Rights Reserved

Вам также может понравиться