Вы находитесь на странице: 1из 12

CLUSTER ANALYSIS

Submitted on: XX August 2019


Contents
1. Introduction ........................................................................................................................................ 2
2. Methods .............................................................................................................................................. 2
3. Examples of clustering applications .................................................................................................... 3
4. Major clustering approaches .............................................................................................................. 3
5. Quality of clusters ............................................................................................................................... 4
6. Cluster analysis in machine learning ................................................................................................... 4
7. K Means Clustering ............................................................................................................................. 4
8. Hierarchical Clustering ........................................................................................................................ 4
9. Brief Overview Of The Code In R......................................................................................................... 5
10. Hierarchical Cluster .......................................................................................................................... 5
10.1 Agglomerative............................................................................................................................. 5
10.2 Divisive ....................................................................................................................................... 7
11. Non-Hierarchal ................................................................................................................................. 7
11.1 K-Means (Relocation Method) ................................................................................................... 7
11.2 Leader method (Single Pass method)...................................................................................... 8
11.3 Jarvis Patrick Method (Nearest Neighbor).................................................................................. 9
1. Introduction
Cluster analysis is a class of techniques that are used to classify objects or cases
into relative groups called clusters. Cluster analysis is also called classification
analysis or numerical taxonomy. In cluster analysis, there is no prior
information about the group or cluster membership for any of the objects.
Cluster Analysis has been used in marketing for various purposes. Segmentation
of consumers in cluster analysis is used on the basis of benefits sought from the
purchase of the product. It can be used to identify homogeneous groups of
buyers.

Cluster analysis involves formulating a problem, selecting a distance measure,


selecting a clustering procedure, deciding the number of clusters, interpreting
the profile clusters and finally, assessing the validity of clustering.

This techniques can be condensed in two main types of problems that unsupervised learning
tries to solve. This problems are:

 Clustering

 Dimensionality Reduction

2. Methods
1. Hierarchical: A hierarchical procedure in cluster analysis is characterized
by the development of a tree like structure. A hierarchical procedure can
be agglomerative or divisive. Agglomerative methods in cluster analysis
consist of linkage methods, variance methods, and centroid
methods. Linkage methods in cluster analysis are comprised of single
linkage, complete linkage, and average linkage.

2. Non-hierarchical: The non-hierarchical methods in cluster analysis are


frequently referred to as K means clustering. The two-step procedure can
automatically determine the optimal number of clusters by comparing the
values of model choice criteria across different clustering solutions. The
choice of clustering procedure and the choice of distance measure are
interrelated. The relative sizes of clusters in cluster analysis should be
meaningful. The clusters should be interpreted in terms of cluster
centroids.
Conducting a cluster analysis means first calculating the proximity measurements with paired
comparisons to determine how far apart two objects (in this case students) happen to be.
There are two types of proximity measures: Similarity measures and distance measures.
Similarity measures describe the similarity between two objects by comparing their content.
Distance measures describe the similarity between two objects by measuring the geometric
distance.

3. Examples of clustering applications


 Marketing: Help marketers discover distinct groups in their customer bases, and then

use this knowledge to develop targeted marketing programs

 Land use: Identification of areas of similar land use in an earth observation database

 Insurance: Identifying groups of motor insurance policy holders with a high average

claim cost

 City-planning: Identifying groups of houses according to their house type, value, and

geographical location

 Earth-quake studies: Observed earthquake epicentres should be clustered along

continent faults

4. Major clustering approaches


Partitioning algorithms: Construct various partitions and then evaluate them by some
criterion
Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects)
using some criterion
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the clusters and the idea is to find the best
fit of that model to each other

5. Quality of clusters
A good clustering method will produce high quality clusters with – high intra-class similarity
– low inter-class similarity. The quality of a clustering result depends on both the similarity
measure used by the method and its implementation. The quality of a clustering method is
also measured by its ability to discover some or all of the hidden patterns. Measure of quality
is given by dissimilarity/Similarity metric. Similarity is expressed in terms of a distance
function.

6. Cluster analysis in machine learning


Basically, Clustering in ML allows you to break a population into smaller groups where each
observation within every group is more similar to each other than it is to an observation of
another group. So, the idea is to group together similar kind of observations into smaller
groups and thus break down the large heterogeneous population into smaller homogenous
groups.

7. K Means Clustering
K means is an iterative clustering algorithm that aims to find local maxima in each iteration.
This algorithm works in these 5 steps :

1. Specify the desired number of clusters K

2. Randomly assign each data point to a cluster

3. Compute cluster centroids

4. Re-assign each point to the closest cluster centroid

5. Re-compute cluster centroids

6. Repeat steps 4 and 5 until no improvements are possible or have reached the maximum
number of possible iterations.

8. Hierarchical Clustering
Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters.
This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest
clusters are merged into the same cluster. In the end, this algorithm terminates when there is
only a single cluster left.The results of hierarchical clustering can be shown using dendrogram.

###########################################################################

9. Brief Overview Of The Code In R


The data set used in this scenarios is as follows:

Wholesale
customers data.xlsx

Here the data is of a store , and the sales in quantity of different variety of roducts, based on the
region

10. Hierarchical Cluster


10.1 Agglomerative
It is a way of segregating the data points, based on minimum variance method aims at finding compact,
spherical clusters.
In this method we are representing it in the form of dendogram:
dist = dist(scale.new.shop,method = "euclidean")
shop_clust = hclust(dist,method = 'ward.D2')
plot(shop_clust)
Here the distance is calculated with ‘Euclidean’ method.
PFA for the Dendogram:

Rplot.pdf

This process if finding the clusters may not be useful as there are a lot of data points.
shop.cut = cutree(shop_clust,3)
table(shop.cut)
Note: We have taken the number of clusters to be 3, as a random number.
Output:
shop.cut
1 2 3
153 281 6

Simmillarly in the method function we have the following options:

 Single : Process where you exhaust the number of clusters to only one cluster, when the above
code was run with Single method we got cluster table as:
shop.cut
1 2 3
438 1 1

single.pdf

 Complete: The complete linkage method finds similar clusters. The single linkage method
(which is closely related to the minimal spanning tree) adopts a ‘friends of friends’ clustering
strategy. Little similar to single method as above
shop.cut
1 2 3
429 10 1

cmplete.pdf

 Average: also known as unweighted paired group method with arithmetic mean (UPGMA),
which is a bottom up hierarchical method.
shop.cut
1 2 3
434 5 1

average.pdf

Mcquitty: also known as weighted paired group method with arithmetic mean (WPGMA),
which is also a bottom up approach attributed to Sokal and Michener. Where a rooted tree is
constructed, which represents pairwise distance matrix (similarity matrix). At each step the
nearest two clusters are combined (say i and j) are combined to form higher level cluster (i⋃j),
then the distance to another cluster k is simply the distance between k and members of i⋃j.
shop.cut
1 2 3
413 26 1

mcquitty.pdf

 Median: Also known as weighted centroid clustering (WPGMC), objective of this clustering is
to minimize within-group sum of squares (i.e. the squared error of ANOVA) is minimized.
shop.cut
1 2 3
437 2 1
Median.pdf

 Centroid: Unweighted centroid clustering, UPGMC joins the objects or groups that have the
highest similarity (or the smallest distance), by replacing all the objects of the group produced
by the centroid of the group. This centroid is considered as a single object at the next clustering
step.
shop.cut
1 2 3
438 1 1

Rplot01.pdf

10.2 Divisive
The divisive hierarchical clustering, also known as DIANA (DIvisive ANAlysis) is the inverse of
agglomerative clustering (top down approach).
It starts by including all objects in a single large cluster. At each step of iteration, the most
heterogeneous cluster is divided into two. The process is iterated until all objects are in their own cluster.
divisive = diana(dist, stand = TRUE)
#install.packages('factoextra')
library(factoextra)
plot(divisive)
PFA For the Plot:

Divisive.pdf

11. Non-Hierarchal
Non-hierarchical cluster analysis aims to find a grouping of objects which maximises or minimises
some evaluating criterion.
Few of the algorithms are as follows:

11.1 K-Means (Relocation Method)


K-means clustering aims to assign objects to a user-defined number of clusters (k) in such a way that
maximises the separation of those clusters while minimising intra-cluster distances relative to the
cluster's mean or centroid. The algorithm typically defaults to Euclidean distances, however, alternate
criteria, such as different distance or dissimilarity measures, can be accepted by many implementations.
To determine optimal number of clusters
nshp = NbClust(scale.new.shop,diss = dist, distance = NULL,
method = "kmeans", min.nc = 2, max.nc = 15, index = "ch")
nshp
$All.index
2 3 4 5 6 7 8 9 10 11 12
153.7581 132.7107 137.4171 155.7384 144.7235 151.5944 143.5295 147.0911 162.8123 155.5195 155.7078
13 14 15
153.4131 148.9810 142.1594

$Best.nc
Number_clusters Value_Index
10.0000 162.8123
Optimal number of clusters would be 10
kmean_shopclust = kmeans(new.shop,10)
kmean_shopclust
Output:
K-means clustering with 10 clusters of sizes 2, 30, 69, 96, 42, 5, 18, 93, 84, 1

Cluster means:
Channel Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 1.000000 34782.000 30367.000 16898.000 48701.500 755.5000 26776.0000
2 2.000000 6683.067 17468.033 26658.933 1986.300 11872.9000 2531.2000
3 1.188406 17642.768 2829.594 3949.116 3052.652 936.1594 1231.4638
4 1.041667 2684.062 2979.323 3160.844 1943.073 860.5521 880.9271
5 1.214286 27460.929 5619.833 7137.405 4712.786 1398.0238 2332.6190
6 2.000000 25603.000 43460.600 61472.200 2636.000 29974.2000 2708.8000
7 1.111111 48835.500 3132.056 4672.000 5576.389 843.4444 2272.1111
8 1.118280 9478.882 2368.871 3199.376 3766.194 776.2258 923.9032
9 1.809524 4535.167 8627.143 12882.738 1416.631 5500.5238 1488.4524
10 1.000000 112151.000 29627.000 18148.000 16745.000 4948.0000 8550.0000

Within cluster sum of squares by cluster:


[1] 1591649631 5004238144 1986214409 2245513564 3622051644 5682449098 2778475310 3530326333 466
5510438
[10] 0
(between_SS / total_SS = 80.3 %)

With between SS score to be around 80.3% which is decent.


K-Means plot is as follows:

kmeans.pdf

11.2 Leader method (Single Pass method)


To produce clusters that are dependent upon the order in which the compounds are processed, and so
will not be considered further.
Leader algorithm is a incremental clustering algorithm generally used to cluster large data sets. This
algorithm is order dependent and may form different clusters based on the order the data set is provided
to the algorithm.
Since the algorithm and the inputs to be mentioned are not clear, we have hypothetically taken the
values as follows:
leaderCluster(scale.new.shop,radius = 5, max_iter = 10L,p=2 )
$cluster_id
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
[34] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 3
[67] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 6 2 1 1 1 1 3 7 1 1 1 1 1
[100] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 8 1 1 1 1 1 1
[133] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[166] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 9 1 10 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[199] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[232] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1
[265] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[298] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1
[331] 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[364] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[397] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[430] 1 1 1 1 1 1 1 1 1 1 1

$cluster_centroids
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.05640132 -0.1159451 -0.10486245 -0.0757586 -0.09800901 -0.08386718
[2,] 1.80423096 2.0226248 0.74887938 0.5481457 0.01604638 4.93319395
[3,] 0.47364697 3.5621577 3.83386984 0.2524855 4.01033079 0.82010969
[4,] -0.27162239 -0.1105995 6.24494432 -0.6057940 7.38707677 -0.10987901
[5,] 0.32549974 5.4740744 8.92636738 -0.4214355 7.95861267 0.50321852
[6,] 0.86379522 9.1732079 2.54259799 -0.4294690 3.60508212 -0.22051315
[7,] -0.05426424 -0.3666840 -0.61971760 6.5786235 -0.58946707 0.41598776
[8,] 5.07907266 -0.3147896 -0.08936785 2.7738361 -0.44118234 -0.21519420
[9,] 7.91872366 3.2289317 1.07298201 2.8164754 0.43342490 2.49108711
[10,] 1.96458102 5.1696185 1.28575327 6.8927538 -0.55423109 16.45971129
[11,] 1.63802986 1.4887768 0.59714043 11.9054495 -0.33757179 1.44821848

$num_clusters
[1] 11

$iter
[1] 3

11.3 Jarvis Patrick Method (Nearest Neighbor)


Algorithm:
Step 1: For each point of the data set list the k nearest neighbours.
Step 2: Set up an integer label table of length n, with each entry initially set to the first entry of the
corresponding neighbourhood row.
Step 3: All possible pairs of neighbourhood rows are tested in the following manner. Replace both label
entries by the smaller of the two existing entries if both zeroth neighbours (the points being tested) are
found in both neighbourhood rows and at least kt neighbour matches exist between the two rows (kt is
referred to as the similarity threshold). Also, replace all appearances of the higher label (throughout the
entire label table) with the lower label if the above test is successful.
Step 4: The clusters under the k, kt selections are now indicated by identical labelling of the points
belonging to the clusters.
Parameters:
k: consider k nearest neighbors
kt: is the number of neighbors that need to match to put two points in the same cluster
The code for the above could be found at below link:
https://michael.hahsler.net/SMU/EMIS8331/material/jpclust.html#implementing_the_jarvis-
patrick_algorithm_in_r
The code as mentioned to implement is a bit complex so we did not implement the same in the report.
References;
https://stat.ethz.ch/R-manual/R-patched/library/stats/html/hclust.html
https://www.google.co.in/search?client=opera&q=WPGMA&sourceid=opera&ie=UTF-8&oe=UTF-8
http://biol09.biol.umontreal.ca/ULaval08/Chapitre_3.pdf
https://mb3is.megx.net/gustame/dissimilarity-based-methods/cluster-analysis/non-hierarchical-
cluster-analysis
https://en.wikipedia.org/wiki/Cluster_analysis
https://www.daylight.com/meetings/mug96/barnard/E-MUG95.html
https://rdrr.io/cran/leaderCluster/man/leaderCluster.html
https://stackoverflow.com/questions/36928654/leader-clustering-algorithm-explanation
https://michael.hahsler.net/SMU/EMIS8331/material/jpclust.html#implementing_the_jarvis-
patrick_algorithm_in_r

Вам также может понравиться