Академический Документы
Профессиональный Документы
Культура Документы
This techniques can be condensed in two main types of problems that unsupervised learning
tries to solve. This problems are:
Clustering
Dimensionality Reduction
2. Methods
1. Hierarchical: A hierarchical procedure in cluster analysis is characterized
by the development of a tree like structure. A hierarchical procedure can
be agglomerative or divisive. Agglomerative methods in cluster analysis
consist of linkage methods, variance methods, and centroid
methods. Linkage methods in cluster analysis are comprised of single
linkage, complete linkage, and average linkage.
Land use: Identification of areas of similar land use in an earth observation database
Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost
City-planning: Identifying groups of houses according to their house type, value, and
geographical location
continent faults
5. Quality of clusters
A good clustering method will produce high quality clusters with – high intra-class similarity
– low inter-class similarity. The quality of a clustering result depends on both the similarity
measure used by the method and its implementation. The quality of a clustering method is
also measured by its ability to discover some or all of the hidden patterns. Measure of quality
is given by dissimilarity/Similarity metric. Similarity is expressed in terms of a distance
function.
7. K Means Clustering
K means is an iterative clustering algorithm that aims to find local maxima in each iteration.
This algorithm works in these 5 steps :
6. Repeat steps 4 and 5 until no improvements are possible or have reached the maximum
number of possible iterations.
8. Hierarchical Clustering
Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters.
This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest
clusters are merged into the same cluster. In the end, this algorithm terminates when there is
only a single cluster left.The results of hierarchical clustering can be shown using dendrogram.
###########################################################################
Wholesale
customers data.xlsx
Here the data is of a store , and the sales in quantity of different variety of roducts, based on the
region
Rplot.pdf
This process if finding the clusters may not be useful as there are a lot of data points.
shop.cut = cutree(shop_clust,3)
table(shop.cut)
Note: We have taken the number of clusters to be 3, as a random number.
Output:
shop.cut
1 2 3
153 281 6
Single : Process where you exhaust the number of clusters to only one cluster, when the above
code was run with Single method we got cluster table as:
shop.cut
1 2 3
438 1 1
single.pdf
Complete: The complete linkage method finds similar clusters. The single linkage method
(which is closely related to the minimal spanning tree) adopts a ‘friends of friends’ clustering
strategy. Little similar to single method as above
shop.cut
1 2 3
429 10 1
cmplete.pdf
Average: also known as unweighted paired group method with arithmetic mean (UPGMA),
which is a bottom up hierarchical method.
shop.cut
1 2 3
434 5 1
average.pdf
Mcquitty: also known as weighted paired group method with arithmetic mean (WPGMA),
which is also a bottom up approach attributed to Sokal and Michener. Where a rooted tree is
constructed, which represents pairwise distance matrix (similarity matrix). At each step the
nearest two clusters are combined (say i and j) are combined to form higher level cluster (i⋃j),
then the distance to another cluster k is simply the distance between k and members of i⋃j.
shop.cut
1 2 3
413 26 1
mcquitty.pdf
Median: Also known as weighted centroid clustering (WPGMC), objective of this clustering is
to minimize within-group sum of squares (i.e. the squared error of ANOVA) is minimized.
shop.cut
1 2 3
437 2 1
Median.pdf
Centroid: Unweighted centroid clustering, UPGMC joins the objects or groups that have the
highest similarity (or the smallest distance), by replacing all the objects of the group produced
by the centroid of the group. This centroid is considered as a single object at the next clustering
step.
shop.cut
1 2 3
438 1 1
Rplot01.pdf
10.2 Divisive
The divisive hierarchical clustering, also known as DIANA (DIvisive ANAlysis) is the inverse of
agglomerative clustering (top down approach).
It starts by including all objects in a single large cluster. At each step of iteration, the most
heterogeneous cluster is divided into two. The process is iterated until all objects are in their own cluster.
divisive = diana(dist, stand = TRUE)
#install.packages('factoextra')
library(factoextra)
plot(divisive)
PFA For the Plot:
Divisive.pdf
11. Non-Hierarchal
Non-hierarchical cluster analysis aims to find a grouping of objects which maximises or minimises
some evaluating criterion.
Few of the algorithms are as follows:
$Best.nc
Number_clusters Value_Index
10.0000 162.8123
Optimal number of clusters would be 10
kmean_shopclust = kmeans(new.shop,10)
kmean_shopclust
Output:
K-means clustering with 10 clusters of sizes 2, 30, 69, 96, 42, 5, 18, 93, 84, 1
Cluster means:
Channel Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 1.000000 34782.000 30367.000 16898.000 48701.500 755.5000 26776.0000
2 2.000000 6683.067 17468.033 26658.933 1986.300 11872.9000 2531.2000
3 1.188406 17642.768 2829.594 3949.116 3052.652 936.1594 1231.4638
4 1.041667 2684.062 2979.323 3160.844 1943.073 860.5521 880.9271
5 1.214286 27460.929 5619.833 7137.405 4712.786 1398.0238 2332.6190
6 2.000000 25603.000 43460.600 61472.200 2636.000 29974.2000 2708.8000
7 1.111111 48835.500 3132.056 4672.000 5576.389 843.4444 2272.1111
8 1.118280 9478.882 2368.871 3199.376 3766.194 776.2258 923.9032
9 1.809524 4535.167 8627.143 12882.738 1416.631 5500.5238 1488.4524
10 1.000000 112151.000 29627.000 18148.000 16745.000 4948.0000 8550.0000
kmeans.pdf
$cluster_centroids
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.05640132 -0.1159451 -0.10486245 -0.0757586 -0.09800901 -0.08386718
[2,] 1.80423096 2.0226248 0.74887938 0.5481457 0.01604638 4.93319395
[3,] 0.47364697 3.5621577 3.83386984 0.2524855 4.01033079 0.82010969
[4,] -0.27162239 -0.1105995 6.24494432 -0.6057940 7.38707677 -0.10987901
[5,] 0.32549974 5.4740744 8.92636738 -0.4214355 7.95861267 0.50321852
[6,] 0.86379522 9.1732079 2.54259799 -0.4294690 3.60508212 -0.22051315
[7,] -0.05426424 -0.3666840 -0.61971760 6.5786235 -0.58946707 0.41598776
[8,] 5.07907266 -0.3147896 -0.08936785 2.7738361 -0.44118234 -0.21519420
[9,] 7.91872366 3.2289317 1.07298201 2.8164754 0.43342490 2.49108711
[10,] 1.96458102 5.1696185 1.28575327 6.8927538 -0.55423109 16.45971129
[11,] 1.63802986 1.4887768 0.59714043 11.9054495 -0.33757179 1.44821848
$num_clusters
[1] 11
$iter
[1] 3