Академический Документы
Профессиональный Документы
Культура Документы
Lakshmipathi.T H
Cluster analysis identifies and classifies objects individuals or variables on the basis of the similarity of the characteristics they possess. It seeks to minimize within-group variance and maximize between-group variance. The result of cluster analysis is a number of heterogeneous groups with homogeneous contents: There are substantial differences between the groups, but the individuals within a single group are similar.
Clustering procedures
Hierarchical clustering follows one of two approaches: Agglomerative methods start with each observation as a cluster and with each step combine observations to form clusters until there is only one large cluster. Divisive methods begin with one large cluster and proceed to split into smaller clusters items that are most dissimilar.
Hierarchical Clustering
This method does not require the number of clusters k as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
a b c d e
Step 4
ab abcde cde de
divisive
Step 3 Step 2 Step 1 Step 0
Single linkage (based on the shortest distance between objects) Complete linkage (based on the largest distance between objects) Average linkage (based on the average distance between objects) Ward's method (based on the sum of squares between the two clusters, summed over all variables) Centroid method (based on the distance between cluster centroids).
K-Means Method This algorithm assigns each item to the cluster having the nearest centroid(mean).the process is composed of these three steps 1.partition the items into K initial clusters 2.proceed through the list of items, assigning an item to the cluster whose centroid (mean) is nearst.recalculate the centroid for the cluster receiving the new item and for the cluster losing the item. 3.Repeat step 2 until no more reassignments take place.
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Example
10 9 8 7 6 5 4
10 9 8 7 6 5 4
3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
reassign
10 9 8 7
10 9 8 7 6 5 4
reassign
6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Description of Data
The data gives the values of four variables for each of 36 occupations in the USA.The question to be addressed about these data is can the occupations be classified in some potentially informative way?
Data
prestige rating 82 90 76 90 87 93 90 88 89 97 59 73 81 45 39 34 41 16 33 53 67 57 26 29 10 15 19 10 13 24 20 7 16 11 8 41 suicide rate 23.8 37.5 37 20.7 10.6 14.2 45.6 31.9 24.3 31.9 16 16.8 64.8 47.3 21.9 16.9 32.4 24.1 32.7 30.8 34.2 34.5 24.4 29.4 14.4 41.7 19.2 24.9 17.9 15.7 36 24.4 42.2 38.2 20.3 47.6 income 3977 5509 4303 4091 2410 4366 6448 4590 6284 8302 3176 3456 4700 3806 2828 3480 3771 2543 2450 3447 4648 3303 2693 3353 1898 2410 3424 2213 2590 2915 2357 1942 2249 2551 1866 2866 education 14.4 16+ 13.6 16+ 16+ 16+ 16+ 16+ 16+ 16+ 15.8 16+ 12.2 11.6 12.7 12.2 12.7 12.1 8.7 11.1 8.8 9.6 9.4 9.3 10.3 8.2 9.2 8.9 9.6 9.6 8.8 9.8 8.7 8.5 8.2 10.6
Dendrogram
The two groups of occupations might be labelled blue collar and professional. Cluster 1 consists of bookkeepers, cooks etc., with generally lower prestige values, income and education, but higher suicide rates. Cluster 2 are occupations such as accountants, architects and dentists, having high prestige, income and education values and relatively low suicide rates.
Nonhierarchical Clustering
/*unstandardized variables*/ proc fastclus data=train.clus maxc=2 out=train.abc; var prestigerating suiciderate income education ; run; /*standardized variables*/ proc fastclus data=train.Zscore1 maxc=2 out=train.abc1; var zprestigerating zsuiciderate zincome zeducation ; run;
A "profile" of a cluster is merely the set of mean values for that cluster. These can be for the internal variables (used to form the clusters) as well as external variables. The external variables may be demographic, psychographic, or consumption-pattern. Once the clusters are formed, other techniques such as profiling or discriminant analysis can be used to see what internal variables account for the clustering.
Thank You