Академический Документы
Профессиональный Документы
Культура Документы
Cluster Analysis
Cluster Analysis
It is a multivariate interdependence analysis that classifies individuals or objects into a small number of mutually exclusive & exhaustive groups ,ensuring that there is as much likeness within groups and as much difference among groups. Cluster analysis differs from multiple discriminant analysis in that the groups are not predefined. The purpose of cluster analysis is to determine how many distinct groups exist and to define their composition. It does not predict relationships & is not a dependence technique
Generally interval scaled variables are ideally suited for cluster analysis Ratio scaled variables can also be used Standardisation of variables is necessary if units of measurement of variables widely differ First, an initial clustering solution is obtained using a hierarchical procedure, such as average linkage or Ward's. The number of clusters so obtained are used as inputs to the non- hierarchical procedure such as optimizing partitioning method.
Steps in Analysis
Run the hierarchical clustering programme on the variables ( after standardisation if necessary) Generate output called Agglomeration schedule It shows all possible solutions from 1to n-1 clusters where n is the number of respondents or objects Going up from the bottom of the Agglomeration schedule look at the column called coefficients to decide on number of clusters In this column starting from the bottom ,calculate difference in the value of coefficient in the neighbouring rows. If the maximum value of this difference occurs ,say, between third & fourth row from the bottom it indicates there might be 3 clusters ( the lower row number) The dendogram & icicle plot can also be requested & will give essentially same information in graphical form .
Steps in Analysis
Once the number of clusters have been identified a K means clustering approach can be run on the data The number of clusters obtained in the first stage are used as input & output obtained This output gives the initial & final cluster centres for each variable Final cluster centres are the best solutions These are used to interpret the average value of each variable for a cluster & thereby describe the clusters
Example
Clustering of consumers based on attitude towards shopping Six attitudinal variables were identified Consumers were asked to express their degree of agreement with the following statements on a 7 point scale(1=Disagree;2=Agree) V1=Shopping is fun V2=Shopping is bad for your budget V3=I combine shopping with eating out V4=I get best buys when shopping V5=I do not care about shopping V6=You can save a lot of money by comparing prices Data obtained from 20 respondents shown in next slide In reality sample size should be much larger
V1
6 2 7 4 1 6 5 7 2 3 1 5 2 4 6 3 4 3 4 2
V2
4 3 2 6 3 4 3 3 4 5 3 4 2 6 5 5 4 7 6 3
V3
7 1 6 4 2 6 6 7 3 3 2 5 1 4 4 4 7 2 3 2
V4
3 4 4 5 2 3 3 4 3 6 3 4 5 6 2 6 2 6 7 4
V5
2 5 1 3 6 3 3 1 6 4 5 2 4 4 1 4 2 4 2 7
V6
3 4 3 6 4 4 4 4 3 6 3 4 4 7 4 7 5 3 7 2
Agglomeration Schedule Using Wards Procedure Stage cluster Clusters combined first appears
Stage 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Cluster 1 Cluster 2 Coefficient 14 16 1.000000 6 7 2.000000 2 13 3.500000 5 11 5.000000 3 8 6.500000 10 14 8.160000 6 12 10.166667 9 20 13.000000 4 10 15.583000 1 6 18.500000 5 9 23.000000 4 19 27.750000 1 17 33.100000 1 15 41.333000 2 5 51.833000 1 3 64.500000 4 18 79.667000 2 4 172.662000 1 2 328.600000 Cluster 1 Cluster 2 Next stage 0 0 6 0 0 7 0 0 15 0 0 11 0 0 16 0 1 9 2 0 10 0 0 11 0 6 12 6 7 13 4 8 15 9 0 17 10 0 14 13 0 16 3 11 18 14 5 19 12 0 18 15 17 19 16 18 0
Agglomeration Schedule Using Wards Procedure Stage cluster Clusters combined first appears
Stage 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Cluster 1 Cluster 2 Coefficient 14 16 1.000000 6 7 2.000000 2 13 3.500000 5 11 5.000000 3 8 6.500000 10 14 8.160000 6 12 10.166667 9 20 13.000000 4 10 15.583000 1 6 18.500000 5 9 23.000000 4 19 27.750000 1 17 33.100000 1 15 41.333000 2 5 51.833000 1 3 64.500000 4 18 79.667000 2 4 172.662000 1 2 328.600000 Cluster 1 Cluster 2 Next stage 0 0 6 0 0 7 0 0 15 0 0 11 0 0 16 0 1 9 2 0 10 0 0 11 0 6 12 6 7 13 4 8 15 9 0 17 10 0 14 13 0 16 3 11 18 14 5 19 12 0 18 15 17 19 16 18 0
Interpretation
Theoretical, or practical considerations may suggest a certain number of clusters If the purpose is market segmentation management may want a particular number of clusters Value in coefficients column suddenly more than doubles between stages 17(3 clusters) & stage 18 (2 clusters) Likewise at the last 2 stages of dendogram clusters are being combined at large distances It appears that a three cluster solution is appropriate
Interpretation
A 3 cluster solution results in clusters with 8,6 & 6 respondents A 4 cluster solution has 8,6,5 & 1 respondents It is not meaningful to have a cluster with only one case ,so a 3 cluster solution is preferable Interpreting & profiling clusters involves examining cluster centroids
Cluster Centroids
Means of Variables
Cluster No.
1 2 3
V1
5.750 1.667 3.500
V2
3.625 3.000 5.833
V3
6.000 1.833 3.333
V4
3.125 3.500 6.000
V5
1.750 5.500 3.500
V6
3.875 3.333 6.000
Interpretation
Cluster 1 is high on V1: Shopping is Fun V3:Combine shopping with eating out V5: Low: Do not care about shopping FUN LOVING & CONCERNED SHOPPERS (1,3,6,7,8,12,15,17) Cluster II is high on V5:DO not care about shopping V1:Low:Shopping is Fun V3:Low:Combine shopping with eating out APATHETIC SHOPPERS (2,5,9,11,13,20) Cluster III is High on V2: Shopping upsets budget V4: Try to get best buys V6: Can save a lot of money by comparing prices ECONOMICL SHOPPERS (10,14,1618&19)
Interpretation
Further profiling can be done on the basis of variables not used for clustering Demographic, psychographic, product usage, media usage variables can be used to target marketing efforts for each cluster The variables that significantly differentiates between clusters can be obtained through Discriminant analysis
1 V1 V2 V3 V4 V5 V6 4 6 3 7 2 7
3 7 2 6 4 1 3
a Iteration History
Iteration 1 2
a. Convergence achieved due to no or small distance change. The maximum distance by which any center has changed is 0.000. The current iteration is 2. The minimum distance between initial centers is 7.746.
1 V1 V2 V3 V4 V5 V6 4 6 3 6 4 6
3 6 4 6 3 2 4
Distances between Final Cluster Centers Cluster 1 2 3 1 5.568 5.698 6.928 2 5.568 3 5.698 6.928
Clustering Variables
In this instance, the units used for analysis are the variables, and the distance measures are computed for all pairs of variables. Hierarchical clustering of variables can aid in the identification of unique variables, or variables that make a unique contribution to the data. Clustering can also be used to reduce the number of variables. Associated with each cluster is a linear combination of the variables in the cluster, called the cluster component. A large set of variables can often be replaced by the set of cluster components with little loss of information. However, a given number of cluster components does not generally explain as much variance as the same number of principal components.
SPSS Windows
To select this procedures using SPSS for Windows click:
Analyze>Classify>Hierarchical Cluster
Analyze>Classify>K-Means Cluster
Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups called clusters. Objects in each cluster tend to be similar to each other and dissimilar to objects in the other clusters. Cluster analysis is also called classification analysis, or numerical taxonomy. Both cluster analysis and discriminant analysis are concerned with classification. However, discriminant analysis requires prior knowledge of the cluster or group membership for each object or case included, to develop the classification rule. In contrast, in cluster analysis there is no a priori information about the group or cluster membership for any of the objects. Groups or clusters are suggested by the data, not defined a priori.
Variable 1
Variable 2
Variable 1
Variable 2
Complete Linkage
Maximum Distance
Cluster 1
Average Linkage
Cluster 2
Centroid Method
2. 3. 4.
5.