Академический Документы
Профессиональный Документы
Культура Документы
Jewel Refran
Cluster analysis is a group of multivariate techniques whose primary
purpose is to group objects (e.g., respondents, products, or other
entities) based on the characteristics they possess.
It is a means of grouping records based upon attributes that make
them similar. If plotted geometrically, the objects within the clusters
will be close together, while the distance between clusters will be
farther apart.
* Cluster Variate
- represents a mathematical representation of the
selected set of variables which compares the object‟s similarities.
Cluster Analysis Factor Analysis
- grouping is -grouping is based
based on the on patterns of
distance variation
(proximity) (correlation)
Hypothesis Generation
- Cluster analysis is also useful when a researcher
wishes to develop hypotheses concerning the nature of
the data or to examine previously stated hypotheses.
Cluster analysis is descriptive, atheoretical, and noninferential. Cluster analysis has no
statistical basis upon which to draw inferences from a sample to a population, and many
contend that it is only an exploratory technique. Nothing guarantees unique solutions,
because the cluster membership for any number of solutions is dependent upon many
elements of the procedure, and many different solutions can be obtained by varying one or
more elements.
Cluster analysis will always create clusters, regardless of the actual existence of any
structure in the data. When using cluster analysis, the researcher is making an assumption
of some structure among the objects. The researcher should always remember that just
because clusters can be found does not validate their existence. Only with strong
conceptual support and then validation are the clusters potentially meaningful and relevant.
The cluster solution is not generalizable because it is totally dependent upon the variables
used as the basis for the similarity measure. This criticism can be made against any
statistical technique, but cluster analysis is generally considered more dependent on the
measures used to characterize the objects than other multivariate techniques. With the
cluster variate completely specified by the researcher. As a result, the researcher must be
especially cognizant of the variables used in the analysis, ensuring that they have strong
conceptual support.
Cluster analysis used for:
◦ Taxonomy Description. Identifying groups within the data
◦ Data Simplication. The ability to analyze groups of similar
observations instead all individual observation.
◦ Relationship Identification. The simplified structure from CA
portrays relationships not revealed otherwise.
Theoretical, conceptual and practical considerations must be
observed when selecting clustering variables for CA:
◦ Only variables that relate specifically to objectives of the CA are
included.
◦ Variables selected characterize the individuals (objects) being
clustered.
The primary objective of cluster analysis is to define the
structure of the data by placing the most similar
observations into groups. To accomplish this task, we must
address three basic questions:
◦ Correlational measures.
- Less frequently used, where large values of r‟s do indicate
similarity
◦ Distance Measures.
Most often used as a measure of similarity, with higher values
representing greater dissimilarity (distance between cases), not
similarity.
Graph 2
Graph 1
Chart Title 7
Graph 1 represents
7 higher 6 level of
6 similarity
5
5
4
4
3
3
2
2
1 1
0 0
Category Category Category Category Category Category Category Category
1 2 3 4 1 2 3 4
A ---
B 3.162 ---
C 5.099 2.000 ---
D 5.099 2.828 2.000 ---
E 5.000 2.236 2.236 4.123 ---
F 6.403 3.606 3.000 5.000 1.414 ---
G 3.606 2.236 3.606 5.000 2.000 3.16 ---
2
SIMPLE RULE:
◦ Identify the two most similar(closest) observations not
already in the same cluster and combine them.
In steps 1,2,3 and 4, the OSM does not change substantially, which
indicates that we are forming other clusters with essentially the
same heterogeneity of the existing clusters.
When we get to step 5, we see a large increase. This indicates that
joining clusters (B-C-D) and (E-F-G) resulted a single cluster that
was markedly less homogenous.
Therefore, the three – cluster solution of Step 4 seems
the most appropriate for a final cluster solution, with
two equally sized clusters, (B-C-D) and (E-F-G), and a
single outlying observation (A).
◦ K – means Method
In stage 8, the
observations 5 and 7 were
joined. The resulting
cluster next appears in
stage 13.
Table 23.2 is a reformed table to see the changes in the coefficients as the
number of clusters increase. The final column, headed 'Change‟, enables us to
determine the optimum number of clusters. In this case it is 3 clusters as
succeeding clustering adds very much less to distinguishing between cases.
Repeat step 1 to 3 to place cases into one of three clusters.
The number you place in the box is the number of clusters that seem best to
represent the clustering solution in a parsimonious way.
Finally click OK.
A new variable has
been generated at
the end of your
SPSS data file
called clu3_1
(labelled Ward
method in variable
view). This
provides the
cluster
membership for
each case in your
sample
Multiple lines
Voice mail
Paging service
Internet
Caller ID
Call waiting
Call forwarding
3-way calling
Electronic billing
Number iteration or repetition of
combining different clusters.
Specify number of clusters
Determines when iteration cease and
represent a proportion of the min.
distance bet. Initial cluster center.
STATISTICS
it will show the
information for each
group.
The initial cluster centers are the variable values of the k well-
spaced observations.
Iteration History
The iteration history shows the progress of the clustering process at
each step.
The ANOVA table indicates which variables contribute the most to your
cluster solution.
Cluster 2 is approximately
equally similar to clusters
1 and 3.
Number of Cases in Each Cluster
The two steps of the TwoStep Cluster Analysis procedure's algorithm can be summarized as follows:
Step 1. The procedure begins with the construction of a Cluster Features (CF) Tree. The tree begins by
placing the first case at the root of the tree in a leaf node that contains variable information about that
case. Each successive case is then added to an existing node or forms a new node, based upon its
similarity to existing nodes and using the distance measure as the similarity criterion. A node that
contains multiple cases contains a summary of variable information about those cases. Thus, the CF tree
provides a capsule summary of the data file.
Step 2. The leaf nodes of the CF tree are then grouped using an agglomerative clustering algorithm. The
agglomerative clustering can be used to produce a range of solutions. To determine which number of
clusters is "best", each of these cluster solutions is compared using Schwarz's Bayesian Criterion (BIC) or
the Akaike Information Criterion (AIC) as the clustering criterion.
Car manufacturers need to be able to appraise the current market to
determine the likely competition for their vehicles. If cars can be
grouped according to available data, this task can be largely
automatic using cluster analysis.