Академический Документы
Профессиональный Документы
Культура Документы
FRIAS
Bukidnon State University
Managers in the banks can One can find a particular One can connect to other
group their clients according to topic/definition in the web people with the same
salary, paying attitude, place, because of clustering hometowns, work places,
etc. hobbies, friends, etc.
Suppressing Crimes Telecommunication Medical Services
Police force uses clustering so Telephone companies use Using cluster analysis hospitals
that all the patrol vehicles clustering algorithm to placed can be placed in areas where
stationed across the areas with towers so that all customers medical services to the people
high crime rates. receive optimum signal will be maximized.
strengths.
1. SCALABILITY
It can deal with large database.
3. DIMENSIONALITY
It can handle high and low dimensions.
4. INTERPRETABLE
It is easy to use, comprehend, and interpret
Cluster the following students using their grades in Physics and Mathematics.
Dist({3,6},{2,5})
= min(dist(3,2),dist(6,2),dist(3,5),dist(6,5))
= min(0.15, 0.25, 0.28, 0.39)
= 0.15
Dist({3,6},{1}) = max(dist(3,1),dist(6,1))
= max(0.22, 0.23)
= 0.23
Figure 6. Complete Link Clustering and Dendogram
Average Linkage:
In here, the proximity of two clusters is defined as the average pairwise
proximity among all pairs of points in the different clusters. The for clusters 𝐶𝑖
and 𝐶𝑗 with sizes 𝑚𝑖 and 𝑚𝑗 , respectively, is:
Mahalanobis Distance: 𝑎 − 𝑏 𝑇 𝑆 −1 𝑎 − 𝑏
1. Click analyze > Classify > Hierarchical cluster
2. Place the input parameter/features to variable(s)
3. Place the name to Label Cases by (You can also leave this step if you have no
labels in you dataset)
4. Click Statistics > Check Agglomeration schedule
The main use of this dialog box is in specifying the number of clusters. If you have
hypothesis (prior knowledge) about the number of clusters, then you can tell SPSS to
create a set of number of clusters, or to create a number of clusters within a range.
5. Choose None from cluster membership > click continue
6. Click Plots > check dendogram > continue
7. Click method > choose from cluster method (Ward’s Method) > choose
Euclidean distance from measure > choose from standardize (z-score) > click
continue
This dialog box is where you choose the method of creating clusters. By default, SPSS
uses Between-group linkage (or average method). Underneath the method section,
there are series of options depending on your data (interval, counts, or binary). If you
have interval data, the most common metric is the Euclidean distance. Finally,
standardize your data by choosing z-score. Since we want to cluster cases, then we
must standardized by variables.
(Note: if you want to cluster variables, then choose standardize across cases.)
8. Click Save > Choose None or Single solution (if you want to determine the
clusters of each data point)
This dialog box allows you to save new variable into the data editor that contains a coding
value representing membership to a cluster. As such, we can use this variable to tell us
which cases fall into the same clusters. In reality, we would normally run the cluster
analysis without selecting cluster membership and then inspect the resulting dendogram
to establish how many substantive clusters lie within the data. Having done this, we could
re-run the analysis, requesting the SPSS save coding values for the number of clusters
identified.
9. Click Continue
10. Click ok
Imagine we wanted to look at clusters of cases referred for
psychiatric treatment. We measure each subject on four
quetionnaires: Spielberger Trait Anxiety Inventory (STAI), the Beck
Depression Inventory (BDI), a measure of Intrusive Thoughts and
Rumination (IT) and a measure of Impulsive Thoughts and Actions
(Impulse). The rationale behind this analysis is that people with the
same disorder should report a similar pattern of scores across the
measures (so the profiles of their responses should be similar). To
check the analysis, we asked 2 trained psychologists to agree a
diagnosis based on the DSMIV. These data are in Table 1 and in the
file diagnosis.sav.
Table 5. Data in diagnosis.sav
Wow! Elbow!
▪ For these data, the fork first splits to separate classes 1,4,7,11,13,10,12,9,15, &
2 from cases 5, 14, 6, 8, & 3.
▪ Based on the DSM-IV classification of these cases, the separation has divided
up GAD and Depression from OCD. This is likely to have occurred because
both GAD and Depression patients have low scores on intrusive thoughts and
impulsive thoughts and actions whereas those with OCD score highly on both
measures.
▪ The second major division is to split one branch of this first fork into two
further clusters. This division separates cases 1, 4, 7, 11 & 13 from 10, 12, 9, 15,
& 2. Looking at the DSM classification this second split has separated GAD
from Depression.
▪ In short, the final analysis has revealed 3 major clusters, which seem to be
related to the classifications arising from DSM. As such, we can argue that
using the STAI, BDI, IT and Impulse as diagnostic measures is an accurate
way to classify these three groups of patients (and possibly less time
consuming than a full DSM-IV diagnosis).
3. Check the data view of your SPSS, a
new variable named CLUS3_1 is
found.
▪ Now that we’ve unearthed the
number of clusters, it’s time to
re-run the analysis and ask
SPSS to save a new variable in
which cluster codes are
assigned to cases
▪ The program will start with k random clusters, and then move objects
between those clusters with the goal to:
1) minimize variability WITHIN clusters; and
2) maximize variability BETWEEN clusters.
▪ This is analogous to "ANOVA in reverse" in the sense that the significance
test in ANOVA evaluates the between group variability against the within-
group variability when computing the significance test for the hypothesis
that the means in the groups are different from each other.
▪ In k-means clustering, the program tries to move objects (e.g., cases) in and
out of groups (clusters) to get the most significant ANOVA results.
Hit Iterate > type 10 (as shown) Hit Save > Check the cluster Hit Option > Check the Initial
> Continue membersip > Check Distance from cluster centers > ANOVA table
cluster center > Continue >Clusters Information for each
case > Continue
Step 6:
Hit OK
▪ The clustering of schools is mainly determined by their performance in
licensure examinations. Cluster 3 are the best performing schools while
cluster 1 are the worst performing schols.
▪ Further, the best performing schools are characterized as those which have
relatively higher tuition rates, are very selective in their students (high
rejection rates) and hires the most number of Ph.D. Faculty.
X2: EDUCATION
1 = ELEMENTARY X9: RELIGION
2 = HIGH SCHOOL 0 = ATHEIST
3 = COLLEGE 1 = MONOTHEIST 2 = MULTIPLE SUPREME BEINGS
X3: MARRIAGE
1 = MONOGAMY
2 = POLYGAMY
X5 : NUTRITION
1 = MALNOURISHED TO 5 = PROPERLY NOURISHED
X6: INCOME
0 = SEASONAL
1 = FIXED
Human Development Index (HDI) is a composite index
measuring average achievement in three basic dimensions of
human development: a long and healthy life, knowledge, and
decent standard of living. In 2003, a survey was conducted to
determine the HDI of several countries (Fukuda-Parr, 2003). In
this survey, the Philippines was ranked 85th out of 175 countries
in the world. Data for fifteen (15) selected countries are shown on
the next page. Perform a cluster analysis on this data set and
formulate tentative theories about Human Development Indices
of countries worldwide.
Country HDI Malnutrition Rate Literacy Rate Poverty Incidence Political Stability
1. Norway 0.944 0.01 0.99 0.02 0.99
2. Japan 0.933 0.01 0.94 0.04 0.94
3. Germany 0.921 0.03 0.96 0.05 0.92
4. Singapore 0.884 0.04 0.87 0.03 0.95
5. Brunei 0.872 0.07 0.89 0.02 0.99
6. Malaysia 0.790 0.08 0.83 0.06 0.92
7. Thailand 0.768 0.08 0.88 0.05 0.88
8. Philippines 0.751 0.09 0.90 0.10 0.84
9. Vietnam 0.688 0.08 0.83 0.11 0.88
10. Indonesia 0.682 0.09 0.80 0.11 0.85
11. Cambodia 0.556 0.12 0.64 0.15 0.84
12. Myanmar 0.549 0.20 0.72 0.22 0.80
13. Sierra Leone 0.275 0.25 0.41 0.25 0.77
14. USA 0.940 0.01 0.97 0.04 0.98
15. Ethiopia 0.330 0.21 0.45 0.21 0.80