Вы находитесь на странице: 1из 55

MARLON S.

FRIAS
Bukidnon State University

May 19, 2018


▪ Partitioning the data into subclasses.
▪ Grouping similar objects.
▪ Partitioning the data based on similarity
Market Research
Search Engines Social Media

Managers in the banks can One can find a particular One can connect to other
group their clients according to topic/definition in the web people with the same
salary, paying attitude, place, because of clustering hometowns, work places,
etc. hobbies, friends, etc.
Suppressing Crimes Telecommunication Medical Services

Police force uses clustering so Telephone companies use Using cluster analysis hospitals
that all the patrol vehicles clustering algorithm to placed can be placed in areas where
stationed across the areas with towers so that all customers medical services to the people
high crime rates. receive optimum signal will be maximized.
strengths.
1. SCALABILITY
It can deal with large database.

2. CAN DEAL WITH DIFFERENT KINDS OF ATTRIBUTES


It can deal with numerical, categorical, and binary data.

3. DIMENSIONALITY
It can handle high and low dimensions.

4. INTERPRETABLE
It is easy to use, comprehend, and interpret
Cluster the following students using their grades in Physics and Mathematics.

Student Physics Math


P 15 20
Q 20 15
R 26 21
X 44 52
Y 50 45
Z 57 38
A 80 85
B 90 88
C 98 98

Are all of them appearing similar?


Figure 1. Physics Score vs Math Score

• Natural grouping of similar objects – based on input parameters.


• Homogenous within, heterogenous across based on characteristics.
▪ There is no objective function in cluster analysis.
▪ There is no dependent variable that we are trying to predict.
What we have are just input parameters.
▪ Based on parameters, we can build up natural groupings on its
own.
▪ Called subjective segmentation because it has no objective
function.
▪ It is also called unsupervised learning because you don’t set the
pattern, you just predict the pattern on its own.
▪ One decides about strategy to deal with segments, once the
segments are developed and understood.
2 2 2
𝑑 𝑖, 𝑗 = ( 𝑥𝑖1 − 𝑥𝑗1 + 𝑥𝑖2 − 𝑥𝑗2 + ⋯ + 𝑥𝑖𝑝 − 𝑥𝑗𝑝 )

Distance between two points i and j using p number of input parameters.


A graphical presentation of the number of clusters of the data
points. It also shows the dissimilarities among data points how they
are clustered.

Figure 8. Dendogram of the Data Points


Intermediate cluster distance.

Figure 2. Single linkage Complete linkage Figure 3. Average Linkage


Use the three linkage functions in the sample data that consist of 6
two-dimensional points, which are shown below:

Table 4. Euclidean Distance of the Data Points

Table 3. Data Points

Figure 4. Plot of Data Points


Single Linkage:
First look at the point that seems close to each other, say 3 and 6. Their
Euclidean distance is 0.11, and that is the height at which they are joined into
one cluster in the dendogram. For points 2 and 5, their distance is 0.14 which is
also reflected in the dendogram and so they are joined into one cluster. Next,
let’s look at the distance between two cluster using single linkage:

Dist({3,6},{2,5})
= min(dist(3,2),dist(6,2),dist(3,5),dist(6,5))
= min(0.15, 0.25, 0.28, 0.39)
= 0.15

Next, cluster {4} is grouped to cluster {3,6,2,5}


since it has closer distance as compared to the
distance between {1} and {3,6,2,5}

The last to be clustered to {3,6,2,5,4} is {1}


Figure 5. Single Link Clustering and Dendogram
Complete Linkage:
First we cluster points 3 and 6, then 2 and 5 because they are clearly close
to each other. However, {3,6} is merged first with {4}, instead of {2,5} or {1}
because:
Dist({3,6},{4}) = max(dist(3,4),dist(6,4))
= max(0.15, 0.22)
= 0.22

Dist({3,6},{2,5}) = max(dist(3,2),dist(6,2), dist(3,5), dist(6,5))


= max(0.15, 0.25, 0.28, 0.39)
= 0.39

Dist({3,6},{1}) = max(dist(3,1),dist(6,1))
= max(0.22, 0.23)
= 0.23
Figure 6. Complete Link Clustering and Dendogram
Average Linkage:
In here, the proximity of two clusters is defined as the average pairwise
proximity among all pairs of points in the different clusters. The for clusters 𝐶𝑖
and 𝐶𝑗 with sizes 𝑚𝑖 and 𝑚𝑗 , respectively, is:

Figure 6. Complete Link Clustering and Dendogram


▪ It is a criterion applied in hierarchical Cluster Analysis.
▪ Ward’s method says that the distance between two clusters, A and B, is
how much the sum of squares will increase when we merge them:

where 𝑚𝑗 is the center of cluster j, and 𝑛𝑗 is the number of points in it. Δ is


called the merging cost of combining the clusters A and B. The sum-of-
squares starts at zero and then grows as we merge clusters. Ward’s method
keeps this growth as small as possible.
▪ The idea is to build a binary tree of the data that successively merges
similar groups of points.
▪ Visualizing this tree provides a useful summary of the data.
▪ Uses small dataset (Usually less than 100 observations).
▪ Shows each stage of observation linkage.
▪ Dendogram and scree plot
▪ RMS STD is within cluster variance.
▪ When number of cluster = 1, then
RMS STD = total variance within
data
▪ The elbow indicates where is the
optimal number of clusters.

Figure 7. Scree Plot of the Clusters vs Within Variance


Note that in Hierarchical clustering:

• Each character/data point is their


own cluster.
• Find dissimilarity amongst
clusters.
• Uniqueness

First we need the input parameters/features


that determines dissimilarity/similarity: e.g.
Genetic codes and the distance where they
live.
• Using dendogram we know how
many cluster do we have as well as
their orders.

• The length of the branches refers to


the measure of dissimilar of the
clusters.
We will know the number of significant clusters formed, if
we set a threshold line.
• Based on the figure, we have 6 data
points. Which means that we start with 6
clusters.

• We can group each point according to


their distance. (In here, we using
Euclidean distance metric)
• Now, by setting our threshold, we see that we can form 3 clusters.
• General practice in setting the threshold: is to use the midpoint of the
longest branch in the dendogram.
Euclidean Distance: σ𝑖 𝑎𝑖 − 𝑏𝑖 2

Squared Euclidean Distance: σ𝑖 𝑎𝑖 − 𝑏𝑖 2

Manhattan Distance: σ𝑖 |𝑎𝑖 − 𝑏𝑖 |2

Maximum Distance: max |𝑎𝑖 − 𝑏𝑖 |


𝑖

Mahalanobis Distance: 𝑎 − 𝑏 𝑇 𝑆 −1 𝑎 − 𝑏
1. Click analyze > Classify > Hierarchical cluster
2. Place the input parameter/features to variable(s)
3. Place the name to Label Cases by (You can also leave this step if you have no
labels in you dataset)
4. Click Statistics > Check Agglomeration schedule
The main use of this dialog box is in specifying the number of clusters. If you have
hypothesis (prior knowledge) about the number of clusters, then you can tell SPSS to
create a set of number of clusters, or to create a number of clusters within a range.
5. Choose None from cluster membership > click continue
6. Click Plots > check dendogram > continue
7. Click method > choose from cluster method (Ward’s Method) > choose
Euclidean distance from measure > choose from standardize (z-score) > click
continue
This dialog box is where you choose the method of creating clusters. By default, SPSS
uses Between-group linkage (or average method). Underneath the method section,
there are series of options depending on your data (interval, counts, or binary). If you
have interval data, the most common metric is the Euclidean distance. Finally,
standardize your data by choosing z-score. Since we want to cluster cases, then we
must standardized by variables.
(Note: if you want to cluster variables, then choose standardize across cases.)
8. Click Save > Choose None or Single solution (if you want to determine the
clusters of each data point)
This dialog box allows you to save new variable into the data editor that contains a coding
value representing membership to a cluster. As such, we can use this variable to tell us
which cases fall into the same clusters. In reality, we would normally run the cluster
analysis without selecting cluster membership and then inspect the resulting dendogram
to establish how many substantive clusters lie within the data. Having done this, we could
re-run the analysis, requesting the SPSS save coding values for the number of clusters
identified.

9. Click Continue
10. Click ok
Imagine we wanted to look at clusters of cases referred for
psychiatric treatment. We measure each subject on four
quetionnaires: Spielberger Trait Anxiety Inventory (STAI), the Beck
Depression Inventory (BDI), a measure of Intrusive Thoughts and
Rumination (IT) and a measure of Impulsive Thoughts and Actions
(Impulse). The rationale behind this analysis is that people with the
same disorder should report a similar pattern of scores across the
measures (so the profiles of their responses should be similar). To
check the analysis, we asked 2 trained psychologists to agree a
diagnosis based on the DSMIV. These data are in Table 1 and in the
file diagnosis.sav.
Table 5. Data in diagnosis.sav

DSM STAI BDI IT IMPULSE


GAD 74 30 20 10
DEPRESSION 50 70 23 5
OCD 70 5 58 29
GAD 76 35 23 12
OCD 68 23 66 37
OCD 62 8 59 39
GAD 71 35 27 17
OCD 67 12 65 35
DEPRESSION 35 60 15 8
DEPRESSION 33 58 11 16
GAD 80 36 30 16
DEPRESSION 30 62 9 13
GAD 65 38 17 10
OCD 78 15 70 40
DEPRESSION 40 55 10 2
This table shows the amount of error created at each clustering stage
when two different objects are brought together to create new cluster.
To graph the error coefficients of agglomeration schedule: double-click
> right-click > Create Graph > Line
Rule of Thumb: Finding the Elbow!

The elbow indicates where


is the optimal number of
clusters!

Wow! Elbow!
▪ For these data, the fork first splits to separate classes 1,4,7,11,13,10,12,9,15, &
2 from cases 5, 14, 6, 8, & 3.

▪ Based on the DSM-IV classification of these cases, the separation has divided
up GAD and Depression from OCD. This is likely to have occurred because
both GAD and Depression patients have low scores on intrusive thoughts and
impulsive thoughts and actions whereas those with OCD score highly on both
measures.

▪ The second major division is to split one branch of this first fork into two
further clusters. This division separates cases 1, 4, 7, 11 & 13 from 10, 12, 9, 15,
& 2. Looking at the DSM classification this second split has separated GAD
from Depression.
▪ In short, the final analysis has revealed 3 major clusters, which seem to be
related to the classifications arising from DSM. As such, we can argue that
using the STAI, BDI, IT and Impulse as diagnostic measures is an accurate
way to classify these three groups of patients (and possibly less time
consuming than a full DSM-IV diagnosis).
3. Check the data view of your SPSS, a
new variable named CLUS3_1 is
found.
▪ Now that we’ve unearthed the
number of clusters, it’s time to
re-run the analysis and ask
SPSS to save a new variable in
which cluster codes are
assigned to cases

1. Click Analyze > Classify >


Hierarchical Cluster
Analysis > Save
2. Click Single Solution and
Type 3 (no. of clusters
we’ve just identified)
STEP 1 STEP 2
▪ General logic :
Suppose that you already have
hypotheses concerning the number of
clusters in your cases or variables. You
may want to "tell" the computer to form
exactly 3 clusters that are to be as
distinct as possible.

▪ This is the type of research question


that can be addressed by the k- means
clustering algorithm.
▪ Computationally, you may think of this method as analysis of variance
(ANOVA) "in reverse."

▪ The program will start with k random clusters, and then move objects
between those clusters with the goal to:
1) minimize variability WITHIN clusters; and
2) maximize variability BETWEEN clusters.
▪ This is analogous to "ANOVA in reverse" in the sense that the significance
test in ANOVA evaluates the between group variability against the within-
group variability when computing the significance test for the hypothesis
that the means in the groups are different from each other.

▪ In k-means clustering, the program tries to move objects (e.g., cases) in and
out of groups (clusters) to get the most significant ANOVA results.

▪ Usually, as the result of a k-means clustering analysis, we would examine the


means for each cluster on each dimension to assess how distinct our k
clusters are.
We are looking at “QUALITY OF EDUCATION” and we are getting data
from 15 universities in the Philippines. We decided to look into:
X1 = performance in licensure exams
X2 = average tuition rate per unit
X3 = enrollment size
X4 = Percentage of Ph.D.’s in faculty
X5 = acceptance/rejection rate per hundred
The data are shown on the next page.
▪ Exam Tuition Enrolment Ph.D. rejection
▪ 87 700 6000 60 0.80
▪ 85 620 5500 50 0.75
▪ 83 600 5000 45 0.77
▪ 82 600 5400 48 0.74
▪ 83 610 6250 50 0.80
▪ 78 450 8000 36 0.65
▪ 77 400 7600 32 0.54
▪ 76 410 7700 37 0.45
▪ 78 460 8900 32 0.50
▪ 80 500 9000 30 0.48
▪ 72 200 8000 8 0.20
▪ 75 250 9000 10 0.10
▪ 70 300 9000 7 0.15
▪ 67 260 11000 10 0.15
▪ 68 200 10000 5 0.05
Step 1: Step 2:
Step 3: Step 4: Step 5:

Hit Iterate > type 10 (as shown) Hit Save > Check the cluster Hit Option > Check the Initial
> Continue membersip > Check Distance from cluster centers > ANOVA table
cluster center > Continue >Clusters Information for each
case > Continue

Step 6:

Hit OK
▪ The clustering of schools is mainly determined by their performance in
licensure examinations. Cluster 3 are the best performing schools while
cluster 1 are the worst performing schols.

▪ Further, the best performing schools are characterized as those which have
relatively higher tuition rates, are very selective in their students (high
rejection rates) and hires the most number of Ph.D. Faculty.

▪ Theory: The Quality of a Higher Education Institution varies directly as


the institution’s student selectivity, quality of faculty and level of
investment in education. Quality is , thus, achieved at a certain economic
price.
▪ On the next page is a hypothetical data set about various
indigenous tribes in Mindanao. We look into the socio-
economic-religious and political situations of these indigenous
tribes as reflected in nine(9) variables.

▪ Perform a cluster analysis on the data set (using three (3) or


more clusters).

▪ Formulate your theories based on the cluster analysis. Explain


your theories.
▪ TRIBE EDUC. MARRIAGE CHILDREN NUTRITION INCOME CONTACT LAWS RELIGION
▪ 1 1 2 4 2 0 1 1 2
▪ 1 1 2 5 2 0 2 1 2
▪ 1 1 2 3 1 0 2 1 2
▪ 1 2 2 5 2 0 2 1 2
▪ 1 2 2 6 3 0 1 1 2
▪ 1 1 2 4 3 0 2 1 2
▪ 2 1 1 5 2 1 2 0 2
▪ 2 1 1 3 3 1 2 0 2
▪ 2 2 1 3 1 0 1 0 2
▪ 2 2 1 3 3 0 2 0 2
▪ 2 2 1 3 2 0 2 0 2
▪ 3 1 2 2 3 0 2 1 0
▪ 3 2 2 4 2 0 1 1 0
▪ 3 3 2 4 2 1 2 0 0
▪ 3 1 2 5 3 1 1 0 0
▪ 4 1 2 3 4 1 2 1 1
▪ 4 2 2 2 5 1 1 1 1
▪ 4 3 2 8 3 1 3 1 1
▪ 4 2 2 9 3 1 3 1 1
▪ 4 3 2 8 2 1 3 1 1
▪ 4 3 2 9 1 0 3 1 1
▪ 4 1 2 10 3 1 3 1 1
▪ 5 1 2 8 2 1 2 1 1
▪ 5 2 2 9 3 1 2 1 1
▪ 5 3 2 5 2 1 3 1 1
▪ 5 1 2 10 1 0 3 1 1
▪ 5 2 2 8 1 0 3 1 1
▪ 5 3 2 9 1 1 3 1 1
▪ 5 3 2 8 2 1 3 1 1
▪ 5 3 2 7 1 1 3 1 1
X1: TRIBE X7: CONTACT WITH MAINSTREAM SOCIETY
1 = SUBANEN 1 = SELDOM
2 = MANOBO 2 = OFTEN
3 = HIGAONON 3 = ALWAYS
4 = MARANAW
5 = TAUSUG X8: INDIGENOUS LAWS (0 = ABSENT/ 1 = PRESENT)

X2: EDUCATION
1 = ELEMENTARY X9: RELIGION
2 = HIGH SCHOOL 0 = ATHEIST
3 = COLLEGE 1 = MONOTHEIST 2 = MULTIPLE SUPREME BEINGS

X3: MARRIAGE
1 = MONOGAMY
2 = POLYGAMY

X4 = CHILDREN ( NO. OF CHILDREN)

X5 : NUTRITION
1 = MALNOURISHED TO 5 = PROPERLY NOURISHED

X6: INCOME
0 = SEASONAL
1 = FIXED
Human Development Index (HDI) is a composite index
measuring average achievement in three basic dimensions of
human development: a long and healthy life, knowledge, and
decent standard of living. In 2003, a survey was conducted to
determine the HDI of several countries (Fukuda-Parr, 2003). In
this survey, the Philippines was ranked 85th out of 175 countries
in the world. Data for fifteen (15) selected countries are shown on
the next page. Perform a cluster analysis on this data set and
formulate tentative theories about Human Development Indices
of countries worldwide.
Country HDI Malnutrition Rate Literacy Rate Poverty Incidence Political Stability
1. Norway 0.944 0.01 0.99 0.02 0.99
2. Japan 0.933 0.01 0.94 0.04 0.94
3. Germany 0.921 0.03 0.96 0.05 0.92
4. Singapore 0.884 0.04 0.87 0.03 0.95
5. Brunei 0.872 0.07 0.89 0.02 0.99
6. Malaysia 0.790 0.08 0.83 0.06 0.92
7. Thailand 0.768 0.08 0.88 0.05 0.88
8. Philippines 0.751 0.09 0.90 0.10 0.84
9. Vietnam 0.688 0.08 0.83 0.11 0.88
10. Indonesia 0.682 0.09 0.80 0.11 0.85
11. Cambodia 0.556 0.12 0.64 0.15 0.84
12. Myanmar 0.549 0.20 0.72 0.22 0.80
13. Sierra Leone 0.275 0.25 0.41 0.25 0.77
14. USA 0.940 0.01 0.97 0.04 0.98
15. Ethiopia 0.330 0.21 0.45 0.21 0.80

Вам также может понравиться