Colic Baby

Business Analytics
Today objective
Clustering Analysis
Indian Institute of Management (IIM),Rohtak

Clustering

Clustering
Clustering is a mining technique used to place data elements
into related groups without advance knowledge of the group
definitions.
News.google.com : it’s a website that uses a

clustering algorithm to group everyday news
articles together to see what are the top
stories of today.

Clustering

Clustering
Clustering is the process of grouping observations of
similar kinds into smaller groups within the larger
population. It has widespread application in
business analytics. One of the questions facing
businesses is how to organize the huge amounts of
available data into meaningful structures. Or break
a large heterogeneous population into smaller
homogeneous groups. Cluster analysis is an
exploratory data analysis tool which aims at
sorting different objects into groups in a way that
the degree of association between two objects is
maximal if they belong to the same group and
minimal otherwise.

Clustering Analysis
Often the marketer needs to categorize objects into groups
(or clusters) so that the objects in each group are similar,
and the objects in each group are substantially different
from the objects in the other groups. Here are some
examples:
 When Procter & Gamble test markets a new cosmetic, it
may want to group U.S. cities into groups that are similar
on demographic attributes such as percentage of Asians,
percentage of Blacks, percentage of others.
A marketing analyst at Coca-Cola wants to segment the
soft drink market based on consumer preferences for
price sensitivity, preference of diet versus regular soda,
and preference of Coke versus Pepsi.

Clustering Analysis
• Flat algorithms
– Usually start with a random (partial) partitioning
– Refine it iteratively
• K means clustering
• (Model based clustering)
• Hierarchical algorithms
– Bottom-up, agglomerative
– (Top-down, divisive)

K-means Algorithm
• k-means clustering is a data mining procedure
used to cluster observations into groups of
related observations without any prior
knowledge of those relationships.
•The k-means procedure is one of the simplest

clustering techniques and it is commonly used
in medical imaging,biometrics,Business
solution's and related fields.

K-means Algorithm
1. Place K points into the space
represented by the objects that
are being clustered. These points
represent initial group centroids.
2. Assign each object to the group
that has the closest centroid.
3. When all objects have been
assigned, recalculate the
positions of the K centroids.
4. Repeat Steps 2 and 3 until the
centroids no longer move. This
produces a separation of the
objects into groups from which
the metric to be minimized can
be calculated.

The K-Means Clustering Method
• Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
the
3
each
2 2
2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
object as initial
6 6
5 5
cluster center 4 Update 4
reassign 3
the 3
2 2
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

Clustering
Hierarchical Clustering : This technique operate on the simplest
principle, which is data-point closer to base point will behave more
similar compared to a data-point which is far from base point. For
instance, a , b ,c, d, e,f are 6 students, and we wish to group them into
clusters.

Problem: Cluster the following eight points (with (x, y) representing
locations) into three clusters A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8)
A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9). Initial cluster centers are: A1(2,
10), A4(5, 8) and A7(1, 2). The distance function between two points
a=(x1, y1) and b=(x2, y2) is defined as: ρ(a, b) = |x2 – x1| + |y2 – y1|
. Use k-means algorithm to find the three cluster centers after the second
iteration.
(2, 10) (5, 8) (1, 2)
Iteration 1 Point Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster
A1 (2, 10)
A2 (2, 5)
A3 (8, 4)
A4 (5, 8)
A5 (7, 5)
A6 (6, 4)
A7 (1, 2)
A8 (4, 9)
First we list all points in the first column of the table above. The
initial cluster centers – means, are (2, 10), (5, 8) and (1, 2) -
chosen randomly. Next, we will calculate the distance from the first
point (2, 10) to each of the three means, by using the distance
function: Indian Institute of Management (IIM),Rohtak
point mean1 point mean2
x1, y1 x2, y2 x1, y1 x2, y2
(2, 10) (2, 10) (2, 10) (5, 8)
ρ(a, b) = |x2 – x1| + |y2 – y1| ρ(a, b) = |x2 – x1| + |y2 – y1|
ρ(point, mean1) = |x2 – x1| + |y2 – y1| ρ(point, mean2) = |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10| = 0 + 0 = 0 = |5 – 2| + |8 – 10|
=3+2 =5
point mean3
x1, y1 x2, y2 So, we fill in these values in the table:
(2, 10) (1, 2)
Cluster 1 Cluster 2 Cluster 3
ρ(a, b) = |x2 – x1| + |y2 – y1| (2, 10)
ρ(point, mean2) = |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
= 1 + 8 =Dist
Point
9 Mean
(2, 10)
1
(5, 8)
Dist Mean 2
(1, 2)
Dist Mean 3 Cluster
A1 (2, 10) 0 5 9 1
A2 (2, 5)
A3 (8, 4)
A4 (5, 8)
A5 (7, 5)
A6 (6, 4)
A7 (1, 2)
A8 (4, 9)
So, which cluster should the point (2, 10) be placed in? The one, where the point has the shortest distance to the mean –
that is mean 1 (cluster 1), since the distance is 0.
So, we go to the second point (2, 5) and we will calculate the distance to each of the
three means, by using the distance function:
point mean1 point mean2 point mean3
x1, y1 x2, y2 x1, y1 x2, y2 x1, y1 x2, y2
(2, 5) (2, 10)
(2, 5) (5, 8) (2, 5) (1, 2)
ρ(a, b) = |x2 – x1| + |y2 – y1|
ρ(point, mean1) = |x2 – x1| + |y2 ρ(a, b) = |x2 – x1| + |y2 – y1| ρ(a, b) = |x2 – x1| + |y2 – y1|
– y1| ρ(point, mean2) = |x2 – x1| + |y2 – y1| ρ(point, mean2) = |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 5| = 0 + 5 = 5 = |5 – 2| + |8 – 5| = 3 + 3 = 6 = |1 – 2| + |2 – 5| = 1 + 3 = 4
(2, 10) (5, 8) (1, 2)
A1
Point
(2, 10)
Dist Mean 1
0
Dist Mean 2
5
Dist Mean 3
9
Cluster
1
we fill in the rest of the table, and
A2
A3
(2, 5)
(8, 4)
5 6 4 3
place each point in one of the
A4 (5, 8)
A5
A6
(7, 5)
(6, 4) clusters:
A7 (1, 2)
A8 (4, 9)
(2, 10) (5, 8) (1, 2)

Point Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster
A1 (2, 10) 0 5 9 1
A2 (2, 5) 5 6 4 3
A3 (8, 4) 12 7 9 2
A4 (5, 8) 5 0 10 2
A5 (7, 5) 10 5 9 2
A6 (6, 4) 10 5 7 2
A7 (1, 2) 9 10 0 3
A8 (4, 9) 3 2 10 2
Cluster 1 Cluster 2 Cluster 3

(2, 10) (8, 4),(5,8)(7,5)(6,4)(4,9) (2, 5)(1,2)
Next, we need to re-compute the new cluster centers (means). We do so, by taking the mean of all
points in each cluster.
For Cluster 1, we only have one point A1(2, 10), which was the old mean, so the cluster center remains
the same.
For Cluster 2, we have ( (8+5+7+6+4)/5, (4+8+5+4+9)/5 ) = (6, 6)
For Cluster 3, we have ( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
The initial cluster centers are shown in red dot. The new cluster centers are shown in red x.
That was Iteration1 (epoch1). Next, we go to Iteration2 (epoch2),
Iteration3, and so on until the means do not change anymore.
In Iteration2, we basically repeat the process from Iteration1 this time
using the new means we computed.

Clustering Analysis
To illustrate the mechanics of cluster analysis(in excel), suppose you
want to “cluster” of 49 America’s largest cities. For each city you have
the following demographic data that will be used as the basis of cluster
analysis.
• Percentage Black
• Percentage Hispanic
• Percentage Asian
• Median Age
• Unemployment rate
• Per capita income
For example ,Atlanta’s demographic information is as follows: Atlanta
is 67 percent Black ,2 percent Hispanic,1 percent Asian ,has a median
age of 31,a 5 percent unemployment rate ,and a per capita income of
$22,000.
Cluster.xlsx

Clustering Analysis
For now assume your goal is to group the cities into

four clusters that are demographically similar. Later
I can address the issue of why we used four cluster.
The basic idea used to identify the cluster is to
choose a city to “anchor”, or “center” each cluster.
You assign each city to the “nearest” cluster center.
Your target cell is then to minimize the sum of the
squared distance from each city to the closet cluster
anchor.

Clustering Analysis

Clustering Analysis
In the example, if you cluster using the attribute levels referred to excel
sheet , the percentage of Blacks and Hispanics in each city will drive
the clusters because these values are more spread out than the other
demographic attributes.
To remedy this problem you can standardize each demographic
attribute by subtracting off the attribute’s mean and dividing by the
attribute’s standard deviation.
For example, the average city has 24.34 percent Blacks with a standard
deviation of 18.11 percent. This implies that after standardizing the
percentage of Blacks, Atlanta has 2.35 standard deviations more Blacks
(on a percentage basis) than a typical city. Working with standardized
values for each attribute ensures that your analysis is unit-free and each
attribute has the same effect on your cluster selection. Of course you
may give a larger weight to any attribute.

Clustering Analysis
You can use the Excel to identify a given

number of clusters. The key in doing so is to
ensure that the cities in each cluster are
demographically similar and cities in different
clusters are demographically different. Using
few clusters enables the marketing analyst to
reduce the 49 U.S. cities into a few (in your
case four) easily interpreted market segments.
To determine the four clusters, now we are
going to follow some computational steps.
Clustering Analysis
Compute the Black mean percentage in C1 with the formula
=AVERAGE(C10:C58).

Clustering Analysis
In C2 compute the standard deviation of the Black percentages with the
formula =STDEV(C10:C58).

Clustering Analysis
Copy these formulas to D1:G2 to compute the mean and standard
deviation for each attribute.

Clustering Analysis
In cell I10 compute the standardized percentage of Blacks in
Albuquerque (often called a z-score) with the formula
=STANDARDIZE(C10,C$1,C$2). This formula is equivalent, of
course, to C10-C$1/ C$2 . The reader can verify that for each
demographic attribute the z-scores have standard deviation of 1.

Clustering Analysis

Clustering Analysis
To determine n clusters (in this case n = 4) you define a changing cell

for each cluster to be a city that “anchors” the cluster. For example, if
Memphis is a cluster anchor, each city in the Memphis cluster should be
similar to Memphis demographically, and all cities not in the Memphis
cluster should be different demographically from Memphis. You can
arbitrarily pick four cluster anchors, and for each city in the data set,
you can determine the squared distance (using z-scores) of each city
from each of the four cluster anchors. Then you assign each city the
squared distance to the closest anchor and have your Solver target cell
equal the sum of these squared distances.

Clustering Analysis
In H5:H8 enter “trial values” for cluster anchors. Each of these values
can be any integer between 1 and 49. For simplicity you can let the four
trial anchors be cities 1–4.

Clustering Analysis
After naming A9:N58 as the range lookup in G5, look up the name of
the first cluster anchor with the formula =VLOOKUP(H5,Lookup,2).

Clustering Analysis
Copy this formula to G6:G8 to identify the name of each cluster center
candidate.

Clustering Analysis
In I5:N8 identify the z-scores for each cluster anchor candidate by
copying from I5 to I5:N8 the formula =VLOOKUP($H5,Lookup,I$3).

Clustering Analysis
Lookup z score for cluster anchor

Clustering Analysis
You can now compute the squared distance from each city to each
cluster candidate To compute the distance from city 1 (Albuquerque) to
cluster candidate anchor 1, enter in O10 the formula
=SUMXMY2($I$5:$N$5,$I10:$N10).
This cool Excel function computes the following:
(I5-I10)2+(J5-J10)2+(K5-K10)2+(L5-L10)2+(M5-M10)2+(N5-N10)2

Clustering Analysis
To compute the squared distance of Albuquerque from the second
cluster anchor, change each 5 in P10 to a 6. Similarly, in Q10 change
each 5 to a 7. Finally, in R10 we change each 5 to an 8.
Copy from O10:R10 to O11:R58 to compute the squared distance of
each city from each cluster anchor.

Clustering Analysis

Clustering Analysis
In S10:S58 compute the distance from each city to the “closest” cluster
anchor by entering the formula =MIN(O10:R10) in cell S10 and
copying it to the cell range S10:S59.

Clustering Analysis
In S8 compute the sum of squared distances of all cities from their
cluster anchor with the formula =SUM(S10:S58).

Clustering Analysis
In T10:T58 compute the cluster to which each city is assigned by
entering in T10 the formula =MATCH(S10,O10:R10,0) and copying
this formula to T11:T58. This formula identifies which element in
columns O:R gives the smallest squared distance to the city.

Clustering Analysis
Use the Solver window, as shown in THIS SLIDE , to find the optimal
cluster anchors for the four clusters(looking minimum total distance ) .
Choose the Evolutionary Solver. Select Options from the Solver
window, navigate to the Evolutionary tab. This setting of the Mutation
rate usually improves the performance of the Evolutionary Solver.
Following constraints to be add:
H5:H8<=49
H5:H8=INTEGER
H5:H8>=1

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis
You can find that the San Francisco cluster consists of rich, older, and
highly Asian cities. The Memphis cluster consists of highly Black cities
with high unemployment rates. The Omaha cluster consists of average
income cities with few minorities. The Los Angeles cluster consists of
highly Hispanic cities with high unemployment rates. From your
clustering of U.S. cities a company like Procter & Gamble that often
engages in test marketing of a new product could now predict with
confidence that if a new product were successfully marketed in the San
Francisco, Memphis, Los Angeles, and Omaha areas, the product would
succeed in all 49 cities. This is because of demographics of each city in
the data set are fairly similar to the demographics of one of the four
city.

Clustering Analysis
The file NewMBAdata.xlsx contains

average undergrad GPA, average
GMAT score, percentage acceptance
rate, average starting salary, and out of
state tuition and fee for 54 top MBA
programs. Use this data to perform a
cluster analysis with five anchors.

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis

Clustering Analysis in R

Clustering Analysis –Example 1
Consider the scenario to “cluster” MBA colleges
based on:
• CGPA
• GMAT score
• Acceptance rate
• Placement
• Out of station fees
For example ,for Harvard , the required CGPA is 3.66,
GMAT score is 724, full acceptance rate is 11.1 and
average salary bonus is 13973 etc… -----
>NewMBAdata.csv
SAMPLE DATASET OF MBA COLLEGES

read.csv() usage in R

R IMPLEMENTATION
input <- read.csv("NewMBAdata.csv",header=TRUE)
dim(input)

OUTPUT OF READING THE FILE USING read.csv COMMAND
and dim gives ROWS AND COLUMNS OF DATASET

Excluding college names for
normalizing as categorical in nature
mydata <- input[1:54,3:7]
## exclude the columns with college name
dim(mydata)

NORMALIZING THE DATA

SAMPLE VIEW OF NORMALIZED DATASET USING View()
COMMAND IN R

K-means clustering usage and parameters

K-means parameter—within-sum-
of-squares

plot() function usage

ASCERTAINING NUMBER OF CLUSTERS
Within
Number of
sum of Initial seed
clusters
squares to start
clustering
Within
sum of
squares
i.e., intra
cluster
distance
Used to join points =
The within groups of squares are calculated and plotted for different
cluster values k=1,2,3,4,5 etc. represented by the ncol versus the
number of clusters represented by nrow parameter and Elbow curve is
formed, the region of bend represents the optimal number.
Interpretation of the
scree plot
The within groups of squares are to be minimized and when between

sum of squares are to be maximized i.e., when the elbow curve is
attained and represents most optimal number of clusters

ASCERTAINING NUMBER OF CLUSTERS
The within groups of squares are calculated and plotted for different
cluster values k=1,2,3,4,5 etc. and Elbow curve is formed, the region
of bend represents the optimal number.
Bend indicates optimal k(number of clusters)=2

K means clustering command:
kmeans(k=2)

Clusters assigned based on K-
means

cbind() parameters and usage

Assigning clusters to the
dataset- cbind()

ASSIGNED CLUSTERS TO MBA COLLEGES- A SAMPLE
VIEW OF DATASET

plotcluster() function and usage

Plotting the cluster using
plotcluster()
in fpc package of R

Cluster plot revealing k=2

Clustering Analysis –Example 2
To illustrate the mechanics of cluster analysis, suppose you want to

“cluster” of 49 America’s largest cities. For each city you have the
following demographic data that will be used as the basis of cluster
analysis.
• Percentage Black
• Percentage Hispanic
• Percentage Asian
• Median Age
• Unemployment rate
• Per capita income
For example ,Atlanta’s demographic information is as follows:
Atlanta is 67 percent Black ,2 percent Hispanic,1 percent Asian ,has a
median age of 31,a 5 percent unemployment rate ,and a per capita
income of $22,000.
clust.csv
Clustering
Analysis

R IMPLEMENTATION
Reading the input clust.csv file for analysis

Excluding city names for
normalizing as categorical in nature
and click on RUN

NORMALIZING THE
DATA

SAMPLE VIEW OF NORMALIZED DATASET USING
View() COMMAND IN R

ASCERTAINING NUMBER OF
CLUSTERS

Bend indicates optimal k(number of
clusters)=4

K means clustering command:
kmeans(k=4)

Clusters assigned based on K-
means

Assigning clusters to the
dataset- cbind()

ASSIGNED CLUSTERS TO MBA COLLEGES-
A SAMPLE VIEW OF DATASET

Plotting the cluster using
plotcluster()in fpc package of R

Cluster plot revealing
k=4

Can you do your clustering using R
for
6-Universities_Clustering
?

Hirechical Clustering
This method creates a hierarchical decomposition of the given set of data
objects. We can classify hierarchical methods on the basis of how the
hierarchical decomposition is formed. There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start
with each object forming a separate group. It keeps on merging the objects or
groups that are close to one another. It keep on doing so until all of the
groups are merged into one or until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with
all of the objects in the same cluster. In the continuous iteration, a cluster is
split up into smaller clusters. It is down until each object in one cluster or the
termination condition holds. This method is rigid, i.e., once a merging or
splitting is done, it can never be undone.
6-Universities_Clustering
hirechical
--------------------------------
input<-read.table(file.choose(),sep=",",header=T)
names(input)
mydata<- input[1:25,1:7]
mydata
normalized_data<-scale(mydata[,2:7]) ( NORMALIZED 2,3,4,5,6
FROM DATA-MEAN DIVIDED SD)
normalized_data
d <- dist(normalized_data, method = "euclidean")
fit<-hclust(d)
plot(fit)
groups <- cutree(fit, k=4)
rect.hclust(fit, k=4, border="red")
membership<-as.matrix(groups)
membership
hclustering<- data.frame(mydata, membership)
hclustering Indian Institute of Management (IIM),Rohtak
Thank you

Colic Baby

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Colic Baby

Загружено:

Авторское право:

Доступные форматы

Business Analytics

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

News.google.com : it’s a website that uses a

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

•The k-means procedure is one of the simplest

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

cluster center 4 Update 4

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

(2, 10) (5, 8) (1, 2)

Cluster 1 Cluster 2 Cluster 3

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

For now assume your goal is to group the cities into

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

You can use the Excel to identify a given

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

To determine n clusters (in this case n = 4) you define a changing cell

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Lookup z score for cluster anchor

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

The file NewMBAdata.xlsx contains

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak

Indian Institute of Management (IIM),Rohtak