Академический Документы
Профессиональный Документы
Культура Документы
Part 1
Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Cluster Analysis
Unsupervised classification
Aims to decompose or partition a data set in to clusters such that each cluster is similar within itself but is as dissimilar as possible to other clusters. Inter-cluster (between-groups) distance is maximized and intra-cluster (within-group) distance is minimized.
Different to classification as there are no predefined classes, no training data, may not even know how many clusters.
24/04/2014 GKGupta 2
Cluster Analysis
Not a new field, a branch of statistics Many old algorithms are available Renewed interest due to data mining and machine learning New algorithms are being developed, in particular to deal with large amounts of data
24/04/2014
GKGupta
Classification Example
Sore Throat
Yes No Yes Yes No No No Yes No Yes
Fever
Yes No Yes No Yes No No No Yes Yes
Swollen Glands
Yes No No Yes No No Yes No No No
Congestion Headache
Yes Yes Yes No Yes Yes No Yes Yes Yes Yes Yes No No No No No Yes Yes Yes
Diagnosis
Strep throat Allergy Cold Strep throat Cold Allergy Strep throat Allergy Cold Cold
24/04/2014
GKGupta
Clustering Example
(From text by Roiger and Geatz)
Sore Throat Swollen Glands
Fever
Congestion
Headache
24/04/2014
GKGupta
Cluster Analysis
As in classification, each object has several attributes.
Some methods require that the number of clusters is specified by the user and a seed to start each cluster be given. The objects are then clustered based on self-similarity.
24/04/2014
GKGupta
Questions
How to define a cluster?
How to find if objects are similar or not? How to define some concept of distance between individual objects and sets of objects?
24/04/2014
GKGupta
Types of Data
Interval-scaled data - continuous variables on a roughly linear scale e.g. marks, age, weight. Units can affect the analysis. Distance in metres is obviously a larger number than in kilometres. May need to scale or normalise data (how?).
Types of Data
Nominal data - similar to binary but can take more than two states e.g. colour, staff position. How to compute distance between two objects with nominal data?
Ordinal data - similar to nominal but the different values are ordered in a meaningful sequence
24/04/2014
GKGupta
10
Distance
A simple, well-understood concept. Distance has the following properties (assume x and y to be vectors): distance is always positive distance x to x is zero distance x to y cannot be greater than the sum of distance x to z and z to y distance x to y is the same as from y to x. Examples: absolute value of the difference |x _ y| (the Manhattan distance ) or (x-y)**2 (the Euclidean distance).
24/04/2014
GKGupta
11
Distance
Three common distance measures are;
1. Manhattan Distance or the absolute value of the difference D(x, y) = |x-y| 2. Euclidean distance D(x, y) = ( (x-y)2 )
3. Maximum difference D(x, y) = maxi |xi yi|
24/04/2014
GKGupta
12
24/04/2014
GKGupta
13
Categorical Data
Difficult to find a sensible distance measure between say China and Australia or between a male and a female or between a student doing BIT and one doing MIT.
One possibility is to convert all categorical data into numeric data (makes sense?). Another possibility is to treat distance between categorical values as a function which only has a value 1 or 0.
24/04/2014
GKGupta
14
Distance
Using the approach of treating distance between two categorical values as a value 1 or 0, it is possible to compare two objects that have all attribute values categorical.
Essentially it is just comparing how many attribute value are the same. Consider the example on the next page and compare one patient with another.
24/04/2014
GKGupta
15
Distance
(From text by Roiger and Geatz)
Sore
Throat Fever
Swollen
Glands Congestion Headache
24/04/2014
GKGupta
16
Types of Data
One possibility: q = number of attributes for which both 1 (or yes) t = number of attributes for which both 0 (or no) r, s when attribute values do not match
d = (r + s)/(r + s + q + t)
24/04/2014
GKGupta
17
Distance Measure
Need to carefully look at scales of different attributes values. For example, the GPA varies from 0 to 4 while the age attribute varies from say 20 to 50. The total distance, the sum of distance between all attribute values, is then dominated by the age attribute and will be affected little by the variations in GPA. To avoid such problems the data should be scaled or normalized (perhaps so that all data varies between -1 and 1).
24/04/2014
GKGupta
18
Distance Measure
The distance measures we have discussed so far assume equal importance being given to all attributes.
One may want to assign weights to all attributes indicating their importance. The distance measure will then need to use these weights in computing the total distance.
24/04/2014
GKGupta
19
Clustering
It is not always easy to predict how many clusters to expect from a set of data.
One may choose a number of clusters based on some knowledge of data. The result may be an acceptable set of clusters. This does not however mean that a different number of clusters will not provide an even better set of clusters. Choosing the number of clusters is a difficult problem.
24/04/2014
GKGupta
20
Types of Methods
Partitioning methods given n objects, make k (n) partitions (clusters) of data and use iterative relocation method. It is assumed that each cluster has at least one object and each object belongs to only one cluster. Hierarchical methods - start with one cluster and then
Types of Methods
Density-based methods - for each data point in a cluster, at least a minimum number of points must exist within a given radius Grid-based methods - object space is divided into a grid Model-based methods - a model is assumed, perhaps based on a probability distribution
24/04/2014 GKGupta 22
Partitional Methods
The K-Means Method: Perhaps the most commonly used method. Select the number of clusters and then dividing the given
clusters.
24/04/2014 GKGupta 24
Example
Student S1 Age 18 Marks1 73 Marks2 75 Marks3 57
S2
S3 S4 S5 S6
18
23 20 22 19
79
70 55 85 91
85
70 55 86 90
75
52 55 87 89
S7
S8 S9 S10
24/04/2014
20
21 19 40
70
53 82 76
GKGupta
65
56 82 60
65
59 60 78
26
Example
Let the three seeds be the first three records:
Student
S1 S2 S3
Age
18 18 23
Mark1
73 79 70
Mark2
75 85 70
Mark3
57 75 52
Now we compute the distances between the objects based on the four attributes. K-Means requires Euclidean distances but we use the sum of absolute differences as well as to show how the distance metric can change the results. Next page table shows the distances and their allocation to the nearest neighbor.
24/04/2014 GKGupta 27
Distances
Student Age Marks1 Marks2 Marks3
Manhattan
Euclidean Distance
C1
C2 C3 C3 C2
Distance
S1
S2 S3 S4 S5
18
18 23 20 22
73
79 70 55 85
75
85 70 55 86
57
75 52 55 87
Seed 1
Seed 2 Seed 3 42/76/36 57/23/67
Seed 1
Seed 2 Seed 3 27/43/22 34/14/41
C1
C2 C3 C3 C2
S6
S7 S8 S9
24/04/2014
19
20 21 19 47
91
70 53 82 75
90
65 56 82 76
89
60 59 60
GKGupta
66/32/82
23/41/21 44/74/40 20/22/36 60/52/58
C2
C3 C3 C1 C2
40/19/47
13/24/14 28/42/23 12/16/19 33/34/32
C2
C1 C3 C1 C3 28
S10
77
Example
Note the different allocations by the two different distance metrics. K-Means uses the Euclidean distance.
The means of the new clusters are also different as shown on the next slide. We now start with the new means and compute Euclidean distances. The process continues until there is no change.
24/04/2014
GKGupta
29
Cluster Means
Cluster
Manhattan
Age
18.5 24.8 21 18 18 23
Mark1
77.5 82.8 62 73 79 70
Mark2
78.5 80.3 61.5 75 85 70
Mark3
58.5 82 57.8 57 75 52
Euclidean
C1
C2 C3
19.0
19.7 26.0
75.0
85.0 63.5
GKGupta
74.0
87.0 60.3
60.7
83.7 60.8
30
24/04/2014
Distances
Student C1 C2 C3 S1 Age 18.5 24.8 21 18 Marks1 77.5 82.8 62 73 Marks2 78.5 80.3 61.5 75 Marks3 58.5 82 57.8 57 8/52/36 C1 Distance Cluster
S2
S3 S4 S5 S6 S7 S8 S9 S10
24/04/2014
18
23 20 22 19 20 21 19 47
79
70 55 85 91 70 53 82 75
85
70 55 86 90 65 56 82 76
GKGupta
75
52 55 87 89 60 59 60 77
30/18/63
22/67/28 46/91/26 51/7/78 60/15/93 19/56/22 44/89/22 16/32/48 52/63/43
C2
C1 C3 C2 C2 C1 C3 C1 C3
31
Comments
Euclidean Distance used in the last slide since that is what K-Means requires.
The results of using Manhattan distance and Euclidean distance are quite different The last iteration changed the cluster membership of S3 from C3 to C1. New means are now computed followed by computation of new distances. The process continues until there is no change.
24/04/2014 GKGupta 32
Distance Metric
Most important issue in methods like the K-Means method is the distance metric although K-Means is married to Euclidean distance. There is no real definition of what metric is a good one but the example shows that different metrics, used on the same data, can produce different results making it very difficult to decide which result is the best.
Whatever metric is used is somewhat arbitrary although extremely important. Euclidean distance is used in the next example.
24/04/2014 GKGupta
33
Example
Example deals with data about countries.
Three seeds, France, Indonesia and Thailand were chosen randomly. Euclidean distances were computed and nearest neighbors were put together in clusters. The result is shown on the second slide. Another iteration did not change the clusters.
24/04/2014
GKGupta
34
Another Example
24/04/2014
GKGupta
35
Example
24/04/2014
GKGupta
36
Notes
The scales of the attributes in the examples are different. Impacts the results. The attributes are not independent. For example, strong correlation between Olympic medals and income. The examples are very simple and too small to illustrate the methods fully. Number of clusters - can be difficult since there is no generally accepted procedure for determining the number. May be use trial and error and then decide based on within-group and between-group distances. How to interpret results?
24/04/2014
GKGupta
37
Validating Clusters
Difficult problem.
Statistical testing may be possible could test a hypothesis that there are no clusters. May be data is available where clusters are known.
24/04/2014
GKGupta
38
Hierarchical Clustering
Quite different than partitioning. Involves gradually merging different objects into clusters (called agglomerative) or dividing large clusters into smaller ones (called divisive).
We consider one method.
24/04/2014
GKGupta
39
Agglomerative Clustering
The algorithm normally starts with each cluster consisting of a single data point. Using a measure of distance, the algorithm merges two clusters that are nearest, thus reducing the number of clusters. The process continues until all the data points are in one cluster.
The algorithm requires that we be able to compute distances between two objects and also between two clusters.
24/04/2014 GKGupta 40
24/04/2014
GKGupta
42
Algorithm
Each object is assigned to a cluster of its own Build a distance matrix by computing distances between every pair of objects Find the two nearest clusters Merge them and remove them from the list Update matrix by including distance of all objects from the new cluster Continue until all objects belong to one cluster The next slide shows an example of the final result of hierarchical clustering.
24/04/2014 GKGupta 43
Hierarchical Clustering
http://www.resample.com/xlminer/help/HClst/HClst_ex.htm
24/04/2014
GKGupta
44
Example
We now consider the simple example about students marks and use the distance between two objects as the Manhattan distance. We will use the centroid method for distance between clusters. We first calculate a matrix of distances.
24/04/2014
GKGupta
45
Example
Student 971234 Age 18 Marks1 73 Marks2 75 Marks3 57
994321
965438 987654 968765 978765
18
23 20 22 19
79
70 55 85 91
85
70 55 86 90
75
52 55 87 89
984567
985555 998888 995544
24/04/2014
20
21 19 47
70
53 82 75
GKGupta
65
56 82 76
60
59 60 77
46
Distance Matrix
24/04/2014
GKGupta
47
Example
The matrix gives distance of each object with every other object. The smallest distance of 8 is between object 4 and object 8. They are combined and put where object 4 was. Compute distances from this cluster and update the distance matrix.
24/04/2014
GKGupta
48
Updated Matrix
24/04/2014
GKGupta
49
Note
The smallest distance now is 15 between the objects 5 and 6. They are combined in a cluster and 5 and 6 are removed. Compute distances from this cluster and update the distance matrix.
24/04/2014
GKGupta
50
Updated Matrix
24/04/2014
GKGupta
51
Next
Look at shortest distance again. S3 and S7 are at a distance 16 apart. We merge them and put C3 where S3 was. The updated distance matrix is given on the next slide. It shows that C2 and S1 have the smallest distance and are then merged in the next step. We stop short of finishing the example.
24/04/2014
GKGupta
52
Updated matrix
24/04/2014
GKGupta
53
Another Example
24/04/2014
GKGupta
54
24/04/2014
GKGupta
55
Next steps
Malaysia and Philippines have the shortest distance (France and UK are also very close). In the next step we combine Malaysia and Philippines and replace the two countries by this cluster. Then update the distance matrix. Then choose the shortest distance and so on until all countries are in one large cluster.
24/04/2014
GKGupta
56
Divisive Clustering
Opposite to the last method. Less commonly used.
We start with one cluster that has all the objects and seek to split the cluster into components which themselves are then split into even smaller components. The divisive method may be based on using one variable at a time or using all the variables together. The algorithm terminates when a termination condition is met or when each cluster has only one object.
24/04/2014 GKGupta 57
Hierarchical Clustering
The ordering produced can be useful in gaining some insight into the data. The major difficulty is that once an object is allocated to a cluster it cannot be moved to another cluster even if the initial allocation was incorrect Different distance metrics can produce different results
24/04/2014
GKGupta
58
Challenges
Scalability - would the method work with millions of objects (e.g. Cluster analysis of Web pages)?
Ability to deal with different types of attributes Discovery of clusters with arbitrary shape - often algorithms discover spherical clusters which are not always suitable
24/04/2014
GKGupta
59
Challenges
Minimal requirements for domain knowledge to determine input parameters - often input parameters are required e.g. how many clusters. Algorithms can be sensitive to inputs.
Ability to deal with noisy or missing data Insensitivity to the order of input records - order of input should not influence the result of the algorithm
24/04/2014 GKGupta 60
Challenges
High dimensionality - data can have many attributes e.g. Web pages
Constraint-based clustering - sometime clusters may need to satisfy constraints
24/04/2014
GKGupta
61