Вы находитесь на странице: 1из 17

11MCA043

Clustering: An Introduction What is Clustering?

Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data.

A loose definition of clustering could be the process of organizing objects into groups whose members are similar in some way.

A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters. We can show this with a simple graphical example:

In this case we easily identify the 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are close according to a given distance (in this case geometrical distance). This is called distance-based clustering.

11MCA043

Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects. In other words, objects are grouped according to their fit to descriptive concepts, not according to simple similarity measures.

The Goals of Clustering


So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? It can be shown that there is no absolute best criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs. For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding natural clusters and describe their unknown properties (natural data types), in finding useful and suitable groupings (useful data classes) or in finding unusual data objects (outlier detection).

Possible Applications

Clustering algorithms can be applied in many fields, for instance:

11MCA043

Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records;

Biology: classification of plants and animals given their features; Libraries: book ordering; Insurance: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds;

City-planning: identifying groups of houses according to their house type, value and geographical location;

Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones;

WWW: document classification; clustering weblog data to discover groups of similar access patterns.

Requirements

The main requirements that a clustering algorithm should satisfy are:


scalability; dealing with different types of attributes; discovering clusters with arbitrary shape; minimal requirements for domain knowledge to determine input parameters;

11MCA043

ability to deal with noise and outliers; insensitivity to order of input records; high dimensionality; interpretability and usability.

Problems

There are a number of problems with clustering. Among them:

current clustering techniques do not address all the requirements adequately (and concurrently);

dealing with large number of dimensions and large number of data items can be problematic because of time complexity;

the effectiveness of the method depends on the definition of distance (for distance-based clustering);

if an obvious distance measure doesnt exist we must define it, which is not always easy, especially in multi-dimensional spaces;

the result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in different ways.

Clustering Algorithms

11MCA043

Classification Clustering algorithms may be classified as listed below:


Exclusive Clustering Overlapping Clustering Hierarchical Clustering Probabilistic Clustering

In the first case data are grouped in an exclusive way, so that if a certain datum belongs to a definite cluster then it could not be included in another cluster. A simple example of that is shown in the figure below, where the separation of points is achieved by a straight line on a bi-dimensional plane. On the contrary the second type, the overlapping clustering, uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different degrees of membership. In this case, data will be associated to an appropriate membership value.

Instead, a hierarchical clustering algorithm is based on the union between the two nearest clusters.

11MCA043

The beginning condition is realized by setting every datum as a cluster. After a few iterations it reaches the final clusters wanted. Finally, the last kind of clustering use a completely probabilistic approach.

In this tutorial we propose four of the most used clustering algorithms:


K-means Fuzzy C-means Hierarchical clustering Mixture of Gaussians

Distance Measure An important component of a clustering algorithm is the distance measure between data points. If the components of the data instance vectors are all in the same physical units then it is possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances. However, even in this case the Euclidean distance can sometimes be misleading. Figure shown below illustrates this with an example of the width and height measurements of an object. Despite both measurements being taken in the same physical units, an informed decision has to be made as to the relative scaling. As the figure shows, different scalings can lead to different clusterings.

11MCA043

Notice however that this is not only a graphic issue: the problem arises from the mathematical formula used to combine the distances between the single components of the data feature vectors into a unique distance measure that can be used for clustering purposes: different formulas leads to different clusterings.

Minkowski Metric
For higher dimensional data, a popular measure is the Minkowski metric,

where d is the dimensionality of the data. The Euclidean distance is a special case where p=2, while Manhattan metric has p=1. However, there are no general theoretical guidelines for selecting a measure for any given application. It is often the case that the components of the data feature vectors are not immediately comparable. It can be that the components are not continuous variables, like length, but nominal categories, such as the days of the week. In these cases again, domain knowledge must be used to formulate an appropriate measure.

11MCA043

Details of Clustering Algorithms


Non-hierarchical clustering methods Single-pass methods

Similarity: Computed between input and all representatives of existing clusters

Example --- cover coefficient algorithm of Can et al.: Select set of documents as cluster seeds; assign each document to cluster that maximally covers it

Time: O(N log N) Space: O(M) Advantages: Simple, requiring only one pass through data; may be useful as starting point for reallocation methods

Disadvantages: Produces large clusters early in process; clusters formed are dependent on order of input data

Algorithm 1. Assign the first document D1 as the representative of cluster C1 2. Calculate the similarity Sj between document Di and each cluster, keeping track of the largest, Smax 3. If Smax is greater than Sthreshold, add the document to the appropriate cluster, else create a new cluster with centroid Di

11MCA043

4. If documents remain, repeat from step 2 Reallocation methods


Similarity: Allows various definitions of similarity / cluster cohesion Type of clusters: Refinements of initial input clustering Time: O(MN) Space: O(M + N) Advantages: Allows processing of larger data sets than other methods Disadvantages: Can take a long time to converge to a solution, depending on the appropriateness of the reallocation criteria to the structure of the data

Algorithm 1. Select M cluster representatives or centroids 2. Assign each document to the most similar centroid 3. Recalculate the centroid for each cluster 4. Repeat steps 2 and 3 until there is little change in cluster membership from pass to pass

Hierarchical clustering methods


single link method

11MCA043

Similarity: Join the most similar pair of objects that are not yet in the same cluster. Distance between 2 clusters is the distance between the closest pair of points, each of which is in one of the two clusters.

Type of clusters: Long straggly clusters, chains, ellipsoidal Time: Usually O(N**2) though can range from O(N log N) to O(N**5) Space: O(N) Advantages: Theoretical properties, efficient implementations, widely used. No cluster centroid or representative required, so no need arises to recalculate the similarity matrix.

Disadvantages: Unsuitable for isolating spherical or poorly separated clusters

SLINK Algorithm

Arrays record cluster pointer and distance information Process input documents one by one. For each: Compute and store a row of the distance matrix Find nearest other point, using the matrix Relabel clusters, as needed

MST Algorithms (see Prim, Kruskal algorithm animations)

11MCA043

MST has all info needed to generate single link hierarchy in O(N**2) operations, or the single link hierarchy can be built at the same time as MST.

Prim-Dijkstra algorithm operates as follows: Place an arbitrary document in the MST and connect its NN to it, to create an initial MST fragment.

Find the document not in the MST that is closest to any point in the MST, and add it to the current MST fragment.

If a document remains that is not in the MST fragment, repeat the previous step.

complete link method


Similarity: Join least similar pair between each of two clusters Type of clusters: All entries in a cluster are linked to one another within some minimum similarity, so have small, tightly bound clusters.

Time: Voorhees alg. worst case is O(N**3) but sparse matrices require much less

Space: Voorhees alg. worst case is O(N**2) but sparse matrices require much less

Advantages: Good results from (Voorhees) comparative studies. Disadvantages: Difficult to apply to large data sets since most efficient algorithm is general HACM using stored data or stored matrix approach.

Voorhees Algorithm

11MCA043

Variation on the sorted matrix HACM approach Based on fact: if sims between all pairs of docs are processed in descending order, two clusters of size mi and mj can be merged as soon as the mi x mj th similarity of documents in those clusters is reached.

Requires sorted list of doc-doc similarities Requires way to count no. of sims seen between any 2 active clusters

group average link method

Similarity: Use the average value of the pairwise links within a cluster, based upon all objects in the cluster. Save space and time if use inner product of 2 (appropriately weighted) vectors --- see Voorhees alg. below.

Type of clusters: Intermediate in tightness between single link and complete link

Time: O(N**2) Space: O(N) Advantages: Ranked well in evaluation studies Disadvantages: Expensive for large collections

Algorithm

Sim between cluster centroid and any doc = mean sim between the doc and all docs in the cluster.

Hence, can use centroids only to compute cluster-cluster sim, in O(N) space

11MCA043

Ward's method (minimum variance method)

Similarity: Join the cluster pair whose merger minimizes the increase in the total within-group error sum of squares, based on Euclidean distance between centroids

Type of clusters: Homogeneous clusters in a symmetric hierarchy; cluster centroid nicely characterized by its center of gravity

Time: O(N**2) Space: O(N) --- carries out agglomerations in restricted spatial regions Advantages: Good at recovering cluster structure, yields unique and exact hierarchy

Disdvantages: Sensitive to outliers, poor at recovering elongated clusters

RNN: We can apply a reciprocal nearest neighbor (RNN) algorithm, since for any point or cluster there exists a chain of nearest neighbors (NNs) that ends with a pair that are each others' NN. Algorithm - Single Cluster: 1. Select an arbitrary doc 2. Follow the NN chain from it till an RNN pair is found. 3. Merge the 2 points in that pair, replacing them with a single new point. 4. If there is only one point, stop. Otherwise, if there is a point in the NN chain preceding the merged points, return to step 2 else to step 1.

11MCA043

Centroid and median methods

Similarity: At each stage join the pair of clusters with the most similar centroids.

Type of clusters: Cluster is represented by the coordinates of the group centroid or median.

Disadvantages: Newly formed clusters may differ overly much from constituent points, meaning that updates may cause large changes throughout the cluster hierarchy.

General algorithm for HACM All hierarchical agglomerative clustering methods (HACMs) can be described by the following general algorithm. Algorithm 1. Identify the 2 closest points and combine them into a cluster 2. Identify and combine the next 2 closest points (treating existing clusters as points too) 3. If more than one cluster remains, return to step 1. How to apply general algorithm: Lance and Williams proposed the Lance-Williams dissimilarity update formula, to calculate dissimilarities between a new cluster and existing points, based on the dissimilarities prior to forming the new cluster. This formula has 3 parameters, and each HACM can be characterized by its own set of Lance-Williams parameters

11MCA043

(see Table 16.1 in text). Then, the algorithm above can be applied, using the appropriate Lance-Williams dissimilarities. Implementations of the general algorithm:

Stored matrix approach: Use matrix, and then apply Lance-Williams to recalculate dissimilarities between cluster centers. Storage is therefore O(N**2) and time is at least O(N**2), but will be O(N**3) if matrix is scanned linearly.

Stored data approach: O(N) space for data but recompute pairwise dissimilarities so need O(N**3) time

Sorted matrix approach: O(N**2) to calculate dissimilarity matrix, O(N**2 log N**2) to sort it, O(N**2) to construct hierarchy, but one need not store the data set, and the matrix can be processed linearly, which reduces disk accesses.

K-Means Clustering Overview K-Means clustering generates a specific number of disjoint, flat (nonhierarchical) clusters. It is well suited to generating globular clusters. The KMeans method is numerical, unsupervised, non-deterministic and iterative. K-Means Algorithm Properties

There are always K clusters. There is always at least one item in each cluster. The clusters are non-hierarchical and they do not overlap.

11MCA043

Every member of a cluster is closer to its cluster than any other cluster because closeness does not always involve the 'center' of clusters.

The K-Means Algorithm Process

The dataset is partitioned into K clusters and the data points are randomly assigned to the clusters resulting in clusters that have roughly the same number of data points

Calculate the distance from the data point to each cluster. If the data point is closest to its own cluster, leave it where it is. If the data point is not closest to its own cluster, move it into the closest cluster.

Repeat the above step until a complete pass through all the data points results in no data point moving from one cluster to another. At this point the clusters are stable and the clustering process ends.

The choice of initial partition can greatly affect the final clusters that result, in terms of inter-cluster and intracluster distances and cohesion.

K-Means Clustering in GeneLinker The version of the K-Means algorithm used in GeneLinker differs from the conventional K-Means algorithm in that GeneLinker does not compute the centroid of the clusters to measure the distance from a data point to a cluster. Instead, the algorithm uses a specified linkage distance metric. The use of the Average Linkage distance metric most closely corresponds to conventional K-Means, but can produce different results in many cases.

11MCA043

Advantages to Using this Technique

With a large number of variables, K-Means may be computationally faster than hierarchical clustering (if K is small).

K-Means may produce tighter clusters than hierarchical clustering, especially if the clusters are globular.

Disadvantages to Using this Technique

Difficulty in comparing quality of the clusters produced (e.g. for different initial partitions or values of K affect outcome).

Fixed number of clusters can make it difficult to predict what K should be.

Does not work well with non-globular clusters. Different initial partitions can result in different final clusters. It is helpful to rerun the program using the same as well as different K values, to compare the results achieved.

Вам также может понравиться