Вы находитесь на странице: 1из 37

Clustering

Clustering is a kind of unsupervised learning. Clustering is a method of grouping data that share

similar trend and patterns. Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data.
Example
After clustering

After clusteringThus, we see clustering means grouping of data or dividing a large data set into smaller data sets of some similarity.

Cont
Clustering : The process of grouping a set of physical or abstract objects into classes of similar objects
Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters

What Is Good Clustering?


A good clustering method will produce high quality

clusters with
high intra-class similarity low inter-class similarity

The quality of a clustering result depends on both the

similarity measure used by the method and its implementation.


The quality of a clustering method is also measured

by its ability to discover some or all of the hidden patterns.

Requirement of a Good clustering in DM


Scalability : Ability of the algorithm to perform

well with a large number of data objects. Ability to deal with different types of attributes : Ability to analyze dataset with mixture of attribute types. Discovery of cluster with arbitrary shapes : It is important to develop algorithm which can detect cluster of arbitrary shape Minimal requirement for domain input parameter Able to deal with noise and outliers

General Applications of Clustering


Pattern Recognition Spatial Data Analysis create thematic maps in GIS by clustering feature

spaces detect spatial clusters and explain them in spatial data mining Image Processing Economic Science (especially market research) WWW Document classification Cluster Weblog data to discover groups of similar access patterns

Examples of Clustering Applications


Marketing: Help marketers discover distinct

groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

Different ways of representing clusters


(a) d a k g e (b) c h d b a e j

j f
i

h
f

(c)

1 a b c 0.4 0.1 0.3

3
(d)

...

0.1 0.5 0.8 0.1 0.3 0.4

a c i e d kb j f h

Features of Clustering
Insensitive to order of input records: The same data

set, when presented to certain algorithms in different orders, may lead to dramatically different clustering. Thus it is important that algorithms be insensitive to the order of input. High dimensionality: The number of attributes or dimensions in many data sets is large, and many clustering algorithms can produce meaningful results only when the number of dimensions is small. Thus it is important that algorithms can produce results even if number of dimensions is high. Incorporation of user-specified constraints: Real applications may need to perform clustering under various kinds of constraints. Thus a good algorithm have to produce results even if data satisfying various constraints. Interpretability and usability: The clustering results should be interpretable, comprehensible and usable.

Clustering process

Partition of a set ( data set )


A partition of a set X , with n objects (size(X) = n) and j =

1,2,,s , where, is a non empty subset of X such that every element x in X is in exactly one of these subsets . Apply the following two equations: 1) The union of the elements of is equal to X. (We say the elements of cover X).

2) The intersection of any two subsets is empty. ( We say

the subsets are pairwise disjoint) i, j = 1,2,,s

Example of partition
For example, the set X={a, b, c } with n=3 objects can be

partitioned so: into three subsets {Pj }3in one way : { {a}, {b}, {c} }, where P1={a}, P2={b}, P3={c}, and s=3; into two subsets {Pj}2 in three ways : { {a, b}, {c} }, { {a, c}, {b} },{ {a}, {b, c} }, where P1={a,b} or {a,c} or {b,c}, P2={c} or {b} or {a}, and s=2; into one subset {Pj}1 in one way : { {a, b, c} }, where P1={a, b, c}, and s=1 .

Data are a collection of

objects and their attributes. An attribute is a property or characteristic of an object. Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature. A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance.

Attribut
Tid es Refund Marital Status 1 2 Yes No No Yes No No Yes No No No Single Married Single Married Taxable Income Cheat 125K 100K 70K 120K No No No No Yes No No Yes No Yes

Objects

3 4 5 6 7 8 9 10
10

Divorced 95K Married 60K

Divorced 220K Single Married Single 85K 75K 90K

Clustering: Data Typing

There are different types of attributes Binary: only two states: variable is absent or present. Examples: male and female, on and off. Nominal: generalization of the binary variable in that it can take more than 2 states Examples: ID numbers, eye color, zip codes. Ordinal: An ordinal q variable can be discrete or continuous and his Mq values can be mapped to ranking: 1,, Mq .. Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}. Ratio-scaled: makes a positive measurement on a scale, such as at exponential scale: AeBt or Ae-Bt. Examples: temperature in Kelvin, length, time, counts. Variables of Mixed Types: a collection of all other previous variables. Examples: temperature in Kelvin, grades, ID numbers, counts. Categorical: when there is no inherent distance measure between data values. Examples: We consider a relation that stores information about movies. A movie is a object or tuple characterized by the values or attributes: director, actor/actress, and genre.

Type of data in clustering analysis


Interval-scaled variables:

Binary variables:
Nominal, ordinal, and ratio variables:

Variables of mixed types:

Interval-valued variables
Standardize data

Calculate the mean absolute deviation:

s f 1 (| x1 f m f | | x2 f m f | ... | xnf m f |) n

Where

m f 1 (x1 f x2 f n

...

xnf )

Calculate the standardized measurement (z-score) x m

zif

if

sf

Using mean absolute deviation is more robust than using

standard deviation

Binary Variables
A contingency table for binary data
1 1 0 a c 0 b d sum a b cd p

sum a c b d

Simple matching coefficient (invariant, if the binary

variable is symmetric):
Jaccard coefficient (noninvariant if the binary variable

is asymmetric):

Nominal Variables
A generalization of the binary variable in that it can

take more than 2 states, e.g., red, yellow, blue, green


Method 1: Simple matching m: # of matches, p: total # of variables

d (i, j) p m p
Method 2: use a large number of binary variables

Ordinal Variables
An ordinal variable can be discrete or continuous Order is important, e.g., rank Can be treated like interval-scaled

Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a

nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt


Methods: treat them like interval-scaled variablesnot a good

choice! (why?the scale can be distorted)


apply logarithmic transformation

yif = log(xif)
treat them as continuous ordinal data treat their

rank as interval-scaled

Variables of Mixed Types


A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal,

ordinal, interval and ratio

Major Clustering Approaches


Partitioning algorithms: Construct various

partitions and then evaluate them by some criterion


Hierarchy algorithms: Create a hierarchical

decomposition of the set of data (or objects) using some criterion


Density-based: based on connectivity and

density functions
Grid-based: based on a multiple-level granularity

structure
Model-based: A model is hypothesized for each

Partitioning Algorithms
Construct a partition of a database D of n objects

into a set of k clusters


k-means :Each cluster is represented by the

center of the cluster K means clustering is an effective algorithm to extract a given number of clusters of patterns from a training set. Once done, the cluster locations can be used to classify patterns into distinct classes.
k-medoids or PAM (Partition around medoids) :

Each cluster is represented by one of the objects in the cluster

Algorithm ( K-means )
Input: k clusters, n objects of database D. Output: A set of k clusrers which minimizes the squared-error function E. Algorithm: 1) Choose k objects as the initial cluster centers. 2) Assign each object to the cluster which has the closest mean point (centroid) under squared Euclidean distance metric. 3) When all objects have been assigned, recalculate the positions of k mean point (centroid). 4) Repeat Steps 2) and 3) until the centroids do not change any more.

Pseudo Code Input: k // Desired number of clusters D={x1, x2,, xn} // Set of elements Output: K={C1, C2,, Ck} // Set of k clusters which minimizes the squared-error function E (*) K-means algorithm 1) assign initial values for means point 1, 2, k // k seeds (**) repeat 2.1) assign each item xi to the cluster which has closest mean; 2.2) calculate new mean for each cluster; until convergence criteria is meat;

Partitioning methods: K-means

Strength & Weakness


Strength
The K-means method is is relatively efficient:

O(tkn), where n is objects number, k is clusters number, and t is iterations number. Normally, k, t << n.
Weakness
Not applicable in categorical data . Need to specify k, the number of clusters, in

advance. Unable to handle noisy data and outliers . Not suitable to discover clusters with non-convex shapes . To overcome some of these problems is introduced the k-medoids or PAM method.

K-Medoids
Instead of taking the mean value of the object

in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

Algorithm ( K-medoid or PAM)


Input: k clusters, n objects of database D. Output: A set of k clusrers which minimizes the sum of the dissimilarities of all n objects to their nearest q-th medoid (q = 1,2,,k). Algorithm: 1) Randomly choose k objects from the data set to be the cluster medoids at the initial state. 2) For each pair of non-selected object h and selected object i, calculate the total swapping cost Tih. 3) For each pair of i and h, - If Tih < 0, i is replaced by h - Then assign each non-selected object to the most simila representative object. 4) Repeat steps 2 and 3 until no change happens

Calculate Tih , the total swap contribution for the pair of objects (i,h), as where Cjih is the contribution to swapping the pair of Cjih (i,h) ( i <-> h ) from object j,defined below. objects

j 1

Tih = There are four possibilities to consider when calculating Cjih , see Tab.1 in Appendix.

Partitioning methods: K-medoid or PAM

Strength & Weakness


PAM is more robust than K-means in the presence

of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean. Weakness: PAM works efficiently for small data sets but does not scale well for large data sets. Infact: O( k(n-k)2 ) for each iteration where n is data numbers, k is clusters numbers.

To overcome these problems is introduced :


CLARA (Clustering LARge Applications) - >

Sampling based method

Partitioning methods: CLARA


CLARA (Clustering LARge Applications) is a

method that instead of taking the whole set of data into consideration, only a small portion of the real data is chosen ( in random manner ) as a representative of the data, and medoids are chosen from this sample using PAM. Deals with larger data sets than PAM. Weakness: Efficiency depends on the sample size. A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased.

Partitioning methods: CLARANS


CLARANS (Randomized CLARA ) is a

method that draws sample of neighbors dynamically. The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids. If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum. It is more efficient and scalable than both PAM and CLARA Partitioning

Hierarchical methods: Introduction


Hierarchical clustering methods works by grouping data objects

into e tree of clusters and uses distance matrix as clustering criteria.This method does not require the number of clusters k as an input, but needs only number of objects n and a termination condition. There are two principal types of hierarchical methods: Agglomerative (bottom-up): merge clusters iteratively.
Start by placing each object in its own cluster; merge these atomic clusters into larger and larger clusters; until all objects are in a single cluster.

Most hierarchical methods belong to this category. They differ only in their definition of between-cluster similarity. An example is AGNES (AGglomerative NESting), [8]. Divisive (top-down):split a cluster iteratively.
It does the reverse by starting with all objects in one cluster and

subdividing them into small pieces. Divisive methods are not generally available, and rarely have been applied. An example is DIANA (DIvisive ANAlysis)

Hierarchical methods: Distance between clusters


Merging of clusters is based on the distance between clusters: Single-Linkage: it is the shortest distance from any member P of one cluster Ci to any member P of the other cluster Cj .

Complete-Linkage: it is the the greatest

distance from any member P of one cluster Ci to any member P of the other cluster Cj .

Average-Linkage: it is the the average distance

Hierarchical methods: Agglomerative Algorithm


Input: D={x1, x2,, xn} // Set of elements;

A // n*n proximity or adyacency matrix A = [d(i,j)] that showing distance between xi , xj ; Cr // r-th cluster, with 1 r . n ; d [ Cr ,Cs] // Proximity between clusters Cr and Cs; k // Sequence number, with k=0,1,n -1; L(k) // Distance-level of the k-th clustering;
Output: // Dendrogram; Algorithm:

1. Begin with n clusters, each containing one object and having level L(0) = 0 and sequence number k = 0. 2. Find the least dissimilar pair (Cr ,Cs) in the current clustering, according to d[Cr ,Cs ] = min(d[Ci ,Cj ]) where the minimum is over all pairs of clusters (Ci ,Cj) in the current clustering. 3. Increment the sequence number : k = k +1. Merge clusters Cr and Cs into a single cluster to form the next clustering k. Set the level of this clustering to L(k) = d[Cr ,Cs]. 4. Update the proximity matrix, D, by deleting the rows and columns

Hierarchical methods: Dendrogram


Agglomerative Algorithm decompose data objects

into a several levels of nested partitioning (tree of clusters), called a dendrogram.


A clustering of the data objects is obtained by

cutting the dendrogram at the desired level, then each connected component forms a cluster. Hierarchical

Cont

Other Hierarchical methods


Hierarchical methods: AGNES

: Use the Single-Link method and the dissimilarity matrix Hierarchical methods: DIANA : Inverse order of AGNES. Eventually each node forms a cluster on its own. Hierarchical methods: BIRCH : BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a method that indroduces two concepts: clustering feature and CF tree (Clustering Feature tree); Hierarchical methods: CURE : The method CURE ( Clustering Using REpresentatives ) integrates hierarchical and partitioning algorithms to favor clusters with arbtrary shape Hierarchical methods: CHAMELEON : CHAMELEON (Hierarchical clustering using dynamic modeling) algorithm explores dynamic modeling in hierarchical clustering

Вам также может понравиться