Clustering (26 9 12)

Clustering
Clustering is a kind of unsupervised learning. Clustering is a method of grouping data that share
similar trend and patterns. Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data.
Example
After clustering
After clusteringThus, we see clustering means grouping of data or dividing a large data set into smaller data sets of some similarity.
Cont
Clustering : The process of grouping a set of physical or abstract objects into classes of similar objects
Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters
What Is Good Clustering?

A good clustering method will produce high quality
clusters with
high intra-class similarity low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its implementation.

The quality of a clustering method is also measured
by its ability to discover some or all of the hidden patterns.
Requirement of a Good clustering in DM

Scalability : Ability of the algorithm to perform
well with a large number of data objects. Ability to deal with different types of attributes : Ability to analyze dataset with mixture of attribute types. Discovery of cluster with arbitrary shapes : It is important to develop algorithm which can detect cluster of arbitrary shape Minimal requirement for domain input parameter Able to deal with noise and outliers
General Applications of Clustering

Pattern Recognition Spatial Data Analysis create thematic maps in GIS by clustering feature
spaces detect spatial clusters and explain them in spatial data mining Image Processing Economic Science (especially market research) WWW Document classification Cluster Weblog data to discover groups of similar access patterns
Examples of Clustering Applications

Marketing: Help marketers discover distinct
groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
Different ways of representing clusters

(a) d a k g e (b) c h d b a e j
j f
i
h
f
(c)
1 a b c 0.4 0.1 0.3
3
(d)
...
0.1 0.5 0.8 0.1 0.3 0.4
a c i e d kb j f h
Features of Clustering
Insensitive to order of input records: The same data
set, when presented to certain algorithms in different orders, may lead to dramatically different clustering. Thus it is important that algorithms be insensitive to the order of input. High dimensionality: The number of attributes or dimensions in many data sets is large, and many clustering algorithms can produce meaningful results only when the number of dimensions is small. Thus it is important that algorithms can produce results even if number of dimensions is high. Incorporation of user-specified constraints: Real applications may need to perform clustering under various kinds of constraints. Thus a good algorithm have to produce results even if data satisfying various constraints. Interpretability and usability: The clustering results should be interpretable, comprehensible and usable.
Clustering process
Partition of a set ( data set )

A partition of a set X , with n objects (size(X) = n) and j =
1,2,,s , where, is a non empty subset of X such that every element x in X is in exactly one of these subsets . Apply the following two equations: 1) The union of the elements of is equal to X. (We say the elements of cover X).
2) The intersection of any two subsets is empty. ( We say
the subsets are pairwise disjoint) i, j = 1,2,,s
Example of partition
For example, the set X={a, b, c } with n=3 objects can be
partitioned so: into three subsets {Pj }3in one way : { {a}, {b}, {c} }, where P1={a}, P2={b}, P3={c}, and s=3; into two subsets {Pj}2 in three ways : { {a, b}, {c} }, { {a, c}, {b} },{ {a}, {b, c} }, where P1={a,b} or {a,c} or {b,c}, P2={c} or {b} or {a}, and s=2; into one subset {Pj}1 in one way : { {a, b, c} }, where P1={a, b, c}, and s=1 .
Data are a collection of
objects and their attributes. An attribute is a property or characteristic of an object. Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature. A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance.
Attribut
Tid es Refund Marital Status 1 2 Yes No No Yes No No Yes No No No Single Married Single Married Taxable Income Cheat 125K 100K 70K 120K No No No No Yes No No Yes No Yes
Objects
3 4 5 6 7 8 9 10
10
Divorced 95K Married 60K
Divorced 220K Single Married Single 85K 75K 90K
Clustering: Data Typing
There are different types of attributes Binary: only two states: variable is absent or present. Examples: male and female, on and off. Nominal: generalization of the binary variable in that it can take more than 2 states Examples: ID numbers, eye color, zip codes. Ordinal: An ordinal q variable can be discrete or continuous and his Mq values can be mapped to ranking: 1,, Mq .. Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}. Ratio-scaled: makes a positive measurement on a scale, such as at exponential scale: AeBt or Ae-Bt. Examples: temperature in Kelvin, length, time, counts. Variables of Mixed Types: a collection of all other previous variables. Examples: temperature in Kelvin, grades, ID numbers, counts. Categorical: when there is no inherent distance measure between data values. Examples: We consider a relation that stores information about movies. A movie is a object or tuple characterized by the values or attributes: director, actor/actress, and genre.
Type of data in clustering analysis

Interval-scaled variables:
Binary variables:
Nominal, ordinal, and ratio variables:
Variables of mixed types:
Interval-valued variables
Standardize data
Calculate the mean absolute deviation:
s f 1 (| x1 f m f | | x2 f m f | ... | xnf m f |) n
Where
m f 1 (x1 f x2 f n
...
xnf )
Calculate the standardized measurement (z-score) x m
zif
if
sf
Using mean absolute deviation is more robust than using
standard deviation
Binary Variables
A contingency table for binary data
1 1 0 a c 0 b d sum a b cd p
sum a c b d
Simple matching coefficient (invariant, if the binary
variable is symmetric):
Jaccard coefficient (noninvariant if the binary variable
is asymmetric):
Nominal Variables
A generalization of the binary variable in that it can
take more than 2 states, e.g., red, yellow, blue, green

Method 1: Simple matching m: # of matches, p: total # of variables
d (i, j) p m p
Method 2: use a large number of binary variables
Ordinal Variables
An ordinal variable can be discrete or continuous Order is important, e.g., rank Can be treated like interval-scaled
Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt

Methods: treat them like interval-scaled variablesnot a good
choice! (why?the scale can be distorted)

apply logarithmic transformation
yif = log(xif)
treat them as continuous ordinal data treat their
rank as interval-scaled
Variables of Mixed Types

A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio
Major Clustering Approaches

Partitioning algorithms: Construct various
partitions and then evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical
decomposition of the set of data (or objects) using some criterion

Density-based: based on connectivity and
density functions
Grid-based: based on a multiple-level granularity
structure
Model-based: A model is hypothesized for each
Partitioning Algorithms
Construct a partition of a database D of n objects
into a set of k clusters

k-means :Each cluster is represented by the
center of the cluster K means clustering is an effective algorithm to extract a given number of clusters of patterns from a training set. Once done, the cluster locations can be used to classify patterns into distinct classes.
k-medoids or PAM (Partition around medoids) :
Each cluster is represented by one of the objects in the cluster
Algorithm ( K-means )
Input: k clusters, n objects of database D. Output: A set of k clusrers which minimizes the squared-error function E. Algorithm: 1) Choose k objects as the initial cluster centers. 2) Assign each object to the cluster which has the closest mean point (centroid) under squared Euclidean distance metric. 3) When all objects have been assigned, recalculate the positions of k mean point (centroid). 4) Repeat Steps 2) and 3) until the centroids do not change any more.
Pseudo Code Input: k // Desired number of clusters D={x1, x2,, xn} // Set of elements Output: K={C1, C2,, Ck} // Set of k clusters which minimizes the squared-error function E (*) K-means algorithm 1) assign initial values for means point 1, 2, k // k seeds (**) repeat 2.1) assign each item xi to the cluster which has closest mean; 2.2) calculate new mean for each cluster; until convergence criteria is meat;
Partitioning methods: K-means
Strength & Weakness

Strength
The K-means method is is relatively efficient:
O(tkn), where n is objects number, k is clusters number, and t is iterations number. Normally, k, t << n.
Weakness
Not applicable in categorical data . Need to specify k, the number of clusters, in
advance. Unable to handle noisy data and outliers . Not suitable to discover clusters with non-convex shapes . To overcome some of these problems is introduced the k-medoids or PAM method.
K-Medoids
Instead of taking the mean value of the object
in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Algorithm ( K-medoid or PAM)

Input: k clusters, n objects of database D. Output: A set of k clusrers which minimizes the sum of the dissimilarities of all n objects to their nearest q-th medoid (q = 1,2,,k). Algorithm: 1) Randomly choose k objects from the data set to be the cluster medoids at the initial state. 2) For each pair of non-selected object h and selected object i, calculate the total swapping cost Tih. 3) For each pair of i and h, - If Tih < 0, i is replaced by h - Then assign each non-selected object to the most simila representative object. 4) Repeat steps 2 and 3 until no change happens
Calculate Tih , the total swap contribution for the pair of objects (i,h), as where Cjih is the contribution to swapping the pair of Cjih (i,h) ( i <-> h ) from object j,defined below. objects
j 1
Tih = There are four possibilities to consider when calculating Cjih , see Tab.1 in Appendix.
Partitioning methods: K-medoid or PAM
Strength & Weakness

PAM is more robust than K-means in the presence
of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean. Weakness: PAM works efficiently for small data sets but does not scale well for large data sets. Infact: O( k(n-k)2 ) for each iteration where n is data numbers, k is clusters numbers.
To overcome these problems is introduced :

CLARA (Clustering LARge Applications) - >
Sampling based method
Partitioning methods: CLARA

CLARA (Clustering LARge Applications) is a
method that instead of taking the whole set of data into consideration, only a small portion of the real data is chosen ( in random manner ) as a representative of the data, and medoids are chosen from this sample using PAM. Deals with larger data sets than PAM. Weakness: Efficiency depends on the sample size. A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased.
Partitioning methods: CLARANS

CLARANS (Randomized CLARA ) is a
method that draws sample of neighbors dynamically. The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids. If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum. It is more efficient and scalable than both PAM and CLARA Partitioning
Hierarchical methods: Introduction

Hierarchical clustering methods works by grouping data objects
into e tree of clusters and uses distance matrix as clustering criteria.This method does not require the number of clusters k as an input, but needs only number of objects n and a termination condition. There are two principal types of hierarchical methods: Agglomerative (bottom-up): merge clusters iteratively.
Start by placing each object in its own cluster; merge these atomic clusters into larger and larger clusters; until all objects are in a single cluster.
Most hierarchical methods belong to this category. They differ only in their definition of between-cluster similarity. An example is AGNES (AGglomerative NESting), [8]. Divisive (top-down):split a cluster iteratively.
It does the reverse by starting with all objects in one cluster and
subdividing them into small pieces. Divisive methods are not generally available, and rarely have been applied. An example is DIANA (DIvisive ANAlysis)
Hierarchical methods: Distance between clusters

Merging of clusters is based on the distance between clusters: Single-Linkage: it is the shortest distance from any member P of one cluster Ci to any member P of the other cluster Cj .
Complete-Linkage: it is the the greatest
distance from any member P of one cluster Ci to any member P of the other cluster Cj .
Average-Linkage: it is the the average distance
Hierarchical methods: Agglomerative Algorithm

Input: D={x1, x2,, xn} // Set of elements;
A // n*n proximity or adyacency matrix A = [d(i,j)] that showing distance between xi , xj ; Cr // r-th cluster, with 1 r . n ; d [ Cr ,Cs] // Proximity between clusters Cr and Cs; k // Sequence number, with k=0,1,n -1; L(k) // Distance-level of the k-th clustering;
Output: // Dendrogram; Algorithm:
1. Begin with n clusters, each containing one object and having level L(0) = 0 and sequence number k = 0. 2. Find the least dissimilar pair (Cr ,Cs) in the current clustering, according to d[Cr ,Cs ] = min(d[Ci ,Cj ]) where the minimum is over all pairs of clusters (Ci ,Cj) in the current clustering. 3. Increment the sequence number : k = k +1. Merge clusters Cr and Cs into a single cluster to form the next clustering k. Set the level of this clustering to L(k) = d[Cr ,Cs]. 4. Update the proximity matrix, D, by deleting the rows and columns
Hierarchical methods: Dendrogram

Agglomerative Algorithm decompose data objects
into a several levels of nested partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by
cutting the dendrogram at the desired level, then each connected component forms a cluster. Hierarchical
Cont
Other Hierarchical methods

Hierarchical methods: AGNES
: Use the Single-Link method and the dissimilarity matrix Hierarchical methods: DIANA : Inverse order of AGNES. Eventually each node forms a cluster on its own. Hierarchical methods: BIRCH : BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a method that indroduces two concepts: clustering feature and CF tree (Clustering Feature tree); Hierarchical methods: CURE : The method CURE ( Clustering Using REpresentatives ) integrates hierarchical and partitioning algorithms to favor clusters with arbtrary shape Hierarchical methods: CHAMELEON : CHAMELEON (Hierarchical clustering using dynamic modeling) algorithm explores dynamic modeling in hierarchical clustering

Clustering (26 9 12)

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Clustering (26 9 12)

Загружено:

Авторское право:

Доступные форматы

Clustering

What Is Good Clustering?

The quality of a clustering result depends on both the

similarity measure used by the method and its implementation.

by its ability to discover some or all of the hidden patterns.

Requirement of a Good clustering in DM

General Applications of Clustering

Examples of Clustering Applications

Different ways of representing clusters

1 a b c 0.4 0.1 0.3

0.1 0.5 0.8 0.1 0.3 0.4

Partition of a set ( data set )

2) The intersection of any two subsets is empty. ( We say

the subsets are pairwise disjoint) i, j = 1,2,,s

Data are a collection of

Divorced 95K Married 60K

Divorced 220K Single Married Single 85K 75K 90K

Clustering: Data Typing

Type of data in clustering analysis

Variables of mixed types:

Calculate the mean absolute deviation:

Calculate the standardized measurement (z-score) x m

Using mean absolute deviation is more robust than using

Simple matching coefficient (invariant, if the binary

take more than 2 states, e.g., red, yellow, blue, green

nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt

choice! (why?the scale can be distorted)

Variables of Mixed Types

ordinal, interval and ratio

Major Clustering Approaches

partitions and then evaluate them by some criterion

decomposition of the set of data (or objects) using some criterion

into a set of k clusters

Each cluster is represented by one of the objects in the cluster

Partitioning methods: K-means

Strength & Weakness

Algorithm ( K-medoid or PAM)

Partitioning methods: K-medoid or PAM

Strength & Weakness

To overcome these problems is introduced :

Sampling based method

Partitioning methods: CLARA

Partitioning methods: CLARANS

Hierarchical methods: Introduction

Hierarchical methods: Distance between clusters

Complete-Linkage: it is the the greatest

Average-Linkage: it is the the average distance

Hierarchical methods: Agglomerative Algorithm

Hierarchical methods: Dendrogram

into a several levels of nested partitioning (tree of clusters), called a dendrogram.

Other Hierarchical methods

Вам также может понравиться