Академический Документы
Профессиональный Документы
Культура Документы
Clustering is a kind of unsupervised learning. Clustering is a method of grouping data that share
similar trend and patterns. Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data.
Example
After clustering
After clusteringThus, we see clustering means grouping of data or dividing a large data set into smaller data sets of some similarity.
Cont
Clustering : The process of grouping a set of physical or abstract objects into classes of similar objects
Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters
clusters with
high intra-class similarity low inter-class similarity
well with a large number of data objects. Ability to deal with different types of attributes : Ability to analyze dataset with mixture of attribute types. Discovery of cluster with arbitrary shapes : It is important to develop algorithm which can detect cluster of arbitrary shape Minimal requirement for domain input parameter Able to deal with noise and outliers
spaces detect spatial clusters and explain them in spatial data mining Image Processing Economic Science (especially market research) WWW Document classification Cluster Weblog data to discover groups of similar access patterns
groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
j f
i
h
f
(c)
3
(d)
...
a c i e d kb j f h
Features of Clustering
Insensitive to order of input records: The same data
set, when presented to certain algorithms in different orders, may lead to dramatically different clustering. Thus it is important that algorithms be insensitive to the order of input. High dimensionality: The number of attributes or dimensions in many data sets is large, and many clustering algorithms can produce meaningful results only when the number of dimensions is small. Thus it is important that algorithms can produce results even if number of dimensions is high. Incorporation of user-specified constraints: Real applications may need to perform clustering under various kinds of constraints. Thus a good algorithm have to produce results even if data satisfying various constraints. Interpretability and usability: The clustering results should be interpretable, comprehensible and usable.
Clustering process
1,2,,s , where, is a non empty subset of X such that every element x in X is in exactly one of these subsets . Apply the following two equations: 1) The union of the elements of is equal to X. (We say the elements of cover X).
Example of partition
For example, the set X={a, b, c } with n=3 objects can be
partitioned so: into three subsets {Pj }3in one way : { {a}, {b}, {c} }, where P1={a}, P2={b}, P3={c}, and s=3; into two subsets {Pj}2 in three ways : { {a, b}, {c} }, { {a, c}, {b} },{ {a}, {b, c} }, where P1={a,b} or {a,c} or {b,c}, P2={c} or {b} or {a}, and s=2; into one subset {Pj}1 in one way : { {a, b, c} }, where P1={a, b, c}, and s=1 .
objects and their attributes. An attribute is a property or characteristic of an object. Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature. A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance.
Attribut
Tid es Refund Marital Status 1 2 Yes No No Yes No No Yes No No No Single Married Single Married Taxable Income Cheat 125K 100K 70K 120K No No No No Yes No No Yes No Yes
Objects
3 4 5 6 7 8 9 10
10
There are different types of attributes Binary: only two states: variable is absent or present. Examples: male and female, on and off. Nominal: generalization of the binary variable in that it can take more than 2 states Examples: ID numbers, eye color, zip codes. Ordinal: An ordinal q variable can be discrete or continuous and his Mq values can be mapped to ranking: 1,, Mq .. Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}. Ratio-scaled: makes a positive measurement on a scale, such as at exponential scale: AeBt or Ae-Bt. Examples: temperature in Kelvin, length, time, counts. Variables of Mixed Types: a collection of all other previous variables. Examples: temperature in Kelvin, grades, ID numbers, counts. Categorical: when there is no inherent distance measure between data values. Examples: We consider a relation that stores information about movies. A movie is a object or tuple characterized by the values or attributes: director, actor/actress, and genre.
Binary variables:
Nominal, ordinal, and ratio variables:
Interval-valued variables
Standardize data
s f 1 (| x1 f m f | | x2 f m f | ... | xnf m f |) n
Where
m f 1 (x1 f x2 f n
...
xnf )
zif
if
sf
standard deviation
Binary Variables
A contingency table for binary data
1 1 0 a c 0 b d sum a b cd p
sum a c b d
variable is symmetric):
Jaccard coefficient (noninvariant if the binary variable
is asymmetric):
Nominal Variables
A generalization of the binary variable in that it can
d (i, j) p m p
Method 2: use a large number of binary variables
Ordinal Variables
An ordinal variable can be discrete or continuous Order is important, e.g., rank Can be treated like interval-scaled
Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a
yif = log(xif)
treat them as continuous ordinal data treat their
rank as interval-scaled
density functions
Grid-based: based on a multiple-level granularity
structure
Model-based: A model is hypothesized for each
Partitioning Algorithms
Construct a partition of a database D of n objects
center of the cluster K means clustering is an effective algorithm to extract a given number of clusters of patterns from a training set. Once done, the cluster locations can be used to classify patterns into distinct classes.
k-medoids or PAM (Partition around medoids) :
Algorithm ( K-means )
Input: k clusters, n objects of database D. Output: A set of k clusrers which minimizes the squared-error function E. Algorithm: 1) Choose k objects as the initial cluster centers. 2) Assign each object to the cluster which has the closest mean point (centroid) under squared Euclidean distance metric. 3) When all objects have been assigned, recalculate the positions of k mean point (centroid). 4) Repeat Steps 2) and 3) until the centroids do not change any more.
Pseudo Code Input: k // Desired number of clusters D={x1, x2,, xn} // Set of elements Output: K={C1, C2,, Ck} // Set of k clusters which minimizes the squared-error function E (*) K-means algorithm 1) assign initial values for means point 1, 2, k // k seeds (**) repeat 2.1) assign each item xi to the cluster which has closest mean; 2.2) calculate new mean for each cluster; until convergence criteria is meat;
O(tkn), where n is objects number, k is clusters number, and t is iterations number. Normally, k, t << n.
Weakness
Not applicable in categorical data . Need to specify k, the number of clusters, in
advance. Unable to handle noisy data and outliers . Not suitable to discover clusters with non-convex shapes . To overcome some of these problems is introduced the k-medoids or PAM method.
K-Medoids
Instead of taking the mean value of the object
in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Calculate Tih , the total swap contribution for the pair of objects (i,h), as where Cjih is the contribution to swapping the pair of Cjih (i,h) ( i <-> h ) from object j,defined below. objects
j 1
Tih = There are four possibilities to consider when calculating Cjih , see Tab.1 in Appendix.
of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean. Weakness: PAM works efficiently for small data sets but does not scale well for large data sets. Infact: O( k(n-k)2 ) for each iteration where n is data numbers, k is clusters numbers.
method that instead of taking the whole set of data into consideration, only a small portion of the real data is chosen ( in random manner ) as a representative of the data, and medoids are chosen from this sample using PAM. Deals with larger data sets than PAM. Weakness: Efficiency depends on the sample size. A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased.
method that draws sample of neighbors dynamically. The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids. If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum. It is more efficient and scalable than both PAM and CLARA Partitioning
into e tree of clusters and uses distance matrix as clustering criteria.This method does not require the number of clusters k as an input, but needs only number of objects n and a termination condition. There are two principal types of hierarchical methods: Agglomerative (bottom-up): merge clusters iteratively.
Start by placing each object in its own cluster; merge these atomic clusters into larger and larger clusters; until all objects are in a single cluster.
Most hierarchical methods belong to this category. They differ only in their definition of between-cluster similarity. An example is AGNES (AGglomerative NESting), [8]. Divisive (top-down):split a cluster iteratively.
It does the reverse by starting with all objects in one cluster and
subdividing them into small pieces. Divisive methods are not generally available, and rarely have been applied. An example is DIANA (DIvisive ANAlysis)
distance from any member P of one cluster Ci to any member P of the other cluster Cj .
A // n*n proximity or adyacency matrix A = [d(i,j)] that showing distance between xi , xj ; Cr // r-th cluster, with 1 r . n ; d [ Cr ,Cs] // Proximity between clusters Cr and Cs; k // Sequence number, with k=0,1,n -1; L(k) // Distance-level of the k-th clustering;
Output: // Dendrogram; Algorithm:
1. Begin with n clusters, each containing one object and having level L(0) = 0 and sequence number k = 0. 2. Find the least dissimilar pair (Cr ,Cs) in the current clustering, according to d[Cr ,Cs ] = min(d[Ci ,Cj ]) where the minimum is over all pairs of clusters (Ci ,Cj) in the current clustering. 3. Increment the sequence number : k = k +1. Merge clusters Cr and Cs into a single cluster to form the next clustering k. Set the level of this clustering to L(k) = d[Cr ,Cs]. 4. Update the proximity matrix, D, by deleting the rows and columns
cutting the dendrogram at the desired level, then each connected component forms a cluster. Hierarchical
Cont
: Use the Single-Link method and the dissimilarity matrix Hierarchical methods: DIANA : Inverse order of AGNES. Eventually each node forms a cluster on its own. Hierarchical methods: BIRCH : BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a method that indroduces two concepts: clustering feature and CF tree (Clustering Feature tree); Hierarchical methods: CURE : The method CURE ( Clustering Using REpresentatives ) integrates hierarchical and partitioning algorithms to favor clusters with arbtrary shape Hierarchical methods: CHAMELEON : CHAMELEON (Hierarchical clustering using dynamic modeling) algorithm explores dynamic modeling in hierarchical clustering