K-Means Data Clustering Approach: Jaipur National University

K-MEANS DATA CLUSTERING
APPROACH
SEMINAR REPORT
Submitted by
SHAIKH FAIZAN AHMED (8CS65)
Atir Kahn (8CS23)
IN
BACHELOR OF TECHNOLOGY
(COMPUTER SCIENCE & ENGINEERING)
SUBMITTED TO
SCHOOL OF ENGINEERING AND

TECHNOLOGY
JAIPUR NATIONAL UNIVERSITY
2
K-MEANS DATA
CLUSTERING
APPROACH
3
Abstract
In data mining, clustering is a technique in which the set of objects are

assigned to a group called clusters. Clustering is the most essential part of
data mining. K-means clustering is the basic clustering technique and is most
widely used algorithm. It is also known as nearest neighbor searching. It simply
clusters the datasets into given number of clusters. Numerous efforts have been
made to improve the performance of the K-means clustering algorithm . In this
paper we have been briefed in the form of a review the work carried out by the
different researchers using K-means clustering. We have discussed the
limitations and applications of the K-means clustering algorithm as well. This
paper presents a current review about the K means clustering algorithm.
4
CONTENTS
1. Introduction
1.1History
2. Definitions and Notation
3. Literature Review
4. Clustering Techniques
4.1Hierarchical Clustering Algorithms
4.2k-Means Clustering Algorithm
4.3Fuzzy Clustering
4.4Representation of Clusters
5. Applications
5.1 Image Segmentation Using Clustering
5.2 Object and Character Recognition
5.3 Information Retrieval
5.4 Data Mining
6. Summary
5
List of tables:
S.N Section Tabl Description Page

o No. e No. No.
1 5.4.1 1 The seven smallest clusters found in the 29
document set.
6
List of diagrams/figures :
S.No Secti Figur Description Page

on e No. No.
No.
1. 4. 1. A Taxnomy of Clustering Approach 09
2. 4.1 2. Monothetic partitional clustering. 10
3. 4.1 3. Points falling in three clusters 11
4. 4. The dendrogram obtained using the 11

single-link algorithm.
5. 5. Two concentric clusters. 11
6. 4.2 6. The k-means algorithm is sensitive to the 12

initial partition.
7. 4.3 7. Fuzzy Clusters 13
8. 4.3 8. Representation of clusters by points 14
9. 4.4 9. Representation of clusters by a 15

classification tree or by conjunctive
statements.
10. 5.1 10. Feature representation for clustering. 19
Image measurements and positions are
transformed to features. Clusters in
feature space correspond to image
segments.
11. 5.1.1 11. Binarization via thresholding. 18
(a): Original grayscale image.
(b): Gray-level histogram
(c):Results of thresholdin.
12. 5.1.1 12. Range image segmentation using 20
clustering. (a): Input range image.
(b): Surface normals for selected image
pixels.
(c): Initial segmentation (19 cluster
solution) returned by CLUSTER using
1000 six-dimensional samples from the
7
image as a pattern set.

(d):Final segmentation (8 segments)
produced by postprocessing.
13. 5.1.1 13. Texture image segmentation results. 21
(a): Four-class texture mosaic.
(b): Four-cluster solution produced by
CLUSTER with pixel coordinates
included in the feature set
14. 5.1.3 14. Multispectral medical image 23
segmentation. (a) A single channel of the
input image.
(b)9-cluster segmentation.
15. 5.1.3 15. LANDSAT image segmentation. 23
(a): Original image
(ESA/EURIMAGE/Sattelitbild)
.(b)Clustered scene.
16. 16. A dendrogram corresponding to 100 28
books.
8
I. INTRODUCTION
K-means clustering is a type ofThis introduction to the K-means
unsupervised learning, clustering algorithm
which is used when you covers:
have unlabeled data (i.e.,
data without definedCommon business cases where K-means
categories or groups). The is used
goal of this algorithm is to
find groups in the data,The steps involved in running the
with the number of groups algorithm
represented by the
variable K. The algorithmA Python example using delivery fleet
works iteratively to assign data
each data point to one
of K groups based on the
features that are provided. 1.1 History
Data points are clustered Even though there is an increasing
based on feature similarity. in-terest in the use of clustering
The results of the K-means methods in pattern recognition ,
clustering algorithm are: image processing clustering has a
rich history in other disciplines such
The centroids of the K clusters, which as biology, psychiatry, psychology,
can be used to label new archaeology, geology, geography,
data and marketing. Other terms more or
less synonymous with clustering
Labels for the training data (each data include unsupervised learning
point is assigned to a single numerical taxonomy [Sneath and
cluster) Sokal 1973], vector quantization
[Oehler and Gray 1995], and
Rather than defining groups before
learning by obser-vation [Michalski
looking at the data,
and Stepp 1983]. The field of spatial
clustering allows you to
analysis of point pat-terns [Ripley
find and analyze the groups
1988] is also related to cluster
that have formed
analysis. The importance and
organically. The "Choosing
interdisciplinary nature of clustering
K" section below describes
is evident through its vast literature.
how the number of groups
can be determined.
A number of books on clustering
Each centroid of a cluster is a collection have been published [Jain and Dubes
of feature values which 1988; Anderberg 1973; Hartigan
define the resulting groups. 1975; Spath 1980; Duran and Odell
Examining the centroid 1974; Everitt 1993; Backer 1995], in
feature weights can be used addition to some useful and
to qualitatively interpret influential review papers. Asurvey of
what kind of group each the state of the art in cluster-ing
cluster represents. circa 1978 was reported in Dubes
and Jain [1980]. A comparison of
9
vari-ous clustering algorithms for have been reported in Mishra and

construct-ing the minimal spanning Raghavan [1994] and Al-Sultan and
tree and the short spanning path was Khan [1996].
given in Lee [1981]. Cluster analysis
was also sur-veyed in Jain [1986]. A
review of image segmentation by2. DEFINITIONS AND
clustering was reported in Jain and
NOTATION
Flynn [1996]. Com-parisons of
various combinatorial opti-mizationThe following terms and notation are
schemes, based on experi-ments,used throughout this paper.
The individual scalar components xi
A pattern (or feature vector, observa- of a pattern x are called features (or
tion, or datum) x is a single data item attributes).
used by the clustering algorithm. It
typically consists of a vector of d mea- d is the dimensionality of the pattern or
of the pattern space.
surements: x 5 ~x1, . . . xd!.
$ x1, . . . xn%. The ith pattern in - is
denoted xi 5 ~xi,1, . . . xi,d!. In many
cases a pattern set to be clustered is
viewed as an n 3 d pattern matrix.
A class, in the abstract, refers to a state
of nature that governs the pat-tern
generation process in some cases.
More concretely, a class can be viewed
as a source of patterns whose distri-
bution in feature space is governed by
a probability density specific to the
class. Clustering techniques attempt to
group patterns so that the classes
thereby obtained reflect the different
pattern generation processes repre-
sented in the pattern set.
Fuzzy clustering procedures assign to
each input pattern xi a fractional de-
gree of membership fij in each output
cluster
3. Literature Review clusters, is still required to be given as
an input, regardless of
K-mean is the most popular the distribution of the data points.
partitioning method of Y. S. Thakare discuss about
clustering. Mac Queen in 1967, firstly performance of kmeans
proposed this algorithm which is evaluated with
technique, though the idea goes back to various databases
Hugo Steinhaus in such as Iris, Wine, Vowel, Ionosphere
1957. The standard algorithm was first and Crude oil data Set
proposed by Stuart and various distance metrics. It is
Lloyd in 1957 as a technique for pulse- concluded that performance
code modulation. of k-means clustering is depend on the
Sometimes it is referred as Lloyd- data base used as well
Forgybecause In 1965, as distance metrics. The k means
E.W.Forgy published essentially the clustering algorithm is
same method evaluated for recognition rate for
K. A. Abdul Nazeer discuss in this different no. of cluster in
paper about this paper.This proposed work will help
the one major drawback of the k-means to choose better
algorithm. K-means distance metric for particular
algorithm, for different sets of values of application.
initial centroids, Soumi Ghosh present a comparative
produces different clusters. Final discussion of
cluster quality in algorithm two clustering algorithms namely
depends on the selection of initial centroid based K-Means
centroids. Two phases and representative object based FCM
includes in original k means algorithm: (Fuzzy C-Means)
first for determining clustering algorithms. This discussion
initial centroids and second for is on the basis of
assigning data points to the performance evaluation of the
nearest clusters and then recalculating efficiency of clustering output
the clustering mean. by applying these algorithms. The
An enhanced clustering method is factors use in this work
discussed in this paper,in upon which the behaviour patterns of
which both the phases of the original k- both the algorithms
means algorithm are analyse are the numbers of data points
modified to improve the accuracy and as well as the number
efficiency. This of clusters. The result of this
algorithm combines a systematic comparative study is that FCM
method for finding initial produces closer result to the K-means
centroids and an efficient way for but still computation
assigning data points to time is more than k-means due to
clusters. But still there is a limitation in involvement of the fuzzy
this enhanced measure calculations.
algorithm that is the value of k, the Sakthi discuss in this paper that due to
number of desired the
increment in the amount of data across
the world, analysis of
the data turns out to be very difficult classification of traditional methods is
task. To understand and improved using this
learn the data classify those data into model. According to the existing
remarkable collection. statistical data, the NBA
So, there is a need of data mining players are classified to make the
techniques. classification and
Shafeeq present a modified K-means evaluation objectively and
algorithm to scientifically. This work show that
improve the cluster quality and to fix this is very effective and reasonable
the optimal number of methodology. Therefore,
cluster. As input number of clusters (K) based on classification result the
given to the K-means guards. type can be defined
algorithm by the user. But in the properly. Meanwhile, the guards.
practical scenario, it is very function in the team can be
difficult to fix the number of clusters in evaluated in a fair and objective
advance. The method manner.
proposed in this paper works for both R. Amutha discuss that when two or
the cases i.e. for known more
number of clusters in advance as well algorithms of same category of
as unknown number of clustering technique is used
clusters. The user has the flexibility then best results will be acquired. In
either to fix the number this paper, various
of clusters or input the minimum clustering algorithms are discussed.
number of clusters required. Two k-means algorithms
The new cluster centres are computed discuss here are: Parallel k/h-Means
by the algorithm by Clustering for Large
incrementing the cluster counter by one Data Sets and A Novel K-Means Based
in each iteration until Clustering Algorithm
it satisfies the validity of cluster for High Dimensional Data Sets.
quality. This algorithm will Parallel k/h-Means
overcome this problem by finding the algorithm is designed to deal with very
optimal number of large data sets. The
clusters on the run.The proposed application result of this algorithm has
approach takes more been proved with 90%
computational time than the K-means efficiency on a distributed computing
for larger data sets. It is environment. These
the major drawback of this approach. results show that this algorithm is
Libao ZHANG propose a simple and scalable. Novel K-Means
qualitative Based Clusteringprovides the
methodology using k means clustering advantages of using both HC
algorithm to classify and K-Means.Using these two
NBA guards and used the Euclidean algorithms, space and
distance as a measure of similarity between the data sets present
similarity distance. This work display each nodes is
by using k-Means extended.
clustering algorithm and120 NBA
guards. data. Manual
(hierarchical methods produce a nested
4. CLUSTERING series of partitions, while partitional
TECHNIQUES methods produce only one).
Different approaches to clustering data
can be described with the help of the The taxonomy shown in Figure 1
hierarchy shown in Figure 1 (other tax- must be supplemented by a discussion
onometric representations of clustering of cross-cutting issues that may (in
methodology are possible;. At the top principle) affect all of the different ap-
level, there is a dis-tinction between proaches regardless of their placement
hierarchical and parti-tional approaches in the taxonomy.
Clustering
Hierarchical Partitional
Single Complete Square Graph Mixture Mode

Link Link Error Theoretic Resolving Seeking
k-means Expectation
Maximization
Figure 1. A taxonomy of clustering approaches.

4.1 Hierarchical Clustering ters is the minimum of the distances
Algorithms between all pairs of patterns drawn
from the two clusters (one pattern from
The operation of a hierarchical cluster- the first cluster, the other from the sec-
ing algorithm is illustrated using the ond). In the complete-link algorithm,
two-dimensional data set in Figure 9. the distance between two clusters is the
This figure depicts seven patterns la- maximum of all pairwise distances be-
beled A, B, C, D, E, F, and G in three
clusters. A hierarchical algorithm tween patterns in the two clusters. In
either case, two clusters
yields a dendrogram representing the are merged to form a
nested grouping of patterns and larger cluster based on
similarity levels at which groupings minimum distance
change. A den-drogram corresponding criteria. The complete-
to the seven link al-gorithm produces
Figure 2. Monothetic partitional tightly bound or com-pact
clustering. clusters . The single-link
algorithm, by contrast,
suf-fers from a chaining.
X2 V
It has a tendency to
2 22 2 2 3
produce clusters that are
2 22
2 2 2 2 2 22 3 33 3
3
3
straggly or elongated.
3
H1
2 2 2
3
3 3
3 3 The single-link algorithm is more
3 3
1
11
1 1
11
1 1 3
1
3 3 3
2
H versatile than the complete-link
11 1
11 1
algorithm, otherwise. For example, the
1 1
1
1 1 4 4 4 single-link algorithm can extract the
1 11 4 44 4
1
4 4 4 44
concentric clusters shown in Figure 11,
4
4 but the complete-link algorithm cannot.
4 However, from a pragmatic viewpoint,
X1
it has been observed that the complete-
points in Figure 3 (obtained from the
link algorithm produces more useful hi-
single-link algorithm. The den-drogram
erarchies in many applications than the
can be broken at different lev-els to
single-link algorithm
yield different clusterings of the data.
a convergence criterion is met (e.g.,
Most hierarchical clustering algo-
there is no reassignment
rithms are variants of the single-link of any pattern from one
complete-link, and minimum-variance cluster to another, or the
algorithms. Of these, the single-link squared error ceases to
and complete-link algorithms are most decrease signifi-cantly
popular. These two algorithms differ in after some number of
iterations).
the way they characterize the similarity
between a pair of clusters. In the single- The k-means algorithm is popular be-
link method, the distance between two cause it is easy to implement, and its
clus time complexity is O~n!, where n is
the number of patterns. A major
problem with this algorithm is that it is
sensitive to the selection of the initial
partition and may converge to a local
minimum of the criterion function
value if the initial partition is not
properly
. If we start with patterns A, B, and C
X
2 as the initial means around which the
F G
three clusters are built, then we end up
Cluster3
with the partition {{A}, {B, C}, {D, E,
F, G}} shown by ellipses. The squared
error criterion value is much larger for
Cluster1
Cluster2
this partition than for the best partition
C DE
B {{A, B, C}, {D, E}, {F, G}} shown by
A
rectangles, which yields the global
X1 minimum value of the squared error
Figure 3 Points falling in three criterion function for a clustering
clusters. containing three clusters. The correct
three-cluster solution is obtained by
choosing, for example, A, D, and F as
the initial cluster means.
S
i
m
i
l
a
r
i
t
y
A B C D E F G
Figure 4. The dendrogram obtained
using the single-link algorithm.
Y
1
1 1
2 2 1
2
1 2 2
2 1
1
1 1
Figure 5 Two concentric clusters.
The k-means is the simplest and most

commonly used algorithm employing a
squared error criterion. It starts with a
random initial partition and keeps
reassigning the patterns to clusters
based on the similarity between the
pattern and the cluster centers until
4.2 k-Means Clustering troids is below another pre-specified
Algorithm threshold. Using this variant, it is pos-
sible to obtain the optimal partition
(1) Choose k cluster centers to coincide starting from any arbitrary initial parti-
with k randomly-chosen patterns or k tion, provided proper threshold values
randomly defined points inside the are specified. The well-known ISO-
hypervolume containing the pat-tern DATA algorithm employs this
set. technique of merging and splitting
clusters. If ISODATA is given the
ªellipseº partitioning as an initial
partitioning, it will produce the optimal
three-cluster parti-
tioning. ISODATA will first merge the
clusters {A} and {B,C} into one cluster
because the distance between their cen-
troids is small and then split the cluster
{D,E,F,G}, which has a large variance,
into two clusters {D,E} and {F,G}.
Another variation of the k-means al-
Figure 6. The k-means algorithm is gorithm involves selecting a different
sensitive to the initial partition. criterion function altogether. The dy-
namic clustering algorithm (which per-
mits representations other than the
(2) Assign each pattern to the closest centroid for each cluster) was proposed
cluster center. in Diday [1973], and Symon [1977]
(3) Recompute the cluster centers using and describes a dynamic clustering ap-
the current cluster memberships. proach obtained by formulating the
clustering problem in the framework of
(4) If a convergence criterion is not maximum-likelihood estimation.
met, go to step 2. Typical
convergence criteria are: no (or
4.3 Fuzzy Clustering
minimal) reas-signment of patterns
to new cluster centers, or minimal Traditional clustering approaches gen-erate
decrease in squared error. partitions; in a partition, each pattern
belongs to one and only one cluster. Hence,
Several variants of the k-means the clusters in a hard clustering are disjoint.
Fuzzy clustering extends this notion to
algorithm have been re-ported in the
associate each pattern with every cluster
literature. Some of them attempt to using a mem-bership function . The out-put
select a good initial partition so that of such algorithms is a clustering, but not a
the algorithm is more likely to find partition. We give a high-level partitional
the global minimum value. fuzzy clustering algorithm below.
Another variation is to permit split-
ting and merging of the resulting clus-
ters. Typically, a cluster is split when its
variance is above a pre-specified
threshold, and two clusters are merged
when the distance between their cen-
Fuzzy Clustering Algorithm ·
(1) Select an initial fuzzy partition of Y

the N objects into K clusters by
selecting the N 3 K membership F
F
matrix U. An element uij of this 1
2
matrix represents the grade of 7

mem-bership of object xi in cluster 3 4
8
1
cj. Typically, uij [ @0,1#. 9
2 5 6 H
H1
(2) Using U, find the value of a fuzzy
2
criterion function, e.g., a weighted

squared error criterion function, as- X
sociated with the corresponding Figure 7 Fuzzy clusters.
par-tition. One possible fuzzy
criterion function is and F2 could be described as
$~1,0.0!, ~2,0.0!, ~3,0.0!, ~4,0.1!,
E2~-, U! 5 ON OK uijixi 2 cki2,
i51k51 ~5,0.15!, ~6,0.4!, ~7,0.35!,
N ~8,1.0!, ~9,0.9!%
where ck 5 ( uikxi is the kth fuzzy
i51
cluster center.
The ordered pairs ~i, mi! in each cluster
Reassign patterns to clusters to re- represent the ith pattern
duce this criterion function value and its mem-
and recompute U. bership value to the cluster mi. Larger
membership values indicate higher con-
(3) Repeat step 2 until entries in U do fidence in the assignment of the pattern
not change significantly. to the cluster. A hard clustering can be
obtained from a fuzzy partition by
In fuzzy clustering, each cluster is a thresholding the membership value.
fuzzy set of all the patterns. Figure 16 The most popular fuzzy clustering
illustrates the idea. The rectangles en- algorithm
close two ªhardº clusters in the data: is the fuzzy c-means (FCM) algorithm.
H1 5 $1,2,3,4,5% and H2 5 $6,7,8,9%.
Even though it is better than the hard
A fuzzy clustering algorithm might pro-
duce the two fuzzy clusters F1 and F2 k-means algorithm at avoiding local
depicted by ellipses. The patterns will minima, FCM can still converge to
have membership values in [0,1] for local minima of the squared error
each cluster. For example, fuzzy cluster criterion. The design of membership
F1 could be compactly described as functions is the most important
problem in fuzzy clustering; different
choices include
X2 X2
* *
*
*
* *
* *
*
* * *
* *
X X
By The Centroid 1 By Three Distant Points1
Figure 8 . Representation of a cluster by points.
(1) Represent a cluster of points by

those based on similarity their centroid or by a set of distant
decomposition and centroids of points in the cluster..
clusters. A generaliza-tion of the FCM (2) Represent clusters using nodes in a
algorithm was proposed by Bezdek classification tree. This is illus-
[1981] through a family of trated in Figure 9.
objective functions. A fuzzy c-shell (3) Represent clusters by using
algo-rithm and an adaptive variant for conjunc-tive logical expressions.
de-tecting circular and elliptical bound- For example, the expression @X1 .
aries was presented in Dave [1992]. 3#@X2 , 2# in Figure 9 stands for
the logical state-ment `X1 is greater
4.4 Representation of Clusters than 3' and 'X2 is less than 2'.
In applications where the number of Use of the centroid to represent a
classes or clusters in a data set must be cluster is the most popular scheme. It
discovered, a partition of the data set is works well when the clusters are com-
the end product. Here, a partition gives pact or isotropic. However, when the
an idea about the separability of the clusters are elongated or non-isotropic,
data points into clusters and whether it then this scheme fails to represent them
is meaningful to employ a supervised properly. In such a case, the use of a
classifier that assumes a given number collection of boundary points in a clus-
of classes in the data set. However, in ter captures its shape well. The number
many other applications that involve of points used to represent a cluster
decision making, the resulting clusters should increase as the complexity of its
have to be represented or described in a shape increases. The two different rep-
compact form to achieve data abstrac- resentations illustrated in Figure 9 are
tion. Even though the construction of a equivalent. Every path in a classifica-
cluster representation is an important tion tree from the root node to a leaf
step in decision making, it has not been node corresponds to a conjunctive
examined closely by researchers. The state-ment. An important limitation of
notion of cluster representation was in- the typical use of the simple
troduced in Duran and Odell [1974] conjunctive concept representations is
and was subsequently studied in Diday that they can describe only rectangular
and Simon [1976] and Michalski or isotropic clusters in the feature
[1981]. They suggested the following space.
representation schemes:
X2 |
|
3
5 | 3
| 3 X1< 3 X1 >3
3
1 | 3
4
1 1 1 | 33 X <2 X >2
| 2 2
1 3 3
3 1 1 1 |
1 3 1 2 3
1 | 3
2 1 1
1 1
|---------- Using Nodes in a Classification Tree
2 2
|
1 1 1 1 1 | 2 2 2
1 1 1 1 | 2 2
1
0 |
X
0 1 2 3 4 5 1 1: [X1 <3]; 2: [X1 >3][X2 <2]; 3:[X1>3][X2 >2]
Using Conjunctive Statements
Figure 9. Representation of clusters by a classification tree or by conjunctive
statements.
Data abstraction is useful in decision subclusters by their
making because of the centroids.It increases the
following: efficiency of the de-cision
making task. In a cluster-
It gives a simple and intuitive de- based document retrieval
scription of clusters which technique , a large
is easy for human collection of documents is
comprehension. In both clustered and each of the
conceptual clustering clusters is represented
[Michalski and symbolic using its centroid. In order
clus-tering this to retrieve docu-ments
representation is obtained relevant to a query, the
without using an query is matched with the
additional step. These al- cluster cen-troids rather
gorithms generate the than with all the docu-
clusters as well as their ments. This helps in
descriptions. A set of retrieving rele-vant
fuzzy rules can be documents efficiently.
obtained from fuzzy Also in several
clusters of a data set. applications involving
These rules can be used to large data sets, clustering
build fuzzy clas-sifiers is used to per-form
and fuzzy controllers. indexing, which helps in
efficient decision making
It helps in achieving data compres-sion
A partitional clustering 5. APPLICATIONS
like the k-means
algorithm cannot separate Clustering algorithms have been used
these two struc-tures in a large variety of
properly. The single-link applications In this
algo-rithm works well on section, we describe
this data, but is several applications
computationally where clustering has been
expensive. So a hy-brid employed as an essential
approach may be used to step. These areas are:
ex-ploit the desirable
properties of both these (1) image segmenta-tion,
algorithms. We obtain 8
sub-clusters of the data (2) object and character recogni-tion,
using the (com- (3) document retrieval,
putationally efficient) k- and
means algo-rithm. Now
the sin-gle-link algorithm (4) data mining.
can be applied on these
centroids alone to cluster 5.1 Image Segmentation Using
them into 2 groups. by Clustering
representing the
Image segmentation is a fundamental
component in many
computer vision
x
3
x2
x
1
Figure 10. Feature representation for clustering. Image measurements and
positions are transformed to features. Clusters in feature space correspond to
image segments.
is the input image with Nr rows and Nc
columns and measurement value xij at
applications, and can be addressed as a pixel ~i, j!, then the segmentation can
be expressed as 6 5 $S1, . . . Sk%, with
clustering problem The segmentation the lth segment
of the image(s) presented to an image
analysis system is critically dependent
on the scene to be sensed, the imaging Sl 5 $~il1, jl1!, . . . ~ilNl, jlNl!%
geometry, con-figuration, and sensor consisting of a connected subset of the
used to transduce the scene into a pixel coordinates. No two segments
digital image, and ulti-mately the share any pixel locations (Si ù Sj 5 À
desired output (goal) of the system. @ i Þ j), and the union of all
segments
The applicability of clustering meth-
odology to the image segmentation cover
problem was recognized over three de- s the entire image ~øik51Si 5
cades ago, and the paradigms underly- $1. . . Nr% 3 $1. . . Nc%!. Jain and
ing the initial pioneering efforts are still Dubes [1988], after Fu and Mui [1981]
in use today. A recurring theme is to identified three techniques for produc-
define feature vectors at every image ing segmentations from input imagery:
location (pixel) composed of both func- region-based, edge-based, or cluster-
tions of image intensity and functions based.
of the pixel location itself.. Consider the use of simple gray level
thresholding to segment a high-contrast
5.1.1 Segmentation. An image seg- intensity image. Figure 11(a) shows a
mentation is typically defined as an ex- grayscale image of a textbook's bar
haustive partitioning of an input image code scanned on a flatbed scanner. Part
into regions, each of which is b shows the results of a simple
considered to be homogeneous with threshold-ing operation designed to
respect to some image property of separate the dark and light regions in
interest . If the bar code area. Binarization steps
like this are often performed in
character recogni-tion systems.
( 5 $xij, i 5 1. . . Nr, j 5 1. . . Nc%
’x.dat’
0 50 100 150 200 250 300
(a) (b)
(c)
Figure 11 Binarization via thresholding. (a): Original grayscale image. (b):
Gray-level histogram. (c):
Results of thresholdin.
racy. An additional advantage of
CLUS-TER is that it produces a
Thresholding in effect `clusters' the sequence of output clusterings (i.e., a 2-
image pixels into two cluster solu-
groups based on the one-
dimensional intensity tion up through a Kmax-cluster solution
measurement where Kmax is specified by the user and
is typically 20 or so); each clustering in
While simple gray level thresh-olding this sequence yields a clustering statis-
is adequate in some tic which combines between-cluster
carefully controlled image sep-aration and within-cluster scatter.
acquisition environ-ments The clustering that optimizes this
and much research has statistic is chosen as the best one. Each
been de-voted to pixel in the range image is assigned the
appropriate methods for seg-ment label of the nearest cluster
thresholding, complex center. This minimum distance
images require more classification step is not guaranteed to
elaborate segmentation produce seg-ments which are connected
tech-niques. in the image plane; therefore, a
connected compo-nents labeling
algorithm allocates new labels for
Many segmenters use measurements disjoint regions that were placed in the
which are both spectral (e.g., the multi- same cluster. Subsequent operations
spectral scanner used in remote sens- include surface type tests, merging of
ing) and spatial (based on the pixel's adjacent patches using a test for the
location in the image plane). The mea- presence of crease or jump edges
surement at each pixel hence corre- between adjacent segments, and surface
sponds directly to our concept of a pat- parameter estimation.
tern.
Figure 12 shows this processing ap-

The CLUSTER algorithm was used to
plied to a range image. Part a of the
obtain seg-ment labels for each pixel.
figure shows the input range image;
CLUSTER is
part b shows the distribution of surface
an enhancement of the k-means algo-
normals. In part c, the initial segmenta-
rithm; it has the ability to identify sev-
tion returned by CLUSTER and modi-
eral clusterings of a data set, each with
fied to guarantee connected segments is
a different number of clusters. Hoffman
shown. Part d shows the final segmen-
and Jain [1987] also experimented with
tation produced by merging adjacent
other clustering techniques (e.g., com-
patches which do not have a significant
plete-link, single-link, graph-theoretic,
crease edge between them. The final
and other squared error algorithms) and
clusters reasonably represent distinct
found CLUSTER to provide the best
surfaces present in this complex object.
combination of performance and accu-
(a) (b)
(c) (d)
Figure 12. Range image segmentation using clustering. (a): Input range
image. (b): Surface normals
for selected image pixels. (c): Initial segmentation (19 cluster solution) returned
by CLUSTER using 1000 six-dimensional samples from the image as a pattern
set. (d): Final segmentation (8 segments) produced by postprocessing.
The analysis of textured images has

been of interest to researchers for sev-
eral years. Texture segmentation tech-
niques have been developed using a va-
riety of texture models and image ing a fuzzy K-means clustering method.
operations. In Nguyen and Cohen The clustering procedure here is modi-
[1993], texture image segmentation fied to jointly estimate the number of
was addressed by modeling the image
as a hierarchy of two Markov Random
Fields, obtaining some simple statistics
from each image block to form a
feature vector, and clustering these
blocks us-
clusters as well as the fuzzy member- characterize the texture in the neigh-
ship of each feature vector to the vari-
ous clusters. borhood of each pixel. These 28
features are reduced to a smaller
A system for segmenting texture im- number through a feature selection
ages was described in Jain and Far- procedure, and the resulting features
rokhnia [1991]; there, Gabor filters are prepro-cessed and then clustered
were used to obtain a set of 28 orienta- using the CLUSTER program. An
tion- and scale-selective features that index statistic
(a) (b)
Figure 13 Texture image segmentation results. (a): Four-class texture mosaic. (b):
Four-cluster solution produced by CLUSTER with pixel coordinates included in
the feature set.
is used to select the best nance imaging channels (yielding a five-

clustering. Minimum distance classifi- dimensional feature vector at each
cation is used to label each of the origi- pixel). A number of clusterings were
nal image pixels. This technique was obtained and combined with domain
tested on several texture mosaics in- knowledge (human expertise) to identify
cluding the natural Brodatz textures the different classes. Decision rules for
and synthetic images. Figure 28(a) supervised classification were based on
shows an input texture mosaic consist- these obtained classes. Figure 29(a)
ing of four of the popular Brodatz tex- shows one channel of an input multi-
tures [Brodatz 1966]. Part b shows the spectral image; part b shows the 9-clus-
segmentation produced when the Gabor ter result.
filter features are augmented to contain The k-means algorithm was applied
spatial information (pixel coordinates). to the segmentation of LANDSAT imag-
This Gabor filter based technique has ery in Solberg et al. [1996]. Initial clus-
proven very powerful and has been ex- ter centers were chosen interactively by
tended to the automatic segmentation of a trained operator, and correspond to
text in documents [Jain and Bhatta- land-use classes such as urban areas,
charjee 1992] and segmentation of ob- soil (vegetation-free) areas, forest,
jects in complex backgrounds [Jain et grassland, and water
shows the input image rendered as
Clustering can be used as a prepro- grayscale; part b shows the result of the
cessing stage to identify pattern classes clustering procedure.
for subsequent supervised classifica-
tion. Taxt and Lundervold [1994] and 5.1.3 Summary. In this section, the
application of clustering methodology
Lundervold et al. [1996] describe a par- to
titional clustering algorithm and a man- image segmentation problems has been
ual labeling technique to identify mate- motivated and surveyed. The historical
rial classes (e.g., cerebrospinal fluid, record shows that clustering is a power-
white matter, striated muscle, tumor) in ful tool for obtaining classifications of
image pixels. Key issues in the design
registered images of a human head ob- of
tained at five different magnetic reso- any clustering-based segmenter are the
(a) (b)
Figure 14 Multispectral medical image segmentation. (a): A single channel of
the input image. (b):
9-cluster segmentation.
(a) (b)
Figure 15 LANDSAT image segmentation. (a): Original image
(ESA/EURIMAGE/Sattelitbild). (b):
Clustered scene.
choice of pixel measurements (features) the identification of a

and dimensionality of the clustering algorithm, the
feature vector (i.e., should development of strate-
the feature vector contain gies for feature and data
intensities, pixel reduction (to avoid the
positions, model pa- ªcurse of dimensionalityº
rameters, filter outputs?), and the computational
a measure of similarity burden of classifying
which is appropriate for
the selected features and large numbers of patterns and/or fea-
the application do-main, tures), and the
identification of variations continue to
necessary pre- and post- emerge in the literature.
processing tech-niques Challenges to the more
The use of clustering for successful use of
segmentation dates back clustering include the
to the 1960s, and new high computa-tional
complexity of many clustering al-

gorithms and their
incorporation of
strong assumptions (often multivariate of objects. One of the approaches to

Gaussian) about the multidimensional indexing employs the notion of view
shape of clusters to be obtained. The classes; a view class is the set of
ability of new clustering procedures to qualitatively similar views of an object.
handle concepts and semantics in In that work, the view classes were
classi-fication (in addition to numerical identified by clustering; the rest of this
mea-surements) will be important for subsection outlines the technique.
certain applications [Michalski and
Stepp 1983; Murty and Jain 1995]. Object views were grouped into
classes based on the similarity of shape
spectral features. Each input image of
5.2 Object and Character an object viewed in isolation yields a
Recognition feature vector which characterizes that
view. The feature vector contains the
6.2.1 Object Recognition. The use of first ten central moments of a normal-
clustering to group views of 3D objects ized shape spectral distribution, # ~ !
H
for the purposes of object recognition ,
in range data was described in Dorai h
and Jain [1995]. The term view refers to of an object view. The shape spectrum
a range image of an unoccluded object of an object view is obtained from its
obtained from any arbitrary viewpoint. range data by constructing a histogram
The system under consideration em- of
ployed a viewpoint dependent (or view- shape index values (which are related
to surface curvature values) and
centered) approach to the object recog- accumu-lating all the object pixels
nition problem; each object to be that fall into each bin. By normalizing
the spectrum with respect to the total
recognized was represented in terms of object area, the scale (size)
a library of range images of that object. differences that may exist between
different objects are removed.
There are many possible views of a The first moment m1 is computed as
3D object and one goal of that work the
was to avoid matching an unknown #~ !
input view against each image of each weighted mean of H h :
object. A common theme in the object #
recognition literature is indexing, m1 5 O~h!H~h!. (1)
wherein the un-known view is used to h
select a subset of views of a subset of The other central moments, mp, 2 # p #
the objects in the database for further 10 are defined as:
comparison, and rejects all other views
p# (2) present in the model database, }D. The
mp 5 O~h 2 m1! H~h!.
h ith view of the jth object, Oij in the
database is represented by ^Lij, Rij&,
Then, the feature vector is denoted as where Lij is the object label and R ij is
R 5 ~m1, m2, ´ ´ ´, m10!, with the range the feature vector. Given a set of object
of each of these moments being representations 5i 5 $^Li1, Ri1&, ´ ´ ´,
@21,1#. ^ i i
L m, R m&% that describes m
Let 2 5 $O1, O2, ´ ´ ´, On% be a col- views of the ith object, the goal is to
lection of n 3D objects whose views are derive a par-
tition of the views, 3i 5 $Ci1, Ci2, ´ ´ ´, tains those views of the ith object that
Ciki%. Each cluster in 3i con- have been adjudged similar based on
the dissimilarity between the corre-dissimilarity, between Rij and Rik, is de-
sponding moment features of the shapefined as:
spectra of the views. The measure of $~Ri, Ri ! 5 O10~Ri 2 Ri !2. (3)
j k jl
5.3 Information Retrieval other documents. Libraries use the Li-

Information retrieval (IR) is concernedbrary of Congress Classification (LCC)
with automatic storage and retrieval ofscheme for efficient storage and re-
documents [Rasmussen 1992]. Manytrieval of books. The LCC scheme con-
university libraries use IR systems tosists of classes labeled A to Z [LC Clas-
provide access to books, journals, andsification Outline 1990] which are used
to characterize books belonging to dif-
ferent subjects. For example, label Qcomputers and other areas of computer
corresponds to books in the area of sci- science.
ence, and the subclass QA is assigned to
mathematics. Labels QA76 to QA76.8There are several problems associated
are used for classifying books related towith the classification of books using the
LCC scheme
. Some of these are listed below: QA 278 corresponds to the topic `cluster
analysis', J corresponds to the first
author's name and 35 is the serial
(1) When a user is searching for books innumber assigned by the Li-brary of
a library which deal with a topic ofCongress. The subject cate-gories for
interest to him, the LCC number alonethis book provided by the publisher
may not be able to retrieve all the(which are typically en-tered in a
relevant books. This is because thedatabase to facilitate search) are cluster
classification number assigned to theanalysis, data processing and algorithms.
books or the subject catego-ries that areThere is a chapter in this book [Jain and
typically entered in the database do notDubes 1988] that deals with com-puter
contain sufficient information regardingvision, image processing, and image
all the topics covered in a book. Tosegmentation. So a user look-ing for
illustrate this point, let us consider theliterature on computer vision and, in
book Algo-rithms for Clustering Data byparticular, image segmenta-tion will not
Jain and Dubes [1988]. Its LCC numberbe able to access this book by searching
is `QA 278.J35'. In this LCC num-ber,the database with the help of either the
LCC number or the subject categories1632 [LC Classification 1990] which is
provided in the database. The LCCvery different from the number QA
number for computer vision books is TA278.J35 assigned to this book.
unique number to such a book is
There is an inherent problem in as-difficult.
signing LCC numbers to books in a
rapidly developing area. For exam-ple,Murty and Jain describe a knowledge-
let us consider the area of neu-ralbased clustering scheme to group
networks. Initially, category `QP' in LCCrepresentations of books, which are
scheme was used to label books andobtained using the ACM CR (Associ-
conference proceedings in this area. Foration for Computing Machinery Com-
example, Proceedings of theputing Reviews) classification tree
International Joint Conference on Neural[ACM CR Classifications 1994]. This
Networks [IJCNN'91] was assigned thetree is used by the authors contributing
number `QP 363.3'. But most of theto various ACM publications to provide
recent books on neural networks arekeywords in the form of ACM CR cate-
given a number using the category labelgory labels. This tree consists of 11
`QA'; Proceedings of the IJCNN'92nodes at the first level. These nodes are
[IJCNN'92] is as-signed the number `QAlabeled A to K. Each node in this tree has
76.87'. Mul-tiple labels for booksa label that is a string of one or more
dealing with the same topic will forcesymbols. These symbols are alpha-
them to be placed on different stacks in anumeric characters. For example, I515 is
li-brary. Hence, there is a need to up-the label of a fourth-level node in the
date the classification labels from time totree.
time in an emerging disci-pline.
Assigning a number to a new book is a 5.3.1 Pattern Representation. Each book

difficult problem. A book may deal withis represented as a generalized list
topics corresponding to two or more[Sangal 1991] of these strings using the
LCC numbers, and therefore, assigning aACM CR classification tree. For the
sake of brevity in representation, the5.4 Data Mining

fourth-level nodes in the ACM CR clas-
sification tree are labeled using numer-In recent years we have seen ever in-
als 1 to 9 and characters A to Z. Forcreasing volumes of collected data of all
example, the children nodes of I.5.1sorts. With so much data available, it is
(models) are labeled I.5.1.1 to I.5.1.6.necessary to develop algorithms which
Here, I.5.1.1 corresponds to the nodecan extract meaningful information from
labeled deterministic, and I.5.1.6 standsthe vast stores. Searching for use-ful
for the node labeled structural. In anuggets of information among huge
similar fashion, all the fourth-level nodesamounts of data has become known as
in the tree can be labeled as nec-essary.the field of data mining.
From now on, the dots in be-tweenData mining can be applied to rela-
successive symbols will be omit-ted totional, transaction, and spatial data-
simplify the representation. For example,bases, as well as large stores of unstruc-
I.5.1.1 will be denoted as I511. tured data such as the World Wide Web.
There are many data mining systems in
use today, and applications include the
U.S. Treasury detecting money launder-
ing, National Basketball Association
coaches detecting trends and patterns of
play for individual players and teams,This can serve purposes of data com-
and categorizing patterns of children inpression (working with the clusters
the foster care system [Hedberg 1996].rather than individual items), or to
Several journals have had recent specialidentify characteristics of subpopula-
issues on data mining [Cohen 1996,tions which can be targeted for specific
Cross 1996, Wah 1996]. purposes (e.g., marketing aimed at se-
nior citizens).
5.4.1 Data Mining Approaches. Data
mining, like clustering, is an ex-continuous k-means clustering algo-
ploratory activity, so clustering methodsrithm [Faber 1994] has been used to
are well suited for data mining. Cluster-cluster pixels in Landsat images [Faber
ing is often an important initial step ofet al. 1994]. Each pixel originally has 7
several in the data mining processvalues from different satellite bands,
[Fayyad 1996]. Some of the data miningincluding infra-red. These 7 values are
approaches which use clustering are da- difficult for humans to assimilate and
tabase segmentation, predictive model- analyze without assistance. Pixels with
ing, and visualization of large data-the 7 feature values are clustered into
bases. 256 groups, then each pixel is assigned
the value of the cluster centroid. The
Segmentation. Clustering methods areimage can then be displayed with the
used in data mining to segmentspatial information intact. Human view-
databases into homogeneous groups.ers can look at a single picture and
identify a region of interest (e.g., high-

way or forest) and label it as a concept. Predictive Modeling. Statistical meth-
The system then identifies other pixels inods of data analysis usually involve hy-
the same cluster as an instance of thatpothesis testing of a model the analyst
concept. already has in mind. Data mining can aid
the user in discovering potential
·
1.0
0.8
0.6
94
89
0.4
75
56
9
3
50
1216
19
3181
6474
0.2
61
58
48
51
85
7980
2155
77
205
3
63
33
25
100 92
1178
18
91
0.0
95
5960
93
30
4952
96
99 97
29
8283
6673
84
5788
1324
22
14
3940
67
1790
210
41
6265
8687
28
36 1
98
26
8
3234353738
43474445
54
1572
6776
7170
69
68
4246
45
Figure 16 A dendrogram corresponding to 100 books.
hypotheses prior to using statistical which will distinguish those subscribers

tools. Predictive modeling uses cluster- that will renew their subscriptions from
ing to group items, then infers rules to those that will not [Simoudis 1996].
characterize the groups and suggest Visualization. Clusters in large data-
models. For example, magazine sub- bases can be used for visualization, in
scribers can be clustered based on a order to aid human analysts in identify-
number of factors (age, sex, income, ing groups and subgroups that have
etc.), then the resulting groups charac- similar characteristics. WinViz [Lee
terized in an attempt to find a model and Ong 1996] is a data mining
visualization
Table 1 The seven smallest clusters found in the document set. These are
stemmed words.
functions of words as features include
Maarek and Shaul [1996] and Chekuri
et al. [1999]. However, relatively small
tool in which derived clusters can be sets of labeled training samples and
exported as new attributes which can very large dimensionality limit the ulti-
then be characterized by the system. mate success of automatic Web docu-
For example, breakfast cereals are clus- ment categorization based on words as
tered according to calories, protein, fat, features.
sodium, fiber, carbohydrate, sugar, po-
Rather than grouping documents in a
tassium, and vitamin content per serv-
word feature space, Wulfekuhler and
ing. Upon seeing the resulting clusters,
Punch [1997] cluster the words from a
the user can export the clusters to Win-
small collection of World Wide Web
Viz as attributes. The system shows that
doc-uments in the document space. The
one of the clusters is characterized by
sample data set consisted of 85 docu-
high potassium content, and the hu-man
ments from the manufacturing domain
analyst recognizes the individuals in
in 4 different user-defined categories
the cluster as belonging to the ªbranº
(labor, legal, government, and design).
cereal family, leading to a generaliza-
These 85 documents contained 5190
tion that ªbran cereals are high in po-
dis-tinct word stems after common
tassium.º
words (the, and, of) were removed.
Since the words are certainly not
5.4.2 Mining Large Unstructured uncorrelated, they should fall into
Da-tabases. Data mining has often clusters where words used in a
been performed on transaction and consistent way across the document set
relational databases which have well- have similar values of frequency in
defined fields which can be used as each document.
features, but there has been recent
research on large unstructured K-means clustering was used to
databases such as the World Wide Web group the 5190 words into 10 groups.
[Etzioni 1996]. One surprising result was that an
Examples of recent attempts to clas- average of 92% of the words fell into a
sify Web documents using words or single clus-ter, which could then be
discarded for data mining purposes.
The smallest clusters contained terms Terms which are used in ordinary
which to a hu-man seem semantically contexts, or unique terms which do not
related. occur often across the training docu-
ment set will tend to cluster into the
This takes care of spelling errors, geophysical/geolog-ical data from

proper names which are infrequent, and production wells and ex-ploration sites,
terms which are used in the same and then organizing them into large
manner through-out the entire databases. Data mining techniques has
document set. Terms used in specific recently been used to derive precise
contexts (such as file in the context of analytic relations be-tween observed
filing a patent, rather than a computer phenomena and param-eters. These
file) will appear in the docu-ments relations can then be used to quantify
consistently with other terms ap- oil and gas reserves.
propriate to that context (patent, invent) In qualitative terms, good
and thus will tend to cluster together. recoverable reserves have high
Among the groups of words, unique hydrocarbon satura-tion that are
con-texts stand out from the crowd. trapped by highly porous sediments
(reservoir porosity) and sur-rounded by
After discarding the largest cluster, hard bulk rocks that prevent the
the smaller set of features can be used hydrocarbon from leaking away. A
to construct queries for seeking out large volume of porous sedi-ments is
other relevant documents on the Web crucial to finding good recover-able
using standard Web searching tools reserves, therefore developing reli-able
(e.g., Lycos, Alta Vista, Open Text). and accurate methods for estimation of
Searching the Web with terms taken sediment porosities from the collected
from the word clusters allows data is key to estimating hydrocarbon
discovery of finer grained topics (e.g., potential.
family med-ical leave) within the The general rule of thumb experts use
broadly defined categories (e.g., labor). for porosity computation is that it is a
quasiexponential function of depth:
5.4.3 Data Mining in Geological Da- Porosity 5 K z e2F~x1, x2, ..., xm!
tabases. Database mining is a critical zDepth. (4)

resource in oil exploration and produc-
tion. It is common knowledge in the oil A number of factors such as rock types,
industry that the typical cost of drilling structure, and cementation as parame-
a new offshore well is in the range of ters of function F confound this rela-
$30-40 million, but the chance of that tionship. This necessitates the defini-
site being an economic success is 1 in tion of proper contexts, in which to
attempt discovery of porosity formulas.
10. More informed and systematic drill- Geological contexts are expressed in
ing decisions can significantly reduce terms of geological phenomena, such as
overall production costs. geometry, lithology, compaction, and
Advances in drilling technology and subsidence, associated with a region. It
data collection methods have led to oil is well known that geological context
companies and their ancillaries collect- changes from basin to basin (different
ing large amounts of geographical areas in the world) and
also from region to region within a ba- Clustering is a process of grouping data
sin [Allen and Allen 1990; Biswas items based on a measure of
1995]. Furthermore, the underlying simi-larity. Clustering is a
features of contexts may vary greatly. subjective pro-cess; the same set
Simple model matching techniques, of data items often needs to be
which work in engineering domains partitioned differently for
where behavior is constrained by man- different applications. This
made systems and well-established subjectivity makes the process of
laws of physics, may not apply in the clustering difficult. This is
hydrocarbon explo-ration domain because a single algorithm or
approach is not adequate to solve
6. SUMMARY every clustering problem. A
There are several applications where possible solution lies in reflecting
decision making and exploratory pat- this subjectivity in the form of
tern analysis have to be performed on knowledge. This knowledge is
large data sets. For example, in docu- used either implicitly or
ment retrieval, a set of relevant docu- explicitly in one or more phases
ments has to be found among several of clustering. Knowledge-based
millions of documents of clustering algorithms use domain
dimensionality of more than 1000. It is knowledge explicitly.
possible to handle these problems if The most challenging step in
some useful abstraction of the data is cluster-ing is feature extraction or
obtained and is used in decision
making, rather than directly using the
pattern rep-resentation. Pattern
entire data set. By data abstraction, we recognition re-searchers
mean a simple and compact conveniently avoid this step by
representation of the data. This assuming that the pattern
simplicity helps the machine in represen-tations are available as
efficient processing or a human in com- input to the clustering algorithm.
prehending the structure in data easily.
Clustering algorithms are ideally suited
In small size data sets, pattern
for achieving data abstraction. representations can be ob-tained
based on previous experience of
the user with the problem.
However, in the case of large data
sets, it is difficult for the user to
keep track of the impor-tance of
each feature in clustering. A
solution is to make as many
measure-ments on the patterns
as possible and use them in
pattern representation. But it is
not possible to use a large
collection of measurements
directly in clustering because of
computational costs. So several schemes proposed for feature
feature extraction/selection ap- extrac-tion/selection are typically
proaches have been designed to iterative in nature and cannot be
obtain linear or nonlinear used on large data sets due to
combinations of these prohibitive computational costs.
measurements which can be used
to represent patterns. Most of the
The second step in clustering is simi-
larity computation. A variety of
schemes have been used to compute
similarity between two patterns.
They
use knowledge either implicitly or ex-
plicitly. Most of the knowledge-
based clustering algorithms useThe next step in clustering is the
explicit knowledge in similarity grouping step. There are broadly
computation. However, if patterns two grouping schemes: hierarchical
are not repre-sented using proper and par-titional schemes. The
features, then it is not possible to get hierarchical schemes are more
a meaningful parti-tion irrespective versatile, and the partitional
of the quality and quantity of schemes are less expensive. The
knowledge used in similar-ity partitional algorithms aim at maxi-
computation. There is no universally mizing the squared error criterion
acceptable scheme for computing func-tion. Motivated by the failure
simi-larity between patterns of the squared error partitional
represented us-ing a mixture of both clustering al-gorithms in finding the
qualitative and quantitative features. optimal solution to this problem, a
Dissimilarity be-tween a pair of large collection of approaches have
patterns is represented using a been proposed and used to obtain
distance measure that may or may the global optimal solu-tion to this
not be a metric. problem. However, these schemes
are computationally prohibi-tive on
large data sets. ANN-based clus-
tering schemes are neural
implementa-tions of the clustering
algorithms, and they share the
undesired properties of these
algorithms. However, ANNs have
the capability to automatically
normal-ize the data and extract
features. An important observation
is that even if a scheme can find the
optimal solution to the squared error
partitioning problem, it may still fall
short of the require-ments because
of the possible non-iso-tropic nature
of the clusters.
In some applications, for example in quantitative features, provided that
document retrieval, it may be useful knowl-edge linking a concept and
to have a clustering that is not a the mixed features are available.
partition. This means clusters are However, imple-mentations of the
overlapping. Fuzzy clustering and conceptual clustering schemes are
functional cluster-ing are ideally computationally expensive and are
suited for this purpose. Also, fuzzy not suitable for grouping large data
clustering algorithms can handle sets.
mixed data types. However, a major
problem with fuzzy clustering is that
it is difficult to obtain the member-
ship values. A general approach mayThe k-means algorithm and its neural
not work because of the subjective implementation, the Kohonen net,
na-ture of clustering. It is required are most successfully used on large
to rep-resent clusters obtained in a data sets. This is because k-means
suitable form to help the decision algorithm is simple to implement
maker. Knowl-edge-based clustering and computa-tionally attractive
schemes generate intuitively because of its linear time
appealing descriptions of clusters. complexity. However, it is not fea-
They can be used even when the sible to use even this linear time
patterns are represented using a algo-rithm on large data sets.
combination of qualitative and Incremental
algorithms like leader and its neural REFERENCES

implementation, the ART network, can
be used to cluster large data sets. But1. AARTS, E. AND KORST, J. 1989.
they tend to be order-dependent. Divide Simulated An-nealing and
and conquer is a heuristic that has been Boltzmann Machines: A Stochas-tic
rightly exploited by computer algorithm Approach to Combinatorial
designers to reduce computational costs. Optimization and Neural Computing.
However, it should be judiciously used Wiley-Interscience series in discrete
in clustering to achieve meaningful re- mathematics and optimiza-tion. John
sults. Wiley and Sons, Inc., New York, NY.
2. ACM, 1994. ACM CR
In summary, clustering is an interest- Classifications. ACM Computing
ing, useful, and challenging problem. It Surveys 35, 5±16.
has great potential in applications like3. AL-SULTAN, K. S. 1995. A tabu
object recognition, image segmentation, search approach to clustering
and information filtering and retrieval. problems. Pattern Recogn. 28,
However, it is possible to exploit this 1443±1451.
potential only after making several de-
4. AL-SULTAN, K. S. AND KHAN,
sign choices carefully.
M. M. 1996. Computational
experience on four algorithms for the
hard clustering problem. Pattern
Recogn. Lett. 17, 3, 295±308.
Designers. Dover Publi-cations, Inc.,
5. ALLEN, P. A. AND ALLEN, J. Mineola, NY.
R. 1990. 13. CAN, F. 1993. Incremental
Basin clustering for dy-namic information
Analysis: Principles and Applica- processing. ACM Trans. Inf. Syst. 11,
tions. Blackwell Scientific 2 (Apr. 1993), 143±164.
Publications, Inc., Cambridge, MA.
6. ALTA VISTA, 1999.14. CARPENTER, G. AND
http://altavista.digital.com. GROSSBERG, S. 1990. ART3:
AMADASUN, M. AND KING, R. Hierarchical search using chemical
A. 1988. Low-level segmentation of transmit-ters in self-organizing
multispectral images via pattern recognition ar-chitectures.
agglomerative clustering of uniform Neural Networks 3, 129 ±152.
neigh-bourhoods. Pattern Recogn. 15.
21, 3 (1988), 261±268. 16. CHENG, C. H. 1995. A branch-and-
bound clus-tering algorithm. IEEE
7. ANDERBERG, M. R. 1973. Cluster Trans. Syst. Man Cybern. 25, 895±
Analysis for Applications. Academic 898.
Press, Inc., New York, NY.
17. CHENG, Y. AND FU, K. S. 1985.
8. AUGUSTSON, J. G. AND Conceptual clus-tering in knowledge
MINKER, J. 1970. An analysis of organization. IEEE Trans. Pattern
some graph theoretical clustering Anal. Mach. Intell. 7, 592±598.
techniques. J. ACM 17, 4 (Oct.
1970), 571± 588. 18. CHENG, Y. 1995. Mean shift, mode
seeking, and clustering. IEEE Trans.
9. BABU, G. P. AND MURTY, M. N. Pattern Anal. Mach. Intell. 17, 7
1993. A near-optimal initial seed (July), 790 ±799.
value selection in K-means algorithm
using a genetic algorithm. Pattern 19. CHIEN, Y. T. 1978. Interactive
Recogn. Lett. 14, 10 (Oct. 1993), Pattern Recogni-tion. Marcel
763± 769. Dekker, Inc., New York, NY.
20. CHOUDHURY, S. AND MURTY,

10. BABU, G. P. AND MURTY, M. N. M. N. 1990. A divi-sive scheme for
1994. Clustering with evolution constructing minimal span-ning trees
strategies. Pattern Recogn. 27, in coordinate space. Pattern Recogn.
321±329. Lett. 11, 6 (Jun. 1990), 385±389.
11. BACKER, E. 1995. Computer-
Assisted Reasoning in Cluster 21. CM 39, 11.COLEMAN, G.
Analysis. Prentice Hall Interna- B.AND ANDREWS,H.1979.
tional (UK) Ltd., Hertfordshire, Image segmentation by cluster-
UK. ing. Proc. IEEE 67, 5, 773±785.
12. BRODATZ, P. 1966. Textures: A 22. D. Arthur, S. Vassilvitskii (2006):

Photographic Al-bum for Artists and "How Slow is the k-means
Method?,"
24. DALE, M. B. 1985. On the
23. D. Arthur, S. Vassilvitskii: "k- comparison of con-ceptual clustering
means++ The Advantages of Careful and numerical taxono-my. IEEE
Seeding" 2007 Symposium on Trans. Pattern Anal. Mach. Intell. 7,
Discrete Algorithms (SODA). 241±244.
25. DAVE, R. N. 1992. Generalized
fuzzy C-shells clustering and
detection of circular and ellip-tic
26. boundaries. Pattern Recogn. 25, 713±722.
DAY, W. H. E. 1992. Complexity theory:Classification, P. Arabie and L. Hubert,

An in-troduction for practitioners ofEds. World Scientific Publishing Co.,
classifica-tion. In Clustering and Inc., River Edge, NJ.
27. Digital Image Processing and vol.14 (NIPS 2001). pp. 1057-1064,
Analysis-byB.Chanda and D.Dutta Vancouver, Canada. Dec. 2001.
Majumdar.
30. J. A. Hartigan (1975) "Clustering
28. Hierarchical, mixture of gaussians) + Algorithms". Wiley.
some interactive demos (java
applets). 31. J. A. Hartigan and M. A. Wong
(1979) "A K-Means Clustering
29. H. Zha, C. Ding, M. Gu, X. He and Algorithm", Applied Statistics, Vol.
H.D. Simon. "Spectral Relaxation for 28, No. 1, p100-108.
K-means Clustering", Neural
Information Processing Systems32. www.wikipedia.com

K-Means Data Clustering Approach: Jaipur National University

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

K-Means Data Clustering Approach: Jaipur National University

Загружено:

Авторское право:

Доступные форматы

K-MEANS DATA CLUSTERING

SCHOOL OF ENGINEERING AND

In data mining, clustering is a technique in which the set of objects are

S.N Section Tabl Description Page

S.No Secti Figur Description Page

2. 4.1 2. Monothetic partitional clustering. 10

3. 4.1 3. Points falling in three clusters 11

4. 4. The dendrogram obtained using the 11

6. 4.2 6. The k-means algorithm is sensitive to the 12

8. 4.3 8. Representation of clusters by points 14

9. 4.4 9. Representation of clusters by a 15

image as a pattern set.

vari-ous clustering algorithms for have been reported in Mishra and

Single Complete Square Graph Mixture Mode

Figure 1. A taxonomy of clustering approaches.

Figure 5 Two concentric clusters.

The k-means is the simplest and most

(1) Select an initial fuzzy partition of Y

matrix represents the grade of 7

criterion function, e.g., a weighted

(1) Represent a cluster of points by

0 50 100 150 200 250 300

Figure 12 shows this processing ap-

The analysis of textured images has

is used to select the best nance imaging channels (yielding a five-

choice of pixel measurements (features) the identification of a

complexity of many clustering al-

strong assumptions (often multivariate of objects. One of the approaches to

5.3 Information Retrieval other documents. Libraries use the Li-

Assigning a number to a new book is a 5.3.1 Pattern Representation. Each book

sake of brevity in representation, the5.4 Data Mining

identify a region of interest (e.g., high-

Figure 16 A dendrogram corresponding to 100 books.

hypotheses prior to using statistical which will distinguish those subscribers

This takes care of spelling errors, geophysical/geolog-ical data from

tabases. Database mining is a critical zDepth. (4)

algorithms like leader and its neural REFERENCES

20. CHOUDHURY, S. AND MURTY,

12. BRODATZ, P. 1966. Textures: A 22. D. Arthur, S. Vassilvitskii (2006):

DAY, W. H. E. 1992. Complexity theory:Classification, P. Arabie and L. Hubert,

Вам также может понравиться