Академический Документы
Профессиональный Документы
Культура Документы
APPROACH
SEMINAR REPORT
Submitted by
SHAIKH FAIZAN AHMED (8CS65)
Atir Kahn (8CS23)
IN
BACHELOR OF TECHNOLOGY
(COMPUTER SCIENCE & ENGINEERING)
SUBMITTED TO
K-MEANS DATA
CLUSTERING
APPROACH
3
Abstract
CONTENTS
1. Introduction
1.1History
2. Definitions and Notation
3. Literature Review
4. Clustering Techniques
4.1Hierarchical Clustering Algorithms
4.2k-Means Clustering Algorithm
4.3Fuzzy Clustering
4.4Representation of Clusters
5. Applications
5.1 Image Segmentation Using Clustering
5.2 Object and Character Recognition
5.3 Information Retrieval
5.4 Data Mining
6. Summary
5
List of tables:
List of diagrams/figures :
I. INTRODUCTION
K-means clustering is a type ofThis introduction to the K-means
unsupervised learning, clustering algorithm
which is used when you covers:
have unlabeled data (i.e.,
data without definedCommon business cases where K-means
categories or groups). The is used
goal of this algorithm is to
find groups in the data,The steps involved in running the
with the number of groups algorithm
represented by the
variable K. The algorithmA Python example using delivery fleet
works iteratively to assign data
each data point to one
of K groups based on the
features that are provided. 1.1 History
Data points are clustered Even though there is an increasing
based on feature similarity. in-terest in the use of clustering
The results of the K-means methods in pattern recognition ,
clustering algorithm are: image processing clustering has a
rich history in other disciplines such
The centroids of the K clusters, which as biology, psychiatry, psychology,
can be used to label new archaeology, geology, geography,
data and marketing. Other terms more or
less synonymous with clustering
Labels for the training data (each data include unsupervised learning
point is assigned to a single numerical taxonomy [Sneath and
cluster) Sokal 1973], vector quantization
[Oehler and Gray 1995], and
Rather than defining groups before
learning by obser-vation [Michalski
looking at the data,
and Stepp 1983]. The field of spatial
clustering allows you to
analysis of point pat-terns [Ripley
find and analyze the groups
1988] is also related to cluster
that have formed
analysis. The importance and
organically. The "Choosing
interdisciplinary nature of clustering
K" section below describes
is evident through its vast literature.
how the number of groups
can be determined.
A number of books on clustering
Each centroid of a cluster is a collection have been published [Jain and Dubes
of feature values which 1988; Anderberg 1973; Hartigan
define the resulting groups. 1975; Spath 1980; Duran and Odell
Examining the centroid 1974; Everitt 1993; Backer 1995], in
feature weights can be used addition to some useful and
to qualitatively interpret influential review papers. Asurvey of
what kind of group each the state of the art in cluster-ing
cluster represents. circa 1978 was reported in Dubes
and Jain [1980]. A comparison of
9
Clustering
Hierarchical Partitional
k-means Expectation
Maximization
A B C D E F G
Figure 4. The dendrogram obtained
using the single-link algorithm.
Y
1
1 1
2 2 1
2
1 2 2
2 1
1
1 1
X X
By The Centroid 1 By Three Distant Points1
Figure 8 . Representation of a cluster by points.
x2
x
1
Figure 10. Feature representation for clustering. Image measurements and
positions are transformed to features. Clusters in feature space correspond to
image segments.
is the input image with Nr rows and Nc
columns and measurement value xij at
applications, and can be addressed as a pixel ~i, j!, then the segmentation can
be expressed as 6 5 $S1, . . . Sk%, with
clustering problem The segmentation the lth segment
of the image(s) presented to an image
analysis system is critically dependent
on the scene to be sensed, the imaging Sl 5 $~il1, jl1!, . . . ~ilNl, jlNl!%
geometry, con-figuration, and sensor consisting of a connected subset of the
used to transduce the scene into a pixel coordinates. No two segments
digital image, and ulti-mately the share any pixel locations (Si ù Sj 5 À
desired output (goal) of the system. @ i Þ j), and the union of all
segments
The applicability of clustering meth-
odology to the image segmentation cover
problem was recognized over three de- s the entire image ~øik51Si 5
cades ago, and the paradigms underly- $1. . . Nr% 3 $1. . . Nc%!. Jain and
ing the initial pioneering efforts are still Dubes [1988], after Fu and Mui [1981]
in use today. A recurring theme is to identified three techniques for produc-
define feature vectors at every image ing segmentations from input imagery:
location (pixel) composed of both func- region-based, edge-based, or cluster-
tions of image intensity and functions based.
of the pixel location itself.. Consider the use of simple gray level
thresholding to segment a high-contrast
5.1.1 Segmentation. An image seg- intensity image. Figure 11(a) shows a
mentation is typically defined as an ex- grayscale image of a textbook's bar
haustive partitioning of an input image code scanned on a flatbed scanner. Part
into regions, each of which is b shows the results of a simple
considered to be homogeneous with threshold-ing operation designed to
respect to some image property of separate the dark and light regions in
interest . If the bar code area. Binarization steps
like this are often performed in
character recogni-tion systems.
( 5 $xij, i 5 1. . . Nr, j 5 1. . . Nc%
’x.dat’
(a) (b)
(c)
Figure 11 Binarization via thresholding. (a): Original grayscale image. (b):
Gray-level histogram. (c):
Results of thresholdin.
racy. An additional advantage of
CLUS-TER is that it produces a
Thresholding in effect `clusters' the sequence of output clusterings (i.e., a 2-
image pixels into two cluster solu-
groups based on the one-
dimensional intensity tion up through a Kmax-cluster solution
measurement where Kmax is specified by the user and
is typically 20 or so); each clustering in
While simple gray level thresh-olding this sequence yields a clustering statis-
is adequate in some tic which combines between-cluster
carefully controlled image sep-aration and within-cluster scatter.
acquisition environ-ments The clustering that optimizes this
and much research has statistic is chosen as the best one. Each
been de-voted to pixel in the range image is assigned the
appropriate methods for seg-ment label of the nearest cluster
thresholding, complex center. This minimum distance
images require more classification step is not guaranteed to
elaborate segmentation produce seg-ments which are connected
tech-niques. in the image plane; therefore, a
connected compo-nents labeling
algorithm allocates new labels for
Many segmenters use measurements disjoint regions that were placed in the
which are both spectral (e.g., the multi- same cluster. Subsequent operations
spectral scanner used in remote sens- include surface type tests, merging of
ing) and spatial (based on the pixel's adjacent patches using a test for the
location in the image plane). The mea- presence of crease or jump edges
surement at each pixel hence corre- between adjacent segments, and surface
sponds directly to our concept of a pat- parameter estimation.
tern.
(c) (d)
Figure 12. Range image segmentation using clustering. (a): Input range
image. (b): Surface normals
for selected image pixels. (c): Initial segmentation (19 cluster solution) returned
by CLUSTER using 1000 six-dimensional samples from the image as a pattern
set. (d): Final segmentation (8 segments) produced by postprocessing.
(a) (b)
Figure 13 Texture image segmentation results. (a): Four-class texture mosaic. (b):
Four-cluster solution produced by CLUSTER with pixel coordinates included in
the feature set.
(a) (b)
Figure 15 LANDSAT image segmentation. (a): Original image
(ESA/EURIMAGE/Sattelitbild). (b):
Clustered scene.
1.0
0.8
0.6
94
89
0.4
75
56
9
3
50
1216
19
3181
6474
0.2
61
58
48
51
85
7980
2155
77
205
3
63
33
25
100 92
1178
18
91
0.0
95
5960
93
30
4952
96
99 97
29
8283
6673
84
5788
1324
22
14
3940
67
1790
210
41
6265
8687
28
36 1
98
26
8
3234353738
43474445
54
1572
6776
7170
69
68
4246
45
27. Digital Image Processing and vol.14 (NIPS 2001). pp. 1057-1064,
Analysis-byB.Chanda and D.Dutta Vancouver, Canada. Dec. 2001.
Majumdar.
30. J. A. Hartigan (1975) "Clustering
28. Hierarchical, mixture of gaussians) + Algorithms". Wiley.
some interactive demos (java
applets). 31. J. A. Hartigan and M. A. Wong
(1979) "A K-Means Clustering
29. H. Zha, C. Ding, M. Gu, X. He and Algorithm", Applied Statistics, Vol.
H.D. Simon. "Spectral Relaxation for 28, No. 1, p100-108.
K-means Clustering", Neural
Information Processing Systems32. www.wikipedia.com