57d65ba6ce6232 34270605

A PROJECT REPORT
ON
DOCUMENT CLUSTERING
Submitted in partial fulfillment for the requirement of the award of

DEGREE
IN
MASTER OF SCIENCE IN COMPUTER SCIENCE
Paper Code:- MCS 1005
Submitted by
Pooja Kumari
Roll: 101611 No.:02220120
Regn No.: 22-110021751 of 2011-2012
Under the supervision of
Mrs. Sunita Sarkar,
Asst. Professor
Department of Computer Science
Assam University, Silchar
MAY, 2016
DEPARTMENT OF COMPUTER SCIENCE
SCHOOL OF PHYSICAL SCIENCES
ASSAM UNIVERSITY SILCHAR
A CENTRAL UNIVERSITY CONSTITUTED UNDER ACT XIII OF 1989
ASSAM, INDIA, PIN - 788011
Date: ………………….
CERTIFICATE
This is to certify that the project paper work entitled “ DOCUMENT
CLUSTERING ” submitted by Pooja Kumari (Roll: 101611 No.:

02220120), hereby recommended to be accepted for the partial
fulfillment of the requirements for M.Sc(Computer Science) degree
from Assam University.
( Dr.Bipul Syam Purkayastha )

Head of the Department
Pin – 788011
i
DEPARTMENT OF COMPUTER SCIENCE
SCHOOL OF PHYSICAL SCIENCES
ASSAM UNIVERSITY SILCHAR
A CENTRAL UNIVERSITY CONSTITUTED UNDER ACT XIII OF 1989
ASSAM, INDIA, PIN - 788011
Date: ………………….
CERTIFICATE
This is to certify that the project paper work entitled “ DOCUMENT
CLUSTERING” submitted by Pooja Kumari (Roll:101611 No.:
02220120), hereby recommended to be accepted for the partial
fulfillment of the requirements for M.Sc(Computer Science) degree
from Assam University.
(Mrs.Sunita Sarkar)
Assistant Professor
Pin – 788011
ii
DECLARATION
I, POOJA KUMARI, do hereby declare that the project paper work

entitled “DOCUMENT CLUSTERING” has been carried out by me under
the guidance of Mrs. Sunita Sarkar, Assistant Professor, Department Of
Computer Science, Assam University, Silchar. Whenever I have used
materials (data, theoretical analysis, figures, and text) from other sources, I
have given due credit to them by citing them in the text of this report and
giving their details in the references.
Date………………….
Pooja Kumari
Roll :-101611 No.: 02220120
Regn No.:-22-110021751
iii
ACKNOWLEDGEMENT
My sincere gratitude and thanks towards my project paper guide Mrs.Sunita

Sarkar, Assistant Professor, Department of Computer Science, Assam
University, Silchar, Assam,India.
It was only with his backing and support that I could complete the report. He
provided me all sorts of help and corrected me if ever seemed to make mistakes.
I have no such words to express my gratitude.
I acknowledge my sincere gratitude to the HOD of Computer Science
Department, Assam University, Silchar. He gave me the permission to do the
project work. Without his support I couldn’t even start the work. So I am
grateful to him.
I acknowledge my sincere gratitude to the lecturers, research scholars and the
lab technicians for their valuable guidance and helping attitude even in their
very busy schedule.
And at last but not the least, I acknowledge my dearest parents for being such a
nice source of encouragement and moral support that helped me tremendously
in this aspect.
I also declare to the best of my knowledge and belief that the Project Work has
not been submitted anywhere else.
Place: POOJA KUMARI

Date: Roll : 101611 No.:02220120
iv
ABSTRACT
The process of grouping a set of physical or abstract objects into classes of similar
objects is called clustering. Clustering is the most common form of supervised learning.
Document clustering is more specific technique for unsupervised document organization,
automatic topic extraction and fast information retrieval. Fast and high quality document
clustering algorithms play an important role in helping users to effectively navigate,
summarize and organize the information.We find a large number of documents everyday
from an extreme amount of major or minor portals from around the globe.The K-means
algorithm is the most commonly used partitional clustering algorithm. It aims to partition and
datasets into k clusters in which each cluster with nearest meaning (similarity).The present
work is focused to clusters the documents by using Self Organizing Map (SOM),Fuzzy-C
Means and K means algorithms.
v
CONTENTS
Certificates ...........................................................................................................i
Declaration .........................................................................................................iii
Acknowledgement...............................................................................................iv
Abstract................................................................................................................v
1. Introduction......................................................................................................1
1.1 Aim of the project..................................................................................2
1.2 Data Clustering......................................................................................2
1.3 Goals of Clustering................................................................................3
1.4 Requirements.........................................................................................4
1.5 Applications...........................................................................................4
1.6 Problems in clustering...........................................................................5
2. Literature Survey.............................................................................................6
3. Document Clustering.......................................................................................7
3.1 Major Challenges.................................................................................7
3.2 Flow Diagram......................................................................................8
4. Approaches for Document Clustering............................................................9
4.1 Data Preprocessing.............................................................................9
4.2 Stages of Preprocessing....................................................................9
4.3 Preprocessing Techniques.................................................................11
vi
4.4 Clustering Techniques........................................................................14
5. Design And Implementation..........................................................................23
5.1 Data set Description............................................................................23
5.2 Screenshots..........................................................................................27
5.3 Results................................................................................................29
6. Conclusion and Future Scope.........................................................................33
References ...............................................................................................34
Appendices.........................................................................................................35
Appendix A(Development Platforms)........................................................35
vii
viii
CHAPTER 1
INTRODUCTION
Clustering is an automatic learning technique aimed at grouping a set of objects into subsets
or clusters. Since there is not standard text classification criterion, it is very difficult for the
people to use the massive text information sources effectively. Therefore, the management
and analysis if text data become very important. Database management system gives access
to data store but this was only a small part of what could be gained from the data. Analyzing
the data by various techniques helped to gain further knowledge about the data explicitly
stored to derive knowledge about the topic.This is where data mining or knowledge comes
into existence.
With the exponential growth of information and also a quickly growing number of text and
hypertext document managed in organizational intranets,represent the accumalated
knowledge of organization that becomes more and more success in today‟s information
society. Since there is not standard text classification criterion,it is very difficult for people to
use the massive text information source effectively.Therefore the management and analysis
of text data become very important, nowadays such fields of text mining,information filtering
and information retrieving have brought great attention to both domestic and foreign expert.
Document clustering aims to automatically group related documents into clusters, it is one of
the most important tasks in machine learning and artificial intelligence and has received
much attention in recent years.The main emphasis is to cluster with a high accuracy as
possible. Document clustering has many important applications in the area of data mining and
information retrieval. While doing the clustering analysis, we first partition the set of data
into groups based on data similarity and then assign the labels to the groups.The different
algorithms are used for clustering the documents and to improve the quality to a great extent .
Page | 1
1.1 AIM OF THE PROJECT
To cluster the set of data using K-means, Fuzzy C means and SOM algorithm.
1.2 DATA CLUSTERING
Data clustering is a popular data analysis task that involves the distribution of unannoted
data (i.e, with np priory class information),in a manner into infinite sets of categories or
cluster such that within a cluster are similar in some aspects and dissimilar from those in
other clusters. Besides the term data clustering (or just clustering), there are a number of
terms with similar meanings including cluster analysis automatic classification,numerical
taxonomy and typological analysis.
Clustering is defined as the problem of classifying in a collection of objects into a set of

natural clusters without any prior knowledge. Data clustering attempts to discover and
emphasis structural relationship between data vectors. Clustering involves grouping of data
into similar groups so that data in similar groups shares similar trends and patterns. Pattern
within a valid cluster are more similar to each other than they are to a pattern belonging to a
different cluster.
There are different kinds of clusters such as compact, linear and circular. Clusters are formed
based on the distance that is the points which are close are clustered in the same group which
is called distance based clustering.
A mathematical definition of a cluster is as follows:
Let X be a set of data, that is
X={ x1, x2,…, xn}
A m-clustering of X is partitions into m parts (clusters) C1, C2,…., Cm, so that
1. None of the clusters is empty; Ci ≠Ø,

2. Every sample belongs to a cluster,
3. Every sample belongs to a single cluster; Ci∩j=Ø, i≠j.
Page | 2
Naturally it is assumed that vectors in a cluster Ci are in some way “more similar” to
each other than to the vectors in other clusters.
Fig. Pictorial representation of clustering

84
82
80
78
76
74
40 45 50 55
1.3 GOAL OF CLUSTERING

The goal of clustering is to decide the intrinsic grouping in a set of unlabelled data. It
can be shown that there is no absolute “best” criterion which would be independent of
final aim of the clustering. Consequently it is the user and the real world application
which must supply this criterion,in such a way that the results of the clustering will suit
their needs.
Page | 3
1.4 REQUIREMENTS
The main requirements that a clustering algorithm should satisfy are:
 Scalability
 Capable of dealing with different types of attributes
 Capable of discovering clusters
 Minimal requirements for domain knowledge to determine input parameters
 Ability to deal with noise and outliers
 High dimensionality
 Interpretability
 Usability
1.5 APPLICATIONS OF CLUSTERING
Clustering is the most common form of unsupervised learning and is a major tool in a
number of applications in many fields of business and science. Hereby, we summarize the
basic directions in which clustering are used.
 Finding Similar Documents: This feature is often used when the user has spotted
one “good” document in a search result and wants more–like –this. Clustering is able
to discover documents that share many of the same words.
 Organizing Large Document Collections: Document retrieval focuses on finding

documents relevant to a particular query, but it fails to solve the problem of making
sense of a large number of uncategorized documents. The challenge here is to
organize these documents in a taxonomy identical to the one humans would create
given enough time and use it as a browsing interface to the original collection of
documents.
Page | 4
 Duplicate Content Detection: In many applications, there is need to find duplicates
or near-duplicates in a large number of documents.
 Recommendation System: In this application a user is recommended articles based

on the articles the user has already read. Clustering of the articles makes it possible in
real time and improves the quality a lot.
 Search Optimization: Clustering helps a lot in improving the quality and efficiency
of search engines as the user query can be first compared to the clusters instead of
comparing it directly to the documents and the search results can also be arranged
easily.
1.6 PROBLEMS IN CLUSTERING
A number of problems are faced in clustering among them few are:
• Current clustering techniques do not address all the requirements adequately.

• Dealing with large number of dimensions and large number of data items can
be problematic because of time complexity.
• The effectiveness of the method depends on the definition of “distance” (for
distance based clustering).
• It an obvious measure does not exist, we must “define” it,which is not always
easy, especially in multi-dimensional spaces.
Page | 5
CHAPTER 2
LITERATURE SURVEY
1. Michael Steinbach from University of New York presents in his paper the results of
an experimental study of some common document clustering techniques. In particular,
comparison of the two main approaches to document clustering, agglomerative
hierarchical clustering and K-means is done. Hierarchical clustering is often portrayed
as the better quality clustering approach, but is limited because of its quadratic time
complexity. In contrast, K-means and its variants have a time complexity which is
linear in the number of documents, but are thought to produce inferior clusters.
2. Charu C.Aggarwal and Chengxiang Zhai, in their paper provided a detailed survey
of the problems of text clustering. Clustering is a widely studied data mining problem
in the text domains. The problem finds numerous applications in customer
segmentation, classification, collaborative filtering, visualization, document
organization and indexing.
3. Ashish Moon and T.Raju, in their paper used corelation similarity and cosine
similarity to measure the similarity between objects in the same cluster and
dissimilarity between objects in the different cluster groups.
4. Yieng Chen and Bing Qin in their papers compared SOM and K means algorithm. K
means is easy to realize and it usually has low computation cost, so it has become a
well-known text clustering method. The shortcoming of K means is that the value of k
must be determined before and initial documents points seeds need to be selected
randomly. If the neuron number is less than the class number, it will not be sufficient
to separate all the classes, the documents from some closely related class may be
merged into one class. If the neuron number is more than the class number,the
clustering results may be too fine. And the clustering efficiency and the clustering
quality may also be adversely affected.
5. Anton V.Leouski and S.Bruce Croft in their paper compared classification methods
for clustering search results. Problems related to document representation clustering
algorithms and cluster representation are discussed. The paper concludes that keeping
50-100 frequencies terms sufficient of document clustering representation.
6. Bezdek introduced Fuzzy C Means clustering method in 1981 extend from Hard C-
Mean clustering method.
Page | 6
CHAPTER 3
DOCUMENT CLUSTERING
Document clustering is an automatic grouping of text documents into clusters (groups) so that
documents within a cluster have higher degree of similarity but are dissimilar to documents in
other cluster. The goal of a document clustering scheme is to minimize intra-cluster distances
(using an appropriate distance measure between documents). A distance measure (or
similarity measure) thus lies at the core of document clustering. The large variety of
documents makes it almost impossible to create a general algorithm which can work best in
case of all kinds of data sets.
Document clustering have been use in a number of different areas of text mining and
information retrieval. Document clustering is done to improve the precision or recall in
information retrieval systems and as an efficient way of finding the nearest neighbors of a
document.
Document clustering is a powerful technique that has been widely used for organizing data
into smaller and manageable information kernels. Ambiguity and synonymy are two of the
major problems that document clustering regularly fail to tackle with.
To overcome the problems, we are describing various document clustering techniques and
evaluating their applications on the data set (set of documents).
3.1 MAJOR CHALLENGES IN DOCUMENT CLUSTERING
Document clustering is being studied from many decades but still it is far from a trivial and
solved problem. The challenges are:
1. Selecting appropriate features of the documents that should be used for clustering.
2. Selecting an appropriate similarity measure between documents.
3. Selecting an appropriate clustering method utilizing the above similarity measure.
4. Implementing the clustering algorithm in an efficient way that makes it feasible in terms of
required memory and CPU resources.
Page | 7
5. Finding ways of assessing the quality of the performed clustering.
Furthermore, with medium to large document collections (10,000+) documents, the number
of term-document relations is fairly high (millions+), and the computational complexity of
the algorithm applied is thus a central factor in whether it is feasible for real-life applications.
If a dense matrix is constructed to represent term-document relations, this matrix could easily
become too large to keep in memory, e.g, 100,000 documents * 100,000 terms= 1010 entries
~40 GB using 32-bit floating point values. If a vector model is applied, the dimensionality of
the resulting vector space will likewise bee quite high(10,000+). This means that simple
operations ,like finding the Euclidean distance between two documents in the vector
space,becomes time consuming tasks.
3.2 FLOW DIAGRAM
Page | 8
CHAPTER 4
APPROACHES FOR DOCUMENT CLUSTERING
4.1 Data Preprocessing
Data preprocessing is a data mining techniques that involves transforming raw data into an
understandable format. Real world data is often incomplete, inconsistent or lacking in certain
behaviors or trends and is likely to contain many errors. Data preprocessing is a proven
method of resolving such issues. Data preprocessing prepares raw data for further processing.
Data in the real world is dirty incomplete that is lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data. Data are noisy that is it contains
errors or outliers. They are inconsistent that is it contains discrepancies in codes or names. No
quality data is found so there will be no quality mining results. Quality decisions must be
based on quality data. Data warehouse needs consistent integration of quality data.
4.2 Stages of preprocessing in clustering
It is important to emphasize that getting from a collection of documents to a clustering, is not

a single operation, but is more a process in multiple stages.
These stages include more traditional information retrieval operations such as crawling,
indexing, weighting, filtering etc. Some of these other processes are central to the quality and
performance of most clustering algorithms, and it is thus necessary to consider these stages
together with a given clustering algorithm to harness its true potential.
Page | 9
Collection of data
Preprocessing
Document Clustering
Post processing
Collection of Data:
It includes the processes like crawling, indexing, filtering etc. which are used to collect the
documents that needs to be clustered, index them to store and retrieve in a better way, and
filter them to remove the extra data. For example: Stopwords.
Preprocessing:
It is done to represent the data in a form that can be used for clustering.
There are many ways of representing the documents like, Vector-Model, graphical model,
etc. Many measures are also used for weighing the documents and their similarities.
Document Clustering:
It is the application of cluster analysis to textual documents. Document clustering is an
automatic grouping of text documents into clusters (groups)so that documents within a cluster
have higher degree of similarity, but are dissimilar to documents in other clusters.
Post processing:
It includes the major applications in which the document clustering is used. For example, the
recommendation application which uses the results of clustering for recommending news
articles to the users.
Page | 10
4.3 PREPROCESSING TECHNIQUES
4.3.1 Stopwords:
It is the first step in preprocessing which will generate a list of terms that describes the
document satisfactorily. Stopwords are the words which are filtered out before and after
processing of natural language data (text). The document is parsed through to find out the list
of all the words. Stop words are removed from each of the document by comparing it with the
stop word list. This process reduces the number of words in the document. Stop words can be
pre-specified list of words or they can depend on the context of the corpus.
This technique helps in improving the effectiveness and efficiency of text processing as they
reduce the indexing file size. There is no definite list of stop words. An example of stop lists
is given below:
A an and are as at be by for from has
he in is it it‟s of on that the to was were
The phrase query “I love to play” ,which contains two stop words(I, to), which is removed
from the sentence,only „Love‟ and „Play‟ remains. This is how this technique works for large
documents. These words do not contribute to the content of the texts. Their appearance in one
document does not distinguish it from other documents.
4.3.2 Stemming :
Stemming is the process of reducing variant forms of a word to their root form (common
form). Stemming is one technique to provide ways of finding morphological variants of
search terms.Used to improve retrieval effectiveness and to reduce the size of indexing files.
Stemming is usually done by removing any attached suffixes and prefixes (affixes) from
index terms before the actual assignment of the term to the index. Since the meaning is same
Page | 11
but the word form is different it is necessary to identify each word form with its base form.
To do this a variety of stemming algorithms have been developed.
For example: The word “connect” has various forms like connection, connective,
connecting, connected.
Porters stemming algorithm is as of now one of the most popular stemming methods
proposed in 1980. Many modifications and enhancements have been done and suggested on
the basic algorithm. It is based on the idea that the suffixes in the English language are mostly
made up of a combination of smaller and simpler suffixes. There are rules and conditions for
this process. If a rule is accepted, the suffix is removed accordingly, and the next step is
performed. The resultant stem at the end of the step is returned.
The rule looks like the following:

<condition> <suffix> → <new suffix>
Example:
Page | 12
4.3.3 Document Representation:
A document is represented by a set of keywords/terms extracted from the document. For our
clustering algorithms,documents are represented using vector space model. In this
model,each document,d is considered to be a vector,d,in the term-space(set of document
„words‟). Each document is represented by the(tf) vector‟
dtf = (tf1,tf2,……,tfn),
where tfi is the frequency of the ith term in the document.
A term-document matrix can be encoded as a collection of n documents and m terms. An

entry in the matrix corresponds to the “weight” of a term in the document; zero means the
term has no significance in the document or simply doesn‟t exist in the document. The whole
document collection can therefore be seen as a m x n feature matrix A(with m as the number
of documents) where the element aij represents the frequency of occurrence of feature j in
document i. This way of representing the document is called term-frequency method. The
most popular term weighting is the Inverse document frequency,where the term frequency is
weighed with respect to the total number of times the term appears in the corpus. There is an
extension of the designated the term frequency inverse document frequency (tf-idf). The
formulation of tf-idf is given as:
Wij = tfi,j * log (N/dfi)

where Wij is the weight if the term i in the document j,
N is the total number of documents in the corpus,
tfi,j = number of occurrences of term i in document j,
dfi = the number of documents containing the term i.
Page | 13
4.4 DOCUMENT CLUSTERING TECHNIQUES
The clustering techniques used here :
 K-means algorithm
 SOM algorithm
 Fuzzy-C means algorithm
4.4.1 K-means
This algorithm was proposed in the year 1957 by Stuart Lloyd. K-means is the method of
partitional cluster analysis. K means clustering algorithm is an effective algorithm to extract a
given number of clusters of patterns from a training set. Once done, the cluster locations can
be used to classify patterns into distinct classes. The k-means clustering algorithm is known
to be efficient in clustering large data sets. This clustering algorithm is one of the simplest
and the best known unsupervised learning algorithms that solve the well-known clustering
problem. The K-Means algorithm aims to partition a set of objects, based on their
attributes/features, into k clusters, where k is a predefined or user-defined constant. The main
idea is to define k centroids, one for each cluster. The centroid of a cluster is formed in such a
way that it is closely related to all objects in that cluster:
Example: The data set has three dimension and the cluster has two points: X=(x1,x2,x3) and
Y=(y1,y2,y3). Then the centroid Z becomes Z=(z1,z2,z3), where z1=(x1+y1)/2 and z2=(x2+y2)/2
and z3=(x3+y3)/2.
Algorithmic steps for K-means clustering:
Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be the set of
centers.
1) Randomly select ‘c’ cluster centers.
2) Calculate the distance between each data point and cluster centers.
Page | 14
3) Assign the data point to the cluster center whose distance from the cluster center is
minimum of all the cluster centers..
4) Recalculate the new cluster center using:
where, ‘ci’ represents the number of data points in ith cluster.
5) Recalculate the distance between each data point and new obtained cluster centers.
6) If no data point was reassigned then stop, otherwise repeat from step 3.
Advantages:
1. Fast, robust and easier to understand.
2. Relatively efficient: O (tkn), where n is objects, k is clusters, and t is iterations. Normally,

k, t << n.
3. Gives best result when data set are distinct or well separated from each other.
Fig : Showing the result of k-means for 'N' = 60 and 'k' = 3
Page | 15
Limitations:
1. The learning algorithm requires apriori specification of the number of cluster centers.
2. The learning algorithm is not invariant to non-linear transformations i.e. with different
representation of data we get different results (data represented in form of cartesian co-
ordinates and polar co-ordinates will give different results).
3.It is not time-efficient.
4.It does not scale well.
5.The result is sensitive to the initial assignment.
6.Euclidean distance is used as a metric and variance is used as a measure of cluster scatter.
7. The learning algorithm provides the local optima of the squared error function.
8. Randomly choosing of the cluster center cannot lead us to the fruitful result.
9.Applicable only when mean is defined i.e. fails for categorical data.
10. Unable to handle noisy data and outliers.
11. Algorithm fails for non-linear data set.
Fig : Showing the non-linear data set where k-means algorithm fails
Page | 16
4.4.2 Self Organizing Map (SOM)
The Self-Organizing Map is a data visualization technique developed by Professor Teuvo

Kohonen in early 1980‟s. SOMs map multidimensional data onto lower dimensional
subspaces where geometric relationships between points indicate their similarity. One of the
most popular neural network models.”Self-Organizing is named so because no supervision is
required. It belongs to the category of competitive learning networks. Based on unsupervised
learning, which means that no human intervention is needed during the learning and that little
needs to be known about the characteristics of the input data. Use the SOM for clustering
data without knowing the class memberships of the input data. The SOM can be used to
detect features inherent to the problem and thus has also been called SOFM, the Self-
Organizing Feature Map.
When sufficient training has been made, the output layer of SOM network will be separated
into different regions. And different neurons will have different response to different input
samples. As this process is automatic, all the input documents will be clustered. Text
documents written by natural language are high-dimension and have strong semantic features.
It is hard to navigate many documents in high-dimension space. Whereas SOM can map all
these high-dimension documents onto 2- or 1- dimension space, and their relations in the
original space can also be kept. In addition, SOM are not very sensitive to some noisy
documents and the clustering quality can also be assured. Due to these merits, SOM
technology is suitable for text clustering, and has been used in some fields such as digital
library.
It is unsupervised learning neural network.Its geometric relationships between image points

indicate similarity. An input vector is presented to the network and output is compared with
the target vector. A SOM was used to cluster the input vectors from the preprocessed
documents. The SOM is an algorithm which is used to visualize and interpret large high-
dimensional data sets. The map consists of regular grid of processing units, neurons.
Page | 17
Network Architecture
Xi Xn n output units
Inputs:
X1 ..... ..... .....
Y1 ....... Yi ...... Yn
Outputs: m output units
The principle of SOM for clustering can be summarized as follows:
1) Initialization: Assign some random number for all the neurons in the output layer. And
normalized the dimension number of neuron is same to the dimension number of all the
documents;
2) Input the sample: Choose randomly one document from the document collection and
send it the SOM network;
3) Find the winner neuron: Calculate the similarity between the input document vector and
the neuron vector, the neuron with the highest similarity will be the winner;
4) Adapt the vectors of the winner and its neighbors: The adaptation can use the following
formula:
Wj (t+1) = Wj(t)+ η(t) (il-Wj(t)))
Wi= Weight vectors associated with each output node

t= 1
Page | 18
il= input sample, computed the square of the Euclidean distance of il
η= learning rate
After the adaptation, the winner and its neighbors are more nearer to the input document
vector, thus these neurons will be more competitive when similar documents are input again.
Through the full training of the SOM network by input sufficient samples, the neurons on the
output layers will be only sensitive to the documents of some topic, and their vector will
become the mean vector .
Algorithmic steps for SOM :
1. Each node‟s weights are initialized.

2. A vector is chosen as random from the set of training data and presented to the
network.
3. Select the input from the input data set.
4. Find the “winning” neurons for the sample input.
5. Adjust the weights of nearby neurons.
Advantages:
1. The data is easily interpreted and understood.
2. The reduction and grid clustering makes it easy to observe similarities in the data.
3. Capable of handling several types of classification problems while providing a useful,
interactive and intelligible summary of the data.
4. SOMs are fully capable of clustering large, complex data sets.
Page | 19
Limitations:
1. It requires necessary and sufficient data in order to develop meaningful clusters.
2. Lack of data or extraneous data in the weight vectors will add randomness to the
groupings.
3. It is difficult to obtain a perfect mapping where groupings are unique within the map.
4. It requires that nearby data points behave similarity.
4.4.3 Fuzzy-C means
The most well-known Fuzzy clustering algorithm is Fuzzy C-Means, was proposed by
Bezdek in 1981 to improve the partitioning performance of the previously existing K-means
algorithm by extending the membership logic. FCM is an unsupervised clustering algorithm
that is applied to wide range of problems connected with feature analysis, clustering and
classifier design.
Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to
two or more clusters. The advantage of FCM assigns each pattern to each cluster with some
degree of membership (i.e. fuzzy clustering). This is more suitable for real applications where
there are some overlaps between the clusters in the data set.
FCM minimizes the following objective function:
FCM is widely applied in agricultural engineering, astronomy, chemistry, geology, image

analysis, medical diagnosis, shape analysis and target recognition.
Page | 20
Fig. Fuzzy C-means clustering
Algorithm steps for FCM :
x: an unlabelled data set
k= the no. of clusters to form
m= the parameter in the objective function
e= a threshold for the convergence criteria
1. Initialize the prototype V= {v1,v2, …, vk}
2.Repeat
Vprevious  V
3. Compute membership functions
Page | 21
4.Update the prototype vi in V
5.Until
Advantages:
1. Gives best result for overlapped data sets and comparatively better than K-means
algorithm.
2. Data points are assigned membership to each cluster center as a result but at the
expense of more number of iteration.
Limitations:
1. Apriori specification of the number of clusters.
2. Euclidean distance measures can equally weight underlying factors.
3. Long computational time
4. Sensitive to noise: One expects low (or even no) membership degree for outliers
(noisy points)
Page | 22
CHAPTER 5
DESIGN AND IMPLEMENTATION
5.1 Data Set Description
Data sets are simple text documents from different domain. Given below are few documents,
there are such many documents used for clustering. Following news group of data has been
used for clustering:
Doc1:
Path:cantaloupe.srv.cs.cmu.edu!das-
news.harvard.edu!ogicse!uwm.edu!wupost!uunet!world!edwards
From: edwards@world.std.com (Jonathan Edwards)
Newsgroups: rec.autos
Subject: Re: Jeep Grand vs. Toyota 4-Runner
Message-ID: <C51Hn0.2JI@world.std.com>
Date: 6 Apr 93 02:01:48 GMT
Article-I.D.: world.C51Hn0.2JI
References: <C50tLy.IGw@world.std.com> <1pq29p$29p@seven-up.East.Sun.COM>
Organization: IntraNet, Inc.
Lines: 18
In article <1pq29p$29p@seven-up.East.Sun.COM> jfox@hooksett.East.Sun.COM writes:
>
>Any reason you are limited to the two mentioned? They aren't really at
>the same point along the SUV spectrum - not to mention price range.
>How about the Explorer, Trooper, Blazer, Montero, and if the budget
Page | 23
>allows, the Land Cruiser?
Any advice on HOW to buy a Land Cruiser? My local Toyota dealer says they
get two a year, and if I want one I can just get on the waiting list.
Forget about a test drive or even kicking the tires.
And if they are that rare, I doubt there is much of a parts inventory on hand.
--
Jonathan Edwards edwards@intranet.com
IntraNet, Inc 617-527-7020
Doc2:
Path:cantaloupe.srv.cs.cmu.edu!rochester!news.bbn.com!noc.near.net!uunet!olivea!decwrl!pa
.dec.com!e2big.mko.dec.com!nntpd.lkg.dec.com!xqzmoi.enet.dec.com!kenyon
From: kenyon@xqzmoi.enet.dec.com (Doug Kenyon (Stardog Champion))
Subject: Re: Integra GSR (really about other cars)
Message-ID: <1993Apr5.212645.15988@nntpd.lkg.dec.com>
Date: 5 Apr 93 21:26:45 GMT
References: <1993Mar26.183635.4100@cactus.org>
<Mar26.221807.19843@engr.washington.edu> <1993Mar27.084837.8005@cactus.org>
<Mar27.172247.28078@engr.washington.edu>
Sender: usenet@nntpd.lkg.dec.com (USENET News System)
Reply-To: kenyon@xqzmoi.enet.dec.com (Doug Kenyon (Stardog Champion))
Organization: Digital Equipment Corporation
Lines: 13
It's great that all these other cars can out-handle, out-corner, and out-
Page | 24
accelerate an Integra.
But, you've got to ask yourself one question: do all these other cars have
a moonroof with a sliding sunshade? No wimpy pop-up sunroofs or power
sliding roofs that are opaque. A moonroof that can be opened to the air,
closed to let just light in, or shaded so that nothing comes in.
You've just got to know what's important :^).
-Doug
'93 Integra GS
Doc3:
Path:
cantaloupe.srv.cs.cmu.edu!rochester!cornell!batcomputer!caen!uunet!boulder!ucsu!dunnjj
From: dunnjj@ucsu.Colorado.EDU (DUNN JONATHAN JAMES)
Subject: Re: Dumbest automotive concepts of all time
Message-ID: <1993Apr6.001927.16308@ucsu.Colorado.EDU>
Organization: University of Colorado, Boulder
References: <1993Apr5.184756.2588@gdwest.gd.com> <1pq6i2$a1f@news.ysu.edu>
Date: Tue, 6 Apr 1993 00:19:27 GMT
Lines: 24
ak296@yfn.ysu.edu (John R. Daker) writes:
>Cup holders (driving is an importantant enough undertaking)
This is a good idea - so you can carry your (non-alcoholic) drinks without
spilling or having someone hold on to them.

Page | 25
>Cellular phones and mobile fax machines (see above)
Fax machines, yes. Cellular phones: Why not get a hands-free model?
>Fake convertible roofs and vinyl roofs.
Seemingly unique to American luxury cars. The Big Three haven't yet realized
that the 1970s are over.
>Any gold trim.
I agree. Just another display of Yuppie excess.
>Jon Dunn<
There are many such documents which we took for clustering. After applying the
preprocessing techniques i.e. removing the stopwords and stemming we calculate the tf-idf of
the terms in the documents. The final tf-idf matrix generation is taken as input for clustering
as shown in Fig: i, ii, iii, iv
Page | 26
5.2 Screenshots
Fig: i) Raw text document
Fig: ii) Document after removing Stopwords
Page | 27
Fig: iii) Document after stemming (Stemmed Document)
Fig: iv) Tf-idf Matrix generation
Page | 28
5.3 Result of different algorithms
K-means
Page | 29
Page | 30
SOM:
Page | 31
Fuzzy-C means:
Page | 32
CHAPTER 7
CONCLUSION AND FUTURE SCOPE
In this, we conclude that it is hardly possible to get a general algorithm, which can work the
best in clustering all types of datasets. Thus we tried to implement the algorithms which can
work well in different types of datasets.The main contribution of this project is to
preprocessed the data, generation of TF-IDF matrix for a set of documents using K-means,
SOM and Fuzzy- C means. Document clustering is very useful to retrieve information
application in order to reduce the time consuming time and get high precision.
So, for future work, the mentioned clustering algorithms can be used for clustering and
compare them to find out the best result.
Page | 33
REFERENCES
[1] Arun K Pujari, ”Data Mining Techniques”, University Press, second edition,2009
[2] http://www.tutorialpoint.com/data_mining/dm_cluster_analysis.htm
[3] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ:
Prentice-Hall, 1988.
[4] Aggarwal, C. C. Charu, and C. X. Zhai, Eds., “Chapter 4: A Survey of Text Clustering
Algorithms,” in Mining Text Data. NewYork: Springer, 2012.
[5] Yiheng Chen and Bing Qin “The Comparison of SOM and K-means of text clustering”
School of Computer Science and Technology, Harbin Institute of Technology.
[6] Porter M.F. “Snowball: A language for stemming algorithms”, 2001.
[7] K. Premalatha and A.M. Nataranjan,” A Literature Review on Document Clustering,”
International Technology Journal, 2010.
[8] L.V. Bijuraj,”Clustering and its Applications,”Proceedings of National Conference on

New Horizons in IT - NCNHIT 2013.
[9] Deepika Sharma,” Stemming Algorithms: A Comparative Study and their Analysis,”
International Journal of Applied Information Systems (IJAIS)”,Volume 4, September 2012.
[10] W.Sarada,” A Review on Clustering Techniques and their Comparison,” International

Journal of Advanced Research in Computer Engineering &Technology (IJARCET) Volume
2, November 2013.
[11] B.Nageswara Rao, P.Keerthi, V.T. Sree Ramya, S.Santhosh Kumar, T.Monish,
”Implementation on Document Clustering using Correlation Preserving Indexing,”
International Journal of Computer Science and Information Technologies, Vol. 5 (1), 2014.
[12] Manjula.K.S , Sarvar Begum , D. Venkata Swetha Ramana,” Extracting Summary from
Documents UsingK-Mean Clustering Algorithm,” International Journal of Advanced
Research in Computer and Communication Engineering,Vol. 2, August 2013.
[13] file:///F:/project/project%20pdf/fcm.pdf
Page | 34
APPENDICES
Appendix A: DEVELOPMENT PLATFORMS
Software Requirements
1. Operating System: windows 7

2. Software: i) Netbeans IDE 7.3.1
ii) MATLAB R2013a
Hardware Requirements
1. Intel Core i3 processor

2. RAM: 3GB
3. Hard Disk: 320 GB
MATLAB(matrix laboratory) is a numerical computing environment and fourth

generation programming language. Developed by MathWorks, MATLAB allows matrix
manipulations, plotting of functions and data, implementation of algorithms, creation of user
interfaces, and interfacing with programs written in other languages including C, C++,Java
and Fortran.
MATLAB is a programming environment for algorithm development, data analysis,

visualization and numerical computation.
We can use MATLAB in a wide range of applications including signal and image
processing, communications, control design, test and measurement. For a million of
engineers and scientists in industry and academia, MATLAB is the language of technical
computing. The MATLAB application is built around the MATLAB language and most use
of MATLAB includes Command Window or executing text files containing MATLAB code
and functions.
Page | 35
Page | 36

57d65ba6ce6232 34270605

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

57d65ba6ce6232 34270605

Загружено:

Авторское право:

Доступные форматы

A PROJECT REPORT

Submitted in partial fulfillment for the requirement of the award of

Roll: 101611 No.:02220120

Regn No.: 22-110021751 of 2011-2012

Under the supervision of

Mrs. Sunita Sarkar,

Department of Computer Science

Assam University, Silchar

SCHOOL OF PHYSICAL SCIENCES

ASSAM UNIVERSITY SILCHAR

A CENTRAL UNIVERSITY CONSTITUTED UNDER ACT XIII OF 1989

ASSAM, INDIA, PIN - 788011

This is to certify that the project paper work entitled “ DOCUMENT

CLUSTERING ” submitted by Pooja Kumari (Roll: 101611 No.:

( Dr.Bipul Syam Purkayastha )

SCHOOL OF PHYSICAL SCIENCES

ASSAM UNIVERSITY SILCHAR

A CENTRAL UNIVERSITY CONSTITUTED UNDER ACT XIII OF 1989

ASSAM, INDIA, PIN - 788011

I, POOJA KUMARI, do hereby declare that the project paper work

My sincere gratitude and thanks towards my project paper guide Mrs.Sunita

Place: POOJA KUMARI

1.1 Aim of the project..................................................................................2

1.2 Data Clustering......................................................................................2

1.3 Goals of Clustering................................................................................3

1.6 Problems in clustering...........................................................................5

3.1 Major Challenges.................................................................................7

3.2 Flow Diagram......................................................................................8

4. Approaches for Document Clustering............................................................9

4.1 Data Preprocessing.............................................................................9

4.2 Stages of Preprocessing....................................................................9

4.3 Preprocessing Techniques.................................................................11

5. Design And Implementation..........................................................................23

5.1 Data set Description............................................................................23

6. Conclusion and Future Scope.........................................................................33

Appendix A(Development Platforms)........................................................35

1.2 DATA CLUSTERING

Clustering is defined as the problem of classifying in a collection of objects into a set of

A mathematical definition of a cluster is as follows:

Let X be a set of data, that is

X={ x1, x2,…, xn}

A m-clustering of X is partitions into m parts (clusters) C1, C2,…., Cm, so that

1. None of the clusters is empty; Ci ≠Ø,

Fig. Pictorial representation of clustering

1.3 GOAL OF CLUSTERING

The main requirements that a clustering algorithm should satisfy are:

 Capable of dealing with different types of attributes

 Capable of discovering clusters

 Minimal requirements for domain knowledge to determine input parameters

 Ability to deal with noise and outliers

1.5 APPLICATIONS OF CLUSTERING

 Organizing Large Document Collections: Document retrieval focuses on finding

 Recommendation System: In this application a user is recommended articles based

1.6 PROBLEMS IN CLUSTERING

A number of problems are faced in clustering among them few are:

• Current clustering techniques do not address all the requirements adequately.

3.1 MAJOR CHALLENGES IN DOCUMENT CLUSTERING

3.2 FLOW DIAGRAM

APPROACHES FOR DOCUMENT CLUSTERING

4.1 Data Preprocessing

4.2 Stages of preprocessing in clustering

It is important to emphasize that getting from a collection of documents to a clustering, is not

A an and are as at be by for from has