Академический Документы
Профессиональный Документы
Культура Документы
ON
DOCUMENT CLUSTERING
Submitted by
Pooja Kumari
Asst. Professor
MAY, 2016
DEPARTMENT OF COMPUTER SCIENCE
Date: ………………….
CERTIFICATE
i
DEPARTMENT OF COMPUTER SCIENCE
Date: ………………….
CERTIFICATE
This is to certify that the project paper work entitled “ DOCUMENT
CLUSTERING” submitted by Pooja Kumari (Roll:101611 No.:
02220120), hereby recommended to be accepted for the partial
fulfillment of the requirements for M.Sc(Computer Science) degree
from Assam University.
(Mrs.Sunita Sarkar)
Assistant Professor
Department of Computer Science
Assam University, Silchar
Pin – 788011
ii
DECLARATION
Date………………….
Pooja Kumari
Roll :-101611 No.: 02220120
Regn No.:-22-110021751
Department of Computer Science
iii
ACKNOWLEDGEMENT
I also declare to the best of my knowledge and belief that the Project Work has
not been submitted anywhere else.
iv
ABSTRACT
The process of grouping a set of physical or abstract objects into classes of similar
objects is called clustering. Clustering is the most common form of supervised learning.
Document clustering is more specific technique for unsupervised document organization,
automatic topic extraction and fast information retrieval. Fast and high quality document
clustering algorithms play an important role in helping users to effectively navigate,
summarize and organize the information.We find a large number of documents everyday
from an extreme amount of major or minor portals from around the globe.The K-means
algorithm is the most commonly used partitional clustering algorithm. It aims to partition and
datasets into k clusters in which each cluster with nearest meaning (similarity).The present
work is focused to clusters the documents by using Self Organizing Map (SOM),Fuzzy-C
Means and K means algorithms.
v
CONTENTS
Certificates ...........................................................................................................i
Declaration .........................................................................................................iii
Acknowledgement...............................................................................................iv
Abstract................................................................................................................v
1. Introduction......................................................................................................1
1.4 Requirements.........................................................................................4
1.5 Applications...........................................................................................4
2. Literature Survey.............................................................................................6
3. Document Clustering.......................................................................................7
vi
4.4 Clustering Techniques........................................................................14
5.2 Screenshots..........................................................................................27
5.3 Results................................................................................................29
References ...............................................................................................34
Appendices.........................................................................................................35
vii
viii
CHAPTER 1
INTRODUCTION
Clustering is an automatic learning technique aimed at grouping a set of objects into subsets
or clusters. Since there is not standard text classification criterion, it is very difficult for the
people to use the massive text information sources effectively. Therefore, the management
and analysis if text data become very important. Database management system gives access
to data store but this was only a small part of what could be gained from the data. Analyzing
the data by various techniques helped to gain further knowledge about the data explicitly
stored to derive knowledge about the topic.This is where data mining or knowledge comes
into existence.
With the exponential growth of information and also a quickly growing number of text and
hypertext document managed in organizational intranets,represent the accumalated
knowledge of organization that becomes more and more success in today‟s information
society. Since there is not standard text classification criterion,it is very difficult for people to
use the massive text information source effectively.Therefore the management and analysis
of text data become very important, nowadays such fields of text mining,information filtering
and information retrieving have brought great attention to both domestic and foreign expert.
Document clustering aims to automatically group related documents into clusters, it is one of
the most important tasks in machine learning and artificial intelligence and has received
much attention in recent years.The main emphasis is to cluster with a high accuracy as
possible. Document clustering has many important applications in the area of data mining and
information retrieval. While doing the clustering analysis, we first partition the set of data
into groups based on data similarity and then assign the labels to the groups.The different
algorithms are used for clustering the documents and to improve the quality to a great extent .
Page | 1
1.1 AIM OF THE PROJECT
To cluster the set of data using K-means, Fuzzy C means and SOM algorithm.
Data clustering is a popular data analysis task that involves the distribution of unannoted
data (i.e, with np priory class information),in a manner into infinite sets of categories or
cluster such that within a cluster are similar in some aspects and dissimilar from those in
other clusters. Besides the term data clustering (or just clustering), there are a number of
terms with similar meanings including cluster analysis automatic classification,numerical
taxonomy and typological analysis.
There are different kinds of clusters such as compact, linear and circular. Clusters are formed
based on the distance that is the points which are close are clustered in the same group which
is called distance based clustering.
Page | 2
Naturally it is assumed that vectors in a cluster Ci are in some way “more similar” to
each other than to the vectors in other clusters.
40 45 50 55
Page | 3
1.4 REQUIREMENTS
Scalability
High dimensionality
Interpretability
Usability
Clustering is the most common form of unsupervised learning and is a major tool in a
number of applications in many fields of business and science. Hereby, we summarize the
basic directions in which clustering are used.
Finding Similar Documents: This feature is often used when the user has spotted
one “good” document in a search result and wants more–like –this. Clustering is able
to discover documents that share many of the same words.
Page | 4
Duplicate Content Detection: In many applications, there is need to find duplicates
or near-duplicates in a large number of documents.
Search Optimization: Clustering helps a lot in improving the quality and efficiency
of search engines as the user query can be first compared to the clusters instead of
comparing it directly to the documents and the search results can also be arranged
easily.
Page | 5
CHAPTER 2
LITERATURE SURVEY
1. Michael Steinbach from University of New York presents in his paper the results of
an experimental study of some common document clustering techniques. In particular,
comparison of the two main approaches to document clustering, agglomerative
hierarchical clustering and K-means is done. Hierarchical clustering is often portrayed
as the better quality clustering approach, but is limited because of its quadratic time
complexity. In contrast, K-means and its variants have a time complexity which is
linear in the number of documents, but are thought to produce inferior clusters.
2. Charu C.Aggarwal and Chengxiang Zhai, in their paper provided a detailed survey
of the problems of text clustering. Clustering is a widely studied data mining problem
in the text domains. The problem finds numerous applications in customer
segmentation, classification, collaborative filtering, visualization, document
organization and indexing.
3. Ashish Moon and T.Raju, in their paper used corelation similarity and cosine
similarity to measure the similarity between objects in the same cluster and
dissimilarity between objects in the different cluster groups.
4. Yieng Chen and Bing Qin in their papers compared SOM and K means algorithm. K
means is easy to realize and it usually has low computation cost, so it has become a
well-known text clustering method. The shortcoming of K means is that the value of k
must be determined before and initial documents points seeds need to be selected
randomly. If the neuron number is less than the class number, it will not be sufficient
to separate all the classes, the documents from some closely related class may be
merged into one class. If the neuron number is more than the class number,the
clustering results may be too fine. And the clustering efficiency and the clustering
quality may also be adversely affected.
5. Anton V.Leouski and S.Bruce Croft in their paper compared classification methods
for clustering search results. Problems related to document representation clustering
algorithms and cluster representation are discussed. The paper concludes that keeping
50-100 frequencies terms sufficient of document clustering representation.
6. Bezdek introduced Fuzzy C Means clustering method in 1981 extend from Hard C-
Mean clustering method.
Page | 6
CHAPTER 3
DOCUMENT CLUSTERING
Document clustering is an automatic grouping of text documents into clusters (groups) so that
documents within a cluster have higher degree of similarity but are dissimilar to documents in
other cluster. The goal of a document clustering scheme is to minimize intra-cluster distances
(using an appropriate distance measure between documents). A distance measure (or
similarity measure) thus lies at the core of document clustering. The large variety of
documents makes it almost impossible to create a general algorithm which can work best in
case of all kinds of data sets.
Document clustering have been use in a number of different areas of text mining and
information retrieval. Document clustering is done to improve the precision or recall in
information retrieval systems and as an efficient way of finding the nearest neighbors of a
document.
Document clustering is a powerful technique that has been widely used for organizing data
into smaller and manageable information kernels. Ambiguity and synonymy are two of the
major problems that document clustering regularly fail to tackle with.
To overcome the problems, we are describing various document clustering techniques and
evaluating their applications on the data set (set of documents).
Document clustering is being studied from many decades but still it is far from a trivial and
solved problem. The challenges are:
1. Selecting appropriate features of the documents that should be used for clustering.
2. Selecting an appropriate similarity measure between documents.
3. Selecting an appropriate clustering method utilizing the above similarity measure.
4. Implementing the clustering algorithm in an efficient way that makes it feasible in terms of
required memory and CPU resources.
Page | 7
5. Finding ways of assessing the quality of the performed clustering.
Furthermore, with medium to large document collections (10,000+) documents, the number
of term-document relations is fairly high (millions+), and the computational complexity of
the algorithm applied is thus a central factor in whether it is feasible for real-life applications.
If a dense matrix is constructed to represent term-document relations, this matrix could easily
become too large to keep in memory, e.g, 100,000 documents * 100,000 terms= 1010 entries
~40 GB using 32-bit floating point values. If a vector model is applied, the dimensionality of
the resulting vector space will likewise bee quite high(10,000+). This means that simple
operations ,like finding the Euclidean distance between two documents in the vector
space,becomes time consuming tasks.
Page | 8
CHAPTER 4
Data preprocessing is a data mining techniques that involves transforming raw data into an
understandable format. Real world data is often incomplete, inconsistent or lacking in certain
behaviors or trends and is likely to contain many errors. Data preprocessing is a proven
method of resolving such issues. Data preprocessing prepares raw data for further processing.
Data in the real world is dirty incomplete that is lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data. Data are noisy that is it contains
errors or outliers. They are inconsistent that is it contains discrepancies in codes or names. No
quality data is found so there will be no quality mining results. Quality decisions must be
based on quality data. Data warehouse needs consistent integration of quality data.
Page | 9
Collection of data
Preprocessing
Document Clustering
Post processing
Collection of Data:
It includes the processes like crawling, indexing, filtering etc. which are used to collect the
documents that needs to be clustered, index them to store and retrieve in a better way, and
filter them to remove the extra data. For example: Stopwords.
Preprocessing:
It is done to represent the data in a form that can be used for clustering.
There are many ways of representing the documents like, Vector-Model, graphical model,
etc. Many measures are also used for weighing the documents and their similarities.
Document Clustering:
It is the application of cluster analysis to textual documents. Document clustering is an
automatic grouping of text documents into clusters (groups)so that documents within a cluster
have higher degree of similarity, but are dissimilar to documents in other clusters.
Post processing:
It includes the major applications in which the document clustering is used. For example, the
recommendation application which uses the results of clustering for recommending news
articles to the users.
Page | 10
4.3 PREPROCESSING TECHNIQUES
4.3.1 Stopwords:
It is the first step in preprocessing which will generate a list of terms that describes the
document satisfactorily. Stopwords are the words which are filtered out before and after
processing of natural language data (text). The document is parsed through to find out the list
of all the words. Stop words are removed from each of the document by comparing it with the
stop word list. This process reduces the number of words in the document. Stop words can be
pre-specified list of words or they can depend on the context of the corpus.
This technique helps in improving the effectiveness and efficiency of text processing as they
reduce the indexing file size. There is no definite list of stop words. An example of stop lists
is given below:
The phrase query “I love to play” ,which contains two stop words(I, to), which is removed
from the sentence,only „Love‟ and „Play‟ remains. This is how this technique works for large
documents. These words do not contribute to the content of the texts. Their appearance in one
document does not distinguish it from other documents.
4.3.2 Stemming :
Stemming is the process of reducing variant forms of a word to their root form (common
form). Stemming is one technique to provide ways of finding morphological variants of
search terms.Used to improve retrieval effectiveness and to reduce the size of indexing files.
Stemming is usually done by removing any attached suffixes and prefixes (affixes) from
index terms before the actual assignment of the term to the index. Since the meaning is same
Page | 11
but the word form is different it is necessary to identify each word form with its base form.
To do this a variety of stemming algorithms have been developed.
For example: The word “connect” has various forms like connection, connective,
connecting, connected.
Porters stemming algorithm is as of now one of the most popular stemming methods
proposed in 1980. Many modifications and enhancements have been done and suggested on
the basic algorithm. It is based on the idea that the suffixes in the English language are mostly
made up of a combination of smaller and simpler suffixes. There are rules and conditions for
this process. If a rule is accepted, the suffix is removed accordingly, and the next step is
performed. The resultant stem at the end of the step is returned.
Example:
Page | 12
4.3.3 Document Representation:
A document is represented by a set of keywords/terms extracted from the document. For our
clustering algorithms,documents are represented using vector space model. In this
model,each document,d is considered to be a vector,d,in the term-space(set of document
„words‟). Each document is represented by the(tf) vector‟
dtf = (tf1,tf2,……,tfn),
where tfi is the frequency of the ith term in the document.
Page | 13
4.4 DOCUMENT CLUSTERING TECHNIQUES
K-means algorithm
SOM algorithm
Fuzzy-C means algorithm
4.4.1 K-means
This algorithm was proposed in the year 1957 by Stuart Lloyd. K-means is the method of
partitional cluster analysis. K means clustering algorithm is an effective algorithm to extract a
given number of clusters of patterns from a training set. Once done, the cluster locations can
be used to classify patterns into distinct classes. The k-means clustering algorithm is known
to be efficient in clustering large data sets. This clustering algorithm is one of the simplest
and the best known unsupervised learning algorithms that solve the well-known clustering
problem. The K-Means algorithm aims to partition a set of objects, based on their
attributes/features, into k clusters, where k is a predefined or user-defined constant. The main
idea is to define k centroids, one for each cluster. The centroid of a cluster is formed in such a
way that it is closely related to all objects in that cluster:
Example: The data set has three dimension and the cluster has two points: X=(x1,x2,x3) and
Y=(y1,y2,y3). Then the centroid Z becomes Z=(z1,z2,z3), where z1=(x1+y1)/2 and z2=(x2+y2)/2
and z3=(x3+y3)/2.
Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be the set of
centers.
2) Calculate the distance between each data point and cluster centers.
Page | 14
3) Assign the data point to the cluster center whose distance from the cluster center is
minimum of all the cluster centers..
5) Recalculate the distance between each data point and new obtained cluster centers.
6) If no data point was reassigned then stop, otherwise repeat from step 3.
Advantages:
3. Gives best result when data set are distinct or well separated from each other.
Page | 15
Limitations:
1. The learning algorithm requires apriori specification of the number of cluster centers.
2. The learning algorithm is not invariant to non-linear transformations i.e. with different
representation of data we get different results (data represented in form of cartesian co-
ordinates and polar co-ordinates will give different results).
6.Euclidean distance is used as a metric and variance is used as a measure of cluster scatter.
7. The learning algorithm provides the local optima of the squared error function.
8. Randomly choosing of the cluster center cannot lead us to the fruitful result.
9.Applicable only when mean is defined i.e. fails for categorical data.
Fig : Showing the non-linear data set where k-means algorithm fails
Page | 16
4.4.2 Self Organizing Map (SOM)
When sufficient training has been made, the output layer of SOM network will be separated
into different regions. And different neurons will have different response to different input
samples. As this process is automatic, all the input documents will be clustered. Text
documents written by natural language are high-dimension and have strong semantic features.
It is hard to navigate many documents in high-dimension space. Whereas SOM can map all
these high-dimension documents onto 2- or 1- dimension space, and their relations in the
original space can also be kept. In addition, SOM are not very sensitive to some noisy
documents and the clustering quality can also be assured. Due to these merits, SOM
technology is suitable for text clustering, and has been used in some fields such as digital
library.
Page | 17
Network Architecture
Xi Xn n output units
Inputs:
X1 ..... ..... .....
Y1 ....... Yi ...... Yn
Outputs: m output units
1) Initialization: Assign some random number for all the neurons in the output layer. And
normalized the dimension number of neuron is same to the dimension number of all the
documents;
2) Input the sample: Choose randomly one document from the document collection and
send it the SOM network;
3) Find the winner neuron: Calculate the similarity between the input document vector and
the neuron vector, the neuron with the highest similarity will be the winner;
4) Adapt the vectors of the winner and its neighbors: The adaptation can use the following
formula:
Page | 18
il= input sample, computed the square of the Euclidean distance of il
η= learning rate
After the adaptation, the winner and its neighbors are more nearer to the input document
vector, thus these neurons will be more competitive when similar documents are input again.
Through the full training of the SOM network by input sufficient samples, the neurons on the
output layers will be only sensitive to the documents of some topic, and their vector will
become the mean vector .
Advantages:
1. The data is easily interpreted and understood.
2. The reduction and grid clustering makes it easy to observe similarities in the data.
3. Capable of handling several types of classification problems while providing a useful,
interactive and intelligible summary of the data.
4. SOMs are fully capable of clustering large, complex data sets.
Page | 19
Limitations:
1. It requires necessary and sufficient data in order to develop meaningful clusters.
2. Lack of data or extraneous data in the weight vectors will add randomness to the
groupings.
3. It is difficult to obtain a perfect mapping where groupings are unique within the map.
4. It requires that nearby data points behave similarity.
The most well-known Fuzzy clustering algorithm is Fuzzy C-Means, was proposed by
Bezdek in 1981 to improve the partitioning performance of the previously existing K-means
algorithm by extending the membership logic. FCM is an unsupervised clustering algorithm
that is applied to wide range of problems connected with feature analysis, clustering and
classifier design.
Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to
two or more clusters. The advantage of FCM assigns each pattern to each cluster with some
degree of membership (i.e. fuzzy clustering). This is more suitable for real applications where
there are some overlaps between the clusters in the data set.
Page | 20
Fig. Fuzzy C-means clustering
2.Repeat
Vprevious V
Page | 21
4.Update the prototype vi in V
5.Until
Advantages:
1. Gives best result for overlapped data sets and comparatively better than K-means
algorithm.
2. Data points are assigned membership to each cluster center as a result but at the
expense of more number of iteration.
Limitations:
4. Sensitive to noise: One expects low (or even no) membership degree for outliers
(noisy points)
Page | 22
CHAPTER 5
Data sets are simple text documents from different domain. Given below are few documents,
there are such many documents used for clustering. Following news group of data has been
used for clustering:
Doc1:
Path:cantaloupe.srv.cs.cmu.edu!das-
news.harvard.edu!ogicse!uwm.edu!wupost!uunet!world!edwards
Newsgroups: rec.autos
Message-ID: <C51Hn0.2JI@world.std.com>
Article-I.D.: world.C51Hn0.2JI
Lines: 18
>
>Any reason you are limited to the two mentioned? They aren't really at
>the same point along the SUV spectrum - not to mention price range.
>How about the Explorer, Trooper, Blazer, Montero, and if the budget
Page | 23
>allows, the Land Cruiser?
Any advice on HOW to buy a Land Cruiser? My local Toyota dealer says they
get two a year, and if I want one I can just get on the waiting list.
And if they are that rare, I doubt there is much of a parts inventory on hand.
--
Doc2:
Path:cantaloupe.srv.cs.cmu.edu!rochester!news.bbn.com!noc.near.net!uunet!olivea!decwrl!pa
.dec.com!e2big.mko.dec.com!nntpd.lkg.dec.com!xqzmoi.enet.dec.com!kenyon
Newsgroups: rec.autos
Message-ID: <1993Apr5.212645.15988@nntpd.lkg.dec.com>
References: <1993Mar26.183635.4100@cactus.org>
<Mar26.221807.19843@engr.washington.edu> <1993Mar27.084837.8005@cactus.org>
<Mar27.172247.28078@engr.washington.edu>
Lines: 13
It's great that all these other cars can out-handle, out-corner, and out-
Page | 24
accelerate an Integra.
But, you've got to ask yourself one question: do all these other cars have
sliding roofs that are opaque. A moonroof that can be opened to the air,
closed to let just light in, or shaded so that nothing comes in.
-Doug
'93 Integra GS
Doc3:
Newsgroups: rec.autos
Path:
cantaloupe.srv.cs.cmu.edu!rochester!cornell!batcomputer!caen!uunet!boulder!ucsu!dunnjj
Message-ID: <1993Apr6.001927.16308@ucsu.Colorado.EDU>
Lines: 24
This is a good idea - so you can carry your (non-alcoholic) drinks without
Fax machines, yes. Cellular phones: Why not get a hands-free model?
Seemingly unique to American luxury cars. The Big Three haven't yet realized
>Jon Dunn<
There are many such documents which we took for clustering. After applying the
preprocessing techniques i.e. removing the stopwords and stemming we calculate the tf-idf of
the terms in the documents. The final tf-idf matrix generation is taken as input for clustering
as shown in Fig: i, ii, iii, iv
Page | 26
5.2 Screenshots
Page | 27
Fig: iii) Document after stemming (Stemmed Document)
Page | 28
5.3 Result of different algorithms
K-means
Page | 29
Page | 30
SOM:
Page | 31
Fuzzy-C means:
Page | 32
CHAPTER 7
In this, we conclude that it is hardly possible to get a general algorithm, which can work the
best in clustering all types of datasets. Thus we tried to implement the algorithms which can
work well in different types of datasets.The main contribution of this project is to
preprocessed the data, generation of TF-IDF matrix for a set of documents using K-means,
SOM and Fuzzy- C means. Document clustering is very useful to retrieve information
application in order to reduce the time consuming time and get high precision.
So, for future work, the mentioned clustering algorithms can be used for clustering and
compare them to find out the best result.
Page | 33
REFERENCES
[1] Arun K Pujari, ”Data Mining Techniques”, University Press, second edition,2009
[2] http://www.tutorialpoint.com/data_mining/dm_cluster_analysis.htm
[3] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ:
Prentice-Hall, 1988.
[4] Aggarwal, C. C. Charu, and C. X. Zhai, Eds., “Chapter 4: A Survey of Text Clustering
Algorithms,” in Mining Text Data. NewYork: Springer, 2012.
[5] Yiheng Chen and Bing Qin “The Comparison of SOM and K-means of text clustering”
School of Computer Science and Technology, Harbin Institute of Technology.
[6] Porter M.F. “Snowball: A language for stemming algorithms”, 2001.
[7] K. Premalatha and A.M. Nataranjan,” A Literature Review on Document Clustering,”
International Technology Journal, 2010.
[9] Deepika Sharma,” Stemming Algorithms: A Comparative Study and their Analysis,”
International Journal of Applied Information Systems (IJAIS)”,Volume 4, September 2012.
[11] B.Nageswara Rao, P.Keerthi, V.T. Sree Ramya, S.Santhosh Kumar, T.Monish,
”Implementation on Document Clustering using Correlation Preserving Indexing,”
International Journal of Computer Science and Information Technologies, Vol. 5 (1), 2014.
[12] Manjula.K.S , Sarvar Begum , D. Venkata Swetha Ramana,” Extracting Summary from
Documents UsingK-Mean Clustering Algorithm,” International Journal of Advanced
Research in Computer and Communication Engineering,Vol. 2, August 2013.
[13] file:///F:/project/project%20pdf/fcm.pdf
Page | 34
APPENDICES
Software Requirements
Hardware Requirements
We can use MATLAB in a wide range of applications including signal and image
processing, communications, control design, test and measurement. For a million of
engineers and scientists in industry and academia, MATLAB is the language of technical
computing. The MATLAB application is built around the MATLAB language and most use
of MATLAB includes Command Window or executing text files containing MATLAB code
and functions.
Page | 35
Page | 36