A Survey of Document Clustering Algorithms With Topic Discovery

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 21
A Survey of Document Clustering Algorithms

with Topic Discovery
J. Jayabharathy, Dr. S. Kanmani, and A. Ayeshaa Parveen
Abstract—Nowadays all paper documents are in electronic form, because of quick access and smaller storage. So, it is a major
issue to retrieve relevant documents from the larger database. Clustering documents to relevant groups is an active field of
research finding various applications in the fields of text mining, topic tracking systems, intelligent web search engines and
question answering systems. Unlike document classification where a set of labels or terms is predefined for each class, the
documents sets grouped by a clustering algorithm have no such predefined labels for convenient recognition of the content of
each set. Each of them requires assignment of a concise and descriptive title to help analysts to interpret the result. Hence,
cluster labeling methods are essential. Most of the existing techniques often assign labels to clusters based on the terms that
the clustered documents contain. This paper presents a survey of the existing document clustering algorithms with topic
discovery and proposes a frame work for comparing them. Through this framework, the details regarding the algorithms, their
capabilities, evaluation metrics, data set and performance are analyzed.
Index Terms—Document Clustering, Ontology, Text Mining, Topic Discovery.
——————————  ——————————
1 INTRODUCTION
I NFORMATION extraction plays a vital role in today’s

life. How efficiently and effectively the relevant docu-
ments are extracted from World Wide Web is a chal-
represents a different topic. However, this is not enough.
Users also need to easily find out what a cluster is about
for determining at a glance those of their interest. Unlike
lenging issue [12]. As today’s search engine does just document classification where a set of labels or terms is
string matching, documents retrieved may not be so rele- predefined for each class, the documents sets grouped by
vant to the user’s query. A good document clustering a clustering algorithm have no predefined labels for con-
method can assist computers in organizing the document venient recognition of the content of each set. Each of
corpus automatically into a meaningful cluster hierarchy them requires post-assignment of an appropriate title to
for efficient browsing and navigation, which is very valu- help analysts to interpret the result. Although good clus-
able for overcoming the deficiencies of traditional infor- tering algorithms are available, efficient solutions for
mation retrieval approaches. If documents are well clus- labeling the clustered results to meet analysts’ needs are
tered, searching within the group with relevant docu- rare. Most of the existing works select the title words
ments improves efficiency and reduces the time for from the terms contained in the clustered documents
search. A good document clustering algorithm should themselves. This paper gives detailed view about the top-
have high intra-cluster similarity and less inter- cluster ic discovery techniques used in document clustering algo-
similarity i.e. the documents within the clusters should be rithms with respect to the similarity measures, data set
more relevant compared to the documents of other clus- and performance metrics.
ters [13].
The following section discusses about the general
Document Clustering is an unsupervised approach, in procedure for document clustering and topic discovery,
which documents are automatically grouped into prede- section 3 gives the detailed survey of topic discovery
fined number of groups based on their content. Docu- techniques in document clustering. Section 4 gives the
ment clustering methods are broadly categorized into the overview of the framework for comparing the existing
following types: partitioning, hierarchical, density-based, topic discovery techniques in document clustering. Sec-
grid-based, model-based, etc. Clustering is commonly tion 5 concludes the paper and discusses about the chal-
used in the process of topic discovery from documents. It lenges in topic discovery used in document clustering.
is aimed at generating document groups where each one
2 A GENERAL PROCEDURE FOR DOCUMENT
————————————————
CLUSTERING AND TOPIC DISCOVERY
• Mrs. J. Jayabharathy is with the Department of Computer Science & Engi-
neering, Pondicherry Engineering College, Puducherry, Pin 605014, India. Document clustering could be formally represented as
• Dr. S. Kanmani is with the Department of Information Technology, Pondi- [25]: Let N be the total number of documents to be clus-
cherry Engineering College, Puducherry, Pin 605014, India.
• Ms. A. Ayeshaa Parveen is with the Department of Computer Science & tered. The set D={D 1 , D 2 , ….D N } be the set of N docu-
Engineering, Pondicherry Engineering College, Puducherry, Pin 605014, ments. The documents are grouped into non-overlapping
India. clusters C= { C 1 , C 2 , ….C k } where k denotes the number
of groups formed. C 1 U C 2 U….U C k = D, Assume C i ≠
Φ and C i ∩ C j = Φ where i ≠ j [22].
Stemming: Removes the affixes in the words and pro-

duces the root word known as the stem [24]. Typically,
the stemming process will be performed so that the
words are transformed into their root form. For example
connected, connecting and connection will be trans-
formed into connect. Most widely used stemming algo-
rithms are Porter [25], Paice stemmer [26], Lovins [27], S-
removal [28].
Document Clustering methods [29] are categorized as

partitioning, hierarchical, graph-based, model-based and
etc. Partitional Clustering methods divide the data ob-
jects into non-overlapping subsets (clusters) such that
each data object is in exactly one subset. Hierarchical
clustering is a set of nested clusters organized as a hierar-
chical tree. In this approach, a hierarchy is built bottom-
up by iteratively computing the similarity between all
pairs of clusters and then merging the most similar pair.
Fig. 1. General Procedure for Document Clustering and Topic Discov- Graph based model – This algorithm transforms the
ery.
problem so that graph theory can be used [30]. In this
approach, the objects (documents or words) to be clus-
tered can be viewed as a set of vertices. Two vertices are
Widely used document representation methods are connected with an undirected edge of positive weight
Vector-Space Model (VSM), Multiword term, Character based on certain measurement. In the clustering process,
N-gram representation are some the existing representa- a set of edges, called edge separator, are removed so that
tion models. VSM is identified as the most widely used the graph is partitioned into k pair-wise disjoint sub-
representation for documents. Using VSM, the document graphs. Objective of the partitioning is to find such sepa-
d is represented by the term-frequency (tf ) vector, dtf = rator with a minimum sum of edge weights.
(tf1; tf2; : : : ; tfm), where tfi is the frequency of the ith term
in d. Clustering algorithms are based on certain similarity
measures [31]. There are many different ways to measure
Currently, the TF-IDF (Term Frequency times Inverse how similar are the two documents. Similarity measures
Document Frequency) model is the most popular model. are highly dependent on the choice of terms to represent
In the TF-IDF model, a term ti is weighted by its frequen- text documents. Accurate clustering requires a precise
cy in the document tfi the logarithm of its inverse docu- definition of the closeness between a pair of objects, in
ment frequency, i.e., tfi * log 2 (N/n), where N is the total terms of either the pair-wise similarity or distance. Cosine
number of documents in the collection and n is the count similarity and Jaccard correlation coefficient are widely
of documents containing the term ti at least once. Using used similarity or distance measures. Euclidean distance
the TF-IDF model, terms that appear too rarely or too fre- and entropy are being applied to calculate pair-wise dis-
quently are ranked lower than the other terms. Widely e- tance between objects in clustering.
documents suffer from high dimensionality; document
representation [23] is itself a challenging issue. By clustering the text documents, the documents shar-
ing the same topic are grouped together. Unlike classifica-
Usually document sources are of unstructured format, tion, no labeled documents are provided in clustering
transforming these unstructured documents to structured which is an unsupervised learning. Topic detection deals
format requires preprocessing. Fig. 1 shows the sequence with discovering meaningful and concise labels for the
of steps involved in document clustering leading to topic clusters which are grouped using document clustering
discovery. Following are the steps involved in pre- algorithm. Searching collection of documents through the
processing: set of topics or labels assigned to the clusters becomes
easy and efficient. A good cluster descriptor should not
Tokenization: Tokenization is the process of mapping only indicate the main concept of the cluster, but also dif-
sentences from character strings into strings of words. For ferentiate the cluster from other clusters. Most of the ex-
example, the sentence “This is a Survey Paper” could be isting works select the title words from the terms con-
tokenized into \This, \is, \a, \Survey, \Paper. tained in the clustered documents themselves. For exam-
ple, the most frequent words in the corresponding clus-
Stop words removal: Stop words are typical frequently ters are chosen as the topics of those clusters [33].
occurring words that have little or no discriminating
power, such as \a", \about", \all", etc., or other domain-
dependent words. Stop words are often removed.
3 TOPIC DISCOVERY IN CLUSTERING ALGORITHMS topics are considered as data set. Considering F-measure
as the performance metrics, the experiments have shown
This section gives a detailed survey of recent works in
that distribution of terms associated with keyword gives
topic discovery used in document clustering.
better results.
Benjamin C.M Fung et al [1] elaborated Frequent Document Clustering Description Extraction and its
itemset based hierarchical clustering [14] technique Applications [4] is a problem of labeling the clustered
(FIHC) for addressing the problems in hierarchical docu- documents. This work mainly focuses on automatic labe-
ment clustering. This technique uses frequent item sets to ling concept based on machine learning. Clustering prob-
construct clusters and to organize clusters into a topic lem transformed into classification for labeling the clus-
hierarchy. The intuition of this clustering criterion is that ters. Considering two Benchmark models Baeline1 and
there are some frequent itemsets for each cluster (topic) in Baseline1 and SVM (support vector machine), Multiple
the document set, and different clusters share few fre- Linear Regression (MLR) and Logistical regression mod-
quent itemsets. A frequent itemset is a set of words that els (Logit) machine learning approaches are considered
occur together in some minimum fraction of documents for labeling. More than 2,000 academic documents from
in a cluster. Therefore, a frequent itemset describes some- Information centre for Social Sciences of RUC were taken
thing common to many documents in a cluster. They have as the dataset for experiments. Considering Precision,
experimentally evaluated FIHC and compared with Hie- recall and F1 value as performance metrics, the author
rarchical Frequent Term based clustering (HFTC) pro- justifies that SVM outperforms than the other models.
posed by Beil, Ester & Xu [15] and Unweighted Pair Baseline1 model behaves worst compared to the other
Group Method with Arithmetic Mean (UPMA) proposed five models.
by Kaufman & Rousseuw [16]. This paper concludes that
FIHC outperforms with respect to time, accuracy and sca- Yanjun Li et al [5] propose two clustering algorithms,
lability. Clustering based on Frequency of Words (CFWS) and
Clustering based on Frequency of Word Meaning Se-
Jiangtao Qiu [2] et al proposed a Topic Oriented Se- quence (CFWMS) instead of considering bag of words
mantic Annotation algorithm. The author considers semi- representation. Word is the one which is simply
supervised document clustering approach. User’s need represented in the document, whereas word meaning is
should be represented as multi-attribute topic structure. the concept which expressed by synonyms of word forms.
Using the topic structure, the next step is to compute top- Word meaning sequence is the sequence of frequency of
ic semantic annotation for each document and topic se- the word occurring in the documents. The authors have
mantic similarity between documents. Ontology based addressed many issues such as document representation,
Document topic semantic annotation (ODSA) algorithm dimensionality reduction, labeling the clusters, overlap-
was proposed for computing topic semantic annotation ping of clusters. They preferred Generalized suffix tree
for each document. They have also proposed a dissimilar- for representing the frequency words in each document.
ity function using which the dissimilarity matrix has to be Dimensionality is reduced because of word meaning re-
constructed. The dissimilarity of two documents is their presentation instead of words. CFWS is evaluated to be a
topic-semantic dissimilarity in topic oriented document better algorithm compared to bisecting k-means and
clustering. Then the documents are clustered using the FIHC with respect of accuracy, overlapping cluster quali-
proposed Optimizing Hierarchical clustering algorithm ty and self labeling features. CFWMS achieves better ac-
which is based on agglomerative hierarchical document curacy when compared to CFWS, and modified Bisecting
clustering algorithm. They have also proposed a new k-means using background knowledge (BBK) [17].
cluster evaluation function DistanceSum. A comparison
between ODSA and Term Frequency and Inverse Docu-
ment Frequency (TFIDF) representation have been expe- Anaya-Sánchez et al [6] proposed a new document
rimented and proved that ODSA takes less time for con- clustering algorithm for topic discovery and labeling
structing dissimilarity matrix when compared to TFIDF. which relies on both probable term pairs generated from
They have also assessed the quality of the cluster using the collection and the estimation of the topic homogeneity
the F-measure as the performance metric, i.e. the cluster- associated to term pair clusters. Initially, support set of
ing algorithm based on ODSA and TFIDF have been ex- the most probable pair of terms generated from the collec-
perimented and judged. tion of documents C (i.e. the set of documents in C that
contain both terms) is built. If this set is homogeneous in
content, a cluster consisting of the set of relevant docu-
The discovery of Topic by Clustering Keywords with- ments for the content labeled by the pair is created. In
out any prior knowledge was proposed by Christian War- order to measure the homogeneity of a document collec-
tena and Rogier Brussee [3]. Identification of most infor- tion C, entropy is usually applied over the vocabulary of
mative keyword is done by using natural probability dis- the collection at hand. This approach is restricted to term-
tributions and the documents are clustered according to pairs in order to simplify the search of cluster labels. A
various similarity functions like Cosine similarity, Jensen- document clustering algorithm for discovering and de-
Shannon divergence distance between document distribu- scribing topics [7] is an extension of their previous work
tion and term distribution. A collection of 8 Wikipedia [6]. It provides more suitable and descriptive topic labels
instead of a simple term pair. Experiments carried out covery. NMF method is efficient in dealing with high di-
over AFP Spanish collection, TDT2 English corpus and mensional documents and clustering, while Testor theory
Reuters-21578 show significant improvements over exist- is used to find the topic of each cluster. NMF method is
ing methods in terms of the standard macro- and micro- very fit in dealing with the high dimensionality reduction
averaged F1 measures. text data, and at the same time clustering them. But to
know what each cluster really mean and to determine at a
Hei-Chia Wang et al [8] proposed a topic detection glance whether the content of a cluster are of user interest
method based on bibliographic structures (Title, Keyword or not, they imported Testor Theory to tag each clusters
and Abstract) and semantic properties to extract impor- and identify distinct and representative topic of each clus-
tant words and cluster the scholarly literature. Based on ter, and complete the topic discovery process in research
the lexical chain method combined with WordNet lexical literature. Article abstracts published in the proceeding of
database, the proposed method clusters documents by IEEE ICC 2008 China Forum were used. There were 107
semantic similarity and extracts the important topics for papers in the proceeding, which have been classified ma-
each cluster. In order to take semantic features into ac- nually into five categories. For this data collection, after
count, in this approach a novel method to calculate se- preprocessing, wordsegmentation, clustering with NMF,
mantic similarity is proposed. The similarities in the Title, and labeling with testor theory, seven categories were
Keyword and Abstract of journal papers will be calcu- obtained. The accuracy of the clustering is assessed using
lated separately. These bibliographic structures are taken the metric AC [18].
into consideration, with different weights given to each.
After the semantic similarity calculation, the Hierarchical Huijie Yang et al [11] proposed a novel personal topics
Clustering Algorithm (HCA) method is used to cluster detection approach by clustering emails. First the emails
the documents. After collection of key phrases and their are pre-processed and an improved email VSM (vector
respective frequencies from each of the documents in one space model) is constructed to label the email combining
cluster, the key phrases with the highest frequencies can the body and subject in a new method, then the advanced
be viewed as topics of the cluster. Because users could not k-means algorithm is adopted to cluster the emails and a
extrapolate easily from separate words, a Phrase Fre- kernel-selected algorithm based on the lowest similarity is
quency (PF) method is proposed to extract topics. F- designed, afterwards the appropriate keywords to label
measure is used to evaluate performance of the proposed the topic of each cluster are extracted. F-measure is used
method compared to the TF-IDF method. Three datasets to evaluate the cluster quality. Experiments on
for performance evaluation were collected from SDOS 20Newsgroups email dataset have shown the validity of
and ISI on-line databases. There are 100 documents in this approach and the experimental results also matches
every dataset, and each document includes its Title, Key- the labels of the clusters generated by human justification.
word and Abstract. From the experimental results it is
shown that the proposed method is better than the tradi- 4 FRAMEWORK FOR ANALYZING THE EXISTING
tional TF-IDF method. Its key contribution is the ability to TOPIC DISCOVERY TECHNIQUES IN DOCUMENT
extract topics by semantic features, taking into account
CLUSTERING
the influence of bibliographic structures, and to recom-
mend clusters to users. This study mainly highlights the recent research work
in the field of topic discovery in document clustering.
A cluster labeling algorithm for creating generic titles, This section focuses on the proposed frame work shown
based on external resources such as WordNet, is pro- in Table 1 for comparing various topic discovery methods
posed in [9] by Yuen-Hsien Tseng. In this method, first used in document clustering. The comparative study is
category-specific terms are extracted as cluster descrip- based on the survey, which is made by analyzing the ex-
tors. These cluster descriptors are then mapped to generic isting algorithms, considering the factors like Document
terms based on a hypernym search algorithm. Correlation clustering technique, Data set used for experiments, Per-
coefficient (CC) method which is a square root of the chi- formance metrics and summarization about the perfor-
square is used for selecting specific terms. This method mance of the proposed technique. Column 1 in this frame
involves two main steps: the category-specific terms are work states the title of the referred paper. The algorithm
first extracted from the documents; these terms are then and techniques which are discussed in the existing papers
mapped to their common hypernyms. As this algorithm are stated in column 2. The third column gives the details
uses only the depth and occurrence information of the of the data set which is considered for conducting the
hypernyms, it is general enough to be able to adopt other experiments. Metrics considered by the authors for per-
hierarchical knowledge systems without much revision. formance evaluation are given in column 4. The concise
This proposed method was compared to InfoMap (Infor- details about the performance of the algorithms are listed
mation Mapping Project, 2006) an online tool which finds in column 5.
a set of taxonomic classes for a list of given words.
Fang li, Qunxiong zhu, Xiaoyong lin [10] proposed a

new way of comprising the Non-negative matrix factori-
zation (NMF) and Testor theory [32] to make topic dis-
TABLE 1 ing K-means ture, gence for

Proposed Framework algorithm popmus- term and
ic etc document
distribution.
Paper Algorithm/ Data Set Metrics Performance
Jensen sha-
Title Technique
non for term
Hierar- Frequency 100k F-measure Improved gives better
chical Itemset- docu- Scalability result
Document based Hie- ments and Accura-
Clustering rarchical cy Standard
A New Clustering TDT2 Performs
- 2006 clustering macro-
Document by term English better when
(FIHC) Clustering pairs gener- corpus, and mi- compared to
Topic Ontology Chinese Time Algorithm ated by AFP cro- baseline,
Compares
Oriented based Doc- web analysis for Topic probability Spanish averaged version 1
with TFIDF
Semi- ument topic pages for com- Discover- function of collec- F1 meas- and version
w.r.t time
Super- semantic about puting ing and co-occurring tion and ures 2 of their
(construction
vised annotation “Li Dissimi- of dissimilar- Labeling - terms using Reuters- approach
Document (HOWNET Ming” – larity ity matrix ) 2008 β similarity, 21578
Clustering ontology same matrix Entropy for
And dimen-
- 2007 knowledge name and testing ho-
sionality
base) different Dimen- mogeneity
analysis
perso- sionality
(less)
nality - analysis
Topic Non- Article AC used AC pro-
football
Discovery negative abstracts for eva- duced an
player
in Re- Matrix Fac- pub- luating average of
who
search torization lished in cluster 0.8324 and
belongs
literature for cluster- the pro- quality Topic name
to differ-
Based on ing and ceeding of each cate-
ent clubs
Non- Testor of IEEE gory was
A docu- CLUTO Society Document This frame- negative theory for ICC 2008 found satis-
ment tool kit of sur- clustering work re- Matrix topic dis- China fying
Clustering gical is a part of trieves re- Factoriza- covery Forum
and Rank- oncology Informa- duced set of tion and
Bisecting K
ing sys- Biblio- tion re- articles be- Testor
– means
tem for graphy– trieval in cause of theory -
clustering
Exploring 10 cate- this paper clustering, 2009
algorithm
MEDLINE gories of topic extrac- Document Uses Ma- Docu- Precision SVM per-
citations - Spectrosco- cancer tion and Clustering chine Learn- ments
Recall forms better
2007 py ranking Descrip- ing ap- from compared to
methods tion Ex- proach like Informa- other 4 Ma-
CFWS & traction SVM, Mul- tion chine Learn-
Text Doc- Clustering 9 catego- F-measure CFWMS – and its tiple Linear Centre ing methods
ument based on ries from Purity has Applica- Regression for So-
Clustering Frequent Reuters tions - Model and cial
better accu-
based on Word Se- data set, 2009 Two Science
racy com-
Frequent quences(CF Classic Benchmarks of RUC –
pared
word WS) and data set - Baseline1 2000
meaning Clustering and 200 to Bisecting and 2 for docu-
sequences based on docu- K-means and Cluster ments
- 2007 Frequent ments Frequent Description
Word from Itemset Journal Lexical Three Better com-
Meaning search based Article chain me- datasets F-measure pared to TF-
Sequences engine Topic thod for each
Hierarchical IDF method
(CFWMS) – results. Detection identifica- with 100
clustering
CLUTO kit Based on tion of cen- docu-
(FIHC)
Semantic tral theme ments
Topic Identifies Wikipe- F-measure Imple- Features - of document collected
Detection most infor- dia ar- mented us- 2009 and HCA from ISI
by Clus- mative ticles of ing 3 similar- for cluster- and
tering keyword by various ity measures ing based on SDOS
Keywords Probability , catego- like Cosine, semantic on-line
- 2008 Clustering ries like Jensen sha- similarity databas-
using Bisect- architec- non diver- es
Generic Clustered 612 pa- Evalua- Compared to description, topic detection in incremental document
Title Labe- documents tent tion of InfoMap clustering, etc.
ling for using differ- docu- descriptor (Information
Clustered ent options ments selection Mapping REFERENCES
Docu- and parame- from the by TFC, Project, 2006)
ters to get [1] Benjamin C. M. Fung, Ke Wang, Martin Ester, “Hierarchical
ments - USPTO’s CC, and and about Document Clustering”, The Encyclopedia Of Data Warehousing
2010 various website CC × TFC 70% of the and Mining, 2005.
views on it. and a cases lead to [2] Jiangtao Qiu, Changjie Tang, “Topic Oriented Semi- Supervised
Specifically, subset of reasonable Document Clustering”, SIGMOD2007 Ph.D. Workshop on Innova-
analyzed the Reu- results tive Database Research 2007(IDAR2007).
the collec- ters- [3] Christian Wartena, Rogier Brussee, “Topic Detection by Clus-
tion based 21578 tering Keywords”, IEEE -19th International Conference and Expert
on docu- collec- System Application, 2008.
ment clus- tion [4] Chenzhi Zhang, Huilin Wang, Yao Liu, Hongjiao Xu, “Docu-
ment Clustering Description Extraction and its Applications”,
tering and
Computer Processing of Oriental Languages. Language Technology
term cluster-
for the Knowledge-based Economy, 22nd International Conference,
ing as well ICCPOL 2009, Springer 2009.
Automati- Advanced 20News F-measure Experimen- [5] Yanjun Li, Soon M. Chung, John D. Holt, “Text Document Clus-
cally De- K-means for groups tal results tering based on Frequent word meaning sequences”, Data
tecting clustering email match the Knowledge Engineering, Elesiver 2007.
Personal emails and a dataset labeled hu- [6] Henry Anaya-Sánchez, Aurora Pons-Porrata and Rafael Ber-
kernel- man cluster- langa-Llavori, “A New Document Clustering Algorithm for
Topics by
selected ing result. Topic Discovering and Labeling”, In Proceedings of the
Clustering
algorithm CIARP’08, Lecture Notes in Computer Science, vol. 5197. Springer,
Emails - pp. 161-168, 2008.
2010 designed
[7] Henry Anaya-Sánchez, Aurora Pons-Porrata and Rafael Ber-
after which
langa-Llavori, “A document clustering algorithm for discover-
keywords to
ing and describing topics”, Pattern Recognition Letters, vol 31. El-
label clus- sevier Science Inc., pp. 502-510, April 2010.
ters are [8] Hei-Chia Wang, Tian-Hsiang Huang, Jiunn-Liang Guo, and
extracted Shu-Chuan Li, “Journal Article Topic Detection Based on Se-
mantic Features”, In Proceedings of the 22nd International Confe-
From the analysis it is inferred that more than 70% of rence on Industrial, Engineering and Other Applications of Applied
Intelligent Systems, Lecture Notes In Artificial Intelligence, vol.
the document clustering algorithms are based on Parti-
5579. Springer, 2009.
tional and hierarchical based algorithms. About 40% of [9] Yuen-Hsien Tseng, “A Generic title labeling for clustered doc-
the document clustering algorithms use ontology based uments”, Expert Systems with Applications: An International Jour-
methods for topic discovery. Recent researches mainly nal , vol. 37, no. 3 pp. 2247-2254 , 2010.
concentrate on ontology based document clustering tech- [10] Fang li, Qunxiong zhu and Xiaoyong lin, “Topic Discovery in
niques. Nowadays we have broad scope in topic detection Research literature Based on Non-negative Matrix Factorization
and in ontology based document clustering. From the and Testor theory”, IEEE – Asia-Pacific Conference on Information
Processing, 2009.
framework it is analyzed that most of the topic oriented
[11] Huijie Yang, Junyong Luo, Meijuan Yin and Yan Liu, “Auto-
document clustering algorithms consider F-measure as matically Detecting Personal Topics byClustering Emails”, IEEE
the performance metric. The data sets used for experi- - Second International Workshop on Education Technology and Com-
ments include those from Reuters, 20 newsgroups, puter Science, 2010.
Wikipedia, TDT2, documents from ISI, SDOS online da- [12] A. A. Kogilavani, Dr. P. Balasubramanie, “Ontology Enhanced
tabases, USPTO’s website, etc. Clustering Based Summarization of Medical Documents”, In-
ternational Journal of Recent Trends in Engineering, 2009.
[13] M. Steinbach, G. Karypis, and V. Kumar. “A comparison of
5 CONCLUSION document clustering Techniques”, In Proceedings of Workshop on
Text Mining, 6th ACM SIGKDD International Conference on Data
In this paper an overview of existing topic discovery Mining (KDD’00), pages 109–110, 2000.
works in document clustering algorithms is presented, [14] Fung B, Wnag K & Ester .M, “Hierarchical Document Cluster-
which gives the summarization of recent research work ing using Frequent itemsets”, SIAM International Conference on
on various methods in document clustering for topic dis- Data Mining, SDM ’03, Pp 59-70, 2003.
[15] Beil F, Ester M & Xu. X , “ Frequent Term Based Clustering”,
covery. This survey precisely states the details about the
International Conference on Knowledge Discovery and Data Minig,
algorithms, data sets, metrics and performance. Partition- KDD’02, pp 436-442, 2002
al clustering algorithms using cosine similarity produce [16] Kaufman, L & Rousseeuw P.J, “Finding groups in data: An
good clustering results whereas hierarchical clustering introduction to cluster analysis”, John Wiley & Sons, New York.
algorithms using semantic similarity measures will pro- 1990.
duce even better results. Ontology-based and concept- [17] Khaled B. Shaban, “A Semantic Approach for Document Clus-
tering”, Journal of Software, 2009.
based text clustering and topic discovery will provide
[18] Xu, W., Liu, X., & Gong, Y, “Document-clustering based on
better solutions unlike the traditional term-based ap- non-negative matrix factorization,” In Proceedings of SIGIR_03,
proaches. Topic discovery in document clustering has Toront CA, pp. 267-273, 2003.
scope in various issues like generic topic detection, topic
[19] Anna Huang, “Similarity Measures for Text Document Cluster-

ing”, NZCRSC’08, April 2008.
[20] Salton, Gerard and Buckley C, “Term-weighting approaches in
automatic text retrieval”, Information Processing & Management,
vol. 24 no. 5, pp. 513-523, 1988.
[21] G. A. Miller, “WordNet: a lexical database for English”, Com-
mun. ACM, vol. 38, no. 11, pp. 39-41, 1995.
[22] Fung B , Wnag K & Ester .M, “Hierarchical Document Cluster-
ing using Frequent itemsets”, SIAM International Conference on
Data Mining, SDM ’03, Pp 59-70, 2003.
[23] Document Representation and Dimension Reduction for Text
Clustering.
[24] WB Frakes, CJ Fox, “Strength and Similarity of Affix Removal
Stemming Algorithms”, ACM SIGIR Forum, 2003.
[25] Porter, M. F. "An Algorithm for Suffix Stripping”, Program 14,
130-137, 1980
[26] Paice, Chris D. "Another Stemmer." SIGIR Forum 24 (3), 56-61,
1990.
[27] Lovins, J. B. "Development of a Stemming Algorithm." Mechani-
cal Translation and Computational Linguistics 11, 22-31, 1968.
[28] Harman, D. "How Effective is Suffixing." Journal of the American
Society for Information Science 42 (1), 7-15, 1991.
[29] Jiawei Han and Micheline Kamber. “Data Mining: Concepts
and Techniques, Mogan Kaufmann”, San Francisco, CA, USA,
2001.
[30] Eui-Hong Han, George Karypis, Vipin Kumar, and Bamshad
Mobasher. “Clustering based on association rule hypergraphs”,
In Research Issues on Data Mining and Knowledge Discovery, 1997.
[31] Anna Huang, “Similarity Measures for Text Document Cluster-
ing”, NZCRSC, 2008.
[32] Manuel Lazo-Cortes, Jose Ruiz-Shulcloper, Eduardo Alba-
Cabrer, “An overview of the evolution of the concept of testor”,
Pattern Recognition, vol. 34, pp. 753-762, 2001.
[33] Cutting, D.R., Pedersen, J.O., Karger, D.R., Tukey, J.W., “Scat-
ter/gather: A cluster-based approach to browsing large docu-
ment collections”, In Proceedings of the 15th Annual Internat.
ACM SIGIR Conf. on Research and Development in Information Re-
trieval. ACM, pp. 318–329, 1992.
J. Jayabharathy received her B.Tech (CSE) from Pondicherry Engi-

neering College, Puducherry, India and M.Tech (CSE) from Pondi-
cherry University, Puducherry, India. She is working as Assistant
Professor in the Department of Computer Science & Engineering at
Pondicherry Engineering College. She has published nearly 8 re-
search papers. She is currently pursuing her Ph.D in Document
Clustering. Her areas of interests include Data mining and Distri-
buted Computing.
Dr. S. Kanmani received her B.E (CSE) and M.E (CSE) from Bhara-
thiar University, Coimbatore, India and Ph.D from Anna University,
Chennai, India. She is working as Professor in the Department of
Information Technology at Pondicherry Engineering College. She
has published nearly 63 research papers. She is currently a supervi-
sor guiding 8 Ph.D scholars. She is an expert in Software Testing.
Her areas of interests include Software Engineering, Genetic algo-
rithms and Data Mining.
A. Ayeshaa Parveen received her B.Tech (CSE) from Pondicherry

Engineering College, Puducherry, India. She is currently pursuing
her M.Tech (DCS) in Pondicherry Engineering College, Puducherry,
India. Her areas of interests include Data Mining and Distributed
Computing.

A Survey of Document Clustering Algorithms With Topic Discovery

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

A Survey of Document Clustering Algorithms With Topic Discovery

Загружено:

Авторское право:

Доступные форматы

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617

A Survey of Document Clustering Algorithms

Index Terms—Document Clustering, Ontology, Text Mining, Topic Discovery.

I NFORMATION extraction plays a vital role in today’s

Stemming: Removes the affixes in the words and pro-

Document Clustering methods [29] are categorized as

Fang li, Qunxiong zhu, Xiaoyong lin [10] proposed a

TABLE 1 ing K-means ture, gence for

[19] Anna Huang, “Similarity Measures for Text Document Cluster-

J. Jayabharathy received her B.Tech (CSE) from Pondicherry Engi-

A. Ayeshaa Parveen received her B.Tech (CSE) from Pondicherry

Вам также может понравиться