Академический Документы
Профессиональный Документы
Культура Документы
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 21
Abstract—Nowadays all paper documents are in electronic form, because of quick access and smaller storage. So, it is a major
issue to retrieve relevant documents from the larger database. Clustering documents to relevant groups is an active field of
research finding various applications in the fields of text mining, topic tracking systems, intelligent web search engines and
question answering systems. Unlike document classification where a set of labels or terms is predefined for each class, the
documents sets grouped by a clustering algorithm have no such predefined labels for convenient recognition of the content of
each set. Each of them requires assignment of a concise and descriptive title to help analysts to interpret the result. Hence,
cluster labeling methods are essential. Most of the existing techniques often assign labels to clusters based on the terms that
the clustered documents contain. This paper presents a survey of the existing document clustering algorithms with topic
discovery and proposes a frame work for comparing them. Through this framework, the details regarding the algorithms, their
capabilities, evaluation metrics, data set and performance are analyzed.
—————————— ——————————
1 INTRODUCTION
3 TOPIC DISCOVERY IN CLUSTERING ALGORITHMS topics are considered as data set. Considering F-measure
as the performance metrics, the experiments have shown
This section gives a detailed survey of recent works in
that distribution of terms associated with keyword gives
topic discovery used in document clustering.
better results.
Benjamin C.M Fung et al [1] elaborated Frequent Document Clustering Description Extraction and its
itemset based hierarchical clustering [14] technique Applications [4] is a problem of labeling the clustered
(FIHC) for addressing the problems in hierarchical docu- documents. This work mainly focuses on automatic labe-
ment clustering. This technique uses frequent item sets to ling concept based on machine learning. Clustering prob-
construct clusters and to organize clusters into a topic lem transformed into classification for labeling the clus-
hierarchy. The intuition of this clustering criterion is that ters. Considering two Benchmark models Baeline1 and
there are some frequent itemsets for each cluster (topic) in Baseline1 and SVM (support vector machine), Multiple
the document set, and different clusters share few fre- Linear Regression (MLR) and Logistical regression mod-
quent itemsets. A frequent itemset is a set of words that els (Logit) machine learning approaches are considered
occur together in some minimum fraction of documents for labeling. More than 2,000 academic documents from
in a cluster. Therefore, a frequent itemset describes some- Information centre for Social Sciences of RUC were taken
thing common to many documents in a cluster. They have as the dataset for experiments. Considering Precision,
experimentally evaluated FIHC and compared with Hie- recall and F1 value as performance metrics, the author
rarchical Frequent Term based clustering (HFTC) pro- justifies that SVM outperforms than the other models.
posed by Beil, Ester & Xu [15] and Unweighted Pair Baseline1 model behaves worst compared to the other
Group Method with Arithmetic Mean (UPMA) proposed five models.
by Kaufman & Rousseuw [16]. This paper concludes that
FIHC outperforms with respect to time, accuracy and sca- Yanjun Li et al [5] propose two clustering algorithms,
lability. Clustering based on Frequency of Words (CFWS) and
Clustering based on Frequency of Word Meaning Se-
Jiangtao Qiu [2] et al proposed a Topic Oriented Se- quence (CFWMS) instead of considering bag of words
mantic Annotation algorithm. The author considers semi- representation. Word is the one which is simply
supervised document clustering approach. User’s need represented in the document, whereas word meaning is
should be represented as multi-attribute topic structure. the concept which expressed by synonyms of word forms.
Using the topic structure, the next step is to compute top- Word meaning sequence is the sequence of frequency of
ic semantic annotation for each document and topic se- the word occurring in the documents. The authors have
mantic similarity between documents. Ontology based addressed many issues such as document representation,
Document topic semantic annotation (ODSA) algorithm dimensionality reduction, labeling the clusters, overlap-
was proposed for computing topic semantic annotation ping of clusters. They preferred Generalized suffix tree
for each document. They have also proposed a dissimilar- for representing the frequency words in each document.
ity function using which the dissimilarity matrix has to be Dimensionality is reduced because of word meaning re-
constructed. The dissimilarity of two documents is their presentation instead of words. CFWS is evaluated to be a
topic-semantic dissimilarity in topic oriented document better algorithm compared to bisecting k-means and
clustering. Then the documents are clustered using the FIHC with respect of accuracy, overlapping cluster quali-
proposed Optimizing Hierarchical clustering algorithm ty and self labeling features. CFWMS achieves better ac-
which is based on agglomerative hierarchical document curacy when compared to CFWS, and modified Bisecting
clustering algorithm. They have also proposed a new k-means using background knowledge (BBK) [17].
cluster evaluation function DistanceSum. A comparison
between ODSA and Term Frequency and Inverse Docu-
ment Frequency (TFIDF) representation have been expe- Anaya-Sánchez et al [6] proposed a new document
rimented and proved that ODSA takes less time for con- clustering algorithm for topic discovery and labeling
structing dissimilarity matrix when compared to TFIDF. which relies on both probable term pairs generated from
They have also assessed the quality of the cluster using the collection and the estimation of the topic homogeneity
the F-measure as the performance metric, i.e. the cluster- associated to term pair clusters. Initially, support set of
ing algorithm based on ODSA and TFIDF have been ex- the most probable pair of terms generated from the collec-
perimented and judged. tion of documents C (i.e. the set of documents in C that
contain both terms) is built. If this set is homogeneous in
content, a cluster consisting of the set of relevant docu-
The discovery of Topic by Clustering Keywords with- ments for the content labeled by the pair is created. In
out any prior knowledge was proposed by Christian War- order to measure the homogeneity of a document collec-
tena and Rogier Brussee [3]. Identification of most infor- tion C, entropy is usually applied over the vocabulary of
mative keyword is done by using natural probability dis- the collection at hand. This approach is restricted to term-
tributions and the documents are clustered according to pairs in order to simplify the search of cluster labels. A
various similarity functions like Cosine similarity, Jensen- document clustering algorithm for discovering and de-
Shannon divergence distance between document distribu- scribing topics [7] is an extension of their previous work
tion and term distribution. A collection of 8 Wikipedia [6]. It provides more suitable and descriptive topic labels
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 24
instead of a simple term pair. Experiments carried out covery. NMF method is efficient in dealing with high di-
over AFP Spanish collection, TDT2 English corpus and mensional documents and clustering, while Testor theory
Reuters-21578 show significant improvements over exist- is used to find the topic of each cluster. NMF method is
ing methods in terms of the standard macro- and micro- very fit in dealing with the high dimensionality reduction
averaged F1 measures. text data, and at the same time clustering them. But to
know what each cluster really mean and to determine at a
Hei-Chia Wang et al [8] proposed a topic detection glance whether the content of a cluster are of user interest
method based on bibliographic structures (Title, Keyword or not, they imported Testor Theory to tag each clusters
and Abstract) and semantic properties to extract impor- and identify distinct and representative topic of each clus-
tant words and cluster the scholarly literature. Based on ter, and complete the topic discovery process in research
the lexical chain method combined with WordNet lexical literature. Article abstracts published in the proceeding of
database, the proposed method clusters documents by IEEE ICC 2008 China Forum were used. There were 107
semantic similarity and extracts the important topics for papers in the proceeding, which have been classified ma-
each cluster. In order to take semantic features into ac- nually into five categories. For this data collection, after
count, in this approach a novel method to calculate se- preprocessing, wordsegmentation, clustering with NMF,
mantic similarity is proposed. The similarities in the Title, and labeling with testor theory, seven categories were
Keyword and Abstract of journal papers will be calcu- obtained. The accuracy of the clustering is assessed using
lated separately. These bibliographic structures are taken the metric AC [18].
into consideration, with different weights given to each.
After the semantic similarity calculation, the Hierarchical Huijie Yang et al [11] proposed a novel personal topics
Clustering Algorithm (HCA) method is used to cluster detection approach by clustering emails. First the emails
the documents. After collection of key phrases and their are pre-processed and an improved email VSM (vector
respective frequencies from each of the documents in one space model) is constructed to label the email combining
cluster, the key phrases with the highest frequencies can the body and subject in a new method, then the advanced
be viewed as topics of the cluster. Because users could not k-means algorithm is adopted to cluster the emails and a
extrapolate easily from separate words, a Phrase Fre- kernel-selected algorithm based on the lowest similarity is
quency (PF) method is proposed to extract topics. F- designed, afterwards the appropriate keywords to label
measure is used to evaluate performance of the proposed the topic of each cluster are extracted. F-measure is used
method compared to the TF-IDF method. Three datasets to evaluate the cluster quality. Experiments on
for performance evaluation were collected from SDOS 20Newsgroups email dataset have shown the validity of
and ISI on-line databases. There are 100 documents in this approach and the experimental results also matches
every dataset, and each document includes its Title, Key- the labels of the clusters generated by human justification.
word and Abstract. From the experimental results it is
shown that the proposed method is better than the tradi- 4 FRAMEWORK FOR ANALYZING THE EXISTING
tional TF-IDF method. Its key contribution is the ability to TOPIC DISCOVERY TECHNIQUES IN DOCUMENT
extract topics by semantic features, taking into account
CLUSTERING
the influence of bibliographic structures, and to recom-
mend clusters to users. This study mainly highlights the recent research work
in the field of topic discovery in document clustering.
A cluster labeling algorithm for creating generic titles, This section focuses on the proposed frame work shown
based on external resources such as WordNet, is pro- in Table 1 for comparing various topic discovery methods
posed in [9] by Yuen-Hsien Tseng. In this method, first used in document clustering. The comparative study is
category-specific terms are extracted as cluster descrip- based on the survey, which is made by analyzing the ex-
tors. These cluster descriptors are then mapped to generic isting algorithms, considering the factors like Document
terms based on a hypernym search algorithm. Correlation clustering technique, Data set used for experiments, Per-
coefficient (CC) method which is a square root of the chi- formance metrics and summarization about the perfor-
square is used for selecting specific terms. This method mance of the proposed technique. Column 1 in this frame
involves two main steps: the category-specific terms are work states the title of the referred paper. The algorithm
first extracted from the documents; these terms are then and techniques which are discussed in the existing papers
mapped to their common hypernyms. As this algorithm are stated in column 2. The third column gives the details
uses only the depth and occurrence information of the of the data set which is considered for conducting the
hypernyms, it is general enough to be able to adopt other experiments. Metrics considered by the authors for per-
hierarchical knowledge systems without much revision. formance evaluation are given in column 4. The concise
This proposed method was compared to InfoMap (Infor- details about the performance of the algorithms are listed
mation Mapping Project, 2006) an online tool which finds in column 5.
a set of taxonomic classes for a list of given words.
Generic Clustered 612 pa- Evalua- Compared to description, topic detection in incremental document
Title Labe- documents tent tion of InfoMap clustering, etc.
ling for using differ- docu- descriptor (Information
Clustered ent options ments selection Mapping REFERENCES
Docu- and parame- from the by TFC, Project, 2006)
ters to get [1] Benjamin C. M. Fung, Ke Wang, Martin Ester, “Hierarchical
ments - USPTO’s CC, and and about Document Clustering”, The Encyclopedia Of Data Warehousing
2010 various website CC × TFC 70% of the and Mining, 2005.
views on it. and a cases lead to [2] Jiangtao Qiu, Changjie Tang, “Topic Oriented Semi- Supervised
Specifically, subset of reasonable Document Clustering”, SIGMOD2007 Ph.D. Workshop on Innova-
analyzed the Reu- results tive Database Research 2007(IDAR2007).
the collec- ters- [3] Christian Wartena, Rogier Brussee, “Topic Detection by Clus-
tion based 21578 tering Keywords”, IEEE -19th International Conference and Expert
on docu- collec- System Application, 2008.
ment clus- tion [4] Chenzhi Zhang, Huilin Wang, Yao Liu, Hongjiao Xu, “Docu-
ment Clustering Description Extraction and its Applications”,
tering and
Computer Processing of Oriental Languages. Language Technology
term cluster-
for the Knowledge-based Economy, 22nd International Conference,
ing as well ICCPOL 2009, Springer 2009.
Automati- Advanced 20News F-measure Experimen- [5] Yanjun Li, Soon M. Chung, John D. Holt, “Text Document Clus-
cally De- K-means for groups tal results tering based on Frequent word meaning sequences”, Data
tecting clustering email match the Knowledge Engineering, Elesiver 2007.
Personal emails and a dataset labeled hu- [6] Henry Anaya-Sánchez, Aurora Pons-Porrata and Rafael Ber-
kernel- man cluster- langa-Llavori, “A New Document Clustering Algorithm for
Topics by
selected ing result. Topic Discovering and Labeling”, In Proceedings of the
Clustering
algorithm CIARP’08, Lecture Notes in Computer Science, vol. 5197. Springer,
Emails - pp. 161-168, 2008.
2010 designed
[7] Henry Anaya-Sánchez, Aurora Pons-Porrata and Rafael Ber-
after which
langa-Llavori, “A document clustering algorithm for discover-
keywords to
ing and describing topics”, Pattern Recognition Letters, vol 31. El-
label clus- sevier Science Inc., pp. 502-510, April 2010.
ters are [8] Hei-Chia Wang, Tian-Hsiang Huang, Jiunn-Liang Guo, and
extracted Shu-Chuan Li, “Journal Article Topic Detection Based on Se-
mantic Features”, In Proceedings of the 22nd International Confe-
From the analysis it is inferred that more than 70% of rence on Industrial, Engineering and Other Applications of Applied
Intelligent Systems, Lecture Notes In Artificial Intelligence, vol.
the document clustering algorithms are based on Parti-
5579. Springer, 2009.
tional and hierarchical based algorithms. About 40% of [9] Yuen-Hsien Tseng, “A Generic title labeling for clustered doc-
the document clustering algorithms use ontology based uments”, Expert Systems with Applications: An International Jour-
methods for topic discovery. Recent researches mainly nal , vol. 37, no. 3 pp. 2247-2254 , 2010.
concentrate on ontology based document clustering tech- [10] Fang li, Qunxiong zhu and Xiaoyong lin, “Topic Discovery in
niques. Nowadays we have broad scope in topic detection Research literature Based on Non-negative Matrix Factorization
and in ontology based document clustering. From the and Testor theory”, IEEE – Asia-Pacific Conference on Information
Processing, 2009.
framework it is analyzed that most of the topic oriented
[11] Huijie Yang, Junyong Luo, Meijuan Yin and Yan Liu, “Auto-
document clustering algorithms consider F-measure as matically Detecting Personal Topics byClustering Emails”, IEEE
the performance metric. The data sets used for experi- - Second International Workshop on Education Technology and Com-
ments include those from Reuters, 20 newsgroups, puter Science, 2010.
Wikipedia, TDT2, documents from ISI, SDOS online da- [12] A. A. Kogilavani, Dr. P. Balasubramanie, “Ontology Enhanced
tabases, USPTO’s website, etc. Clustering Based Summarization of Medical Documents”, In-
ternational Journal of Recent Trends in Engineering, 2009.
[13] M. Steinbach, G. Karypis, and V. Kumar. “A comparison of
5 CONCLUSION document clustering Techniques”, In Proceedings of Workshop on
Text Mining, 6th ACM SIGKDD International Conference on Data
In this paper an overview of existing topic discovery Mining (KDD’00), pages 109–110, 2000.
works in document clustering algorithms is presented, [14] Fung B, Wnag K & Ester .M, “Hierarchical Document Cluster-
which gives the summarization of recent research work ing using Frequent itemsets”, SIAM International Conference on
on various methods in document clustering for topic dis- Data Mining, SDM ’03, Pp 59-70, 2003.
[15] Beil F, Ester M & Xu. X , “ Frequent Term Based Clustering”,
covery. This survey precisely states the details about the
International Conference on Knowledge Discovery and Data Minig,
algorithms, data sets, metrics and performance. Partition- KDD’02, pp 436-442, 2002
al clustering algorithms using cosine similarity produce [16] Kaufman, L & Rousseeuw P.J, “Finding groups in data: An
good clustering results whereas hierarchical clustering introduction to cluster analysis”, John Wiley & Sons, New York.
algorithms using semantic similarity measures will pro- 1990.
duce even better results. Ontology-based and concept- [17] Khaled B. Shaban, “A Semantic Approach for Document Clus-
tering”, Journal of Software, 2009.
based text clustering and topic discovery will provide
[18] Xu, W., Liu, X., & Gong, Y, “Document-clustering based on
better solutions unlike the traditional term-based ap- non-negative matrix factorization,” In Proceedings of SIGIR_03,
proaches. Topic discovery in document clustering has Toront CA, pp. 267-273, 2003.
scope in various issues like generic topic detection, topic
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 27
Dr. S. Kanmani received her B.E (CSE) and M.E (CSE) from Bhara-
thiar University, Coimbatore, India and Ph.D from Anna University,
Chennai, India. She is working as Professor in the Department of
Information Technology at Pondicherry Engineering College. She
has published nearly 63 research papers. She is currently a supervi-
sor guiding 8 Ph.D scholars. She is an expert in Software Testing.
Her areas of interests include Software Engineering, Genetic algo-
rithms and Data Mining.