Вы находитесь на странице: 1из 4

International Journal of Computational Science, Mathematics and

Engineering
IJCSME-Volume-3-Issue-1-January-2016
E-ISSN-2349-8439
___________________________________________________________________________________
Study and Analysis of Ontology based Document Clustering
Using Map Reduce Technique
S.Sudha Lakshmi1, R.C.Saritha2 and M.Usha Rani3
_____________________________________________________________________________________________
ABSTRACT: Search engines like Google, Yahoo, Twitter, FB produces huge amount of data every day in different forms
like text, images, audios and videos. As the number of text documents increases it needs efficient techniques that can help
during searching and retrieval of information. Document clustering is one such technique which automatically arranges text
documents into meaningful groups. The most common method for representing documents is the vector space model in
which it represents document features as a bag of words and does not represent semantic relations between words, they
assume all words are independent and it does not work for short queries. Incorporating semantic knowledge from an
ontology into document clustering is an important but challenging problem. Another challenge faced by the current
document clustering algorithms is that they do not consider the semantics of features selected for clustering. To resolve this
problem, ontological based clustering techniques are being widely used. An ontology-based approach allows the data
analyst to represent the complex structure of objects, to implement the knowledge about hierarchical structure of categories
as well as to show and use the information about relationships between categories and individual objects. In this paper we
describe the bisecting k-means algorithm using MapReduce programming model in distributed environment and also
WordNet ontology with k-means in order to use the semantic relations between words to improve the document clustering
results.
KEYWORDS: Document clustering, Ontology, Text Mining, Distributed Computing, Mapreduce.
___________________________________________________________________________________
1. INTRODUCTION MapReduce is a programming model introduced by
Document clustering Technique is the process of grouping Google’s Team for processing huge datasets in distributed
related documents.This also helpsin organizing documents environment for large-scale computations. In addition to
as well as improves the methods of displaying results of that,it enables programmers who have no experience with
search engine.Vector space model is widely used for parallel and distributed system to utilize the resources of a
representing the documents in most of the document large distributed system in easily and efficient way[3].
clustering approaches.In vector space model the terms are 2. LITERATURE SURVEY
represented by means of a bag of words. The main Ontology is usually referred to as a method for obtaining an
drawback of this model is it ignores the semantic relations agreement on similar activities in aspecific domain[4].Lot of
between the terms.Ontology has been adopted in order to ontologies have been developed for various domains, such
enhance document clustering, further it has been in use to as foodontology [5], gene ontology [6], and agriculture
solve the problem of large number of document features [1]. ontology [7]. Now, we briefly discuss the overview of some
In this paper the use of WordNet ontology [2] with bisecting research efforts that explored the use of ontology for
k-means algorithm using MapReduce approach helps in enhancing the documents clustering process. In [8], the
representing semantic relations between terms and to reduce authors integrated WordNet ontology with document
document features. For example, the words “cheese” and clustering.The authors considered the sunsets (sets of
“bread” have the same lexical noun category noun.food, so synonyms) as concepts and extended the bag of words
it belongs to the same term document feature, which results model by including the parent concepts (hyponyms) of
in drastically reducing the number of dimensions. synsets up to five levels. For example, they utilized the
WorldNet ontology to find the similarity between the related
1
Research Scholar, terms, such as "rice " and "wheat”, these two terms have the
Dept. of Computer Science, same parent concept "cereals". Therefore, a document
SPMVV,Tirupati. having the term "rice “will be related to a document with the
s_sudhamca@yahoo.com, term "wheat" appearing in it; the current approach is shown
2
Senior Technical Officer, to enhance the clustering performance. In[8]they used
C-DAC Knowledge Park, Bangalore. ontology to reduce the number of features. In their work
saritharc@yahoo.com. they showed text document clustering is improved by the
3
Professor, use of noun features itself.Recupero[1]coped with problems
Dept. of Computer Science, of vector space model which include high dimensionality of
SPMVV, Tirupati. data and ignoring the semantic relationships between terms.
musha_rohan@yahoo.co.in Recupero[1] used WordNet ontology to find the lexical
category of each term, and replaced each and every term
with its corresponding lexical category. Using the fact that
WordNet has forty-one lexical categories for representing

228 | P a g e
Copyright-IJCSME DOI: 10.18645/IJCSME.SPC.0051
International Journal of Computational Science, Mathematics and
Engineering
IJCSME-Volume-3-Issue-1-January-2016
E-ISSN-2349-8439
___________________________________________________________________________________
nouns and verbs,all document dimensions were decreased to Ontology plays a major role in document clustering process
just 41 dimensions. Recupero used ANNIE, which is an by reducing the huge number of
information extraction system[9] to help in understanding if documents features. The features reduction processuses
two words or compound words refer to the same entity. For ontology characteristic which includes semantic relations
example entities like "Obama, Mr. Obama, President between words such assynonyms and hierarchical relations
Obama, and President " will allbe represented by one entity. between words. we can get the wordparent from hierarchy
K-means clustering and its variant bisection k-means are the relations and use it for representing document features.For
focus of our discussion due to their wide use and adoption example the words corn, wheat, and ricecan be represented
in many applications. by only one word which is “cereals”.Exploiting semantic
2.1.K-means and Bisecting k-means relations between words helps in placing the documentsthat
K-means is a extensively known partitioning clustering contain words such as rice, wheat and corn in the similar
technique.The input of k-means is N objects,M dimensions, cluster. This paper presents theWordNet ontology with
and k representing the targeted number of clusters. The semanticrelations concept as follows:WordNet organizes
output of k-means is kclusters represented by the cluster word synsets(synonyms) into fortyfive lexicographer files;
centers. The main idea behind the the traditional k-means which are further divided into twentysixfornouns, fifteen for
iterative algorithm isto increase (maximize) inter cluster verbs, and four for adverbs and adjectives.We extended[1]
similarity and minimize the similarity with other approach which finds and replacesevery term by its
clusters[10][11].The steps of k-means algorithm are corresponding WordNet lexical category, we get the lexical
repeated until convergenceisachieved,which may be category whichcorresponds to nouns terms only.
satisfied if there is no changes in cluster centers or may be
satisfied after a specific number of iterations.Bisecting k-
means is a variant of k-means developed by Steinbach, et
al[12]; it has been appliedin document clustering. In[12] the
authors showed that documentclustering results than the
basic k-means algorithm. Briefly, bisecting k-means
algorithm asfollows:
Given a desired number of clusters denoted by “k”: Figure:1 Bag of lexical categories representation & bag of
1- Use basic k-means algorithm to get two sub clusters of words Format
input dataset. Figure(1)represents the detailed steps of replacing the
2- Select the sub-cluster,that satisfies the maximum overall traditional representation of document terms as a bag of
similarity to be the input dataset. words by a bag of lexical categories and displays the format
3- Repeat steps 1, 2 until reaching the desired number of of the resulting document-words file[13].For a given a set of
clusters. documents, the first step in this phase is the Extract-Words
Selecting which cluster to be split may be on the basis of process. In this process removes it extracts the remaining
choosing either the largest cluster or the cluster with least words by eliminating stopwords.It generates two files
overall similarity. In[12] the authors concluded that the namely the vocabulary file that contains the list of all words
difference between the two methods is small. and the other document-words file which stores associations
3. ONTOLOGY BASED DOCUMENT CLUSTERING between words and document i.e bag of words. The next
Document clustering process faces mainly two issues One is process is “Get Lexical Categories” it converts the bag of
the huge volume ofdocuments and the large size of words into a bag of WordNetlexicalcategories. This process
document features. In this paper we discuss a novel involves the following steps:
approach toenhance the efficiency of document clustering. 1-Mapping each word to its WordNet lexical category. This
This approach is divided into two phases: step generates a WordIDCategoryIDfile.In case of the word
During the first phase very large number of document doesn’t have a corresponding WordNetlexicalcategory, it is
terms or document features is reduced through WordNet mapped to Uncategorized category.
ontology. The second phase objective is tomanage large 2-Generating a bag of lexical categories that replaces "
volume of documents by employing the MapReduce parallel docIDwordID count " to "docIDCategoryIDcount".The third
programmingparadigm to enhance the system performance. process is "Optimize", the input to this process is the bag of
In the following discussion we discuss each phasein more words file or bag of lexicalcategories, and the output is an
detail. optimized representation. In case of bag of words, each
3.1.Phase I: Document Features Reduction by Using documentwill be represented by one line as follows:"docID
WordNet Ontology word1ID:count word2ID:count ...... wordnID:count ".In case

Copyright-IJCSME
229 | P a g e
International Journal of Computational Science, Mathematics and
Engineering
IJCSME-Volume-3-Issue-1-January-2016
E-ISSN-2349-8439
___________________________________________________________________________________
of bag of lexical categories, each document will be alogorithmrunning on MapReduce.The following are the
represented also by one line asfollows: inputs of the algorithm :
"docID Lexical1ID:count Lexical2ID:count ...... the path of input dataset and cluster centroids, the output
LexicalnID:count ".The optimized representation of bag of path,number of dimensions, the required number of clusters.
word reduces file size dramatically. For example, The Documents’ clusters, and clusters centroids is the
for"PubMed" bag of words dataset[13]: it reduced the bag of output of the algorithm.The algorithm proceeds as follows:
words file from 6.3 gigabytes to928.475 megabytes. First, it calls the basic k-means with the following
3.2 Phase II: Bisecting k-means Algorithm Implementation parameters:path of dataset, path of cluster centroids, path of
over MapReduce Framework output, and number of features(dimensions).In this step,the
In this Phase,Mapreduce approach is used to run the basic k-means algorithm is adjusted to generate two clusters
Bisecting K-means algorithm due to continuous growth in at eachcall.After that, the largest cluster is selected to be the
document size.To implement bisecting k-means over the path of theinputdataset.Also,the string thatrepresents the
MapReduce framework, first,traditional k-means algorithm output path is concatenated with "1", and the string that
is implemented to generate two clusters; we adapt the represents the path ofclusters’ centroids is concatenated with
methodwhich is presented in[14] as follows: "1". Every time we call the basic k-means the number
1. Initialize centers. ofobtained clusters is increased by one except at the last
2. In Map function each document vector is assigned to the call. This process of calling the basic k-meansalgorithm is
nearest center. The key of map repeated, until we get desired number of clusters.The above
function is the document Id and the value is document figure shows the pseudo-code of Bisecting K-means
vector, the map function emitscluster index, and associated algorithm, Let PI represents the pathof the input dataset, k
document vector. denotes the desired number of clusters, Nd denotes the
3. In Reduce function new centers are calculated. The key number ofdimensions, PCC represents the path of the
will be the cluster index and thevalue is document vector. clusters’centroids, and PO represents the output path.Let
Using cluster index instead of Cluster centroid in reduce Basic-K-means be the procedure that computes the
function as a key reduce the amount of data that will be traditional k-means algorithm and returnk-clusters.
aggregated by reduce workersnodes. 4. EVALUATION OF DOCUMENT CLUSTERING
4. In the clear function of reducer, new centers are saved in In document clustering approach, here two types of
a global file. It will be used for clusterevaluation is done i.e external and internal
Next iteration of k-means algorithm. evaluation. External evaluation(i.e. their true cluster are
5. Finally, if convergence is achieved then stop, else go to known apriori) measures such as purity, entropy, and the F-
step 2. measure [15], [16], [12]is applied for Lableddocuments.For
unlabeled documents Internal evaluation is applied, the goal
is to maximize the cosine similarity between each document
andits associated center[17]. The value of this measure is a
range from 0 to1, as this value increases better clustering
results areachieved.In this section we discuss the
experimental results for evaluating the performance of the
presented clustering approach. The first part works on
labeled dataset in order to evaluate the applicability of the
system. The second part works on unlabeled dataset in order
to measure the scalability of the proposed system and the
efficiency of applying it on large datasets.For labeled
dataset the above mentioned approach is compared against
applying bisecting k-means clustering using Hotho
method[11],Recupero[4] lexical categories method, lexical
nouns methods and stemmed "without ontology” method.
Figure(2):Bisecting k-means Algorithm Over Mapreduce For Hothomethod the “add” concept method is applied .In
The key idea behind implementing the bisecting k-means this method for each and every term in the vector
algorithm to run on MapReduce iscontrolling Hadoop representation of the dataset, we check whether it appears
Distributed File System(HDFS) file paths of dataset, cluster in the WordNet[5]or not. In case if it appears this can be
centers, and outputclusters.Controlling HDFS paths aid in extended by word Nethypernymsynset concept up to 5
implementing algorithms that use multiple levels. The results of F-measure,purity,entrophy shows that
MapReduceiterations such as Bisecting k-means utilizing WordNet ontology for representing document
algorithm.Figure(3)displays bisecting k-means

Copyright-IJCSME
230 | P a g e
International Journal of Computational Science, Mathematics and
Engineering
IJCSME-Volume-3-Issue-1-January-2016
E-ISSN-2349-8439
___________________________________________________________________________________
terms either as lexical categories or lexical nouns only environment for robust NLP tools and applications.In ACL
enhances document clustering results.The reason for this (pp.168-175).
result is due to representing semantic relations between [10]MacQueen, J. (1967, June). Some methods for
document terms; for example, documents that contain words classification and analysis of multivariateobservations. In
such as “rice”or“wheat” will be related to a document with Proceedings of the fifth Berkeley symposium on
the term"corn" appearing in it. mathematical statistics and
5. CONCLUSION probability (Vol. 1, No. 14, pp. 281-297).
This paper presents a detailed study to enhance clustering [11]Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS
containing huge number ofdocuments through WordNet 136: A k-means clustering algorithm. Appliedstatistics, 100-
Ontology by bisecting k-means algorithm using MapReduce 108.
programming model.We concluded that using WordNet [12]Steinbach, M., Karypis, G., & Kumar, V. (2000,
lexical categories to represent document terms has many August). A comparison of document clustering
advantages,first it represents semantic relations between techniques.In KDD workshop on text mining (Vol. 400, No.
words, second it reduces features dimensions, and finally it 1, pp. 525-526).
made clustering big data visible by dealing with reduced
number of dimensions.This study will definitely provide [13]Bache, K. &Lichman, M. (2013). UCI Machine
new ways for researchers to enhance document clustering Learning Repository, CA: University of California, School
results.For future work,this approach can be extended to of Information and Computer Science.
other lexical categories and to further investigate possible [14]Zhao, W., Ma, H., & He, Q. (2009). Parallel k-means
enhancements for bisecting k-means implementation over clustering based on mapreduce. In Cloud Computing (pp.
Map-Reduce paradigm. 674-679). Springer Berlin Heidelberg.
REFERENCES [15]Huang, A. (2008, April).Similarity measures for text
[1] Recupero, D. R. (2007). A new unsupervised method for document clustering.In Proceedings of the 6thnew zealand
document clustering by using WordNetlexical and computer science research student conference
conceptual relations. Information Retrieval, 10(6), 563-579. (NZCSRSC2008Christchurch, NewZealand (pp. 49-56).
[2]Miller, G. A. (1995). WordNet: a lexical database for [16]RRNYI, A. (1961). On measures of entropy and
English.Communications of the ACM,38(11). information. In Fourth Berkeley Symposium on
[3]Dean, J., &Ghemawat, S. (2008). MapReduce: simplified Mathematical Statistics and Probability(pp. 547-561).
data processing on large clusters. [17]Zhao, Y., &Karypis, G. (2002).Comparison of
Communications of the ACM, 51(1), 107-113. agglomerative and partitional document
[4]Gruber, T. R. (1991). The role of common ontology in clusteringalgorithms (No.TR-02-014). Minnesota Univ
achieving sharable, reusable knowledge bases.KR, 91, 601- Minneapolis DeptOf Computer Science.
602.
[5]Cantais, J., Dominguez, D., Gigante, V., Laera, L.,
&Tamma, V. (2005). An example of food
ontology for diabetes control. In Proceedings of the
International Semantic Web Conference2005 workshop on
Ontology Patterns for the Semantic Web.
[6]Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D.,
Butler, H., Cherry, J. M., ...& Sherlock, G.(2000). Gene
Ontology: tool for the unification of biology. Nature
genetics, 25(1), 25-29.
[7]Lauser, B., Sini,M., Liang, A., Keizer, J., & Katz, S.
(2006). From AGROVOC to the AgriculturalOntology
Service/Concept Server.An OWL model for creating
ontologies in the agricultural domain.In Dublin Core
Conference Proceedings. Dublin Core DCMI.
[8]Fodeh, S., Punch, B., & Tan, P. N. (2011).On ontology-
driven document clustering using core semantic
features.Knowledge and information systems, 28(2), 395-
421.
[9]Cunningham, H., Maynard, D., Bontcheva, K., &Tablan,
V. (2002, July).A framework and graphical development

Copyright-IJCSME
231 | P a g e

Вам также может понравиться