Вы находитесь на странице: 1из 3

International Journal on Recent and Innovation Trends in Computing and Communication

Volume: 2 Issue: 12

ISSN: 2321-8169
4060 - 4062


Survey on Enhancing Clustering Output using Side Information by the Textual

Extraction Mechanism
Mr. Mahadik Amol B.

Mr. Gurav Yogesh B.

Department of Computer Engineering

P.V.P.I.T. College, Bavdhan

Department of Computer Engineering
P.V.P.I.T College, Bavdhan

AbstractIn text mining, more operations are based on the statistical analysis of a term, word or phrase. Clustering is a popular technique for
automatically organizing a large collection of text; it is also used to text classification. Many text mining applications contains side information
with text documents in the form of web documents, user access web-log, and different links attached with text files. This side information is
helpful for clustering purpose but sometime it is risky to use side information because it may add noise to procedure. So we need a better
technique for text mining to improve quality of presentation. In this paper, we are using different algorithms for enhancement of the clustering
quality with the document-based, sentence-based, corpus-based, and combined approach concept analysis design, so as to maximize the benefits
from using side information.
Keywords Text classification, clustering, Side information, concept analysis, document-based, sentence-based.

because the logs can often pick up subtle correlations in
content, which cannot be picked up by the raw text

Data mining is the exercise of automatically searching large
stores of data to discover patterns and trends with simple analysis.

Text mining attempts to search new and existing unknown

information by applying methods from regular language
processing and data mining. Text mining is a variation on a
field called data mining that practically find out interesting
patterns from large databases. Text mining which is known as
Intelligent Text Analysis, Text Data Mining refers generally to
the process of extracting interesting and non-trivial
information and knowledge from unstructured text. Text
mining is corrective field which draws on machine learning
information retrieval, data mining and statistics. A huge
amount of practical work has been done in past years on the
problem of clustering in text collections [1], [4] in the database
and information retrieval communities but this work is done
for primary clustering. In text mining techniques, several
problems are arises due to many application domain such as
several web information, different networks and digital data.
In different applications huge side information is available in
the form of database attributes and meta- information which is
used for clustering purpose.
Side information is present in different form some examples

Web document: The user-access behavior is captured

in the form of web logs. For every document, the metainformation may contain the browsing history of the
different users. Such logs can be used to improve the
quality of the mining process in a way which is more
meaningful to the user and also application-sensitive,



Text Document having Links: Many text documents

contain links among them, which sometimes contain
provenance information and several attributes. such
attributes may often provide insights about the relations
among documents in a way which is difficult to access
from raw content.
Meta-Data: Data such as location, ownership or even
temporal information may be informative for mining

Above side information is provide help but sometimes it is

risky when side information is noisy. So we will use different
approaches to carefully handle side information for clustering
purposes. It handles two types of data one with text attributes
and side information giving same hints and other which
creates complexities. We will create work to extend the
problem classification.
There are lots of clustering problems explained by database
community [7], Several reviews of different clustering
algorithms found in [6]. In text mining various methods are
based on the statistical study of a word or term[2]. This study
gives term frequency to show the significance of the term
inside a document. One of the terms contributes more to the
meaning of its sentences than the other when two or more
terms have the same frequency in their documents.
Thus, the basic text mining model should signify terms that
capture the semantics of text. In this case, the mining model
can catches term that shows the concepts of the sentence, this

IJRITCC | December 2014, Available @ http://www.ijritcc.org


International Journal on Recent and Innovation Trends in Computing and Communication

Volume: 2 Issue: 12

ISSN: 2321-8169
4060 - 4062

helps to identify of the topic of the document. one of the
traditional data mining techniques is an unsupervised learning
paradigm where clustering methods try to identify inherent
groupings of the text documents, so that a set of clusters is
produced in which clusters exhibit high intra-cluster similarity
and low inter cluster similarity. Generally, text document
clustering methods attempt to segregate the documents into
groups where each group represents some topic that is
different than those topics represented by the other groups.
Most current document clustering methods are based on the
Vector Space Model (VSM), which is a widely used data
representation for text classification and clustering[8].
Document clustering has been investigated for use in a number
of different areas of text mining and information retrieval [5].
Initially, document clustering was investigated for improving
the precision or recall in information retrieval systems and as
an efficient way of finding the nearest neighbors of a
document. More recently, clustering has been proposed for use
in browsing a collection of documents or in organizing the
results returned by a search engine in response to a users
query. Document clustering has also been used to
automatically generate hierarchical clusters of documents. A
somewhat different approach finds the natural clusters in
already existing document taxonomy, and then uses these
clusters to produce an effective document classifier for new
documents. Agglomerative hierarchical clustering and Kmeans are two clustering techniques that are commonly used
for document clustering. Agglomerative hierarchical clustering
is always portrayed as better than K-means, although
slower. A widely known study, discussed in [9] indicated that
agglomerative hierarchical clustering is superior to K-means,
although we stress that these results were with non-document
data. In the document domain, Scatter/Gather, a document
browsing system based on clustering, uses a hybrid approach
involving both K-means and agglomerative hierarchical
clustering. K-means is efficient because of its efficiency and
agglomerative hierarchical clustering is used to improve
quality. Recent work to generate document hierarchies uses
some of the clustering techniques from and presents a result
that indicates that agglomerative hierarchical clustering is
better than K-means, although this result is just for a single
data set and is not one of the major results of the paper.
Initially we also believed that agglomerative hierarchical
clustering was superior to K-means clustering, especially for
building document hierarchies, and we sought to find new and
better hierarchical clustering algorithms. During the course of
our experiments we discovered that a simple and efficient
variant of K-means, bisecting K-means, can produce clusters
of documents that are better than those produced by regular
K-means and as good as or better than those produced by
agglomerative hierarchical clustering techniques. We have
practically find what we think is a reasonable explanation for
this behavior.

Co-clustering is a technique for knocking the rich metainformation of web documents like multimedia [3], including
category, annotation and explanation, for relative discovery.
Most coclustering methods implemented for different data
ignoring the representation issue of short and noisy text and
their performance is tied up by the experimental weighting of
the multi-modal features.
We will select method consists of concept-based similarity
measure, document-based concept analysis, corpus-based
concept-analysis, and sentence-based concept analysis. Many
forms of text databases contain a large amount of sideinformation is the input to the proposed model.

Web document will be input to the proposed model.


Text preprocessing is done by


Separate sentence

Label terms

Remove stop words

Stem words

Concept based analysis will implements using

Calculating conceptual term frequency

Term frequency

Document frequency


First, the analyzed labeled terms are the concepts that

capture the semantic structure of each sentence.
Second, the frequency of a concept will be used to
measure the contribution of the concept to the sense
of the sentence and to the main subjects of the
document. The number of documents having the
analyzed concepts will be used to discriminate among
documents in calculating the similarity.


For testing effect of concept based similarity on

clustering, Hierarchical Agglomerative Clustering is


By exploiting the semantic structure of sentences in

document better text clustering result is achieved.

To extend the approach to classification, we will cover our

earlier clustering techniques in order to incorporate
supervision, and implement a model which summarizes the
class distribution in terms of the clusters from the data. Then,
we will explain how to use the summarized model for
effective classification to increase the efficiency. We will

IJRITCC | December 2014, Available @ http://www.ijritcc.org


International Journal on Recent and Innovation Trends in Computing and Communication

Volume: 2 Issue: 12

ISSN: 2321-8169
4060 - 4062

introduce some notations and terms which are related to the
classification problem.

A. The COATES Algorithm:

The name COATES is given by following way. It is a
algorithm which corresponds to the fact that it is a Content and
Auxiliary attribute based Text cluStering algorithm.
The algorithm requires two phases:
1. Initialization: Lightweight initialization is used in
algorithm. We have taken this algorithm because it is
very well-organized and provides practical initial
starting point. Centroid and partitions are outputs of
this phase. It uses only text and no auxiliary
information is used. Initialization phase mainly used
to construct an initialization and providing a good
starting point to the clustering process based on text
2. Main phase: Output of initialization phase is an
input for main phase. This phase starts with the initial
groups. Then clusters are used simultaneously by
using both the text content and the auxiliary
information. it uses alternating iterations and it help
to improve the quality of clustering.

In this survey paper, better text clustering result will be
achieved for mining text data with the use of side information.
Multiple text databases contain a large amount of sideinformation or meta information, which is used in order to
improve the efficiency of clustering process. To design the
clustering method, implementation of an iterative partitioning
technique with a probability estimation process which gives
the importance of different kinds of side-information. General
approaches will be used to improve the clustering and
classification algorithms. In this paper we have studied
different techniques to improve clustering and data mining.
It gives brief knowledge of clustering in text mining. Every
method increases the accuracy and quality of different clusters
by using side information.


Each major iteration consists of Content iterations and

Auxiliary iterations which has two minor iterations
corresponding to the auxiliary and text-based methods
respectively. The algorithm makes use of alternating minor
iterations of content-based and auxiliary attribute-based




Methodology Comparison:

1) Methods of Data collection:

Since this work has been specifically proposed to cluster web
documents, hence World Wide Web is the best source of
collecting dynamic, linked and different type collections of
web pages. We will retrieve the list of results of the search
engine for a given query (side-crawler). Various web sites
URL and snapshot features of webpage will be taken as
reference to prepare datasets which will be used for data
2) Probable Methods of data analysis:
Clustering of course is a very complex procedure as it depends
on the collection on which it is applied as well as the choice of
the various parameter values. Hence, a careful selection of
these is very crucial to the success of the clustering. We will
compare our results with other clustering algorithms such as kmeans, single-pass clustering. We will also compare our
results with other document clustering methods such as Vector
Space Model where cosine measure and Jaccard measures are
used by using Hybrid approach in web document clustering we
will obtain better results than the traditional methods.






C. C. Aggarwal and P. S. YU, A framework for

clusteringmassive text and categorical data streams, in
proc. SIAM Conf. Data Mining, 2006, pp. 477-481.
Shady Shehata, Fakhri KarrayAn Efficient Concept based
Mining Model for enhancing text clustering IEEE
transaction on knowledge and data engineering, vol. 22, no.
10, pp. 1360-1371, 2010.
Lei Meng, Ah-Hwee Tan, Dong Xu Semi-Supervised
Heterogeneous Fusion for Multimedia Data Co-Clustering
IEEE Transactions On kknowledge And Data Engineering,
vol. 26, no. 9,pp. 2293-2306, 2014
M. Steinbach, G. Karypis, and V. Kumar, A Comparison
of document clustering techniques, in Proc. Text Mining
Workshop KDD, 2000, pp. 109-110.
U. Y. Nahm and R. J. Mooney, A Mutually Beneficial
Integration of Data Mining and Information Extraction,
Proc. 17th Natl Conf. Artificial Intelligence ,(AAAI 00),
pp. 627-632, 2000
Jain and R. Dubes, Algorithms for Clustering Data.
Englewood Cliffs, NJ,USA: Prentice-Hall, Inc. , 1988.
M. Steinbach, G. Karypis, and V. Kumar, A Comparison
of document clustering techniques, in Proc. Text Mining
Workshop KDD, 2000, pp. 109-110.
S. Guha, R. Rastogi, and K. Shim, CURE: An efficient
clustering algorithm for large databases, in Proc. ACM
SIGMOD Conf. , New York, NY, USA, 1998, pp. 7384.
G. Salton, A. Wong, and C. S. Yang, A Vector Space
Model for Automatic Indexing, Comm. ACM, vol. 18, no.
11, pp. 112-117, 1975.
Unsupervised Learning of Topic Hierarchies from Text
Data, Proc. 16th Intl Joint Conf. Artificial Intelligence
(IJCAI 99), pp. 682-687, 1999.

IJRITCC | December 2014, Available @ http://www.ijritcc.org