Вы находитесь на странице: 1из 8

International Journal of Advanced Engineering Research and Technology (IJAERT) 23

Volume 4 Issue 2, February 2016, ISSN No.: 2348 8190

SIMILARITY-BASED TECHNIQUES FOR TEXT DOCUMENT


CLASSIFICATION
SWEETI, NIVEDITA MISHRA
Department of Computer Science, BIT, Kanpur, AKTU Lucknow
Department of Electronics Engineering, PSIT, Kanpur, AKTU Lucknow

Abstract
This paper presents a Similarity-Based Techniques for
text document classification and formally examines the
similarity on both documents. Documents classification
includes automatic assignment of a set of documents into
predefined classes by a learning system (classifier) that
has been trained on similar data-sets of test documents.
Within this context, document indexing is the activity of
mapping a document into a form that can be consumed
by a classification system. Several document indexing
models exist, many of which rely on feature extraction,
dimensionality reduction, or both. In feature extraction,
the associated document is typically represented as a
feature vector encoding presence of words, syntactic
entities, or semantically linked tags, and a term-weight is
computed for each such feature.
We introduce document classification and its techniques
that are nave byes classifier, TF-IDF, Latent Semantic,
indexing, Support vector machines, Concept Mining.
Our goal was to develop a text categorization system
that uses fewer examples for training to achieve a given
label of performance using a similarity based learning
algorithm and threshold strategies. Experimental results
show that the proposed model is quit to useful to build
document categorization systems. This can be enhanced
for a larger text document set and the efficiency can be
provided against the performance of the presently
available methods like SVM, nave bayes etc. this
approach on the whole concentrates on categorizing
small level documents and assigned task with
completeness.
Keywords: Text classification, information retrieval,
language modeling, TF-IDF, Support vector machines,
concept mining.

1. Introduction
Documents classification/categorization is a problem in
information science. The task is to assign an electronic
document to on or more categories, based on its
contents. Within this context, document indexing is the
activity of mapping a document into a form that can be
consumed by a classification system. Several document
indexing models exist, many of which rely on feature
extraction, dimensionality reduction, or both. In feature

extraction, the associated document is typically


represented as a feature vector encoding presence of
words, syntactic entities, or semantically linked tags, and
a term-weight is computed for each such feature.
The State of the art system for text categorization use
induction algorithms in conjunction with word-based
features (bag of words). The BOW approaches is
inherently limited, as it can only us species of
information that are explicitly mention in documents,
and even that provided same vocabulary is consistently
used. Prior to text categorization, a features generator
analyzes the documents and maps them arrangement
system concepts, which in turn a set of generator
features that the augment the standard bag of words.
Although several approaches have been proposed to
address this critical issue over the past decade, even the
best classification systems have showed only marginal
Improvements. In his critical survey on the subject in
2002, Sebastian hypothesized.
In face of these underlying limitations, one of the
possible approaches would be to deviate from inductive
learning and achieve a substantially deeper
understanding of the document structure using NLP
techniques. However, before trying to develop such
complete understanding of natural language text, we try
to find out the extent of improvement in performance
that we can achieve by extending the document
representation beyond the contents of the document
itself using non-probabilistic approaches. Thus restrict
ourselves to the inductive methods and try to enrich the
TF-IDF vector with extra words (in vector space) that
are in the context of the document. These words are
acquired by utilizing a neural-net component that has
been trained on (document, word) pairs to determine
whether the word in the pair is in the context of the
document. In the next section, we discuss some of the
related work and prior-art on document classification,
followed by our approach. Then, we present some
preliminary results and comparisons and finally,
conclude the paper with a discussion on research
pointers for future work. Document classification is a
well-known task in information retrieval domain and
relies upon various indexing schemes to map documents
into a form that can be consumed by a classification
system. Term Frequency-Inverse Document Frequency

www.ijaert.org

International Journal of Advanced Engineering Research and Technology (IJAERT) 24


Volume 4 Issue 2, February 2016, ISSN No.: 2348 8190

(TF-IDF) is one such class of term-weighing functions


used extensively for document representation. One of the
major drawbacks of this scheme is that it ignores key
semantic links between words and/or word meanings
and compares documents based solely on the word
frequencies. Majority of the current approaches that try
to address this issue either rely on alternate
representation schemes, or are based upon probabilistic
models. We utilize a non-probabilistic approach to build
a robust document classification system, which
essentially relies upon enriching the classical TF-IDF
scheme with context sensitive semantics using a neuralnet based learning component.
Similarity based methods, which measure proximity
between sentences by using (most of the time) the cosine
of the angle between vectors representing sentences.
Graphical methods, which graphically represent terms
frequencies and use these representations to identify
topical segments (which are dense dot clouds on the
graphic). The Dot plotting algorithm is the most
common example of the use of a graphical approach of
text segmentation. Lexical chains based methods, which
links multiple occurrences of a term and consider a chain
is broken when there are too much sentences between
two occurrences of a term. Segmented uses this method
for text segmentation with a subtle adjustment as it
determines the number of necessary sentences to break a
chain in function of the syntactical category of the term.
Many data mining (i.e. knowledge discovery) techniques
are used for ontology learning text mining, Web
mining, graph mining, network analysis, link analysis
relation data mining, stream mining. In the current stateof the-art bundle mining of software code an Associated
documentations is not explicitly addressed. With the
growing amounts of software, especially open-source
software libraries, Data is worth considering as a new
methodology. That introduces the term software
mining to refer to such methodology. The term denotes
the process of extracting knowledge (i.e. useful
information) out of data sources that typically
accompany an open-source software library. The
motivation for software mining comes from the fact that
the discovery of reusable software artifacts is just as
important as the discovery of documents and multimedia
contents. According to the recent Semantic Web trends,
contents need to be semantically annotated with
concepts from the domain ontology in order to be
discoverable by intelligent agents. Because the legacy
content repositories are relatively large, cheaper semiautomatic means for semantic annotation and domain
ontology construction are preferred to the expensive
manual labor. An open-source software library for
naturals language processing written in Java

programming language. This interprets software


mining as being a combination of methods for structure
mining and for content mining. To be more specific, that
approach the software mining task with the techniques
used for text mining and link analysis. The GATE case
study serves as a perfect example in this perspective. On
concrete examples we discuss how each instance (i.e. a
programming construct such as a class or a method) can
be represented as a feature vector that combines the
information about how the instance is interlinked with
other instances, and the information about its (textual)
content. The so-obtained feature vectors serve as the
basis for the construction of the domain ontology with
Ontogeny, a system for semi-automatic, data-driven
ontology construction, or by using traditional machine
learning algorithms such as clustering, classification,
regression, or active learning.
2. Similarity
The notion of similarity rests either on exact or
approximate repetitions of patterns in the compared
items. In the case of approximate repetitions we talk
about statistical similarity as found in a fractal and its
parts. Finding similarities or distinguishing between
dissimilarities depends on the faculties of pattern
recognition and disambiguation, respectively. Text
mining techniques are applied to large collection of
document from various sources such as new articles,
research paper, books, digital library, email massage and
web pages.
The main classification objective, particularly with
respect to knowledge management, is to simplify access
to and processing of explicit knowledge.
Classification supports analyzing the knowledge.
1. Retrieval
2. Organization
3. Visualization
4. Development
5. Exchange of knowledge
There are three classes3 of objects that could be used for
querying an information retrieval system:
A set of concepts of a knowledge representation that
describes the situation of the knowledge worker, for
example the current actions a person perform or the
competencies he or she acquires. This concepts stem
from the formal models that are used to represent the
context of the knowledge worker.
A set of documents that are related to the current
situation of the knowledge worker, for example the
document template he or she is currently interacting
with, or the process documentation the person is
reading.In our approach documents are related to
concepts from the task and the domain model. This

www.ijaert.org

International Journal of Advanced Engineering Research and Technology (IJAERT) 25


Volume 4 Issue 2, February 2016, ISSN No.: 2348 8190

enables us to infer which documents are associated with


the current task and vice versa.
A set of terms which are related to his or hers current
situation, examples for such terms would be parts of
documents the person currently views or a text he or she
currently types.
The set of terms is not related to any of the models that
span our context model. Nevertheless we think of it as a
vital addition to our approach to retrieval.

3. Document categorization
In general, document categorization only means
assigning documents to a fixed set of categories. But in
the domain of text mining document categorization also
involves the preliminary process of automatically
learning categorization patterns so that the categorization
of new (uncategorized) documents is straightforward.
Major categorization approaches are decision trees,
decision rules, k-nearest neighbors, Bayesian
approaches, neural networks, regression-based methods,
and vector-based methods.
Document classification task can be divided in two sorts:
Supervised document classification defines such as
human feedback and unsupervised document
classification must be done entirely without reference to
external information.
Document classification techniques include
1. Nave Bayes classifier
2. TI-IDF
3. Latent semantic indexing
4. Concept mining
5. SVM(Support vector mining)
1. Nave Bayes classifier: A nave Bayes classifier
assumes that the presence (or absence) of a particular
feature of a class is unrelated to the presence (or
absence) of any other feature.
For example, a fruit may be considered to be an
apple if it is red, round, and about 4" in diameter.
Even though these features depend on the
existence of the other features, a naive Bayes
classifier considers all of these properties to
independently contribute to the probability that
this fruit is an apple.
Depending on the precise nature of the probability
model, naive Bayes classifiers can be trained very
efficiently in a supervised learning setting.
In many practical applications, parameter
estimation for naive Bayes models uses the
method of maximum likelihood;
In other words, one can work with the naive Bayes
model without believing in Bayesian probability or
using any Bayesian methods.

In spite of their naive design and apparently oversimplified assumptions, naive Bayes classifiers often
work much better in many complex real-world situations
than one might expect. Recently, careful analysis of the
Bayesian classification problem has shown that there are
some theoretical reasons for the apparently unreasonable
efficacy of naive Bayes classifiers.
An advantage of the naive Bayes classifier is that it
requires a small amount of training data to estimate the
parameters (means and variances of the variables)
necessary for classification. Because independent
variables are assumed, only the variances of the
variables for each class need to be determined and not
the entire covariance matrix.
A naive Bayes classifier is a simple probabilistic
classifier based on applying Bayes' theorem (from
Bayesian statistics) with strong (naive) independence
assumptions. A more descriptive term for the underlying
probability model would be "independent feature
model".
A naive Bayes classifier is a simple probabilistic
classifier based on applying Bayes' theorem (from
Bayesian statistics) with strong (naive) independence
assumptions. A more descriptive term for the underlying
probability model would be "independent feature
model".
3.1.1 The naive Bayes probabilistic model
Abstractly, the probability model for a classifier is a
conditional model
.. (3.1)
over a dependent class variable C with a small number
of outcomes or classes, conditional on several feature
variables F1 through Fn. The problem is that if the
number of features n is large or when a feature can take
on a large number of values, then basing such a model
on probability tables is infeasible. We therefore
reformulate the model to make it more tractable.
Sing Bayes' theorem, we write
.. (3.2)
In plain English the above equation can be written as
(3.3)
In practice we are only interested in the numerator of
that fraction, since the denominator does not depend on
C and the values of the features Fi are given, so that the
denominator is effectively constant. The numerator is
equivalent to the joint probability model.
.. (3.4)
which can be rewritten as follows, using repeated
applications of the definition of conditional probability:

www.ijaert.org

International Journal of Advanced Engineering Research and Technology (IJAERT) 26


Volume 4 Issue 2, February 2016, ISSN No.: 2348 8190

and so forth. Now the "naive" conditional independence


assumptions come into play: assume that each feature Fi
is conditionally independent of every other feature Fj for
. This means that
.. (3.5)
and so the joint model can be expressed as

. (3.6)
This means that under the above independence
assumptions, the conditional distribution over the class
variable C can be expressed like this:
(3.7)
Where Z is a scaling factor dependent only on
, i.e., a constant if the values of the feature
variables are known.
Models of this form are much more manageable, since
they factor into a so-called class prior p(C) and
independent probability distributions
.
If there are k classes and if a model for p(Fi) can be
expressed in terms of r parameters, then the
corresponding naive Bayes model has (k 1) + n r k
parameters. In practice, often k = 2 (binary
classification) and r = 1 (Bernoulli variables as features)
are common, and so the total number of parameters of
the naive Bayes model is 2n + 1, where n is the number
of binary features used for prediction.
3.2 TF-IDF
The TF-IDF weight (term frequencyinverse document
frequency) is a weight often used in information retrieval
and text mining. This weight is a statistical measure used
to evaluate how important a word is to a document in a
collection or corpus.
The importance increases proportionally to the number
of times a word appears in the document but is offset by
the frequency of the word in the corpus. Variations of
the tfidf weighting scheme are often used by search
engines as a central tool in scoring and ranking a
document's relevance given a user query.
One of the simplest ranking functions is computed by
summing the tf-idf for each query term; many more
sophisticated ranking functions are variants of this
simple model.

3.2.1 Mathematical Details


The term count in the given document is simply the
number of times a given term appears in that document.
This count is usually normalized to prevent a bias
towards longer documents (which may have a higher
term count regardless of the actual importance of that
term in the document) to give a measure of the
importance of the term ti within the particular document
dj. Thus we have the term frequency, defined as follows.
. (3.8)
Where ni,j is the number of occurrences of the considered
term (ti) in document dj, and the denominator is the sum
of number of occurrences of all terms in document dj.
The inverse document frequency is a measure of the
general importance of the term (obtained by dividing the
number of all documents by the number of documents
containing the term, and then taking the logarithm of that
quotient).
.. (3.9)
With
| D | : total number of documents in the corpus
: Number of documents where the term
ti appears (that is
). If the term is not in the
corpus, this will lead to a division-by-zero. It is therefore
common to use
Then
. (3.10)
A high weight in tfidf is reached by a high term
frequency (in the given document) and a low document
frequency of the term in the whole collection of
documents; the weights hence tend to filter out common
terms. The tf-idf value for a term will always be greater
than or equal to zero.
3.3 LATENT SEMANTIC INDEXING
Latent Semantic Indexing (LSI) is an indexing and
retrieval method that uses a mathematical technique
called Singular Value Decomposition (SVD) to identify
patterns in the relationships between the terms and
concepts contained in an unstructured collection of text.
LSI is based on the principle that words that are used in
the same contexts tend to have similar meanings. A key
feature of LSI is its ability to extract the conceptual
content of a body of text by establishing associations
between those terms that occur in similar contexts. It is
Called Latent Semantic Indexing because of its ability to
correlate semantically related terms that are latent in a
collection of text; it was first applied to text at Bell
Laboratories in the late 1980s.

www.ijaert.org

International Journal of Advanced Engineering Research and Technology (IJAERT) 27


Volume 4 Issue 2, February 2016, ISSN No.: 2348 8190

The method, also called Latent Semantic Analysis


(LSA), uncovers the underlying latent semantic structure
in the usage of words in a body of text and how it can be
used to extract the meaning of the text in response to
user queries, commonly referred to as concept searches.
Queries, or concept searches, against a set of documents
that have undergone LSI will return results that are
conceptually similar in meaning to the search criteria
even if the results dont share a specific word or words
with the search criteria.
3.4 SUPPORT VECTOR MACHINES
A Support Vector Machine (SVM) performs
classification by constructing an N-dimensional hyper
plane that optimally separates the data into two
categories. SVM models are closely related to neural
networks. In fact, a SVM model using a sigmoid kernel
function is equivalent to a two-layer, perceptron neural
network.
Support Vector Machine (SVM) models are a close
cousin to classical multilayer perceptron networks.
Using a kernel function, SVMs are an alternative
training method for polynomial, radial basis function and
multi-layer perceptron classifiers in which the weights of
the network are found by solving a quadratic
programming problem with linear constraints, rather
than by solving a non-convex, unconstrained
minimization problem as in standard neural network
training.
A predictor variable is called an attribute, and a
transformed attribute that is used to define the hyper
plane is called a feature. The task of choosing the most
suitable representation is known as feature selection. A
set of features that describes one case is called a vector.
So the goal of SVM modeling is to find the optimal
hyper plane that separates clusters of vector in such a
way that cases with one category of the target variable
are on one side of the plane and cases with the other
category are on the other size of the plane. The vectors
near the hyper plane are the support vectors. The figure
below presents an overview of the SVM process.
3.4.1 Two-Dimensional Example
A simple 2-dimensional example. Assume to perform a
classification, and our data has a categorical target
variable with two categories. Also assume that there are
two predictor variables with continuous values. If plot
the data points using the value of one predictor on the X
axis and the other on the Y axis we might end up with an
image such as shown below. One category of the target
variable is represented by rectangles while the other
category is represented by ovals.
Examples

In this idealized example, the cases with one category


are in the lower left corner and the cases with the other
category are in the upper right corner; the cases are
completely separated.
The SVM analysis attempts to find a 1-dimensional
hyperplane (i.e. a line) that separates the cases based on
their target categories. There are an infinite number of
possible lines; two candidate lines are shown above. The
question is which line is better, and how do we define
the optimal line.
The dashed lines drawn parallel to the separating line
mark the distance between the dividing line and the
closest vectors to the line. The distance between the
dashed lines is called the margin. The vectors (points)
that constrain the width of the margin are the support
vectors. The following figure illustrates this.
An SVM analysis finds the line (or, in general,
hyperplane) that is oriented so that the margin between
the support vectors is maximized. In the figure above,
the line in the right panel is superior to the line in the left
panel.
If all analyses consisted of two-category target variables
with two predictor variables, and the cluster of points
could be divided by a straight line, life would be easy.
Unfortunately, this is not generally the case, so SVM
must deal with
(a) More than two predictor variables,
(b) Separating the points with non-linear curves,
(c) Handling the cases where clusters cannot be
completely separated, and
(d) Handling classifications with more than two
categories.

SUPPORT VECTOR
FIGURE: 3.6 Small Margins and Large Margin

www.ijaert.org

International Journal of Advanced Engineering Research and Technology (IJAERT) 28


Volume 4 Issue 2, February 2016, ISSN No.: 2348 8190

3.5. CONCEPT MINING


It can be argued that humans don't think in words, they
think in concepts. Its true, the voice in your head, that
by-product of consciousness, uses words; but behind the
words is the flow of concepts. Whenever you think
rapidly and intuitively the words get left behind; at least,
that is the experience of most of us, and we think in a
string of concepts.
These mental concepts are not accessible to others, they
are our own private language of thought, but because we
share the same world and because we share common
languages these concepts are heavily influenced by the
outside world. They have order, they have structure.
Anyone who has used a thesaurus has seen part
of this structure. A thesaurus has two elements, a
dictionary that acts as a look up table and a
massive tree of meanings.
One uses a thesaurus by looking up a word in
the dictionary section, finding the possible
meanings as tree locations, and then finding
other words in the same location of the tree that
have the same meaning.
On the other hand, language has lots of
ambiguities. then there will be times that it will
select the wrong concept and have to backtrack
when a later part of the text makes it clear that
confused the sound of a word, or one of several
meanings attached to it. This re-adjustment is
one of the mechanisms of humor.
It's also in this area that the most rapid growth of
languages occurs. New words are frequently invented,
but we even more rapidly give new meanings to existing
words, especially in technical jargon. It is the
computational complexity of this that has put off text
mining researchers, who have concentrated on words,
word frequencies and word co-occurrences as the raw
material of their analyses. A thesaurus is organized using
one kind of tree, using "is a kind of" relationships
(known as hyponymy), where concepts further down the
tree are examples of object further up. But there are
other relationships that linguists use that add more
structure, like "is a part of" relationships (metonymy) or
"is the opposite of" relationships (antonymy).
The key insight of concept mining is that the benefits we
get by being able to manipulate the structures is far
greater than the deficit we get from the ambiguity of
text. In short, Concept Mining has the potential to be
much more powerful than text mining, and a much more
fruitful area of research. At Scientio we've put a great
deal of work into this area. We built a commercial
system several years ago that used concept mining to
find similar documents in large corpora in Olog (n) time.
This is still the fastest algorithm to do this in existence.

We've built up a tool kit of concept mining tools in our


product Concept Mine that enables the user to join us in
researching new products in this exciting new
technology.
In the next section, this discusses some of the related
work and prior-art on document classification, followed
by that approach. Then, we present some preliminary
results and comparisons and finally, conclude the paper
with a discussion on research pointers for future work.
3.6. Document classification
The approach comprises training of Neural Network for
Text Representation (NNTR) and Neural Network for
Document Classification (NNDC), followed by the
actual classification task.
In the next section, this discusses some of the related
work and prior-art on document classification, followed
by that approach. Then, we present some preliminary
results and comparisons and finally, conclude the paper
with a discussion on research pointers for future work.
The approach comprises training of Neural Network for
Text Representation (NNTR) and Neural Network for
Document Classification (NNDC), followed by the
actual classification task.
3.6.1 TRAINING NNTR MODEL
The first step is to first train a neural-net on (document,
word) pairs for a training collection of documents and a
dictionary R of chosen words. The target for the network
is high / low depending on whether word is in the
context of the document.
This context for the training set is decided by a domain
expert, or using one of the approached mentioned. The
simplest way is to define context as a containment
relation and output high if the word is contained in the
document.
This network is based on the Neural Network for Text
Representation model explained in [1]. It uses one-hot
encoding scheme for the word vectors and TF-IDF for
document vectors. Further, it uses three multi-layer
perceptions (MLP), one for word / document vectors
each and one to combine the outputs from the first two
MLPs. The intent is to relate a distributed and rich
representation of words to that of the documents,
whenever the word lies within the context of the
corresponding document.
3.6.2 TRAINING NNDC MODEL
There is next training for NNDC models train another
neural-net, which would now be used for the actual
document classification. This Neural Network for
Document Classification. It is the creation of the input
vector (both for training and classification) to this

www.ijaert.org

International Journal of Advanced Engineering Research and Technology (IJAERT) 29


Volume 4 Issue 2, February 2016, ISSN No.: 2348 8190

network that makes our system unique when compared


with other similar implementations (explained below).
Let V denote the vector instance created by applying
TFIDF on an input document T. Then, as before:
V = {(g (wi), i) | i D}, " unique wi T. (3.12)
Now, that find out all the semantically relevant words in
R
For T by invoking NNTR
(wj , T), "wj R.
We denote the set of all such relevant words by Z and
extend D by Z, i.e., D = D Z. Further, we modify V to
reflect all the words in the set Z, with g (.) value equal to
a parameter s to be adjusted during training phase.
Let us denote this modified vector instance as V'.
Finally, we train NNDC on the modified vector V' over
all the test documents.

Figure: 3.11 Similarity-Based Categorization

4. Conclusion and future work


The similarity based techniques basically utilizes the
document library. It uses the core functionality from
similarity of documents. It extends it to create NNTR
and NNDC networks and finally hook it all up in text
document Classifier. There are still working on some of
the document classification aspects concerning the TFIDF filter (Word Vector) and parser. A brief
architectural overview is provided in the diagrams
below.Similarity measures are the core of such diverse
techniques as similarity-based classification, clustering
and case-based reasoning.
The performance of these techniques depends heavily on
the quality of the similarity measure it would it be
possible to extract more elaborate representations from
the documents and train more Effectively to classify
according to the context. The clustering is based on
expression-level shape rather than magnitude. The shape

information is captured by the first-order time


difference. However, since the gene expression profiles
were obscured by the varying levels manifested in the
data, the time difference must be obtained on the
expression levels with the same scale and dynamic
range. Motivated by the observations, the proposed
algorithm has three steps. In the first step, the expression
data is rescaled. In the second step, the signal shape
information is captured by calculating the first-order
time difference. In the last step, clustering is performed
on the time-difference data using a Variational Bayes
Expectation Maximization (VBEM) algorithm. In the
following, each step is discussed in detail. It also needs
to be seen how much thiapproach scale with larger and
more complicated text would document sets.
There are varies future work of the Similarity based
techniques which is given following.
A method for document classification based on
an enriched version of TF-IDF model was
suggested in this paper.
The key idea used was to add some contextspecific words to the existing document
representation during the training and
classification phases.
These context-specific words were derived from
NNTR model explained.
The overall model is amenable to many
enhancements and investigations. It would be
interesting to see if by training NNTR on larger
clusters of words along with documents or
words with subsets of documents (sentences for
example).
It is possible to extract more elaborate
representations from the documents and train
more effectively to classify according to the
context. It also needs to be seen how much
would this approach scale with larger and more
complicated text document sets.

References
[1] Mikaela Keller and Samy Bengio. A Neural Network
for Text Representation. Artificial Neural Networks
Formal Models and Their Applications - ICANN 2005
[2] Caropreso, Maria Fernanda, Stan Matwin, and
Fabrizio Sebastiani. 2001. A learner-independent
evaluation of the usefulness of statistical phrases for
automated text categorization. In Amita G. Chin, editor,
Text Databases and Document Management: Theory and
Practice. Idea Group Publishing, Hershey, US.
[3] Peng, Fuchun, Dale Schuurmans, and Shaojun Wang
2004. Augmenting naive Bayes classifiers with
statistical language models. Information Retrieval, 7.

www.ijaert.org

International Journal of Advanced Engineering Research and Technology (IJAERT) 30


Volume 4 Issue 2, February 2016, ISSN No.: 2348 8190

[4] Hofmann, T.: Unsupervised learning by Probabilistic


Latent Semantic Analysis. Machine Learning 42 (2001)
[5] Evgeniy Gabrilovich and Shaul Markovitch,
FeatureGeneration for Text Categorization Using World
Knowledge.
[6] Weka 3: Data Mining Software in Java
http://www.cs.waikato.ac.nz/ml/weka/
[7] Jucheng yang, Chonbuk National University
Implementation of Information Retrieval System with
Binary Tree.
[8] Peter Scheir, Stefanie N. Lindstaedt Know-Center
Inffeldgasse 21a, 8010 Graz, Austria. A network model
approach to document retrieval taking into account
domain knowledge
[9] Jae-Ho Kim, Jin-Xia Huang, Ha-Yong Jung, KeySun Choi Korea Advanced Institute of Science and
Technology(KAIST) / National Language Resource
Research Center (BOLA), Patent Document Retrieval
and Classification at KAIST.
[10] Quinlan, J.r. C4.5: Programs for Machine
Learning.(morgan Kaufmann Publishers, San mateo,
CA, USA,1993).
[11]. Breiman, L., Friedman, J., Olshen, r. & Stone,
C.Classification and Regression Trees (Wadsworth
International Group, Belmont, CA, USA, 1984).
[12] Caruana, r. & Niculescu-mizil, A. An empirical
comparison of supervised learning algorithms. In
Machine Learning, Proceedings of the Twenty-Third
International Conference (eds. Cohen, W.W. & moore,
A.) 161168 (ACm, New York, 2003).
[13] Zadrozny, B. & Elkan, C. Obtaining calibrated
probability estimates from decision trees and naive
Bayesian classifiers. in Proceedings of the 18th
International Conference on Machine Learning, (eds.
Brodley, C.E.& Danyluk, A.P.) 609616 (morgan
Kaufmann, San Francisco, 2001).
[14]. murthy, S.K., Kasif, S. & Salzberg, S. A system
forinduction of oblique decision trees. J. Artif. Intell.
Res.2, 132 (1994).
[15] macKay, D.J.C. Information Theory, Inference and
Learning
Algorithms
(Cambridge
University
Press,Cambridge, UK, 2003).
[16] Quinlan, J.r. & rivest, r.L. Inferring decision trees
using the minimum Description Length Principle. Inf.
Comput. 80, 227248 (1989).
[17] Breiman, L. random forests. Mach. Learn. 45, 5
32(2001).
[18] Heath, D., Kasif, S. & Salzberg, S. Committees of
decision trees. in Cognitive Technology: In Sear Human
Interface (eds. Gorayska, B. & mey, J.) 305317
(Elsevier Science, Amsterdam, The Netherlands,1996).
[19] Schapire, r.E. The boosting approach to machine
learning: an overview. in Nonlinear Estimation and

Classification (eds. Denison, D.D., Hansen, m.H.,


Holmes, C.C., mallick, B. & Yu, B.) 141171 (Springer,
New York, 2003).
[20] Freund, Y. & mason, L. The alternating decision
tree learning algorithm. in Proceedings of the 16th
International Conference on Machine Learning,
(eds.Bratko, I. & Deroski, S.) 124133 (morgan
Kaufmann, San Francisco, 1999).

www.ijaert.org

Вам также может понравиться