Академический Документы
Профессиональный Документы
Культура Документы
Abstract
This paper presents a Similarity-Based Techniques for
text document classification and formally examines the
similarity on both documents. Documents classification
includes automatic assignment of a set of documents into
predefined classes by a learning system (classifier) that
has been trained on similar data-sets of test documents.
Within this context, document indexing is the activity of
mapping a document into a form that can be consumed
by a classification system. Several document indexing
models exist, many of which rely on feature extraction,
dimensionality reduction, or both. In feature extraction,
the associated document is typically represented as a
feature vector encoding presence of words, syntactic
entities, or semantically linked tags, and a term-weight is
computed for each such feature.
We introduce document classification and its techniques
that are nave byes classifier, TF-IDF, Latent Semantic,
indexing, Support vector machines, Concept Mining.
Our goal was to develop a text categorization system
that uses fewer examples for training to achieve a given
label of performance using a similarity based learning
algorithm and threshold strategies. Experimental results
show that the proposed model is quit to useful to build
document categorization systems. This can be enhanced
for a larger text document set and the efficiency can be
provided against the performance of the presently
available methods like SVM, nave bayes etc. this
approach on the whole concentrates on categorizing
small level documents and assigned task with
completeness.
Keywords: Text classification, information retrieval,
language modeling, TF-IDF, Support vector machines,
concept mining.
1. Introduction
Documents classification/categorization is a problem in
information science. The task is to assign an electronic
document to on or more categories, based on its
contents. Within this context, document indexing is the
activity of mapping a document into a form that can be
consumed by a classification system. Several document
indexing models exist, many of which rely on feature
extraction, dimensionality reduction, or both. In feature
www.ijaert.org
www.ijaert.org
3. Document categorization
In general, document categorization only means
assigning documents to a fixed set of categories. But in
the domain of text mining document categorization also
involves the preliminary process of automatically
learning categorization patterns so that the categorization
of new (uncategorized) documents is straightforward.
Major categorization approaches are decision trees,
decision rules, k-nearest neighbors, Bayesian
approaches, neural networks, regression-based methods,
and vector-based methods.
Document classification task can be divided in two sorts:
Supervised document classification defines such as
human feedback and unsupervised document
classification must be done entirely without reference to
external information.
Document classification techniques include
1. Nave Bayes classifier
2. TI-IDF
3. Latent semantic indexing
4. Concept mining
5. SVM(Support vector mining)
1. Nave Bayes classifier: A nave Bayes classifier
assumes that the presence (or absence) of a particular
feature of a class is unrelated to the presence (or
absence) of any other feature.
For example, a fruit may be considered to be an
apple if it is red, round, and about 4" in diameter.
Even though these features depend on the
existence of the other features, a naive Bayes
classifier considers all of these properties to
independently contribute to the probability that
this fruit is an apple.
Depending on the precise nature of the probability
model, naive Bayes classifiers can be trained very
efficiently in a supervised learning setting.
In many practical applications, parameter
estimation for naive Bayes models uses the
method of maximum likelihood;
In other words, one can work with the naive Bayes
model without believing in Bayesian probability or
using any Bayesian methods.
In spite of their naive design and apparently oversimplified assumptions, naive Bayes classifiers often
work much better in many complex real-world situations
than one might expect. Recently, careful analysis of the
Bayesian classification problem has shown that there are
some theoretical reasons for the apparently unreasonable
efficacy of naive Bayes classifiers.
An advantage of the naive Bayes classifier is that it
requires a small amount of training data to estimate the
parameters (means and variances of the variables)
necessary for classification. Because independent
variables are assumed, only the variances of the
variables for each class need to be determined and not
the entire covariance matrix.
A naive Bayes classifier is a simple probabilistic
classifier based on applying Bayes' theorem (from
Bayesian statistics) with strong (naive) independence
assumptions. A more descriptive term for the underlying
probability model would be "independent feature
model".
A naive Bayes classifier is a simple probabilistic
classifier based on applying Bayes' theorem (from
Bayesian statistics) with strong (naive) independence
assumptions. A more descriptive term for the underlying
probability model would be "independent feature
model".
3.1.1 The naive Bayes probabilistic model
Abstractly, the probability model for a classifier is a
conditional model
.. (3.1)
over a dependent class variable C with a small number
of outcomes or classes, conditional on several feature
variables F1 through Fn. The problem is that if the
number of features n is large or when a feature can take
on a large number of values, then basing such a model
on probability tables is infeasible. We therefore
reformulate the model to make it more tractable.
Sing Bayes' theorem, we write
.. (3.2)
In plain English the above equation can be written as
(3.3)
In practice we are only interested in the numerator of
that fraction, since the denominator does not depend on
C and the values of the features Fi are given, so that the
denominator is effectively constant. The numerator is
equivalent to the joint probability model.
.. (3.4)
which can be rewritten as follows, using repeated
applications of the definition of conditional probability:
www.ijaert.org
. (3.6)
This means that under the above independence
assumptions, the conditional distribution over the class
variable C can be expressed like this:
(3.7)
Where Z is a scaling factor dependent only on
, i.e., a constant if the values of the feature
variables are known.
Models of this form are much more manageable, since
they factor into a so-called class prior p(C) and
independent probability distributions
.
If there are k classes and if a model for p(Fi) can be
expressed in terms of r parameters, then the
corresponding naive Bayes model has (k 1) + n r k
parameters. In practice, often k = 2 (binary
classification) and r = 1 (Bernoulli variables as features)
are common, and so the total number of parameters of
the naive Bayes model is 2n + 1, where n is the number
of binary features used for prediction.
3.2 TF-IDF
The TF-IDF weight (term frequencyinverse document
frequency) is a weight often used in information retrieval
and text mining. This weight is a statistical measure used
to evaluate how important a word is to a document in a
collection or corpus.
The importance increases proportionally to the number
of times a word appears in the document but is offset by
the frequency of the word in the corpus. Variations of
the tfidf weighting scheme are often used by search
engines as a central tool in scoring and ranking a
document's relevance given a user query.
One of the simplest ranking functions is computed by
summing the tf-idf for each query term; many more
sophisticated ranking functions are variants of this
simple model.
www.ijaert.org
SUPPORT VECTOR
FIGURE: 3.6 Small Margins and Large Margin
www.ijaert.org
www.ijaert.org
References
[1] Mikaela Keller and Samy Bengio. A Neural Network
for Text Representation. Artificial Neural Networks
Formal Models and Their Applications - ICANN 2005
[2] Caropreso, Maria Fernanda, Stan Matwin, and
Fabrizio Sebastiani. 2001. A learner-independent
evaluation of the usefulness of statistical phrases for
automated text categorization. In Amita G. Chin, editor,
Text Databases and Document Management: Theory and
Practice. Idea Group Publishing, Hershey, US.
[3] Peng, Fuchun, Dale Schuurmans, and Shaojun Wang
2004. Augmenting naive Bayes classifiers with
statistical language models. Information Retrieval, 7.
www.ijaert.org
www.ijaert.org