Вы находитесь на странице: 1из 6

Application of an ant colony algorithm for text

indexing

Nadia Lachetar
1
1
Computer Science department,
University 20 aout 1955 Skikda

Skikda, Algeria;

Email nadia_ishak2002@yahoo.fr
Halima Bahi
2

2
LabGED Laboratory, Computer Science department,
University Badji Mokhtar Annaba
Annaba, Algeria;
Email bahi@labged.net


Abstract Every day, the mass of information available to us
increases. This information would be irrelevant if our ability to
efficiently access did not increase as well. For maximum benefit,
we need tools that allow us to search, sort, index, store, and
analyze the available data. We also need tools helping us to find
in a reasonable time the desired information by performing
certain tasks for us. One of the promising areas is the automatic
text categorization. Imagine ourselves in the presence of a
considerable number of texts, which are more easily accessible if
they are organized into categories according to their theme. Of
course, one could ask a human to read the texts and classify them
manually. This task is hard if done for hundreds, even thousands
of texts. So, it seems necessary to have an automated application,
which would consist on indexing text databases. In this article, we
present our experiments in automated text categorization, where
we suggest the use of an ant colony algorithm. A Naive Bayes
algorithm is used as a baseline in our tests.
Keywords-componen: Information Retrieval, Text
categorization; Naive Bayes Algorithm; Ant Colony Algorithm.
I. INTRODUCTION
Research in the field of automatic categorization remains
relevant today since the results are still subject to
improvements. For some tasks, the automatic classifiers
perform almost as well as humans, but for others the gap is
even greater. At first glance, the main problem is easy to grasp.
On one hand, we are dealing with a bank of documents of texts
and on the other with a set of categories. The goal is to make a
computer application which can determine to which category
belongs a text based on its contents [2].
Despite this simplified definition, the solution is not
straightforward and several factors must be considered. First,
we need the selection of an adequate representation of texts to
be treated; this is an essential step in machine learning. We
should opt for a consistent and sensible attributes to abstract
the data before submitting them to an algorithm. Subsequently
we will discuss the selection of attributes almost always
involved in automated text categorization, and eliminate
unnecessary attributes considered for classification [2]. Once
the pretreatment is completed, we perform classification using
both naive Bayes algorithm [1] and our proposed ant colony
algorithm. The remaining of the paper is organized as follow:
In section II, we present the various aspects of automatic
text categorization; particularly, it addresses the main modes of
documents representation. Then, Section III introduces the
Naive Bayes algorithm. In section IV, we present our approach
which is the application of an ant colony algorithm to the texts
categorization. Section V presents the obtained results and a
discussion.
II. TEXT CATEGORIZATION
The purpose of the automatic text categorization is to learn
a machine to classify a text into the correct category based on
its content; the categories refer to the topics (subjects). We may
wish that the same text is associated with only one category or
it can belong to number of categories. The set of categories is
determined in advance. The problem is to group the texts by
their similarity. In text categorization: the classification is
similar to the problem of extracting semantics of the texts,
since membership of the text to a category is closely related to
the meaning of the text. This is partly what makes the task
difficult since the treatment of the semantics of words written
in natural language is not yet solved.
A. How to categorize a texte?
The categorization process includes the construction of a
prediction model that receives in input the text, and as output it
combines one or more labels. To identify the category
associated to a text, the following steps are required:

1) Learning includes several steps and leads to a
prediction model.
a) We have a set of labeled texts (for every text we know
its class)
b) From this corpus, we extract the k descriptors (t
1
, ..,
t
k
) which are most relevant in the sense of the problem solving.
c) We have then a table of "descriptors X individuals
and for every word of texts we know the value of descriptors
and its label.

2) The classification of words for a new text d
x
includes
two stages
978-1-61284-732-0/11/$26.00 2010 IEEE


a) Research and weighting the instances t
1
, .. t
k
of terms
in text to classify d
x
.
b) Implementation of a learning algorithm on these
instances and the previous table to predict the labels of the
text d
x
[1]. Note that the k most relevant individuals (t
1
, ..., t
k
)
are extracted during the first phase by analyzing the texts of
the training corpus. In the second phase, the classification of a
new text, we simply seek the frequency of these k descriptors
(t
1
, ..., t
k
) in the text to be classified.
B. Representation and coding of a text
Prior coding of text is necessary because there is currently
no method of learning which can directly handle unstructured
data in the model construction stage, or when used in
classification.
For most learning methods, we must convert all texts
in a PivotTable "individuals-variables".
Individual is a text d
j
, labeled during the learning stage,
it will be classified in the prediction phase.
Variables are descriptors (terms) t
k
which are
extracted from data of the text.
The contents of table w
kj
represent the weight of term k in
document j.
Different methods are proposed for the selection of
descriptors and weights associated with these descriptors.
Some researchers use the words as descriptors, while others
prefer to use the lemmas (lexical roots) or even Stemme
(deletion of affix) [1].
C. Approaches for texts representation
Learning algorithms are not able to treat texts and more
generally unstructured data such as images, sounds and video
clips. Therefore a preliminary step called representation is
required. This step aims to represent each document by a
vector whose components are such words in the text to make it
usable by the learning algorithms. A collection of texts can be
represented by a matrix whose columns are the documents [1].
Many researchers have chosen to use a vector
representation in which each text is represented by a vector of
n weighted terms. The n terms are simply the n different
words in the texts.
1) Choice of terms: In text categorization, we transform
the document into a vector d
j
= d
j
(w
1j
, w
2j
, ..., w
| T | j
), where T
is the set of terms (descriptors) that appear at least once in the
corpus (the collection) learning.The weight w
kj
correspond to
the contribution of terms t
k
to the semantics of text d
j
[1].
2) Bag of words representation: The simplest
representation of text is a vector model called "bag of words".
The idea is to transform the texts in a vector where a
component is a word. Words have the advantage of having an
explicit sense. However, several problems arise. It must first
define what a "word" to be able to automatically process. The
word can be regarded as a sequence of characters from a
dictionary or, more practically, as a sequence of non-
delimiters framed by delimiter characters.The components of
the vector are a function of occurrence of words in the text.
Representation of text and grammatical analysis excludes any
distance between words is why this representation is called
"bag of words; other authors speak of "whole words when
the weights are associated binary [1].
3) Representation of texts by sentences: Despite the
simplicity of using words as units of representation, some
authors propose to use sentences like unities. The sentences
are more informative than words, because they have the
advantage of preserving information on the position of the
word in the sentence: Logically, such a representation must
get better results than those obtained through words. However,
if semantic qualities are preserved, the statistical qualities are
largely degraded [1].
4) Representation of texts by lexical roots and lemmas: In
the model of representation "bag of words", each form of a
word is considered as a different descriptor. For example, the
words "movers, removals, move, etc.. descriptors are
considered different while they belong to the same root
move. Techniques of suffixation (or Stemming), which are
to investigate the lexical roots, stemming may resolve this
difficulty. For the detection of lexical roots, several algorithms
have been proposed, the most known for the English language
is the algorithm of Porter [7]. Stemming is to replace the verbs
in their infinitive form, and the names of their singular form.
TreeTagger algorithm was developed for English, French,
German and Italian [6].
5) Coding terms: Once we choose the components of the
vector representing the text j, we must decide how to encode
each coordinate of the vector d
j
. There are different methods to
calculate the weight w
kj
. These methods are based on two
observations:
a) More the term t
k
is frequently in a document d
j
, more
it is relevant to the subject of this document.
b) More often the term t
k
is in a collection, unless it is
used as discriminating between documents.
# (t
k
, d
j
): the number of occurrences of term t
k
in the text dj.
|tr|: the number of documents from the training corpus.
# Tr (t
k
): the number of documents in this set where appears at
least once the term t
k
.
According to the two previous observations, a term t
k
is
therefore assigned a weight as much stronger than it appears
frequently in the document corpus.
The vector component is coded f (# (t
k
, d
j
)), the function f
remains to be determined [1].
Two approaches can be used:
The first is to assign equal weight to the occurrence of the
term in the document
w
kj
= # (t
k
, d
j
) (1)
The second approach is simply to assign a binary value 1 if the
word appears in the text, 0 otherwise.


w
kj
= 1 if # (t
k
, J
]
) 1
(2)
w
kj
= 0 Otherwise
6) Coding terms frequency X inverse document frequency:
The two functions (1) and (2) above are rarely used because
they deplete the encoding information:
Function (1) does not take into account the frequency of
occurrence of the word in the text (which often can be an
important decision).
The function (2) does not take into account the frequency of
the term in other texts [1].
TF X IDF encoding was introduced in the vector model; it
gives much importance to words that appear often within the
same text, which corresponds to the intuitive idea that these
words are more representative of the document. But its
particularity is that it also gives less weight to words that
belong to several documents: to reflect the fact that these
words have little ability to discriminate between classes [2].
The weight of term t
k
in document d
j
is calculated as:
IF xIF (t
k
, u
j
) = #(t
k
, u
j
) - log
|T

|
T

(t
R
)
(3)
#(t
k
,d
j
): number of occurrence of term t
k
in document
d
j

|T
r
|: number of documents from the training corpus.
#T
r
,(t
k
): number of documents in this set in which the
term t
k
appears at least one.
7) Coding termsTFC: TF X IDF encoding does not fix the
length of documents for this purpose, the coding TFC is
similar to TF x IDF, but it corrects the length of texts by
cosine normalization, in order to not favor the long documents
[1].
TFC(tk, uj) =
TF-IDF(t
R
,d
]
)
_
[TF-IDF(t
s
,d
]
)
2
|s|
s=1
(4)
III. NAIVE BAYES ALGORITHM
In machine learning, different types of classifiers have
been developed to achieve maximum degree of precision and
efficiency, each with its advantages and disadvantages. But,
they share common characteristics [8].
Among the learning algorithms we cite: Naive Bayes
which is the most known algorithm, Rocchio method, neural
network, method of k nearest neighbors, decision trees and
support vector method [8].
Naive Bayes Classifier is the most commonly used
algorithm, this classifier based on Bayes theorem for
calculating conditional probabilities. In a general context, this
theorem provides a way to calculate the conditional
probability of a case knowing the presence of an effect.
When we apply the nave Bayes for a text categorization
task, we look for the classification that maximizes the
probability of observing the words of the documents.
During the training phase, the classifier calculates the
probability that a new document belongs to this category
based on the proportion of training documents belonging to
this category. It calculates the probability that a given word is
present in a text, knowing that this text belongs to this
category. Then as a new document should be classified, we
calculate the probability that it belongs to each class using
Bayes rule and the probabilities calculated in the previous
step.
The likelihood to be estimated is:
p(c
j
|a
1
,a
2
, a
3
, ..., a
n
).
Where c
j
is a category and a
i
is an attribute
Using the Bayes theorem, we obtain:
p(c
]
|o
1
, o
2
, o
3
, , on) =
P(u
1
,u
2
,u
3
,,,u
n
\c
]
)-P(c
]
)
P(u
1
,u
2
,u
3
,,u
n
)
(5)
p(o
1
, o
2
, o
3
, , o
n
\c
]
) = p(o

\c
]
)
n
=1
(6)
The probability that a word appears in a text is that a word
appears in a text is independent of the presence of other words
in the text. But, for example, the probability of occurrence of
the word "artificial" depends partly on the presence of the
word "intelligence." However, this assumption does not
preclude such a classifier provide satisfactory results and more
importantly, it greatly reduces the necessary calculations.
Without it, we should consider all possible combinations of
words in a text, which on the one hand involves a large
number of calculations, but also reduce the quality of
statistical estimation, since the frequency of occurrence of
each combination would be much lower than the frequency of
occurrence of words alone [1].
To estimate the probability P(a
i
\c
j
), we could calculate
directly in documents driving the proportion of those
belonging to class c
j
that contain the word a
i
.
In the extreme case where a word is not met in a class its
probability of 0 dominates the others in the above product and
would void the overall probability. To overcome this problem,
a good way is to use the m-estimate calculated as well.

nk+1
n+|ocubuIu|
(7)
Where,
n
k
is the number of occurrences of the word in class c
j

n is the total count of words in the training corpus.
|Vocabulary|: the number of keywords.


IV. APPLICATION OF AN ANT COLONY ALGORITHM FOR
TEXT CATEGORIZATION
A. Introduction
The originality of our approach is on adapting an algorithm
of ant colony to text categorization.
The algorithm of ant colony optimization is inspired by the
behavior of ants searching for food. Its principle is based on
the behavior of individual ants; they are able to determine the
shortest path between their nest and a food source using the
pheromone which is a substance that ants lay on the floor
when they move. When an ant has to choose between two
directions, it chooses with higher probability [9].
Strongly inspired by the movement of groups of ants, this
method aims to build the best solutions from the elements that
have been explored by others. When an individual discovers a
solution, good or bad, it enriches the collective knowledge of
the colony. Thus, whenever a new individual will have to
make choices, he can rely on the collective knowledge to
assess his choices.
The ant colony algorithm reformulates the problem to
solve as a problem of finding a best path in a graph and use
artificial ants to find the right path in this graph.
At each cycle of the algorithm, each ant colony built a
random graph path and the amount of pheromone on the best
paths found during this cycle. Subsequent cycles, the ants are
building new roads with a probability depending on the
pheromone deposited during previous cycles and on a
heuristic of the considered problem. The ant colony converges
gradually toward the best solutions [9].
B. Principes of the algorithm
It relies on the specific behavior of ants, and determines
the shortest route between the nest and a food source
overall progress algorithm.
1) Iteration and moving ant: An iteration corresponds to
the movement of ants. To get from one node of the graph to
another, each ant will need a number of iterations depending
on the size of the edge to go. This mode of iteration will also
emphasize the shortest path as the ants will need less iteration
to reach the end.
2) Life of ant: Each ant must know the list of nodes it has
visited and the nodes still to go. In addition it must measure
the time she spends in exploring the solution. At each node the
ant will consider the possible edges in observing their
corresponding levels of pheromone. It has only to choose
randomly, favoring arcs strongly pheromone. Once at
destination, the ant knows the total length of the solution, it
can reconstruct the path in reverse to mark the path with its
pheromones and increases the collective knowledge of the
colony. We define three states for ants.

Just set up and looking for its first node.
Looking for a solution, therefore, already committed
to the graph.
Moving back toward the starting point and mark path
with pheromones.
3) The deposition of pheromone: The pheromone is a
substance that ants lay on the floor when they move. The
heuristic pheromone deposition can significantly change how
the convergence of the algorithm. From a naive point of view,
we can completely remove the same amount of pheromone on
each path. Ants engaged in long paths will filed less than
pheromones because they can try fewer paths. And instead
engaged ants on the shortest paths will soon try other paths. In
the shortest paths will be found more pheromone than others.
We may also use other methods of depositing pheromones. An
interesting idea is to deposit more pheromone as the solution
is good.
C. For text categorization:
For the construction of the graph, the nodes represent
documents. The pheromone is a measure of similarity between
documents which may be the distance between these
documents. The choice of distance is an important parameter.
D. Calculate the distance between the document filed and the
documents constituting the graph
For our approach we use the cosine similarity between two
documents a and b defined by

p
t
(u)-p
t
(b)
p
t
(u)
2
- p
t
(b)
2
teT teT
te1
(8)
Where:
T is the set of attributes.
p
t
(a) is the weight of term t in document a.
p
t
(b) is the weight of term t in document b.
This measure allows comparing texts of different lengths
by normalizing their vector and focuses instead on the
presence absence of words (the presence of words is probably
more representative of the class text that lack of words).
We use the cosine similarity between each document a
of the graph of documents and the input document b to be
classified.
The following algorithm computes the cosine similarity
based on relevant attributes for the various couples forming
the nodes of a graph and the input document. It takes as input
the graph of documents and the document to classify and
returns as output a similarity matrix based on relevant
attributes.
















Figure1. Cosine similarity algorithm
E. Ant colony optimization
To find the text category, we adopt the algorithm of ant
colony optimization (ACO), proposed in [5]. Although the ant
colony algorithm is originally designed for the traveling
salesman problem, it finally offers great flexibility. Our choice
is motivated by the flexibility of the metaheuristics which
makes possible its application to different problems that are
common to be NP-hard. Thus the use of a parallel model
(colonies of ants) reduces the computing time and improves
the quality of solutions for categorization.
Formalization of the problem: In our context, the problem
of classifying a text reduces the problem of subset selection
[5], and we can formalize the pair (S, f) such that:
S contains all the cosine similarities calculated
between the documents and graph and the text to
classify. It's "matrix similarity" mat_sim.
F is defined by the function score, the score function
is defined in [5] by the formula.
scorc(s
i
) = (uoc uiaph r uoc_class - g(slits s
i
)).
Splits (S') is the set of nodes in the graph which are more
similar to the document to classify. So the result is a consistent
subset S' of nodes, as the score function is maximized.
F. Description of the algorithm
At each cycle of the algorithm, each ant constructs a
subset. Starting from empty subset, ants at each iteration add a
couple of nodes from the similarity matrix. S
k
chosen among
all couples not yet selected. The pair of nodes to add to S
k
is
chosen with a probability which depends on the trail of
pheromones and heuristics. One aims to encourage couples
who have the greatest similarity and the other is to encourage
couples who are most increase the score function. Once each
ant has built its subset, a local search procedure start to
improve the quality of the best subset found during this cycle.
Pheromone trails are subsequently updated based on the subset
improved. Ants stop their construction when all pairs of
candidate nodes are decreased the score subset or when the
three latest additions failed to increase scoring.
Construction of a solution by an ant: The following code
describes the procedure followed by ants to construct a subset.
The first object is selected randomly. The following items are
selected in all candidates.




















Figure2. Construction of a solution by an ant
V. RESULTS AND DISCUSSION
To evaluate performances of our suggestion, we make
some experiments using two corpus one for the training and
the other for the test. We also use the Nave Bayes classifier as
baseline one.
TABLE I. CLASSES OF CORPUS
Classes # of documents in
training stage
# of documents in test stage
Economy 29 18
Education 10 09
Religion 19 17
Sociology 30 14
Sport 4 2

The results of classification stage are reported below for
ant colony algorithm and nave Bayes algorithm
TABLE II. RESULTS OF TESTS WITH ANT COLONY
ALGORITHM
Class Eco. Educ. Relig. Socio. Sport Total
Eco. 17 0 0 1 0 18
Educ. 0 7 1 1 0 09
Relig. 0 1 16 0 0 17
Socio. 0 3 8 3 0 14
Sport 0 0 0 0 2 2
Algorithme_Cosine_Similarity
Input: doc_Graph, doc_class //graph of documents,
Classified document;
Output: Mat_Sim / / similarity matrix based on the relevant
attributes
Mat_Sim 0;
begin
For each node of doc_Graph
/ * Extract set of attribute nodes of the graph
SIM = Calcul_Sim (node, doc_class);
Mat_Sim = Mat_Sim +Sim (node, doc_class);
Return Mat_Sim
End.
Procedure-Construction-subset
Input: graph_doc S (S, f) and an associated heuristic function: S *
P (S) IR
+
;
a strategy and a pheromone factor.
Output: a subset consisting of objects
Initialize pheromone trails to max
begin
Repeat
For each ant k in 1 .. nbAnts, construct a solution Sk as follows:
1. Randomly select the first node
2. Sk {oi}
3. Candidat {oi belongs S / Sk {oj}} belonging to
S consisting
4. While Candidates do
5. Choose a node with probability
6. Poi =
p
t
(u)-p
t
(b)
p
t
(u)
2
- p
t
(b)
2
teT teT
te1

/ / Where T is the set of attributes
/ / pt (a) is the weight of term t in the document node in the graph.
/ / pt (b) is the weight of term t in document b the document to be
classified
7. sk Sk union {oi}
8. Remove oi from Candidates
9. Remove from candidates each node oj as Sk {oi} belong to
Sconsisting
10. End while
11. End for
Update pheromone trails according to {S1, ..., SnbAnts}
If a pheromone trail is less than min then set it to min
Else If a pheromone trail is greater than max then set it to max
Until maximum number of cycles reached or solution found.
TABLE III. RESULTS OF TESTS WITH N
ALGORITHM
Class Eco. Educ. Relig. Socio. S
Eco. 14 2 0 2 0
Educ. 0 7 1 1 0
Relig. 0 2 14 0 0
Socio. 0 4 8 2 0
Sport 0 0 0 0 2

Precision and recall are the most used
evaluate information retrieval systems, the
follow:
TABLE IV. CONTINGENCY TABLE BASED EVA
CLASSIFIERS
Document
belonging to
category
Document assigned to the class
by the classifier
a
Document class rejected by the
classifier
c
According to this table, we define:
Precision = a/(a+b), the number of co
over the total number of assignments.
Recall=a/(a+c), the number of correct a
number of assignments that should have b
evaluating the performance of a classifier, p
is not considered separately. So the F1 m
which is used extensively by the formula:
F1 = 2*r*p/(p+ r) (r is the recall, and p i
is a function which is maximized when
precision are close.
Table V and table VI present performan
and nave Bayes in terms of recall, precision
TABLE V. RECALL, PRECISION, F1 FOR EACH CL
Class Recall T Precision T
Economy 94 ,44 100
Education 77,77 63,63
Religion 94,11 64
Sociology 21,42 60
Sport 100 100
TABLE VI. RECALL, PRECISION, F1 FOR EACH CL
Class Recall T Precision T
Economy 77 ,77 100
Education 77,77 46,66
Religion 82,35 60,86
Sociology 14,28 40
Sport 100 100



NAVE BAYES
Sport Total
0 18
0 09
0 17
0 14
2 2
measurements to
ey are defined as
ALUATION OF THE
Document not
belonging to the
category
b
d
orrect assignments
assignments on the
been made. When
precision or recall
measure is defined
s the precision). It
n the recall and
nces of ant colony
and F1
LASS (ANT COLONY)
F1
97,14
69,99
76,18
31,56
100
LASS (NAVE BAYES)
F1
87,49
58,32
69,99
21,04
100
Figure3. Classification rates for e
While results of several c
results of Sociology class ar
size of learning corpus.
The histogram shows th
algorithm outperforms Nave
recall and precision. This is n
representation of the problem h
similar documents than Nave B
REFER
[1] Jalame R. machine learning and
light Lyon 2 University, June 20
[2] Rhel S. Automatic text categ
from documents not labeled Sub
University Laval, Quebec, Januar
[3] Hacid, H. and Zighed D.
Neighborhood graphs updating,
[4] Valette M. Application of cl
detection of racist content on the
[5] Solnon C. Contributions
Combinatorial-Graph and the An
University Claude Bernard Lyo
[6] Schmid H. Probabilistic par
conference on new methods in
1994.
[7] Porter, M. An algorithm for su
137, 1980.
[8] Sebastiani F. Automated tex
application,Renne France, April
0
20
40
60
80
100
F1 ant colon
ach category and for both classifiers
classes seem to be acceptable,
re dramatic; this is due to small
hat the suggested ant colony
Bayes algorithm in terms of
not surprise since the graphical
handled better relationship with
Bayes.
RENCES
d multilingual text classification
003.
gorization and co-occurrence of words
bmission to the Graduate Faculty of the
ry 2005.
An Effective Method for Locally
, pp. 930, 939, In DEXA 2005.
lassification algorithms for automatic
Internet, June 2003.
to the practical problem-solving
nts, thesis for the Habilitation Research,
on 1, December 2005.
rt-of-speech tagging using tree in
language processing, Manchester UK,
fux stripping, program 14 (3) pp 130-
xt categorization tools, techniques and
l 3, 2002.

ny F1 naive bayes
economy
education
religion
sociology
sport

Вам также может понравиться