Академический Документы
Профессиональный Документы
Культура Документы
Abstract
Arabic information retrieval has become a focus of research and commercial development
due to the vital necessity of such tools for people in the electronic age. The number of Arabicspeaking Internet users is assumed to achieve 43 millions during this year1; however, on the
other side, few full search engines are available to Arabic-speaking users. This dissertation
focuses on three naturally related areas of research: information retrieval, document
clustering, and dimensionality reduction.
In information retrieval, we propose an Arabic information retrieval system, based on light
stemming in the pre-processing phase, and on the Okapi BM-25 weighting scheme and the
latent semantic analysis model in the processing phase. This system has been suggested after
performing and analyzing many experiments dealing with Arabic natural language processing
and different weighting schemes found in literature. Moreover, it has been compared with
another proposed system based on noun phrase indexation.
In clustering, we propose to use the diffusion map space based on the cosine kernel and the
singular value decomposition (that we denote by the cosine diffusion map space) for clustering
documents. We illustrate experimentally, using the k-means clustering algorithm, the
robustness of document indexation in this space compared to the Saltons space. We discuss
the problems of the reduced dimension determination related to the singular value
decomposition method and the choice of clusters number, and we provide some solutions for
these issues. We provide some statistical results and discuss how the k-means algorithm
performs better in the latent semantic analysis model space than in the cosine diffusion map
space in the case of two clusters, but not in the case of multi-clusters. We propose a new
approach for online clustering, based on the cosine diffusion map and the updating singular
value decomposition method.
Concerning dimensionality reduction, we use singular value decomposition technique
for feature transformation, while we propose to supplement this reduction by a
generic term extracting algorithm for features selection in the context of information
retrieval.
Dedication
Acknowledgements
Table of Contents
List of Tables ................................................................................................................ V
List of Figures ............................................................................................................ VII
List of Abbreviations ...................................................................................................IX
Chapter 1 Introduction ...................................................................................................1
1. 1. Research Contributions..........................................................................................2
1. 2. Thesis Layout & Brief Overview of Chapters .......................................................3
Chapter 2 Literature Review..........................................................................................5
2. 1. Introduction............................................................................................................5
2. 2. Document Retrieval ...............................................................................................5
2.2.1. DOCUMENT RETRIEVAL MODELS .................................................................................... 5
2.2.1.1. Set-theoretic Models..................................................................................................... 6
2.2.1.2. Algebraic Models ......................................................................................................... 7
2.2.1.3. Probabilistic Models..................................................................................................... 7
2.2.1.4. Hybrid Models.............................................................................................................. 8
2.2.2. INTRODUCTION TO VECTOR SPACE MODELS .................................................................. 8
Table of Contents
2.5.4.1. Arabic Morphology .................................................................................................... 24
2.5.4.2. Word-form Structures................................................................................................. 25
2.5.5. ANOMALIES ................................................................................................................... 27
2.5.5.1. Agglutination.............................................................................................................. 27
2.5.5.2. The Vowelless Nature of the Arabic Language.......................................................... 27
2.5.6. EARLY WORK ................................................................................................................ 28
2.5.6.1. Full-form-based IR ..................................................................................................... 28
2.5.6.2. Morphology-based IR................................................................................................. 29
2.5.6.3. Statistical Stemmers ................................................................................................... 30
2. 7. Summary ..............................................................................................................33
Chapter 3 Latent Semantic Model ...............................................................................34
3. 1. Introduction..........................................................................................................34
3. 2. Model Description ...............................................................................................34
3.2.1. TERM-DOCUMENT REPRESENTATION............................................................................ 35
3.2.2. WEIGHTING .................................................................................................................... 35
3.2.3. COMPUTING THE SVD ................................................................................................... 39
3.2.4. QUERY PROJECTION AND MATCHING ............................................................................ 41
3. 4. Summary ..............................................................................................................48
Chapter 4 Document Clustering based on Diffusion Map...........................................49
4. 1. Introduction..........................................................................................................49
4. 2. Construction of the Diffusion Map ......................................................................49
4.2.1. DIFFUSION SPACE .......................................................................................................... 49
4.2.2. DIFFUSION KERNELS...................................................................................................... 51
4.2.3. DIMENSIONALITY REDUCTION ...................................................................................... 51
4.2.3.1. Singular Value Decomposition................................................................................... 52
4.2.3.2. SVD-Updating............................................................................................................ 54
II
Table of Contents
4. 3. Clustering Algorithms..........................................................................................56
4.3.1. K-MEANS ALGORITHM ................................................................................................... 56
4.3.2. SINGLE-PASS CLUSTERING ALGORITHM ....................................................................... 57
4.3.3. THE OSPDM ALGORITHM ............................................................................................. 58
4. 5. Summary ..............................................................................................................81
Chapter 5 Term Selection ............................................................................................83
5. 1. Introduction..........................................................................................................83
5. 2. Generic Terms Definition ....................................................................................83
5. 3. Generic Terms Extraction ....................................................................................83
5.3.1. SPHERICAL K-MEANS ..................................................................................................... 87
5.3.2. GENERIC TERM EXTRACTING ALGORITHM ................................................................... 87
6. 4. Summary ............................................................................................................111
Chapter 7 Conclusion and Future Work ....................................................................113
7. 1. Conclusion .........................................................................................................113
III
Table of Contents
7. 2. Limitations .........................................................................................................113
7. 3. Prospects ............................................................................................................114
Appendix A Natural Language Processing................................................................115
A.1. Introduction........................................................................................................115
A.2. Basic Techniques ...............................................................................................115
A.2.1. N-GRAMS .................................................................................................................... 115
A.2.2. TOKENIZATION............................................................................................................ 115
A.2.3. TRANSLITERATION ...................................................................................................... 116
A.2.4. STEMMING .................................................................................................................. 117
A.2.5. STOP WORDS............................................................................................................... 118
IV
List of Tables
Table 2.1. Arabic letters...............................................................................................22
Table 2.2. Different shapes of the letter gh (Ghayn). ........................................22
Table 2.3. Ambiguity caused by the absence of vowels in the words ktb and
mdrsp. ..................................................................................................23
Table 2.4. Some templates generated from roots with examples from the root (
ktb)...................................................................................................................24
Table 2.5. Derivations from a borrowed word. ...........................................................25
Table 3.1. Comparison between Different Versions of the Standard Query Method. .42
Table 3.2. Size of collections........................................................................................43
Table 3.3. Result of weighting schemes in increasing order for Cisi corpus. .............44
Table 3.4. Result of weighting schemes in increasing order for Cran corpus.............45
Table 3.5. Result of weighting schemes in increasing order for Med corpus..............45
Table 3.6. Result of weighting schemes in increasing order for Cisi-Med corpus......46
Table 3.7. The best reduced dimension for each weighting scheme in the case of four
corpuses. ..............................................................................................................47
Table 4.1. Performance of different embedding representations using k-means for the
set Cisi and Med...................................................................................................61
Table 4.2. The process running time for the cosine and the Gaussian kernels. ..........61
Table 4.3. Performance of k-means in cosine diffusion, Salton and LSA spaces for the
set Cisi and Med...................................................................................................64
Table 4.4. Measure of the difference between the approximated and the histogram
distributions. ........................................................................................................66
Table 4.5. Performances of different embedding representations using k-means for the
set Cran, Cisi and Med. .......................................................................................67
Table 4.6. Performance of k-means in cosine diffusion, Salton, and LSA spaces for the
set Cran, Cisi and Med. .......................................................................................68
Table 4.7. Measure of the difference between the approximated and the histogram
distributions. ........................................................................................................70
Table 4.8. Performance of different embedding cosine diffusion and LSA
representations using k-means for the set Cran, Cisi, Med and Reuters_1.........72
Table 4.9. Performance of k-means in Cosine diffusion, Salton and LSA spaces for the
set Cran, Cisi, Med and Reuters_1. .....................................................................72
Table 4.10. The confusion matrix for the set Cran-Cisi-Med-Reuters_1 ....................73
Table 4.11. The confusion matrix for the set S in 2-dimension cosine diffusion space.
..............................................................................................................................74
V
List of Tables
Table 4.12. The resultant confusion matrix. ................................................................74
Table 4.13. Mutual information of different embedding cosine diffusion
representations using k-means to exclude the cluster C2 from the set Cran, Cisi,
Med and Reuters_1. .............................................................................................75
Table 4.14. Performance of different embedded cosine diffusion representations using
k-means for the set S. ...........................................................................................75
Table 4.15. The confusion matrix for the set Cran-Cisi-Med-Reuters_1 clustered into
4 clusters in the 4-dimention cosine diffusion space. ..........................................75
Table 4.16. Performance of different embedding cosine diffusion and LSA
representations using k-means for the set Cran, Cisi, Med and Reuters_2.........76
Table 4.17. Performance of k-means in Cosine diffusion, Salton and LSA spaces for
the set Cran, Cisi, Med and Reuters_2. ...............................................................77
Table 4.18. Performance of different embedding cosine diffusion and LSA
representations using k-means for Reuters..........................................................77
Table 4.19. Performance of k-means in Cosine diffusion, Salton and LSA spaces for
Reuters. ................................................................................................................77
Table 4.20. The statistical results for the performance of k-means algorithm in cosine
diffusion and LSA spaces. ....................................................................................80
Table 4.21. Performances of the single-pass clustering. .............................................81
Table 5.1. Index size in the native and Noun phrase spaces........................................90
Table 5.2. The MIAP measure for the collection Cisi in different indexes..................90
Table 5.3. The MIAP measure for the collection Cran in different indexes. ...............91
Table 5.4. The MIAP measure for the collection Med in different indexes. ................91
Table 5.5. LSA performance in the native and Noun phrase spaces. ..........................92
Table 6.1. [AR-ENV] Corpus Statistics. ......................................................................96
Table 6.2. An example illustrating the typical approach to query term selection. ......96
Table 6.3. Token-to-type ratios for fragments of different lengths, from various
corpora.................................................................................................................98
Table A.1. Buckwalter Transliteration. .....................................................................117
Table A.2. Prefixes and suffixes list...........................................................................118
Table B.1. List of term weighting components. .........................................................123
VI
List of Figures
Figure 2.1. A taxonomy of clustering approaches. ......................................................13
Figure 3.1. A pictorial representation of the SVD. The shaded areas of U and V, as
well as the diagonal line in S, represent Ak, the reduced representation of the
original term-document matrix A.........................................................................40
Figure 3.2. The interpolated recall-precision curves of the LSA and the VSM models.
..............................................................................................................................48
Figure 4.1. Average cosine of the principal angles between 64 concept subspace and
various singular subspaces for the CLASSIC data set.........................................53
Figure 4.2. Average cosine of the principal angles between 64 concept subspace and
various singular subspaces for the NSF data set.................................................53
Figure 4.3. Representation of our data set in various diffusion spaces.......................60
Figure 4.4. Representation of our data set in Cosine and Gaussian diffusion spaces
for various t time iterations..................................................................................63
Figure 4.5. Representation of the first 100 singular values of the cosine diffusion map
on the set Cisi and Med........................................................................................64
Figure 4.6. Representation of the first 100 singular values of the Cisi and Med termdocument matrix...................................................................................................65
Figure 4.7. Histogram representation of the cluster C1 documents. ...........................66
Figure 4.8. Histogram representation of the cluster C2 documents. ...........................66
Figure 4.9. Representation of the first 100 singular values of the cosine diffusion map
on the cluster C1. .................................................................................................67
Figure 4.10. Representation of the first 100 singular values of the cosine diffusion
map on the cluster C2 ..........................................................................................67
Figure 4.11. Representation of the first 100 singular values of the cosine diffusion
space on the set Cran, Cisi and Med. ..................................................................68
Figure 4.12. Representation of the first 100 singular values of the Cran, Cisi and Med
term-document matrix..........................................................................................68
Figure 4.13. Histogram representation of the cluster C1 documents. .........................69
Figure 4.14. Histogram representation of the cluster C2 documents. .........................69
Figure 4.15. Histogram representation of the cluster C3 documents. .........................70
Figure 4.16. Representation of the first 100 singular values of the cosine diffusion
map on cluster C1. ...............................................................................................70
Figure 4.17. Representation of the first 100 singular values of the cosine diffusion
map on cluster C2. ...............................................................................................71
Figure 4.18. Representation of the first 100 singular values of the cosine diffusion
map on cluster C3. ...............................................................................................71
VII
List of Figures
Figure 4.19. Representation of the first 100 singular values of the cosine diffusion
map on the set Cran, Cisi, Med and Reuters_1. ..................................................72
Figure 4.20. Representation of the first clusters of the hierarchical clustering. .........73
Figure 4.21.Representation of the first 100 singular values of the cosine diffusion map
on the data set S ...................................................................................................73
Figure 4.22. Representation of the Set S clusters.........................................................74
Figure 4.23. Representation of the first 100 singular values of the cosine diffusion
map on the set Cran, Cisi, Med and Reuters_2. ..................................................76
Figure 4.24. Representation of the first 100 singular values of the cosine diffusion
map on Reuters. ...................................................................................................77
Figure 4.25. The LSA and Diffusion Map processes....................................................79
Figure 5.1. Top-Level Flowchart of GTE Algorithm. ..................................................89
Figure 6.1. Zipf law and word frequency versus rank in the [AR-ENV] collection. ..98
Figure 6.2. Token-to-type ratios (TTR) for the [AR-ENV] collection..........................99
Figure 6.3. A standardized information retrieval system...........................................100
Figure 6.4. An information retrieval system for Arabic language.............................101
Figure 6.5. Comparison between the performances of the LSA model for five
weighting schemes. ............................................................................................104
Figure 6.6. Language processing benefit...................................................................105
Figure 6.7. A new information retrieval system suggested for Arabic language.......106
Figure 6.8. A comparison between the performances of the VMS and the LSA models.
............................................................................................................................107
Figure 6.9. Weighting queries impact.......................................................................108
Figure 6.10. Arabic Information Retrieval System based on NP Extraction. ............109
Figure 6.11. Influence of the NP and the singles terms indexations on the IRS
performance. ......................................................................................................110
Figure C.1. The computation of Recall and Precision...............................................124
Figure C.2. The Precision Recall trade-off................................................................125
Figure C.3. Interpolated Recall Precision Curve. .....................................................127
VIII
List of Abbreviations
Acc: Accuracy
AFN: Affinity Set
AFP: Agence France Presse
AIR: Arabic Information Retrieval
AIRS: Arabic Information Retrieval System
AP: Average Precision
BNS: Bi-Normal Separation
CCA: Corpus of Contemporary Arabic
CHI: 2 -test
CQ: Characteristic Quotient
DF: Document Frequency
DM: Diffusion Map
ELRA: European Language Resources distribution Agency
GPLVM: Gaussian Process Latent Variable Model
GTE: Generic Term Extracting
HPSG: Head-driven Phrase Structure Grammar
ICA: Independent Component Analysis
ICA: International Corpus of Arabic
ICE: International Corpus of English
IG: Information Gain
IR: Information Retrieval
IRP: Interpolated Recall-Precision
IRS: Information Retrieval System
ISOMAPS: ISOmetric MAPS
LLE: Locally Linear Embedding
LSA: Latent Semantic Analysis
LTSA: Local Tangent Space Alignment
MDS: Multidimensional Scaling
MI: Mutual Information
MIAP: Mean Interpolated Average Precision
NLP: Natural Language Processing
IX
List of Abbreviations
nonrel: non-relevant
NP: Noun Phrase
OSPDM: On-line Single-Pass Clustering based on Diffusion Map
P2P: Peer-To-Peer
PCA: Principle Component Analysis
POS: Part Of Speech
Pr: Probability
R&D: Research and Development
rel: relevant
RSV: Retrieval Status Value
SOM: Self-Organizing Maps
SVD: Singular Value Decomposition
SVM: Support Vector Machine
TREC: Text REtrieval Conference
TS: Term Strength
TTR: Token-to-Type Ratio
TDT: Topic Detection and Tracking
VSM: Vector-Space Model
Chapter 1 Introduction
The advent of the World Wide Web has increased the importance of information retrieval. Instead of
going to the local library to look for information, people search the Web. Thus, the relative number of
manual versus computer-assisted searches for information has shifted dramatically in the past few years.
This has accentuated the need for automated information retrieval for extremely large document
collections, in order to help in reading, understanding, indexing and tracking the available literature. For
this reason, researchers in document retrieval, computational linguistics and textual data mining are
working on the development of methods to process these data and present them in a usable and suitable
format for many written languages where Arabic is one.
Known as the second2 most widely spoken language in the world, Arabic knows an important
increasing of the speaking Internet users number. In 2002 was about 4.4 million [ACS04], and 16
million in 2004, while the research commissioned from Dubai-based Internet researcher Madar shows
that this number could jump to 43 million in 20083. However, at present there are relatively few standard
Arabic search engines known. Despite their availability, according to Hermann Havermann (managing
director of German Internet tech firm Seekport, and founder member of the project Arabic search engine
SAWAFI), they are not considered as full Arabic engines. As announced in the Reuters article news4,
Hermann Havermann confirmed that There is no [full] Arabic internet search engine on the market.
You find so-called search engines, but they involve a directory search, not a local search.
The fact that any improved access to Arabic text will have profound implications for cross-cultural
communication, economic development, and international security, encourage us to take an interest
more particularly in this language.
The limited number of research in the Arabic document retrieval area over 20 years, began by the
arabization of the MINISIS system [Alg87] then the development of the Micro-AIRS system [Alka91],
are all dominated by the use of statistical methods to automatically match natural language user queries
against records. There has been interest in using natural language processing to enhance term matching
by using root, stem, and n-gram, as is highlighted in Text REtrieval Conference TREC-2001 [GeO01].
However yet to 2005, the effect of stemming upon stopwords was not studied; the Latent Semantic
2
http://encarta.msn.com/media_701500404/Languages_Spoken_by_More_Than_10_Million_People.html,
Microsoft
Arabic search engine may boost content, by Andrew Hammond, in Reuters, on April 26th, 2006. Retrieved on 10-05-2007.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Introduction
Analysis model (LSA), developed in the early 1990s [DDF90] and known by its high capacity of
resolving the synonymy and the polysemy problems, was not utilized; and neither the indexation by
phrases was used.
We are motivated by the fact that the use of the LSA model, in an attempt to discover structure and
implicit meanings hidden, may meet the challenges of the wide use of synonyms offered in Arabic.
The employment of several weighting schemes, taking into account the term importance within both
documents and query; and the use of Arabic natural language processing, based on spelling mutation,
stemming, stopword removal and noun phrase extraction, make the study more interesting.
The first objective of our study is the improvement of the computing similarity score between
documents and a query for Arabic documents; however, this study has been extended to consider other
aspects. Many studies have proved that clustering is an important tool in information retrieval for
constructing taxonomy of a documents collection, by forming groups of closely-related documents
[FrB92, FaO95, HeP96, Leu01]. Based on the Cluster Hypothesis: closely associated documents tend
to be relevant to the same requests [Van79], clustering is used to accelerate query processing by
considering only a small number of clusters representatives, rather than the entire corpus. Typically, we
think that reducing the corpus dimension by using some features selection methods may also help a user
to find relevant information more quickly. Thus, we have been interested in developing new clustering
methods in off and on-line cases, and extending the generic term extraction method to reduce the storage
capacity for retrieval task.
1. 1. Research Contributions
In the objective of improving the performance and the complexity of document retrieval systems, ten
major contributions are proposed in this thesis:
-
Studying the Weighting Schemes found in the current text retrieval literature to discover the best
one, while the Latent Semantic model is used.
Utilizing the Diffusion map for off-line document clustering, and improving its performance by
using the Cosine distance.
Comparing the k-means algorithm performance in the Salton, LSA and cosine diffusion spaces.
Proposing two postulates indicating the appropriate reduced dimension to use for clustering,
and the optimal number of clusters.
Developing a new method for on-line clustering, based on Diffusion map and updating singular
value decomposition.
Analyzing the benefit of extracting Generic Terms in decreasing the Data Storage capacity,
required for document retrieval.
2 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Introduction
-
Creating an Arabic Retrieval Test Collection, where documents affecting a scientific field
specialized in the environment, and queries structured into two categories would help to examine
the performance difference between the case of short 2 or 3 words and long queries
sentence.
Applying the Latent Semantic model to the Arabic language in attempt to meet the challenges of
the wide use of synonyms offered by this language.
Analyzing the Weighting Schemes influence on the use of some Arabic language processing.
Studying the effect of representing the Arabic document content by Noun Phrase in the
improvement of the proposed automatic document retrieval system based on the two previous
contributions.
Introduction
effectiveness of different index terms on these collections.
Chapter 7 summarizes the research and concludes with its major achievements and possible
directions that could be considered for future research.
Appendix A presents all natural language processing used and mentioned in this work.
Appendix B reviews the weighting schemes notations.
Appendix C outlines the commonly evaluation metrics used in retrieval and clustering evaluation
tasks, more specifically those used in this thesis.
Appendix D recalls the quantities known as principal angles, used to measure the closeness of
subspaces.
4 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
2. 2. Document Retrieval
The problem of finding relevant information is not new. Early systems tried to classify knowledge
into a set of known fixed categories. The first of these was completed in 1668 by the English
philosopher John Wilkins [Sub92]. The problem with this approach is that categorizers commonly do
not place documents into the categories where searchers expect to find them. No matter what categories
a user thinks of, these categories will not match what someone searching will find. For example, users of
e-mail systems place mails in folders or categories only to spend countless hours trying to find the same
documents because they cannot remember what category they used, or the category they are sure they
used does not contain the relevant document. Effective and efficient search techniques are needed to
help users quickly find the information they are looking for. Another approach is to try to understand the
content of the documents, ideally, by loading them into the computer for reading and understanding
before users would ask any questions; involving by that, the use of a document retrieval system.
The elementary definition of document retrieval is the matching of some stated user query against
useful parts of free-text records. These records could be any type of mainly unstructured text, such as
bibliographic records, newspaper articles, or paragraphs in a manual. User queries could range from
multi-sentence full descriptions of an information need to a few words. However, this definition is not
informative enough, because a document can be relevant even though it does not use the same words as
those provided in the query. The user is not generally interested in retrieving documents with exactly the
same words, but with the concepts that those words represent. To this end, many models are discussed.
Literature Review
create an interest in accelerating research to produce more effective search methodologies, including
more use of natural language processing techniques.
A great variety of document retrieval models is described in the information retrieval literature.
Based on a mathematic view, the techniques currently in use could be classed into four types: Boolean
or set-theoretic, vector or algebraic, probabilistic, and hybrid models.
A model is characterized by four parameters:
-
In the following paragraphs, we describe instances of each type in the context of the model
parameters.
Literature Review
model.
The strict Boolean and fuzzy-set models are preferable to other models in terms of computational
requirements, which are low in terms of both the disk space required for storing document
representations and the algorithmic complexity of indexing and computing query-document similarities.
Literature Review
This model includes: Binary independence retrieval [RoS76], Uncertain inference [CLR98],
Language models [PoC98], Divergence from randomness models [AmR02].
According to our best knowledge, the recent used model for Arabic language, before our work
[BoA05] where latent semantic model is utilized, was the standard vector space model [SaM83]. For this
reason, we have been interested in the algebraic models, more particularly those based on vectors, to
begin our study.
t (1 i m )
i
Thus, each unique term in the document collection corresponds to a dimension in the space. Similarly, a
query is represented as a vector q = (t ' , t ' ,..., t ' ) where term t ' (1 i m) is a non-negative value
1
denoting the number of occurrences of t' (or, merely a 1 to signify the occurrence of term t' ) in the
i
query [BeC87]. Both the document vectors and the query vector provide the locations of the objects in
the term-document space. By computing the distance between the query and other objects in the space,
objects with similar semantic content to the query will presumably be retrieved.
Vector-space models that do not attempt to collapse the dimensions of the space treat each term
independently, essentially mimicking an inverted index [FrB92]. However, vector-space models are
more flexible than inverted indices since each term can be individually weighted, allowing that term to
become more or less important within a document or the entire document collection as a whole. Also, by
applying different similarity measures to compare queries to terms and documents, properties of the
document collection can be emphasized or de-emphasized.
For example, the dot product similarity measure M ( q, d ) = q . d finds the distance between the query
and a document in the space, where the operation . is the inner product multiplication, with the inner
m
8 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Literature Review
The inner product or the dot product favors long documents over short ones since they contain more
terms and hence their product increases.
On the other hand by computing the angle between the query and a document rather than the
distance, the cosine similarity measure cos( q , d ) =
q.d
q.d
X .Y is the inner product defined above, and X is the Euclidian length of the vector X.
X =
x
i =1
2
i
In some cases, the directions of the vectors are a more reliable indication of the semantic similarities of
the objects than the distance between the objects in the term-document space [FrB92].
Vector-space models, by placing documents and queries in a term-document space and computing
similarities between the queries and the documents, allow the results of a query to be ranked according
to the similarity measure used. Unlike lexical matching techniques that provide no ranking or a very
crude ranking scheme (for example, ranking one document before another document because it contains
more occurrences of the search terms), the vector-space models, by basing their rankings on the
Euclidean distance or the angle measure between the query and documents in the space, are able to
automatically guide the user to documents that might be more conceptually similar and of greater use
than other documents.
Vector-space models, specifically the latent semantic model, were developed to eliminate many of
the problems associated with exact, lexical matching techniques. In particular, since words often have
multiple meanings (polysemy), it is difficult for a lexical matching technique to differentiate between
two documents that share a given word, but use it differently, without understanding the context in
which the word was used. Also, since there are many ways to describe a given concept (synonymy),
related documents may not use the same terminology to describe their shared concepts. A query using
the terminology of one document will not retrieve the other related documents. In the worst case, a
query using terminology different from that used by related documents in the collection may not retrieve
any documents using lexical matching, even though the collection contains related documents [BeC87].
For example, a text collection contains documents on house ownership and web home pages with some
others using the word house only, some using the word home only, and some using both words. For a
query on home ownership, traditional lexical matching methods fail to retrieve documents using the
word house only, which are obviously related to the query. For the same query on home ownership,
lexical matching methods will also retrieve irrelevant documents about web home pages.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Literature Review
2. 3. Document Clustering
Document clustering has been studied in the field of document retrieval for several decades. In the
aim to reduce the time search, the first approaches were attempted by Salton [Sal68], Litofsky [Lit69],
Crouch [Cro72], Van Rijsbergen [Van72], Prywes & Smith [PrS72], and Fritzche [Fri73]. Based on
these studies, Van Rijsbergen specifies, in his book [Van79], that while choosing a cluster method to use
in experimental document retrieval, two, often conflicting, criteria are frequently used.
The first one, and the most important in his point of view, is the theoretical soundness of the
method, meaning that the method should satisfy certain criteria of adequacy. Below, we list some of the
most important of these criteria:
1) The method produces a clustering which is unlikely to be altered drastically when further objects
are incorporated, i.e. it is stable under growth;
2) The method is stable in the sense that small errors in the description of the objects lead to small
changes in the clustering;
3) The method is independent of the initial ordering of the objects.
These conditions have been adapted from Jardine and Sibson [JaS71]. The point is that any cluster
method which does not satisfy these conditions is unlikely to produce any meaningful experimental
results.
The second criterion for choice, considered as the overriding consideration in the majority of
document retrieval experimental works, is the efficiency of the clustering process in terms of speed and
storage requirements. Efficiency is really a property of the algorithm implementing the cluster method.
It is sometimes useful to distinguish the cluster method from its algorithm, but in the context of
document retrieval this distinction becomes slightly less useful, since many cluster methods are defined
by their algorithm, so no explicit mathematical formulation exists.
The current information explosion, fueled by the availability of hypermedia and the World-Wide
Web, has led to the generation of an ever-increasing volume of data, posing a growing challenge for
information retrieval systems to efficiently store and retrieve this information [WMB94]. A major issue
that document databases are now facing is the extremely high rate of update. Several practitioners have
complained that existing clustering algorithms are not suitable for maintaining clusters in such a
dynamic environment, and they have been struggling with the problem of updating clusters without
frequently performing complete re-clustering [CaD90, Can93, Cha94]. To overcome this problem, online clustering approaches have been proposed.
In the following, we explain the clustering procedure in the context of document retrieval, we survey
a clustering methods taxonomy by focusing on needed categories, and we give an overview of some
recent studies in both classical and on-line clustering fields, after specifying the definition of the
10 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Literature Review
clustering by comparing this approach to other classification approaches.
2.3.1. Definition
In supervised classification, or discriminant analysis, a collection of labeled (pre-classified) patterns
is provided; the problem is to label a newly encountered, yet unlabeled, pattern. Typically, the given
labeled (training) patterns are used to learn the descriptions of classes which in turn are used to label a
new pattern. In the case of clustering (unsupervised classification), the problem is to group a given
collection of unlabeled patterns into meaningful clusters. In a sense, labels are associated with clusters
also, but these category labels are data driven; that is, they are obtained solely from the data.
Iterative methods that are more efficient and proceed directly from the document vectors.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
11
Literature Review
b- Iterative Methods
This class consists of methods that operate in less than quadratic time (that is O(nlogn) or
O(n2/logn)) on the average [FaO95]. These methods are based directly on the item (document)
descriptions and they do not require the similarity matrix to be computed in advance. The price for
the increased efficiency is the sacrifice of the theoretical soundness; the final classification
depends on the order that the documents are processed, or else on the existence of a set of seedpoints around which the classes are to be constructed.
Although some experimental evidence exists indicating that iterative methods can be effective for
information retrieval purposes [Dat71], specifically in on-line clustering [KWX01, KlJ04, KJR06], most
researchers prefer to work with the theoretically more attractive hierarchical grouping methods, while
attempting, at the same time, to save computation time. This can be done in various ways by applying
the expensive clustering process to a subset of the documents only and then assigning the remaining unclustered items to the resulting classes; or by using only a subset of the properties for clustering
purposes instead of the full keyword vectors; or finally by utilizing an initial classification and applying
the hierarchical grouping process within each of the initial classes only [Did73, Cro77, Van79].
12 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Literature Review
Agglomerative vs. divisive [JaD88, KaR90]: An agglomerative clustering (bottom-up) starts with
one-point (singleton) clusters and recursively merges two or more most appropriate clusters. A
divisive clustering (top-down) starts with one cluster of all data points and recursively splits the
most appropriate cluster. The process continues until a stopping criterion (frequently, the
requested number k of clusters) is achieved.
Monothetic vs. polythetic [Bec59]: A monothetic class is defined in terms of characteristics that
are both necessary and sufficient in order to identify members of that class. This way of defining
a class is also termed the Aristotelian definition of a class [Van79]. A polythetic class is defined
in terms of a broad set of criteria that are neither necessary nor sufficient. Each member of the
category must possess a certain minimal number of defining characteristics, but none of the
features has to be found in each member of the category. This way of defining classes is
associated with Wittgenstein's concept of family resemblances [Van79]. Monothetic is a type
in which all members are identical on all characteristics. Whereas, polythetic is a type in which
all members are similar, but not identical.
Hard vs. fuzzy: A hard clustering algorithm allocates each pattern to a single cluster during its
operation and in its output. A fuzzy clustering method assigns degrees of membership in several
clusters, that do not have hierarchical relations with each other, to each input pattern. A fuzzy
clustering can be converted to a hard clustering by assigning each pattern to the cluster with the
largest measure of membership.
Deterministic vs. stochastic: This issue is most relevant to partitional approaches designed to
optimize a squared error function. This optimization can be accomplished using traditional
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
13
Literature Review
techniques or through a random search of the state space consisting of all possible labelings.
-
Incremental vs. non-incremental: This issue arises when the pattern set to be clustered is large,
and constraints on execution time or memory space affect the architecture of the algorithm. The
early history of clustering methodology does not contain many examples of clustering algorithms
designed to work with large data sets, but the advent of data mining has fostered the
development of clustering algorithms that minimize the number of scans through the pattern set,
reduce the number of patterns examined during execution, or reduce the size of data structures
used in the algorithms operations [JMF99].
Literature Review
4.3.1 for more details concerning this algorithm). Several variants of the k-means algorithm have been
reported in the literature. One of them will be studied in Chapter 5.
15
Literature Review
iterative and therefore their time requirements are also small.
2. 4. Dimensionality Reduction
As the storage technologies evolve, the amount of available data explodes in both dimensions:
samples number and input space dimension. Therefore, one needs dimension reduction techniques to
explore and to analyze his huge data sets. Often many dimensions are irrelevant, in high dimensional
data. These irrelevant dimensions can confuse analysis algorithms by hiding useful information in noisy
data. As the number of dimensions in a dataset increases, distance measures become increasingly
meaningless. Additional dimensions spread out the points until they are almost equidistant from each
other, in very high dimensions.
Various dimensionality reduction methods have been proposed including both term transformation
and term selection techniques. Feature transformation techniques attempt to generate an optimal
16 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Literature Review
dimension of synthetic terms by creating combinations of the original terms. These techniques are
very successful in uncovering latent structure in datasets. However, since they preserve the relative
distances between documents, they are less effective when there are large numbers of irrelevant terms
that hide the difference between sets of similar documents in a sea of noise. In addition, seeing that the
synthetic terms are combinations of the originals, it may be very difficult to interpret the synthetic terms
in the context of the domain. However, term selection methods have the advantage to select most
relevant dimensions from a dataset, and reveal groups of documents that are similar within a subset of
their terms.
Non-linear methods are by themselves subdivided into two groups: those providing a
mapping and those giving a visualization. The non-linear mapping methods include
techniques such as kernel PCA [SSM99], and Gaussian process latent variable models
(GPLVM) [Law03]. While non-linear visualization methods are based on proximity data that
is distance measurement, include such as Locally Linear Embedding (LLE) [RoS00], Hessian
LLE [DoG03], Laplacian Eigenmaps [BeN03], Multidimensional Scaling (MDS) [BoG97],
Isometric Maps (ISOMAPS) [TSL00], and Local Tangent Space Alignment (LTSA)
[ZhZ02].
The transformations generally preserve the original, relative distances between documents. Term
transformation is often a preprocessing step, allowing analysis algorithm to use just a few of the newly
created synthetic terms. A few algorithms have incorporated the use of such transformations to identify
important terms and iteratively improve their performance [HiK99, DHZ02]. While often very useful,
these techniques do not actually remove any of the original terms from consideration. Thus, information
from irrelevant dimensions is preserved, making these techniques ineffective at revealing sets of similar
documents when there are large numbers of irrelevant terms that mask the sets. Another disadvantage of
using combinations of terms is that they are difficult to interpret, often making the algorithm results less
useful. Because of this, term transformations are best suited to datasets where most of the dimensions
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
17
Literature Review
are relevant, while many are highly correlated or redundant.
Literature Review
branch and bound method.
Koller and Sahami [KoS96] examined a method for feature subset selection based on Information
Theory: they presented a theoretically justified model for optimal feature selection based on using crossentropy to minimize the amount of predictive information lost during feature elimination.
Jain and Zongker [JaZ97] considered various feature subset selection algorithms and found that the
sequential forward floating selection algorithm, proposed by Pudil et al. [PNK94], dominated the other
algorithms tested.
Dash and Liu [DaL97] gave a survey of feature selection methods for classification.
In a comparative study of feature selection methods in statistical learning of text categorization (with
a focus on aggressive dimensionality reduction), Yang and Pedersen [YaP97] evaluated document
frequency (DF), information gain (IG), mutual information (MI), a 2 -test (CHI) and term strength
(TS); and found IG and CHI to be the most effective.
Blum and Langley [BlL97] focused on two key issues: the problem of selecting relevant features and
the problem of selecting relevant examples.
Kohavi and John [KoJ97] introduced wrappers for feature subset selection. Their approach searches
for an optimal feature subset tailored to a particular learning algorithm and a particular training set.
Yang and Honavar [YaH98] used a genetic algorithm for feature subset selection.
Liu and Motoda [LiM98] wrote their book on feature selection which offers an overview of the
methods developed since the 1970s and provides a general framework in order to examine these
methods and categorize them.
Vesanto and Ahola [VeA99] proposed to visually detect correlation using a self-organizing maps
based approach (SOM).
Makarenkov and Legendre [MaL01] try to approximate an ultra-metric in the Euclidian space or to
preserve the set of the k-nearest neighbors.
Weston et al. [WMC01] introduced a method of feature selection for SVMs which is based upon
finding those features which minimize bounds on the leaveone-out error. The method was shown to be
superior to some standard feature selection algorithms on the data sets tested.
Xing et al. [XJK01] successfully applied feature selection methods (using a hybrid of filter and
wrapper approaches) to a classification problem in molecular biology involving only 72 data points in a
7130 dimensional space. They also investigated regularization methods as an alternative to feature
selection, and showed that feature selection methods were preferable in the problem they tackled.
Mitra et al. [MMP02] use a similarity measure that corresponds to the lowest eigenvalue of
correlation matrix between two features.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
19
Literature Review
See Miller [Mil02] for a book on subset selection in regression.
Forman [For03] presented an empirical comparison of twelve feature selection methods. Results
revealed the surprising performance of a new feature selection metric, Bi-Normal Separation (BNS).
Dhillon et al. [DKN03] present two term selection techniques, the first based on the term variance
quality measure, while the second is based on co-occurrence of similar terms in the same context.
Guyon and Elisseeff [GuE03] gave an introduction to variable and feature selection. They
recommend using a linear predictor of your choice (e.g. a linear SVM) and select variables in two
alternate ways: (1) with a nested subset selection method performing forward or backward selection or
with multiplicative updates; (2) with a variable ranking method using correlation coefficient or mutual
information.
Gurif et al. [GBJ05] used a similar idea to Vesanto and Aholas work [VeA99] and integrated a
weighting mechanism in the SOM training algorithm to reduce the redundancy side effects.
More recently, some approaches have been proposed to address the difficult issue of irrelevant
features elimination in the unsupervised learning context [Bla06, GuB06]. These approaches use quality
measures of partition such as the Davies-Bouldin index [DaB79, GuB06], the Wemmert and Gancarski
index or the entropy [Bla06], in addition to Gurif and Bennani [GuB07] where they have extend the wk-means algorithm proposed by Huang et al. [HNR05] to the SOM framework and rely their feature
selection approach on the weighting coefficients learned during the optimization process.
2. 5. Studied Languages
2.5.1. English Language
English is a West Germanic language originating in England. It was the second5 widely spoken
language in the world, and is used extensively as a second language and as an official language
throughout the world, especially in Commonwealth countries, and in many international organizations.
English is the dominant international language in communication, science, business, aviation,
entertainment, radio and diplomacy. The influence of the British Empire is the primary reason for the
initial spread of the language far beyond the British Isles. Following World War II, the growing
economic and cultural influence of the United States has significantly accelerated the spread of the
language.
Hence many studies have been interested in this language. Thus, it possesses a very rich free corpus
data-base, which helped us to evaluate a bunch of our studies.
20 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Literature Review
http://encarta.msn.com/media_701500404/Languages_Spoken_by_More_Than_10_Million_People.html, Microsoft
www.asharqalawsat.com.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
21
Literature Review
Arabic letter
Corresponding
Pronunciation
Arabic letter
Corresponding
Pronunciation
a*
Alif
Daad
Baa
Taa
Taa
Thaa
th
Thaa
Ayn
Jiim
gh
Ghayn
Haa
Faa
kh
Kha
Qaaf
Daal
Kaaf
dh
Thaal
Laam
Raa
Miim
Zaayn
Nuun
Siin
Haa
sh
Shiin
w*
Waaw
Saad
y*
Yaa
"
Isolated
End
Middle
'&
Beginning
(&
)
Literature Review
In arabic, there are three long vowels aa, ii and uu represented by the letters a (alif) [a:], y
(yaa) [i:], and ! w (waaw) [u:] respectively. Diacritics (,,,,,,,) , called respectively 5 4
fatha [], 78
kasra [i], 9:;
damma [u], < 5
4 fathatayn[an], <= 7 8
kasratayn [in],
< :9 ;
Dammatayn [un], ?!
sukuun [a-] (no vowel), and 9 A
shaddah, are placed above or
below consonant to mark short vowels and gemination or tashdeed (consonant doubling), and make the
difference between words having the same representation. For example, if we consider the English
words some, sum and same, non-diacritization (vowellessness) would reduce these words to
sm. For Arabic examples see Table 2.3 (refer to Table A.1 for the mapping between the Arabic letters
and their Buckwalter transliteration). Diacritics appear in the Holy Quran, and with less consistency in
other religious texts, classical poetry, textbooks children and foreign learners. However, they appear
occasionally in complex texts to avoid ambiguity. In spite everyday writing, the reader recognizes the
words as a result of experience as well as the context.
Word
1st interpretation
2nd interpretation
ktb
kataba
kutiba
Wrote
3rd interpretation
Has been
written
kutubN
B
School
Teacher
mdrsp madorasapN
mudar~isapN
Books
9
Taught
mudarsapN
Table 2.3. Ambiguity caused by the absence of vowels in the words ktb and
mdrsp.
In addition to singular and plural constructs, Arabic has a form called dual that indicates precisely
two of something. For example, a pen is CD (qalam), two pens are <: CD (qalamayn), and pens are
FD ( aqlaam). As in French, Spanish, and many other languages, Arabic nouns are either feminine or
masculine, and the verbs and adjectives that refer to them must agree in gender. In written Arabic, case
endings are used to designate parts of speech (subject, object, prepositional phrase, etc.), in a similar
fashion to Latin and German.
English imposes a large number of constraints on word order. Arabic, however, is distinguished by
its high syntactical flexibility. This flexibility includes: the omission of some prepositional phrases
associated with verbs; the possibility of using several prepositions with the same verb while preserving
the meaning; allowing more than one matching case between the verb and the verbal subject, and the
adjective and its broken plural, and the sharpness of pronominalization phenomena where the pronouns
usually indicate the original positions of words before their extra-positioning, fronting and omission. In
other words, Arabic allows a great deal of freedom in the ordering of words in a sentence. Thus, the
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
23
Literature Review
syntax of the sentence can vary according to transformational mechanisms such extra-position, fronting
and omission, or according to syntactic replacement such as an agent noun in place of a verb.
ktb
CaCaCa
H4
H 4
Writing notion
kataba
CaACiC
IJ4
kaAtib
=J
Writer
CaCuwC
!H4
katuwb
!
skilled writer
CiCaAC
JH4
kitaAb
J
Book
CuCay~iC
maCOCaC
B<H 4
HM
Kutay~ib
B<
?
handbook
maCOCaCap CH M
CaCaACiyC
makOtab
makOtabap N ?
<IJH4 kataAtiyb
Wrote
Desk
Library
C stands for the letters that a part of the root. An underlined C stands for a letter that is doubled.
a, i, u, designate vowels, and m represents a derivation consonant.
Table 2.4. Some templates generated from roots with examples from the root ( ktb).
24 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Literature Review
>akOsada
CH 4
faEOlala
Oxidize
8 O
Mu&akOsad
CH M
mufaEOlal
Oxidized
>aksadah
CCH 4
faEOlalah
Oxidation
8 P=
ta>aksud
CH M =
tafaEOlul
Oxidation
25
Literature Review
Indeed, many words in Arabic do start by one of these consonants. It is true that many Arabic words are
composed of three consonants, but this is not always the case. It might be possible to identify those
prefixes by comparing prefixed words to a huge database of lexical forms in order to define which
words contain prefixes and which do not. The outcome of this process, however, is not clear at all.
However, the above mentioned prefixes do also occur in combination. This means that in practice
two or three prefixes can be linked to a word. The three most frequently used combinations of prefixes
are: (1) a combination between a connective and a preposition (for instance: wa-bi, in written
language wb, meaning: and with), (2) a combination between a preposition and the article (for
instance: J& bi-Al, in written language J& bAl, meaning: with the) and (3) a combination of three
particles, which is most commonly the combination between a connective, a preposition and the article
(for instance: J& wa-bi-Al, in written language J& wbAl, meaning and with the).
Suffixes
In Arabic, suffixes are sets of letters, articles, and pronouns attached to the end of the word and
written as part of it. There are 17 used as possessive suffixes. Besides, there is the suffix of the A
(alif) which is used as an undefined accusative; and there is the suffix of the energetic, the 9 n
(nna). The possessive suffixes consist of one or two consonants. It is obvious that one consonant suffix
is more difficult to identify than two. Moreover, there will always remain combinations which are
ambiguous. The suffix J hA, for instance, of the third person singular feminine can easily be mixed
up with the undefined accusative A (alif) of a word ending with the consonant h.
The representation below shows a possible word structure. Note that the writing and reading of an
Arabic word are from right to left.
Postfix
Suffix
Scheme
Prefix
Antefix
The prefixes and suffixes express grammatical features and indicate the functions: noun case,
verb mood and modalities (number, gender, person ).
| X
79Y =
Literature Review
Scheme: 79Y = atata*akr.
Suffix: X wn Verbal suffix.
Postfix : JX naA Pronoun suffix.
2.5.5. Anomalies
As is generally known, the Arabic language is complicated for natural language processing because
of the combination of two main language characteristics. The first is the agglutinative nature of the
language and the second is the aspect of the vowellessness of the language which causes problems of
ambiguity at different levels, and complicates the identification of words.
2.5.5.1. Agglutination
The first problem is the identification of words in sentences. As in most European languages, Arabic
words can, to a certain degree, be identified in computer terms as a string of characters between blanks.
Two blanks in a text serve as a marker for the separation of strings of characters, but those strings of
characters do not always coincide with words. Some Arabic grammatical categories which are
considered words in other languages appear to be affixes. Those affixes are directly linked to the words
in Arabic (as is explicated in Section 2.5.4.2), which means that a string of characters between two
blanks can contain more than one word so that multiword combinations are found which are not
separated by blanks.
The string is ambiguous as the affixes could be an attached particle or a part of the word. Thus a
form such as \?C4 flky can be read as \?C4 (falaky) meaning astronomer, or \? C4 (falikay) which
means then for, or B\? C4 (falikayyi) which means then for ironing or burning. Note that there is no
deterministic way to tell whether the first letter is part of word or the prefix.
2.5.5.2. The Vowelless Nature of the Arabic Language
The second problem in Arabic is the vowellessness of the words in sentences. This causes problems
not only on the previous mentioned multiword combinations, but also on word level. The vowellessness
affects the meaning of words. As an example, we take the string of characters consisting of the
consonants ( kaf) and ( lam). A reader could identify these two consonants as the noun
(kullo) which means all, or the verb ( kul) (verb ( akala) for the second person singular
masculine) which means he eats. Another example of such ambiguity is the string ]4
mdAfE,
which could means according to its spelling ]4
mudaAfiE defender or ]4
madaAfiE
cannons.
Also it affects the grammatical labeling of words, which is especially the case for verbs. The
different persons of the verb form, both in the present and past tenses, are in most cases only identifiable
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
27
Literature Review
by means of vowels which are omitted. The verb form ktbt, for example, can refer to four
possible persons: i.e.
N katabOtu for the first person singular,
N katabOta for the second
person singular masculine,
N katabOti for the second person singular feminine and N
katabatO for the third person singular feminine. It is almost impossible for a computer program to
determine the subject of these verbs. Only the context can help in defining the correct persons of a verb
form. In this respect some help might be expected from a minimal form of text categorization. Indeed, in
newspaper text, the first person singular is less likely to occur, whereas in literature this person might
occur more abundantly. Nevertheless, it seems quite difficult to tag texts automatically when they are
not vocalized or when the larger context cannot be taken into account.
28 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Literature Review
2.5.6.2. Morphology-based IR
The efforts that have been made in the academic environment to evaluate more sophisticated systems
give an idea about the next generation of the Arabic search engines. Evaluation has been performed on
systems using multiple approaches of incorporating morphology. Different proposed classifications of
Arabic morphological analysis techniques, found in literature, are reviewed in the work of Al-Sughaiyer
and Al-Kharashi [AlA04a]. However, in this work we adapt the Larkey et al. classification [LBC02],
where they proposed classifying Arabic stemmers into four different classes, namely, manually
constructed dictionaries, algorithmic light stemmers which remove prefixes and suffixes, morphological
analyses which attempt to find roots, and statistical stemmers.
Constructed Dictionaries: Manually constructed dictionaries of words with stemming information
are in surprisingly wide use. Al-Kharashi and Evens worked with small text collections, for which they
manually built dictionaries of roots and stems for each word to be indexed [AlE94]. Tim Buckwalter10
developed a set of lexicons of Arabic stems, prefixes, and suffixes, with truth tables indicating legal
combinations. The BBN group used this table-based stemmer in TREC - 2001 [XFW01].
Algorithmic Light Stemmers: Light stemming refers to a process of stripping off a small set of
prefixes and/or suffixes, without trying to deal with infixes, or recognize patterns and find roots
[LBC02, Dar03]. Although light stemming can correctly conflate many variants of words into large stem
classes, it can fail to conflate other forms that should go together. For example, broken (irregular)
plurals for nouns and adjectives do not get conflated with their singular forms, and past tense verbs do
not get conflated with their present tense forms, because they retain some affixes and internal
differences, like the noun ( soduud) the plural of (sad) which means dam.
Morphological Analyzers: Several morphological analyzers have been developed for Arabic
[AlA89, Als96, Bee96, KhG9911, DDJ01, GPD04, TEC05] but few have received a standard IR
evaluation. Such analyzers find the root, or any number of possible roots for each word. Since most
verbs and nouns in Arabic are derived from triliteral (or, rarely, quadriliteral) roots, identifying the
underlying root of each word theoretically retrieves most of the documents containing a given search
term regardless of form. However, there are some significant challenges with this approach.
Determining the root for a given word is extremely difficult, since it requires a detailed morphological,
syntactic and semantic analysis of the text to fully disambiguate the root forms. The issue is complicated
further by the fact that not all words are derived from roots. For example, loan words (words borrowed
from another language) are not based on root forms, although there are even exceptions to this rule. For
10
11
29
Literature Review
example, some loans that have a structure similar to triliteral roots, such as the English word film C<4,
are handled grammatically as if they were root-based, adding to the complexity of this type of search.
Finally, the root can serve as the foundation for a wide variety of words with related meanings. The root
ktb is used for many words related to writing; including ( kataba), which means to
write; J( kitaab), which means book; ?
(maktab), which means office; and =J
(kaatib), which means author. But the same root is also used for regiment/battalion: N<
(katyba). As a result, searching based on root forms results in very high recall, but precision is usually
quite low.
2.5.6.3. Statistical Stemmers
In statistical stemmer class, we distinguish between two kinds of stemmers, those consisting in
grouping word variants using clustering techniques and n-gram. The former model consists in grouping
words that result in a common root after applying a specific algorithm as a conflation or equivalence
class. These equivalence classes are not overlapping, where each word belongs to exactly one class.
Based on the co-occurrence analysis and a variant of EMIM (expected mutual information) [Van79,
ChH89], which measures the proportion of word co-occurrences that are over and above what would be
expected by chance, statistical stemmers for Arabic language were used to refine stem-based and rootbased stemmers [LBC02]; whereas, they were applied also to n-gram stemmer for English and Spanish
languages [XuC98].
Statistical stemming applied to the best Arabic stemmers (Darwish light stemmer modified by
Larkey [LBC02], and Khoja root-based stemmer [KhG99]12) changes classes a great deal, but does not
improve (or hurt) overall retrieval performance. This may be suspected to the clustering method having
high bias against low frequency variants.
The second statistical model, n-gram, generates a document vector by moving a window of n
characters in length through the text, enabling a statistical language description by learning the
apparition probability of each group of these n characters.
N-gram stemmers have different challenges primarily caused by the significantly larger number of
unique terms in an Arabic corpus, and the peculiarities imposed by the Arabic infix structure that
reduces the rate of correct n-gram matching.
Published comparison studies of using stems against using roots for information retrieval are
discrepant. Older studies revealed that words sharing a root are semantically related, and root indexing is
12
30 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Literature Review
reported to outperform stem and word indexing on retrieval performance [HKE97, AAE99, Dar02].
However, later works on the TREC collection showed two different results. Darwish (as cited by Larkey
et al. [LBC02]) found no consistent difference between root and stem while Al-Jlayl & Frieder,
Goweder et al. and Thagva et al. [AlF02, GPD04, TEC05] showed that stem-based retrieval is more
effective than root-based retrieval. The older studies showing the superiority of roots over stems are
based on small and nonstandard test collections, making results non-justifiable.
Similarly, the work of Larkey et al. [LBC02] showed that the statistical stemmer, based on cooccurrence, still inferior to good light stemming and morphological analysis. In addition, the work of
Mustafa and Al-Radaideh [MuA04] indicated that the digram method offers a better performance than
trigram with respect to conflation precision and conflation recall ratios, but in either case, the n-gram
approach does not appear to provide a good performance compared to the light stemming approach.
Hence, we could conclude that Al-Stem (Darwish stem-based stemmer, modified by Larkey), up the
day of this study have been effectuated, is the best known and published stemmer.
2. 6. Arabic Corpus
In attempts to study and evaluate IR systems, morphological analyzers, and machine translation
systems for Arabic language, researchers initiated the creation of corpora. Among these Arabic corpora,
we find some available such as AFP corpus, Al-Hayat newspaper, Arabic Gigaword, Treebanks, and
ICA. However, at the exception of the Initial Version of the ICA, including about 448 files of totaling
size approximately of 13.5MB in uncompressed form, that has been made available for free, all the other
corpus are not free.
31
Literature Review
Information Retrieval applications development purposes.
The data have been distributed into seven subject-specific databases, thus following the Al-Hayat
subject tags: General, Car, Computer, News, Economics, Science, and Sport.
Mark-up, numbers, special characters and punctuation have been removed. The size of the total file
is 268 MB. The dataset contains 18 639 264 distinct tokens in 42 591 articles, organized in 7 domains.
2.6.4. Treebanks
A treebank is a text corpus in which each sentence has been annotated with syntactic structure.
Syntactic structure is commonly represented as a tree structure, hence the name treebank. Treebanks can
be used in corpus linguistics for studying syntactic phenomena or in computational linguistics for
training or testing parsers.
Treebanks are often created on top of a corpus that has already been annotated with part-of-speech
tags. In turn, treebanks are sometimes enhanced with semantic or other linguistic information.
Treebanks can be created completely manually, where linguists annotate each sentence with
syntactic structure, or semi-automatically, where a parser assigns some syntactic structure which
linguists then check and, if necessary, correct.
Some treebanks follow a specific linguistic theory in their syntactic annotation (e.g. the
BulTreeBank13 follows head-driven phrase structure grammar (HPSG)) but most try to be less theoryspecific. However, two main groups can be distinguished: treebanks that annotate phrase structure such
the Penn Arabic Treebank14, and those that annotate dependency structure such the Prague Arabic
Dependency Treebank15.
13
14
15
32 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Literature Review
2. 7. Summary
In this chapter, we highlight the different models of information retrieval, especially vector space
model. We give taxonomy of clustering algorithms, and explain the usefulness and the use of clustering
in the information retrieval process. We introduce dimension reduction techniques, and review
chronologically features selection methods used for clustering. Moreover, we present the Arabic
language characteristics, and underline previous work undertaken in the aim of improving Arabic
retrieval. Finally, we present the existent available Arabic corpora.
16
Retrieved on 10-8-2007.
17
33
3. 2. Model Description
The latent semantic document retrieval model builds upon the prior research in document retrieval
and, using the singular value decomposition (SVD) [GoV89] to reduce the dimensions of the termdocument space, attempts to solve the synonymy and polysemy problems (Section 2.2.2) that plague
automatic document retrieval systems. LSA explicitly represents terms and documents in a rich, highdimensional space, allowing the underlying (latent), semantic relationships between terms and
documents to be exploited during searching.
34 ___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
3.2.2. Weighting
The benefits of weighting are well-known in the document retrieval community [Jon72, Dum91,
Dum92]. LSA typically uses both a local and global weighting scheme to increase or decrease the
relative importance of terms within documents and across the entire document collection, respectively.
A combination of the local and global weighting functions is applied to each non-zero element of A,
aij = L(i, j ) G (i ) ,
or aij =
L(i, j )
G (i )
(1)
(2)
where L(i,j) is the local weighting function for term i indicating its importance in the document j, and
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
35
aij =
L(i, j ) G (i )
N ( j)
(3)
where N(j), the document length normalization, is used to penalize the term weight for the document j
in accordance with its length. Such weighting functions are used to differentially treat terms and
documents to reflect knowledge that is beyond the collection of the documents.
Some popular local weighting schemes include [Dum91, Dum92]:
-
Term Frequency: tf or f ij is the integer representing the number of times term i appears in
document j.
if fij > 1
aij =
0
otherwise
log(Term Frequency + 1): is used to damp the effects of large differences in frequencies, such
that an additional occurrence of term i in document j is considered more important at smaller
term frequency levels than at larger levels.
Four well-known global weightings are: Normal, GfIdf, Idf, and Entropy. Each is defined in terms of
the term frequency f ij , the document frequency df i , which is the number of documents in which term i
occurs, and the global frequency gfi , which is the total number of times term i occurs in the whole
collection. N is the number of documents, and M is the number of terms in the collection.
-
Normal: N
fij
j
It normalizes the length of each row (term) to 1. This has the effect of giving high weight to
infrequent terms. However, it depends only on the sum of the squared frequencies and not the
distribution of those frequencies.
-
Gfldf:
gfi
dfi
N
Idf: log 2
dfi
GfIdf and Idf are closely related. Both of them weight terms inversely by the number of different
18
36 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
1 - Entropy: 1
f ij
where pij =
gf i
Entropy is a sophisticated weighting scheme that takes into account the distribution of terms over
N pij log pij
j log( N )
). Subtracting this
quantity from a constant assigns minimum weight to terms that are equally distributed over
documents (i.e. where pij =
documents.
Furthermore, there are other global weighting schemes as:
-
N
j
and Entropy:
N
1 + pij log pij
j
log(N )
19
In general, all global weighting schemes give a weaker weight to frequently terms or those occurring
in lot of documents.
Two main reasons make the use of normalization necessary:
-
Higher Term Frequencies: Long documents usually use the same terms repeatedly. As a result,
the term frequency factors may be large for long documents, increasing the average contribution of
its terms towards the query-document similarity.
-
More Terms: Generally, vocabulary is richer and more varied in long documents than shorter
ones. This enhances the number of matches between a query and a long document, increasing the
query-document similarity, and the chances of retrieval of long documents in preference over shorter
documents.
The normalization could be either explicit or implicit effectuated by the cosine based measure
(angular distance between query q and document D)
19
37
D.q
D * q
Various normalization techniques are used in document retrieval systems. Following is a review of some
commonly used normalization techniques [SBM96]:
1
M 2
- Cosine Normalization: wi where
wi = w
(i , j ) w
(i ) ,
local
global
i
Cosine normalization is the most commonly used normalization technique in the vector-space
model. It attacks both normalization reasons in one step: higher individual term frequencies augment
individual weighting values wi , increasing the penalty on the term weights. Also, if a document is
rich, the number of individual weights in the cosine factor (M in the above formula) increases,
yielding a higher normalization factor.
-
individual tf weights for a document by the maximum tf in the document. The Smart systems
augmented tf
tf
tf
0,4 + 0,6 *
are examples of such normalization.
max tf
By restricting the tf factors to a maximum value of 1.0, this technique only compensates for the first
normalization reason (higher tf s), while it does not make any correction for the second reason
(more terms). Hence, the technique turns out to be a weak form of normalization and favors the
retrieval of long documents.
-
Byte Length Normalization: More recently, a length normalization scheme based on the byte
size of documents has been used in the Okapi system. This normalization factor attacks both
normalization reasons in one shot.
Other classic weighting schemes are used in the literature such as: the Tfc, Ltc weighting [AaE99],
and the Okapi BM-25 weighting [RWH94, Dar03].
-
Tfc:
fij * idfi
M
2
f * idf k
k =1 k j
The tfxidf weighting, even widely used, does not take into account that documents may be of
different lengths. The tfc weighting is similar to the tfxidf weighting except for the fact that length
normalization is used as part of the word weighting formula.
38 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
( )
M
2
(log( f kj + 1)* idf k )
k =1
log f ij + 1 * idfi
Ltc:
A slightly different approach uses the logarithm of the word frequency instead of the raw word
frequency, thus reducing the effects of large differences in frequencies.
-
Okapi BM-25:
3 * log N log df * f
i
ij
N * dl
2 * 0.25 + 0.75 *
N
k = 1 dl
k
+ f
ij
Among these weighting scheme tried with LSA (Section 3.3.2.1), we find that the Okapi BM-25
scheme provides a 7.9 % - 27.7 % advantage over term frequency scheme on all the English corpuses
used in this Chapter.
where the columns of U and V are the left and right singular vectors, respectively, corresponding to the
monotonically positive decreasing (in value) diagonal elements of S, which are called the singular
values of the matrix A. As illustrated in Figure 3.1, the first k columns of the U and V matrices and the
first (largest) k singular values of A are used to construct a rank-k approximation to A via Ak = U k S kVk .
T
The columns of U and V are orthogonal, such that U TU = V TV = I r , where r is the rank of the matrix A.
A theorem due to Eckart and Young [GoR71] suggests that Ak , constructed from the k-largest singular
triplets20 of A is the closest rank-k approximation (in the least squares sense) to A [BeC87].
20
The triple
{U , ,V }
i
, where S
= diag ( 0 , 1 ,..., k 1 ) ,
th
is called the
i th singular
triplet.
Ui
i , of the matrix A.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
39
=
k
(m x n)
Ak
(r x r)
(r x n)
(m x r)
Uk
Sk
Term
Vectors
Vk
Document
Vectors
Figure 3.1. A pictorial representation of the SVD. The shaded areas of U and V, as well as the diagonal
line in S, represent Ak, the reduced representation of the original term-document matrix A.
With regard to LSA, Ak is the closest k-dimensional approximation to the original term-document
space represented by the incidence matrix A. As stated previously, by reducing the dimensionality of A,
much of the noise that causes poor retrieval performance is thought to be eliminated. Thus, although a
high-dimensional representation appears to be required for good retrieval performance, care must be
taken not to reconstruct A. If A is nearly reconstructed, the noise caused by variability of word choice
and terms that span or nearly span the document collection will not be eliminated, resulting in poor
retrieval performance [BeC87]. Generally, the choice of the reduced dimension is empiric, depends on
the nature of the corpus, the type of the used queries if they are long or short represented by keywords,
and the performed weighting schemes. This is experimentally proved in [ABE08] and the Section
3.3.2.2.
It is worthwhile to point out that, in the context of text retrieval, document vectors could either refer
to the columns in A or the columns in V T , and term vectors could either refer to the rows in A or the
rows in U. And the same nomenclature applies to the dimension reduced model, only with the subscripts
dropped off. It is important to differentiate between these two kinds of document vectors (or term
vectors) to avoid confusion. Also note that the term vectors and document vectors in A may be referred
to as the initial/original (term or document) vectors since they have not been subjected to dimension
reduction; while in Ak they may be referred to as the (term or document) concepts, because the termdocument matrix reduction captures semantic structure (i.e. concepts) while it rejects the noise that
results from term usage variations.
In addition to the fact that the left and right singular vectors specify the locations of the terms and
documents respectively, the singular values are often used to scale the term and document vectors,
allowing clusters of terms and documents to be more readily identified. Within the reduced space,
semantically-related terms and documents presumably lie near each other since the SVD attempts to
40 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Version A
(I)
Underlying Philosophy: Column vectors (V T (:,1),...,V T (:, n)) in matrix V T are k-dimensional
document vectors, their dimension having been reduced from m. Dimensionally-reduced V T (:, i )
should carry some kind of latent semantic information captured from the original model and may
be used for querying purposes. However, since V T (:, i ) is k-dimensional while q is mdimensional, it is needed to translate q into some proper form in order to compare it with
V T (:, i ) .
Observing
that
and
equation
A = USV T leads to ( A(:,1),..., A(:, n)) = US (V T (:,1),...,V T (:, n)) . Thus, for any individual column
vector in A A(:, i ) = USV T (:, i ) for (1 i n) which implies that V T (:, i ) = S 1U T A(:, i ) for
Query Method: First, use formula q 'a = S 1U T q to translate the original query q into a form
comparable with any column vector V T (:, i ) in matrix V T . Then compute the cosine between q 'a
and each V T (:, i) for (1 i n) .
(II)
Version B
Underlying Philosophy: As mentioned earlier, document vectors can mean two different things:
either column vectors (V T (:,1),...,V T (:, n)) in V T or column vectors ( A(:,1),..., A(:, n)) in A. In
fact, the latter ones might be a better choice for serving as document vectors because they are
rescaled from the dimensionally reduced U and V by a factor of S after the SVD process. To
utilize ( A(:,1),..., A(:, n)) for querying purposes, only one further step on the basis of version A
need to be taken, which is to scale k-dimensional
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
41
21
Query Method: First, use formula q 'b = UU T q to translate the original query q into a folded-inplus-rescaled form comparable with any column vector A(:, i ) in matrix A. Then compute the
cosine between q 'b and each A(:, i ) for (1 i n) .
(III)
Version B'
Underlying Philosophy: All the reasoning behind version B sounds good except for one thing:
since m-dimensional column vectors ( A(:,1),..., A(:, n)) will be used as document vectors, it is not
needed to fold in q and then rescale it back to m dimensions: just the original query q could be
used (which is already m-dimensional) for comparing with each m-dimensional A(:, i ) for
(1 i n) .
Query Method: Compute the cosine between q and each A(:, i) for (1 i n) .
The above three different versions of query method are summarized in Table 3.1, along with the
conventional technique of lexical matching.
Lexical
Matching
Version A
Version B
Version B'
Document
Vectors
m-dim column
vectors in A
k-dim column
vectors in VT
m-dim column
vectors in A
m-dim column
vectors in A
Query
Vector
m-dim original
query vector q
S -1UTq
vector UUTq
[BDO95]
Applicable
Literature
Many
[BeF96] [FiB02]
[Jia97] [Jia98]
[LeB97] [Let96]
[Wit97]
[DDF89]
[DDF90]
m-dim original
query vector q
[BCB92] [BDJ99]
[Din99]
[Din01][HSD00]
[KoO96]
[KoO98] [Zha98]
[ZhG02]
Table 3.1. Comparison between Different Versions of the Standard Query Method.
Based on the analysis of the three standard versions, it was proved that version B and version B' are
essentially equivalent. On the other hand, the task of seeking the best version of the standard query method has
brought a marked difference for the version B compared to the version A [Yan08]. However, this latter is
21
It should be pointed out that because of dimensional reduction, UTU=I while UUTI.
42 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
3.3.1. Data
The English testing data used, in our experiments including this chapter and some following
chapters, are formed by mixing documents from multiple topics arbitrarily selected from standard
information science test collections. The text objects in these collections are bibliographic citations
(consisting of the full text of document titles, and abstracts), or the full text of short articles. Table 3.2
gives a brief description and summarizes the sizes of the datasets used.
Cisi: document abstracts in library science and related areas published between 1969 and 1977 and
extracted from Social Science Citation Index by the Institute for Scientific Information.
Cran: document abstracts in aeronautics and related areas, originally used for tests at the Cranfield
Institute of Technology in Bedford, England.
Med: document abstracts in biomedicine received from the National Library of Medicine.
Reuters-21578: short articles belonging to the Reuters-21578 collection23. This collection consists
of news stories, appearing in the Reuters newswire for 1987, mostly concerns business and the economy.
It contains multiple categories that are overlapping.
Collection name
Cisi
Cran
Med Reuters-21578
21578
22
23
43
3.3.2. Experiments
In these experiments, we are interested in evaluating the effectiveness of the weighting schemes,
after which we compare the performances of the latent semantic analysis and the vector-space models.
Scheme
MIAP
Scheme MIAP
lNn
0.14
nGn
nNn
0.14
nnn
nE n
G
nE n
1
lE n
G
0.17
lE n
1
0.18
ltc
0.18
lGn
0.19
Scheme MIAP
_
nG n
Scheme MIAP
0.21
_
nN n
0.22
nE n
Scheme MIAP
_
0.24
lE n
0.27
0.24
ltn
0.27
0.25
lE n
0.20
ntc
0.20
nE n
1
0.22
_
lNn
0.21
_
lG n
0.22
nE n
0.25
lE n
0.28
0.21
lEn
0.23
ntn
0.25
Okapi
0.32
_
S
0.27
_
G
Table 3.3. Result of weighting schemes in increasing order for Cisi corpus.
44 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Scheme
MIAP
lNn
0.10
nNn
nE n
G
nE n
1
lE n
G
0.10
0.12
0.14
Scheme MIAP
nGn
_
nG n
nnn
lE n
1
0.18
nE n
0.19
_
lG n
0.20
nN n
Scheme MIAP
0.24
ntc
Scheme MIAP
_
0.27
lE n
0.24
nE n
0.27
lE n
0.33
lNn
0.24
ntn
0.28
ltn
0.33
0.20
lGn
0.25
nE n
0.29
lE n
0.34
0.24
ltc
0.25
lE n
0.30
Okapi
0.47
0.16
Scheme MIAP
_
S
0.32
Table 3.4. Result of weighting schemes in increasing order for Cran corpus.
Scheme
MIAP
nNn
0.09
lNn
0.09
_
nN n
ltc
_
lN
0.12
0.13
0.13
Scheme MIAP
nE n
G
ntc
lE n
G
nE n
1
_
nG n
Scheme MIAP
Scheme MIAP
0.14
nnn
0.18
_
nE n
1
0.15
nGn
0.19
lE n
0.20
lGn
0.17
_
lG n
0.20
nE n
0.18
_
lE n
0.20
nE n
0.16
lE n
0.20
ntn
0.24
0.22
ltn
0.25
0.23
lE n
0.23
lE n
0.25
0.24
Okapi
0.26
_
1
Scheme MIAP
_
G
_
G
_
S
0.25
Table 3.5. Result of weighting schemes in increasing order for Med corpus.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
45
MIAP
nNn
0.14
nG n
0.14
lN n
0.18
_
nN
lNn
nE n
G
lE n
Scheme MIAP
Scheme MIAP
Scheme MIAP
_
0.35
0.24
lG n
0.26
lE n
0.30
lE n
0.25
lE n
0.27
lGn
0.31
lE n
0.25
ntc
0.27
nE n
0.32
lE n
0.36
0.27
nE n
0.33
ltn
0.37
0.28
ntn
0.34
Okapi
0.41
nnn
0.25
ltc
nE n
0.22
nGn
0.26
nE n
_
1
0.36
0.22
Scheme MIAP
Table 3.6. Result of weighting schemes in increasing order for Cisi-Med corpus.
The precedent experiments show that the choice of a weighting scheme is very important, because
some schemes destroy the mean interpolated average precision (MIAP) (see Section C.2.3. ). As we can
see for example in Table 3.5, the term frequency indexation (nnn) gives better result than when the first
_
Also the experiments show that the Okapi BM-25 weighting scheme gives the best results, presented in
bold, overall the other schemes, in the whole examples. Moreover, we remark that there is a marked
difference in the rank of the other weighting schemes from a corpus to another. For example, by
evaluating the performance of the system, the well known and used TfxIdf scheme (ntn) is classified
between the 18th and 21st rank.
46 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
nnn
_
E
l S
_
E
l 1n
_
E
n 1n
E
l 1n
E
n 1n
Cisi
808
1440
1438
1458
1447
1395
487
1137
1424
427
442
1424
Cran
1204
772
1121
1189
961
1096 1370
904
971
1400
1365
886
Med
316
123
186
266
311
418
931
86
316
152
251
960
Cisi-Med 1329
437
1018
1436
2081
1260
889
333
1329
1622
564
534
_
N
n
lNn
nNn
_
G
l n
_
G
n n
lGn
nGn
ntn
Cisi
1458 1460
219
244
1433 1457
109
109
Cran
1387 1400
655
Med
810
_
N
l
_
E
n S
_
E
l G
_
E
n G
Gn n Gn
n l
ltn
Okapi
100
40
606
262
1032 1032
206
321
745
157
146
341
531
366
112
53
491
502
1634
428
219
532
868
1778
257
70
ntc
ltc
lE n
Table 3.7. The best reduced dimension for each weighting scheme in the case of four corpuses.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
47
Cisi
Cran
Med
Cisi-Med
Figure 3.2. The interpolated recall-precision curves of the LSA and the VSM models.
3. 4. Summary
In this Chapter, we have first recalled the traditional technique of document retrieval, which is the
Vector-Space Model (VSM). Then, we have described one of its extended models: the Latent Semantic
Analysis (LSA), which shows substantial advancement over the traditional VSM, even when only the
version A modeling the data and queries is performed. Also, we have presented three combination
methods, found in current weighting literature, for the local, global weighting functions and the length
normalization. Through some experiments, we have juxtaposes the application of twenty five weighting
schemes, where comparison has shown advantages on behalf of the Okapi BM-25 weighting scheme.
The first advantage, of this scheme, is represented by its high performance improvement of the
information retrieval system, while the second is illustrated in getting the smallest best reduced
dimension k for the LSA model when this scheme is used.
48 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Chapter 4 Document
Diffusion Map
Clustering
based
on
4. 1. Introduction
A great challenge of text mining arises from the increasingly large text datasets and the high
dimensionality associated with natural language. In this chapter, a systematic study is conducted, for the
first time, in the context of the document clustering problem, using the recently introduced diffusion
framework and some characteristics of the singular value decomposition.
This study is two major fold: classical clustering and on-line clustering. In the first fold, we propose
to construct a diffusion kernel based on the cosine distance, we discuss the problem of the reduced
dimension choice, and we compare the performances of k-means algorithm in four different vector
spaces: Saltons vector space, latent semantic analysis space, diffusion space based on the cosine
distance, and another based on the Euclidian distance. We also propose two postulates indicating the
optimal dimension to use for clustering as well as the optimal number of clusters to use in that
dimension.
While in the second fold, we introduce single-pass clustering, one of the most popular methods used
in online applications such as peer-to-peer information retrieval (P2P) [KWX01, KlJ04, KJR06], topic
detection and tracking (TDT) [HGM00, MAS03]. We present a new version of the classical single pass
clustering algorithm, called On-line Single-Pass clustering based on Diffusion Map (OSPDM).
k is symmetric: k (d i , d j ) = k (d j , d i )
1 , ..., N , we have
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
49
i =1 j =1
k (d i , d j ) 0.
This kernel represents some notion of affinity or similarity between the documents of D, as it
describes the relationship between pairs of documents in the corpus. In this sense, one can think of the
documents as being the nodes of a symmetric graph whose weight function is specified by k . The kernel
measures the local connectivity of the documents, and hence captures the local geometry of the
corpus, D . The idea behind the diffusion map is to construct the global geometry of the data set from the
local information contained in the kernel k . The construction of the diffusion map involves the
following steps. First, assuming that the transition probability m1 , in one time step, between documents
d i and d j is proportional to k (d i , d j ) we construct an N N
M (i, j ) = m1 ( d i , d j ) =
k (d i , d j )
p(d i )
where
p (d i ) = k (d i , d j ).
j
The Markov matrix M reflects the first-order neighborhood structure of the graph. However to
capture information on larger neighborhoods, powers of the Markov matrix M are taken, inducing a
forward running in time of the random walk and constructing a Markov Chain. Thus considering M t the
tth power of M, the entry mt (d i , d j ) represents the probability of going from document d i to d j in t time
steps.
Increasing t, corresponds to propagating the local influence of each node with its neighbors. In other
words, the quantity M t reflects the intrinsic geometry of the data set defined via the connectivity of the
graph in a diffusion process and the time t of the diffusion plays the role of a scale parameter in the
analysis. When the graph is connected, we have that [Chu97]:
lim mt (d i , d j ) = 0 (d j ) , where 0 is the unique stationary distribution 0 (d i ) =
t +
p(d i )
.
l p(d l )
Using a dimensionality reduction function (the SVD in our approach), the Markov matrix M will
have a sequence of r (where r is the matrix rank) eigenvalues in non-increasing order
0 1 ... l ... r 1 with corresponding right eigenvectors l .
The stochastic matrix M t naturally induces a distance between any two documents. Thus, we define the
diffusion distance as D t Diff (d i , d j ) = l2t ( l (d i ) l (d j )) 2 and the diffusion map as the mapping from
2
50 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
t 0 0 (d )
t
1 1 (d )
the vector d , representing a document, to the vector t (d ) =
, for a value n. By retaining
M
t n1 (d )
n 1
only the first n eigenvectors, we embed the corpus D in an n -dimensional Euclidean diffusion space,
where { 0 , 1 , ..., n1 } are the coordinate axes of the documents in this space. Note that typically,
n << N and hence we obtain a dimensionality reduction of the original corpus.
the neighborhoods defining the local geometry of the data. The smaller the parameter , the faster the
exponential decreases and hence the weight function k becomes numerically insignificant more quickly,
as we move away from the center.
However, as experiments show in Section 4.4.1, there are strong indications that this kernel is not the
right choice for the document clustering. For this reason, in addition to the fact that the cosine distance
has emerged as an effective distance for measuring document similarity [Sal71, SGM00], we propose to
use a kernel based on what is known as the cosine distance: DCos (d i , d j ) =1
d i .d j
di . d j
. We define this
DCos (d i , d j )
.
kernel as k (d i , d j ) = exp
However in the case where the vectors d i and d j are normalized, due to the fact that the two kernels
are related, as shown by the equation DCos ( d i , d j ) = 1 d i .d j =
T
2
1
d i d j , the distinction between
2
51
of the most important theorems of SVD, Eckart and Young theorem [GoR71], states that a matrix
combination) is the best approximation to the original matrix that uses k degrees of freedom. The
technique of approximating a data set with another one having fewer degrees of freedom works well,
because the leading singular triplets capture the strongest, most meaningful, regularities of the data. The
latter triplets represent less important, possibly spurious, patterns. Ignoring them actually improves
analysis, though there is the danger that by keeping too few degrees of freedom, or dimensions of the
abstract vector space, some of the important patterns will be lost [LFL98].
In [DhM99, DhM01], Dhillon and Modha compared the closeness between the subspaces spanned
by the spherical k-means concept vectors and the singular vectors by using principal angles [BjG73,
GoV89, Arg03] (for more details, see Appendix D). Seeing that the concept vectors constitute an
approximation matrix comparable in quality to the SVD, they were interested in comparing a sequence
of singular subspaces to a sequence of concept subspaces, but since it is hard to directly compare the
sequences, they compared a fixed concept subspace to various singular subspaces, and vice-versa.
52 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Figure 4.1. Average cosine of the principal angles between 64 concept subspace and various singular
subspaces for the CLASSIC data set.
Figure 4.2. Average cosine of the principal angles between 64 concept subspace and various singular
subspaces for the NSF data set.
This fact means that the concept subspace is completely contained in the singular subspace constituted
of the first six singular vectors. Thus the minimum number k of independent variables, required to
describe the approximate behavior of the underlying system in the truncated SVD matrix M n 1 , where
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
53
M n1 =U n1 S n 1Vn1 , is reduced by a factor of 10 compared to the one needed and used for information
T
k 1 greater than a given threshold f 0 ; while in Lerman procedure, based on the plot of the
singular values in decreasing order, and the break or discontinuity in the slope, she shows that the
degrees of freedom k is equal to the number of points on the left side of the discontinuity.
After reducing the dimension, documents are represented as k-dimensional vectors in the diffusion
space, and could be clustered by using a standard clustering algorithm, such as k-means and single pass.
4.2.3.2. SVD-Updating
Suppose an m n matrix A has been generated from a set of data in a specific space, and its SVD,
denoted by SVD(A) and defined as:
A = USV T
(1)
has been computed. If more data (represented by rows or columns) must be added, three alternatives for
incorporating them currently exist: recomputing the SVD of the updated matrix, folding-in the new rows
and columns, or using the SVD-updating method developed in [Obr94].
Recomputing the SVD of a larger matrix requires more computation time and, for large problems,
may be impossible due to memory constraints. Recomputing the SVD allows the new p rows and q
columns to directly affect the structure of the resultant matrix by creating a new matrix A( m+ p )( n+ q ) ,
computing the SVD of the new matrix, and generating a different rank-k approximation matrix Ak ,
where
(2)
In contrast, folding-in, which is essentially the process described in Section3.2.4 for query
representation version A and B, is based on the existing structure, the current Ak , and hence new rows
and columns have no effect on the representation of the pre-existing rows and columns. Folding-in
requires less time and memory but, following the study undertaken in [BDO95], has deteriorating effects
54 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
A( m+ p )( n+ q ) is explicitly computed.
The process of SVD-updating requires two steps, which involve adding new columns and new rows.
a- Overview
Let D denote the p new columns to process, then D is an m p matrix. D is appended to the
columns of the rank-k approximation of the m n matrix A, i.e., from Equation (2), Ak so that the klargest singular values and corresponding singular vectors of
B = ( Ak D)
(3)
are computed. This is almost the same process as recomputing the SVD, only A is replaced by Ak . Let T
denote a collection of q rows for SVD-updating. Then T is a q n matrix. T is then appended to the
rows of Ak so that the k-largest singular values and corresponding singular vectors of
A
C = k
T
(4)
are computed.
b- SVD-Updating Procedures
In this section, we detail the mathematical computations required in each phase of the SVD-updating
process. SVD-updating incorporates new row or column information into an existing structured model
( Ak from Equation (2)) using the matrices D and T discussed in Section 4.2.3.1. SVD-updating exploits
the previous singular values and singular vectors of the original matrix A as an alternative to
recomputing the SVD of A( m+ p )( n+ q ) .
Updating Column. Let B = ( Ak D ) from Equation (3) and define SVD ( B ) = U B S BVBT . Then,
V
U kT B k
0
0
= ( S k U kT D ) , since Ak = U k S kVk T . If F = ( S k U kT D ) and SVD( F ) =U F S FVFT , then it
I P
V
follows that U B =U kU F , VB = k
0
0
VF , and S B = S F .
I P
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
55
U kT
CVk = S k
TV
Iq
k
U
S
If H = k and SVD ( H ) =U H S H VHT , then it follows that U C = k
0
TVk
0
U H , VC =VkVH , and S C = S H .
I q
4. 3. Clustering Algorithms
In this section, we present the clustering algorithms that we use in this chapter, which are the kmeans algorithm, the single-pass algorithm and finally the on-line single-pass clustering based on
diffusion map (OSPDM).
j =1 i =1
xi( j ) c j
, where xi( j ) c j
centroid c j , which is the mean point of all the points xi( j ) of the cluster j.
The algorithm starts by partitioning the input points into k initial sets, either at random or using some
heuristic data. It then calculates the centroid of each set. It constructs a new partition by associating each
point to the nearest centroid. Then the centroids are recalculated for the new clusters, and algorithm
repeated by alternate application of these two steps until convergence, which is obtained when the points
no longer switch clusters (or alternatively centroids are no longer changed).
k-means algorithm is composed of the following steps:
1- Place k points into the space represented by the objects that are being clustered. These points
represent initial group centroids.
2- Assign each object to the group that has the closest centroid.
56 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Algorithm
Single-pass clustering, as the name suggests, requires a single, sequential pass over the set of
documents it attempts to cluster. The algorithm classifies the next document in the sequence according
to a condition on the similarity function employed. At every stage, based on the comparison of a certain
threshold and the similarity between a document and a defined cluster, the algorithm decides on whether
a newly seen document should become a member of an already defined cluster or the center of a new
one. Usually, the description of a cluster is the centroid (average vectors of the document representations
included in the cluster in question), and a document representation consists of a term-frequency vector.
Basically, the single-pass algorithm operates as follow:
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
57
M
7- Apply SVD-updating for T = k
RM
S
a. Put H = k
RM Vk
58 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
U
b. Compute U T = k
0
0
U H .
1
Example 1. (Cisi and Med) In this example, the data set contains all documents of the collections
Cisi and Med. Figure 4.3 shows the two collections in the diffusion space at power t = 1, (a, c, e) for the
cosine kernel, and (b, d, f) for the Gaussian kernel, respectively, in 1, 2 and 3 dimensions. From this
figure, it appears clearly that the collections are better represented in the embedding space using the
cosine kernel.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
59
-a-
-bGaussian Kernel
Cosine Kernel
4
20
15
10
-1
-5
-2
-10
-3
-2.5
-2
-1.5
-1
-0.5
0.5
1.5
-15
-200
-150
-100
-50
-c-
-d-
-e-
-f-
50
100
60 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Spaces
Acc
2-Dim
MI
Acc
MI
3-Dim
Acc
MI
4-Dim
Acc
MI
58.68 0.15
58.78 0.21
58.68 0.10
Cosine diffusion
98.60 89.18
90.11 69.33
77.96 41.80
75.82 34.47
LSA
98.63 90.40
94.61 78.41
93.78 77.04
90.76 67.08
5-Dim
10-Dim
20-Dim
100-Dim
Spaces
Acc
MI
Acc
MI
Acc
MI
Acc
MI
58.60 0.05
58.60 0.05
58.60 0.05
Cosine diffusion
76.11 33.80
75.76 32.25
71.63 29.75
59.49 8.12
LSA
83.72 46.4
73.31 26.82
65.95 12.64
61.16 5.497
Table 4.1. Performance of different embedding representations using k-means for the set Cisi and Med.
From the results of Table 4.1, we can see that the embedding diffusion that one obtains is very
sensitive to the choice of a diffusion kernel, and the data representation in higher dimension produces
worse results, confirming the Dhillon and Modha results [DhM01] discussed in Section 4.2.3.1.
Moreover, by comparing in Table 4.2 the running time of the diffusion process, at t = 1, using the two
kernels, we find that the process needs just about 36 seconds to build the 2-dimension diffusion space
based on the cosine kernel, while for the Gaussian kernel it takes about 31 minutes, indicating that the
cosine kernel takes advantage of the word document matrix sparsity in the computation of the
Markov matrix and the SVD.
9s
7s
Markov matrix
2s
14 s
SVD
25 s
31 min
Table 4.2. The process running time for the cosine and the Gaussian kernels.
In Figure 4.4, we plot the first two coordinates of some powers of the Markov matrix M (a, c, e, g, i)
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
61
-a-
-b-
-c-
-d-
-e-
-f-
62 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
-g-
- h-
-i-
- j-
Figure 4.4. Representation of our data set in Cosine and Gaussian diffusion spaces for various t time
iterations.
From this figure, we remark that when the value of the power t increases, the data of the two
collections gets more merged. Effectively, because in this case the data point get connected by a larger
number of paths. Moreover, we remark that the dependency changing rate of data in the case Gaussian
diffusion space is larger than in the cosine space, showing by this that the cosine distance is stable than
the Euclidean distance.
On the basis of these results, and the fact that we are using un-normalized data, we have decided to
exclude from our succeeding experiments the diffusion space based on the Euclidian distance (Gaussian
diffusion space) and the use of the Markov matrix power. Thus, we will restrict our comparisons to the
cosine diffusion space for t equal to 1, LSA, and Salton spaces.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
63
Acc
MI
95.72 83.61
LSA
98.63 90.40
Table 4.3. Performance of k-means in cosine diffusion, Salton and LSA spaces for the set Cisi and Med.
On the other hand, despite the fact that Table 4.3 shows that the cosine diffusion representation is
just incrementally better in both accuracy and mutual information compared to the Salton representation,
we should not forget the excellent gain in the computation time. Thus, using a k-means algorithm
running in Matlab sometimes takes more than two hours when documents are represented in the Salton
space, where the length of a document vector is determined by the number of the collection terms,
usually in the thousands; while with the cosine diffusion representation, the running time is just a few
seconds, in view of the fact that the length of a document vector in the embedded space is very small,
reduced by a factor that may be larger than 1000. However, in the case of the LSA representation, we
remark that for this set of documents, k-means performs almost as accurately as in the case of the cosine
diffusion representation.
To pick the number of dimensions for the embedding space, as shown in Figure 4.5, we plot the first
100 singular values of the cosine diffusion map in the bottom curve. To help identify the discontinuity in
the slope of the singular value curve, we plot on the top part of the figure the difference between each
successive pair of singular values, magnified by a factor of 10 and displaced by 1 from the origin for
emphasis.
Figure 4.5. Representation of the first 100 singular values of the cosine diffusion map on the set Cisi
and Med.
64 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Figure 4.6. Representation of the first 100 singular values of the Cisi and Med term-document matrix.
Using the same technique, to pick the reduced dimension for the LSA space, we remark that for this
set of documents the discontinuity point in Figure 4.6 corresponds to the best reduced dimension found
in Table 4.1.
After clustering the set of document into two clusters, in this step, we will use the diffusion process
and the Buchaman-Wollaston & Hodgeson method to make sure that the documents of each resulting
cluster, referenced by C1 and C2, should not be further partitioned.
Given that it is well known that for well separated data, the number of empirical histogram peaks is
equal to the number of components, the Buchaman-Wollaston and Hodgeson method [BuH29] consists
in fitting each peak to a distribution. Based on this, and on the Kullback-Leibler (KL) divergence
[KuL51], the Jensen-Shannon divergence [FuT04], and the accumulation function [Rom90] to compare
between the approximation distribution and the data histogram distribution, we determine, as is shown in
Table 4.4, that we have very good approximations of the histograms of clusters C1 and C2, each
represented, respectively, in Figure 4.7 and Figure 4.8, by only one normal distribution.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
65
Cluster C1 Cluster C2
Kullback-Leibler
8e-15
2e-15
Jensen-Shannon
-4e-17
8e-17
Accumulation
1e-16
3e-17
Table 4.4. Measure of the difference between the approximated and the histogram distributions.
In Figure 4.9 and Figure 4.10, we represent the first hundred singular values of documents from
clusters C1 and C2, respectively, in the cosine diffusion space. We remark that the discontinuity point of
the slope coincides with the largest singular value, which means that the other singular values are
meaningless. Thus, we could represent a document in the embedded cosine diffusion space in 1dimension.
66 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Figure 4.9. Representation of the first 100 singular values of the cosine diffusion map on the cluster C1.
Figure 4.10. Representation of the first 100 singular values of the cosine diffusion map on the cluster
C2
Example 2. (Cran, Cisi, and Med) Here, we mix all documents of the three collections: Cran, Cisi,
and Med. In Table 4.5, we present the results of the k-means algorithm running in five different
dimensions for the LSA and the cosine diffusion spaces. Table 4.6 shows the optimal performance of kmeans in the cosine diffusion, Salton and LSA spaces.
Spaces
Dim1
Acc
MI
Cosine
93.21 78.72
diffusion
LSA
89.67 72.84
Dim2
Acc
MI
Dim3
Acc
MI
Dim4
Acc
MI
Dim5
Acc
MI
98.45 92.14
97.05 90.29
94.38 87.26
92.67 86.34
86.22 81.27
86.74 82.11
92.32 86.44
79.05 67.32
Table 4.5. Performances of different embedding representations using k-means for the set Cran, Cisi
and Med.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
67
Acc
MI
73.03 62.35
LSA
92.32 86.44
Table 4.6. Performance of k-means in cosine diffusion, Salton, and LSA spaces for the set Cran, Cisi
and Med.
From Tables Table 4.3 and Table 4.6, we remark that k-means performs much better in the cosine
diffusion space compared to the Salton space, and better than in the LSA space. However, for this set of
documents, the discontinuity of the singular value technique is not working for the LSA space, because
the marked slope discontinuity (shown in Figure 4.12) is around the 3rd singular value indicating an
optimal dimension equal to 2, while the best dimension found in Table 4.5 is the 4th.
Figure 4.11. Representation of the first 100 singular values of the cosine diffusion space on the set
Cran, Cisi and Med.
Figure 4.12. Representation of the first 100 singular values of the Cran, Cisi and Med term-document matrix.
68 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
69
6e-17
1e-16
2e-16
Jensen-Shannon
2e-16
1e-16
1e-16
Accumulation
4e-17
4e-17
1e-17
Table 4.7. Measure of the difference between the approximated and the histogram distributions.
Figure 4.16. Representation of the first 100 singular values of the cosine diffusion map on cluster C1.
70 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Figure 4.17. Representation of the first 100 singular values of the cosine diffusion map on cluster C2.
Figure 4.18. Representation of the first 100 singular values of the cosine diffusion map on cluster C3.
The previous results suggest two postulates for the cosine diffusion space:
Dimension Postulate: The optimal dimension of the embedding for the cosine diffusion space is
equal to the number, d, of the singular values on the left side of the discontinuity point after excluding
the largest (first) singular value. When d is equal to zero, the data will be represented in 1-dimension.
Cluster Postulate: The optimal number of clusters in a hierarchical step is equal to d+1, where d is
the optimal dimension provided by the dimension postulate of the same step.
Example 3. (Cran, Cisi, Med, and Reuters_1) For this example, we use just 500 documents from
each collection of Cran, Cisi, and Med, mixed with 425 documents from the Reuters collection. From
Table 4.8, representing the results of the k-means algorithm running in five different dimensions for the
LSA and the cosine diffusion spaces, and Table 4.9, representing its optimal performance in the cosine
diffusion, Salton and LSA spaces, it appears that k-means performs better in the cosine diffusion space
compared to both of the other spaces. However, we are interested in more than that.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
71
Spaces
Dim1
Acc
MI
Cosine
77.87 66.02
diffusion
LSA
66.85 66.42
Dim2
Acc
MI
Dim3
Acc
Dim4
MI
Acc
MI
Dim5
Acc
MI
84.73 78.88
95.93 93.90
99.22 96.66
98.04 95.61
87.74 81.41
82.28 83.30
70.82 67.44
62.83 57.28
Table 4.8. Performance of different embedding cosine diffusion and LSA representations using k-means
for the set Cran, Cisi, Med and Reuters_1.
Spaces
Acc
MI
71.68 71.62
LSA
87.74 83.30
Table 4.9. Performance of k-means in Cosine diffusion, Salton and LSA spaces for the set Cran, Cisi,
Med and Reuters_1.
Effectively, in the following we are concerned in validating our postulates in the cosine diffusion
space. Based on the dimension and cluster postulates, Figure 4.19, representing the first hundred
singular values for the chosen set of documents, indicates that the embedding dimension for this set of
data should be equal to two, and the number of clusters should be equal to 3.
Figure 4.19. Representation of the first 100 singular values of the cosine diffusion map on the set Cran,
Cisi, Med and Reuters_1.
Thus, we run the 3-means program in the 2-dimension cosine diffusion space, and we present the
generated confusion matrix in Table 4.10.
72 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
493
59
C2
499
C3
441
425
K-means clustring
2
C3
1.5
1
0.5
0
-0.5
C2
C1
-1
-1.5
-2
-2.5
-3
-2
-1.5
-1
-0.5
0.5
1.5
2.5
7
6
5
4
3
2
1
0
10
20
30
40
50
60
70
80
90
100
Figure 4.21.Representation of the first 100 singular values of the cosine diffusion map on the data set S
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
73
493
C2
493
C3
425
Table 4.11. The confusion matrix for the set S in 2-dimension cosine diffusion space.
-1
-2
-3
-2
-1.5
-1
-0.5
0.5
1.5
493
C2
499
C3
493
C4
425
1-Dim
2-Dim
3-Dim
4-Dim
Cosine diffusion
18.89
84.69
76.4
69.47
Table 4.13. Mutual information of different embedding cosine diffusion representations using k-means
to exclude the cluster C2 from the set Cran, Cisi, Med and Reuters_1.
Spaces
1-Dim
Acc
2-Dim
MI
Cosine
90.87 72.32
diffusion
Acc
3-Dim
MI
Acc
MI
97.55 93.76
99.08 95.07
4-Dim
Acc
MI
86.51 79.79
Table 4.14. Performance of different embedded cosine diffusion representations using k-means for the
set S.
In order to verify the results of the hierarchical clustering suggested by the two postulates, we run 4means in 4-dimension cosine diffusion space, which is indicated in Table 4.8 as the best reduced
dimension for clustering the Cran-Cisi-Med-Reuters_1 set in one step.
By presenting, in Table 4.15, the generated confusion matrix from partitioning the entire collection
into 4 clusters in the 4-dimensional cosine diffusion space, we remark that this matrix indicates the
existence of 15 misclassified documents, which is identical to the number of misclassified documents in
the confusion matrix resulted from combining the confusion matrices of the hierarchical steps, presented
in Table 4.12.
Cran Cisi Med Reuters
C1
492
C2
500
C3
493
C4
425
Table 4.15. The confusion matrix for the set Cran-Cisi-Med-Reuters_1 clustered into 4 clusters in the 4dimention cosine diffusion space.
In this example, we have not just validated our postulates, but moreover we have established that the
relation between them is mutual in each step of the hierarchical process. The results of Table 4.8 show
that the reduced dimension deduced graphically for a set of data depends on the number of clusters to
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
75
Example 4. (Cran, Cisi, Med, and Reuters_2) To make sure that the need for many hierarchical
clustering steps does not depend on the number of clusters, especially when this number is larger than 3,
as the case of Example 3, we have chosen 500 documents from each of the collections Cran, Cisi, and
Med, different than those used in Example 3, and then we have mixed them with the 425 Reuters
documents used in Example 3.
From Figure 4.23, we can see that the marked slope discontinuity around the 4th singular value
indicates the optimal dimension shown in Table 4.16 and the correct number of clusters, since the first
hierarchical step.
Figure 4.23. Representation of the first 100 singular values of the cosine diffusion map on the set Cran,
Cisi, Med and Reuters_2.
Spaces
Cosine
diffusion
LSA
Dim1
Acc
MI
Dim2
Dim3
Dim4
Dim5
Acc
MI
Acc
MI
Acc
MI
Acc
MI
72.05 57.74
86.37
79.06
98.16
96.08
97.92
95.39
96.97
94.69
80.16 71.04
88.94
83.98
86.82
86.74
72.04
66.99
67.52
60.67
Table 4.16. Performance of different embedding cosine diffusion and LSA representations using kmeans for the set Cran, Cisi, Med and Reuters_2.
76 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Spaces
Acc
MI
71.44 69.44
LSA
88.94 86.74
Table 4.17. Performance of k-means in Cosine diffusion, Salton and LSA spaces for the set Cran, Cisi,
Med and Reuters_2.
Example 5. (Reuters) In this example, we do a small experiment to assess how our approach to
clustering documents based on the cosine diffusion map and our proposed postulates responds to nonseparated data. To this end, we have mixed documents of four Reuters categories.
Spaces
Cosine
diffusion
LSA
1-Dim
2-Dim
3-Dim
4-Dim
Acc
MI
38.99
8.68
50.95 28.88
66.26 49.33
36.13
7.62
40.46 12.06
39.33 10.92
Acc
MI
Acc
MI
Acc
5-Dim
MI
Acc
MI
66.44 56.75
62.57
49.08
38.81 10.49
38.64
9.25
Table 4.18. Performance of different embedding cosine diffusion and LSA representations using kmeans for Reuters.
Spaces
Acc
MI
46.59 35.22
LSA
40.46 12.06
Table 4.19. Performance of k-means in Cosine diffusion, Salton and LSA spaces for Reuters.
1.8
Slope
Singular value
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
10
20
30
40
50
60
70
80
90
100
Figure 4.24. Representation of the first 100 singular values of the cosine diffusion map on Reuters.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
77
78 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Term-Document
Matrix
Term-Document
Matrix
Document -Document
Matrix
Singular Value
Decomposition
Markov Matrix
Singular Value
Decomposition
Dimension
Reduction
Dimension Reduction
Diffusion Map process
LSA process
XD
, where each component of
SD
the vector D is the difference between a pair of the accuracies or the mutual information in the cosine
diffusion and the LSA spaces, for a set of data. N is the length of vector D, or explicitly, is the number of
N
data sets. X D is the mean of D, and SD is its standard deviation, defined by:
N (d i d ) 2 .
i =1
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
79
Two
Acc
MI
Three
Acc
MI
Four
Acc
MI
T-test value
-3.02 -3.44 0.56 0.62 1.76 3.08
Table 4.20. The statistical results for the performance of k-means algorithm in cosine diffusion and LSA
spaces.
From the results of the Table 4.20, we can conclude that the k-means clustering algorithm perform
very differently in the cosine diffusion space than in the LSA space, because the absolute value of the ttest values is extremely larger than the statistical significance threshold, which is usually equal to 0.05
[All07], for the three cases. Moreover, the results show that when there are only 2 topics, which imply
that the term distributions for these 2 topics are disjoint, the k-means algorithm performs better in the
LSA space than in the cosine diffusion space. While for multiple topics (in the cases of 3 and 4 clusters),
when documents on different topics may use overlapping vocabulary, k-means performance is better in
the cosine diffusion space. Furthermore, the performance difference of the k-means algorithm in the two
spaces becomes larger when the number of clusters increases. On the other hand, we remark that these
results conform to the ones shown in Tables Table 4.3, Table 4.6, Table 4.9 and Table 4.17.
If we take into consideration that, in the real-world clustering environment, the data sets usually
contain more than two clusters, we can conclude that the k-means algorithm performs well in the cosine
diffusion space.
80 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Salton
ACC
MI
DM
ACC
Upd-DM
MI
ACC
MI
80.5
26.07
0.24
4. 5. Summary
In this chapter, we have proposed a process, based on the cosine diffusion map and the singular
value decomposition, to cluster on-line and off-line documents. The experimental evaluation of our
approach for classical clustering has not only shown its effectiveness, but has furthermore helped to
formulate two postulates, based on the slope discontinuity of the singular values, for choosing the
appropriate reduced dimension in the cosine diffusion space, and finding the optimal number of clusters
for well separated data. Thus, our approach has shown many advantages compared to other clustering
methods in the literature. Firstly, the use of the cosine distance to construct the kernel has
experimentally indicated a better representation of the un-normalized data in the diffusion space than the
Gaussian kernel, and minimized the computational cost by taking advantage of the word x document
matrix sparsity. Secondly, the running time of the k-means algorithm in the reduced dimension of the
diffusion map space is very much lower than in the Salton space. Thirdly, we formulated a simple way
to find the right reduced dimension, where a learning phase is not needed. Fourth, the estimation of the
optimal number of clusters is immediate, not as in other approaches where some criteria are optimized
as a function of the number of clusters; in addition, our approach indicates this number even when there
is just one cluster. Finally, data representation in the cosine diffusion space has shown a non-trivial
statistical improvement in the case of multi-topic clustering compared to the representation in LSA
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
81
82 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Definition 1: The Concept Vector c of a set of n term vectors termi (1in) is their normalized
mean (this definition is adapted from [DhM01]).
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
83
Term Selection
Given that the n term vectors do not diverge from each other too much, their Concept Vector can be
seen as a normalized representative vector for these n term vectors.
Mathematically, following Definition 1, we have:
1 n
c = term i
n i =1
1 n
n
term i = term i
n i =1
i =1
term i
i =1
(1)
Given that a certain document collection has t terms (keywords), we can use spherical k-means
clustering algorithm to partition these t terms into k clusters. Mathematically, we have:
k
= {term i : 1 i t}
(2)
(3)
U Cluster
j =1
and
Definition 2: The Affinity between a term vector and a cluster of term vectors is the cosine of
the term vector and the Concept Vector of the cluster.
The Affinity between a term and a cluster of terms, with a range of values between 1 and 1
inclusive, indicates how closely (in terms of the absolute value of the Affinity) and in which manner (the
Affinity being positive or negative) this term is related to this cluster.
Mathematically, given a term vector term and a cluster Cluster = {term(1), term(2), ... , term(w)}, their
Affinity is defined as
affinity(term, Cluster) =
term
c
term c term c
=
=
term
c
term 1
term
where c = term(i )
i =1
term(i )
i =1
(4)
Definition 3: The Affinity Set between a term vector and a partition of all terms in a document
collection is the set of Affinity values between the said term vector and each cluster of the said partition.
The Affinity Set records a number of Affinity values for a particular term across all the clusters of a
certain partition.
Mathematically, given a term vector term and a partition having k clusters {Cluster j }kj =1 , the Affinity
Set AFN between this term and this partition is defined as:
AFN = {affinity(term, Clusterj): 1jk)}
(5)
84 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Term Selection
Definition 4: The Characteristic Quotient (or CQ) of a term vector with respect to a partition
of all terms in a document collection is the standard deviation of the Affinity Set defined between this
term vector and this partition over the mean of all the members in the said Affinity Set.
The Characteristic Quotient of a term vector with respect to a partition provides a sensible estimate
(educated guess) on how evenly (or unevenly) the meaning of this term participates across all the
clusters of this partition.
Mathematically, given a term vector term, a partition having k clusters {Cluster j }kj =1 , and their
Affinity Set AFN, the Characteristic Quotient of this term vector with respect to this partition is defined
as:
CQ =
stdv( AFN )
mean( AFN )
(6)
1 n
xi
n i =1
1 n
( xi x ) 2
n 1 i =1
Now we may formally define Generic Terms and Domain Specific Terms.
Definition 5: For a particular document collection, given all the terms and a meaningful partition
of these terms, the Generic Terms are those terms whose Characteristic Quotients are below the value
of GEN_CUTOFF.
Definition 6: For a particular document collection, given all the terms and a meaningful partition
of these terms, the Domain Specific Terms are those terms whose Characteristic Quotients are above or
equal to the value of GEN_CUTOFF.
The following two points shall clarify Definition 5 and Definition 6:
(I)
The phrase meaningful partition refers to a partition that groups terms in such a way so
that terms of similar meanings are most likely located in the same cluster of the partition.
(II)
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
85
Term Selection
Comparing Definition 5 to the intuitive definition (characterization) of Generic Terms at the
beginning of the current Section, we have the following observations:
(I)
The new definition has the same spirit of the old one: A meaningful partition of terms into
many clusters stated in the new definition resembles a sensible grouping of documents
into many topics implied in the old one. The old definition was based on the distribution
pattern of Generic Terms over a range of document topics; the new one is based on the
participation (Affinity) pattern of Generic Terms among a number of term clusters.
(II)
The new definition has an advantage over the old one: Definition 5 is a working
definition on Generic Terms which makes it possible for devising an algorithm to identify
all the Generic Terms in a given document collection. In the new definition:
Characteristic Quotients are mathematically well-defined; a meaningful partition of terms
is obtainable through a clustering algorithm called Spherical k-means; and the value of
GEN_CUTOFF can be determined experimentally through trial and error.
The rationale behind the new definition of Generic Terms and Domain Specific Terms is as follows:
In a meaningful partition of terms, terms of similar meanings are grouped together cluster by cluster.
The Characteristic Quotient of a term vector with respect to this partition indicates how evenly (or
unevenly) the meaning of this term relates to all the clusters of this partition. The bigger the CQ is, then
the more unevenly the relationship becomes, and the stronger the tendency is for this term to be
categorized as a Domain Specific Term. On the other hand, the smaller the CQ is, then the more evenly
the relationship becomes, and the stronger the tendency is for this term to be categorized as a Generic
Term. Therefore, the value of CQ may be used to identify a term as a Generic Term or a Domain
Specific Term for that matter.
It is worth noting that a limited number of terms may sit on the borderline between Generic Terms
and Domain Specific Terms, whatever the actual value of GEN_CUTOFF is. Therefore increasing the
value of GEN_CUTOFF may allow some previously categorized borderline-case Domain Specific
Terms to be newly identified as Generic Terms, and vice versa.
Practically, Yan used a simpler but equally effective method to avoid the process of determining the
actual value of GEN_CUTOFF. He set up a goal to identify a fixed number (say ng) of generic terms so
that those terms whose Characteristic Quotients are among the lowest ng of all terms are automatically
identified as generic terms with the rest of the terms simultaneously being identified as domain specific
ones. In this way, he eliminated the GEN_CUTOFF value without any compromise of the validity of the
generic term identification process.
86 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Term Selection
Initialization: (i) normalize all term vectors termi, where i [1, t];
(ii) t = t0, loop = 1;
(iii) randomly assign t terms to k clusters, thereby having: {Cluster0, j }kj =1 ; (iv)
compute the concept vectors: {c 0, j }kj =1 .
Step (II)
term i c Tloop, j
For each termi (i [1, t]), compute ci* = arg min
term c T
i
loop , j
term i c Tloop , j
satisfying arg min
term c T
i
loop , j
, then
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
87
Term Selection
Step (V)
c loop +1, j
term
termCluster
loop +1 , j
=
1
termClusterloop +1, j
term
termCluster
loop +1, j
.
1
termClusterloop +1, j
k
k
j =1 termCluster
j =1 termClusterloop +1, j
loop +1, j
.
Step (VI) Compute: improvement =
k
term c Tloop, j
j =1 termClusterloop +1, j
stdv ( AFNi )
, where i [1, t].
Step (IX) Compute i* = arg min
mean( AFNi )
Step (X)
Figure 5.1 represents the extraction of a generic term (i.e., identification and removal) from the pool
of all terms. The old partition of terms is further adjusted via re-running the clustering sub-algorithm
before the next generic term is identified, this measure prevents the earlier-identified generic terms from
exerting compounding effects on the to-be-identified ones. Therefore, there are as many rounds of
running the clustering sub-algorithm as there are many generic terms to be identified.
88 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Term Selection
Start
Spherical k-means
clustering sub-algorithm
Extracting
a generic term
No
Is termination
criteria
satisfied?
Yes
Exit
The Spherical k-means clustering algorithm (Step (I) through Step (VI)) is known to
converge [DhM01].
(II)
Step (VII), Step (VIII) and Step (IX) Case A are sequential procedures.
(III)
Termination criteria in Step (IX) Case B guarantees that the steps mentioned in the above
two points are iterated for no more than a maximum of MAX_GEN_TERM times.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
89
Term Selection
9161
2087
Cran
6633
1914
Med
12173
2769
Table 5.1. Index size in the native and Noun phrase spaces.
In Table 5.2, Table 5.3, and Table 5.4, we put respectively for the Cisi, Cran, and Med collections
the mean interpolated average precision (MIAP) for several indexes, where the number of the excluded
generic terms from the indexes is indicated in the tables. We would like to note that all these results are
for the best reduced dimension in the training phase of the LSA model.
Number of extracted
generic terms
MIAP
Number of extracted
generic terms
MIAP
0.28
1300
0.28
100
0.28
1400
0.28
200
0.28
1450
0.28
300
0.28
1460
0.28
400
0.28
1470
0.28
500
0.28
1480
0.28
600
0.28
1490
0.28
700
0.28
1495
0.27
800
0.28
1500
0.27
900
0.28
2000
0.27
1000
0.28
3000
0.27
1100
0.28
3560
0.28
1200
0.28
3570
0.25
Table 5.2. The MIAP measure for the collection Cisi in different indexes.
90 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Term Selection
Number of extracted
generic terms
MIAP
Number of extracted
generic terms
MIAP
0.51
600
0.50
100
0.52
700
0.50
200
0.51
800
0.50
300
0.51
900
0.50
400
0.51
1000
0.50
500
0.51
1100
0.50
550
0.51
1200
0.50
560
0.51
1300
0.50
570
0.51
1400
0.50
575
0.50
1500
0.48
580
0.50
Table 5.3. The MIAP measure for the collection Cran in different indexes.
Number of extracted
generic terms
MIAP
Number of extracted
generic terms
MIAP
0.66
1200
0.66
100
0.66
1300
0.66
200
0.66
1400
0.66
300
0.66
1410
0.66
400
0.66
1420
0.66
500
0.66
1425
0.65
600
0.66
1430
0.65
700
0.66
1450
0.65
800
0.66
1500
0.65
900
0.66
2000
0.65
1000
0.66
3000
0.64
1100
0.66
3500
0.63
Table 5.4. The MIAP measure for the collection Med in different indexes.
By analyzing these tables, we remark that there is not an appearing improvement in the LSA
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
91
Term Selection
performance; because the existing improvement is very small affects in the case of Cisi and Med
collections the third decimal number, while in the case of Cran collection it affects the second decimal
number for an improvement less than 1%. Moreover, this improvement is remarked when no more than
the first six hundreds generic terms are excluded. For this reason, we think to use the generic term
extracting algorithm as a dimensionality reduction technique, by keeping the same performance
achieved in the native space while excluding more numbers of generic terms. As indicated in bold, by
this approach, we can exclude respectively 3560, 570, and 1420 terms from the Cisi, Cran, and Med
collections, which represent respectively about 38.8, 8.6, and 11.7% of the index size in the native
space.
On the other hand, by comparing these results to those achieved in Section 3.3.2.1, and recalled in
Table 5.5, we remark a large trade-off between sizes indicated in Table 5.1 and performances indicated
in Table 5.5, especially for Cran and Med collections.
0.28
0.32
Cran
0.51
0.47
Med
0.66
0.26
Table 5.5. LSA performance in the native and Noun phrase spaces.
The reduction of the dimension may lead to significant savings of computer resources and
processing time. However poor feature selection may dramatically degrade the information retrieval
systems performance. This is clearly remarked when NP indexation is used or when a large number of
terms is excluded by using the GTE algorithm, such in the case of excluding 3570, 1500, and 3500
generic terms respectively from the Csi, Cran, and Med collections, we get a performance degradation of
3%. Thus by removing many terms, the risk to remove potentially useful information on the meaning of
the documents becomes larger. It is then clear that, in order to obtain optimal (cost-)effectiveness, the
reduction process must be performed with care.
Term Selection
performance has been decreased by 1% before. This remark shows that the exclusion of a specific
number of terms, using the GTE algorithm, could positively or negatively affects the LSA concepts
because of the interactions between terms.
Although, the GTE algorithm has this advantage, it has a limitation as it does not process
automatically in the elimination of terms, which is due to the fact that the performance is not monotone.
5. 6. Summary
By proposing to supplement, in the context of information retrieval, the feature transformation
method based on singular value decomposition with term selection, we have used the Yans approach.
Initially, this approach, consisting in extracting generic terms, was proposed to improve the performance
of the LSA model; however, we have used it for reducing the index size. In fact, the exclusion of generic
terms has not just reduced the storage capacity but also the capability of influencing a large number of
LSA concepts in an unpredictable way.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
93
Chapter 6 Information
Language
Retrieval
in
Arabic
6. 1. Introduction
Arabic texts are becoming widely available but due to the Arabic characteristic challenges still free
available corpora are missing automatic processing tools, and emerged standard IR-oriented algorithms
for this language.
In order to develop an Arabic IR system, we think that the improvement of former systems may
yield a predictive model to accelerate their processing and to obtain reliable results. So in the objective
of a specific study and a possible performance improvement of the Arabic information retrieval systems,
we have created an analysis corpus and a reference one, specialized in the environment field, and we
have proposed to use the latent semantic analysis method, to cure the problems arising from the vectorspace model. We have also studied how linguistic processing and weighting schemes could improve the
LSA method, and we have compared the performance of the vector-space model and the LSA approach
for the Arabic language.
As is generally known, the Arabic language is complicated for natural language processing due to
two main language characteristics. The first is the agglutinative nature of the language and the second is
the aspect of the vowellessness of the language, causing ambiguity problems at different levels. In this
work, we are more interested, specially, in the agglutinative problem.
6.2.1. Motivation
In recent work within the framework of the information retrieval and the Arabic language automatic
processing, some sizeable newspaper corpuses (cf. Section 2. 6) have started to be available. However,
they are not free, and the topics treated by these corpuses remain of a general nature without affecting a
scientific field specialized such as the environment. For these two reasons, we have been interested in
94 ___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
24
25
26
http://www.google.com/intl/ar/.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
95
[AR-ENV] Corpus
Document Number
1 060
Query Number
30
Token Number
475 148
54 705
Table 6.2. An example illustrating the typical approach to query term selection.
The relevance assessment was also performed manually, by reading and checking whole document
collection from Arabic narrative speaker reviewers. After picking up all documents specified as relevant
for a given query, a document is admitted as relevant to that particular query if derived from the rule of
majority: i.e., a document is defined as relevant to a particular query only if at least three out of five
from five reviewers agree on its relevancy.
The size of this corpus, used in our study as a reference corpus, although still modest, can guarantee
that the articles discuss a wide range of subjects and that their content is, to some extent, heterogeneous.
96 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Zipfs law
According to Zipfs law, if we count up how often each word occurs in a corpus and then list these
words in the order of their frequency of occurrence, then the relationship between the frequency of a
given word f and its position in the list (its rank r) will be a constant k such that: f.r = k.
Ideally, a simple graph for the above equation using logarithmic scale will show a straight line with a
slope of 1. So the situation in the corpus was checked by starting with one file and increasingly adding
more files to a corpus and checking the behavior of the relation between the rank and the frequency. An
enhanced theory of Zipfs law is the Mandelbrot distribution. Mandelbrot notes that although Zipfs
formula gives the general shape of the curves, it is very bad in reflecting the details [MaS99]. So to
achieve a closer fit to the empirical distribution of words, Mandelbrot derived the following formula for
a relation between the frequency and the rank:
f = P(r+)-b
where P, b, and are parameters of the text that collectively measure the richness of the texts use of
words. The common factor is that there is still a hyperbolic relation between the rank and the frequency
as in the original equation of Zipfs law. If this formula is graphed on doubly logarithmic axes, it closely
approximates a straight line descending with a slope b just as Zipfs law describes (See Figure 6.1).
The graph shows Rank on the X-axis versus Frequency on the Y-axis, using logarithmic scale. The
line in magenta corresponds to the ranks and frequencies of words in the whole documents of our
corpus. The straight line in canyon shows the relationship between Rank and Frequency predicted by
Zipfs formula f . r = k.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
97
Figure 6.1. Zipf law and word frequency versus rank in the [AR-ENV] collection.
Bengali
English
Arabic
(CILL)
(Brown) (Al-Hayat)
100
1.204
1.449
1.19
1 600
2.288
2.576
1.774
6 400
3.309
4.702
2.357
16 000
4.663
5.928
2.771
20 000
5.209
6.341
2.875
1 000 000
10.811
20.408
8.252
Table 6.3. Token-to-type ratios for fragments of different lengths, from various corpora.
The measure is obtained by dividing the number of tokens (text length) by the number of distinct
words (type). It is sensitive to sample size, with lower ratios (i.e. a higher proportion of new words)
expected for smaller (and therefore sparser) samples. A 1 000 word article might have a TTR of 2.5; a
shorter one might reach 1.3; 4 million words will probably give a form/token ratio of about 50, and so
on. The factors that influence TTR for raw textual data include various morphosyntactic features and
orthographic conventions (see Table 6.3).
98 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Other Measures
To measure the lexical richness of the [AR-ENV] corpus, we have also used in the context of lexical
categories the lexical coverage and the grammatical category distribution measures. For further details
on these two metrics consult Boulaknadel dissertation [Bou08].
99
6. 3. Experimental Protocol
In contrast to other languages such as French and English, Arabic is an agglutinant language, where
words are preceded and succeeded by prefixes and suffixes. Moreover, diacritic marks are commonly
used in this language. Therefore, an adaptation of the standardized information retrieval system,
presented in Figure 6.3, is needed, specifically in the preprocessing phase.
Natural
Language
Processing
Tokenization
User Query
Corpus
Stemming
VM
Index
Query
Correspondance
100 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
t J&
J:`7^
<`7^
Corpus
Moving Diacritics
User Query
Mutation
Tokenization
Stop Word List
Stemming
Stop Word List
Index
VM
Query
Correspondance
101
6.3.2. Evaluations
In the objective to ameliorate the performance of the proposed system, we evaluate the effectiveness
of this latter and some other suggestions, described and discussed below, in both created corpora: the
analysis corpus [AR-RS] and the reference corpus [AR-ENV].
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
103
-a-
-b-
Figure 6.5. Comparison between the performances of the LSA model for five weighting schemes.
- No weighting case
-a-
-b-
-c-
-d-
105
t J&
J:`7^
<`7^
Moving Diacritics
Corpus
Tokenization
User Query
Mutation
Stemming
Index
VM
Query
Correspondance
Figure 6.7. A new information retrieval system suggested for Arabic language.
106 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
-a-
-b-
Figure 6.8. A comparison between the performances of the VMS and the LSA models.
For reference corpus, the curves of Figure 6.8-a, representing the results in the case of short queries,
show a significant statistical difference of 15.90% for the LSA model over the standard VSM, while the
curves in Figure 6.8-b, representing the results in the case of long queries, show a gain of 16.30%.
Similarly, the experiments performed on analysis corpus showed an improvement of 12.77% for the
LSA model in the case of short queries, and 13.73% in the case of long queries.
-a-
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
107
-b-
108 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
t J&
J:`7^
<`7^
Corpus
Moving Diacritics
User Query
Transliteration
Tokenization
POS Tagging
NP Extraction
Stemming
Index
LSA
Query
Correspondance
Buckwalter Transliteration
Taking into account that, up to the day where this work has been done, no tool for Arabic POS
tagging was performed in Arabic script, we have applied the Buckwalter transliteration which consists in
converting Arabic characters into Latin.
Part Of Speech Tagging
Part-of-speech (POS) task consists in analyzing texts in order to assign an appropriate syntactical
category to each word (noun, verb, adjective, preposition, etc).
For Arabic language, many part-of-speech taggers have been developed which we can classify into
different categories. The first class techniques are based on tagset that have been derived from an IndoEuropean based tagset. However, the tagsets used in the second category have been derived from
traditional Arabic grammatical theory. The taggers in the third class are considered as hybrid based on
statistical and rule-based techniques; in spite of the fourth category, machine learning is used.
All works on Arabic tagging (that we are aware of) are Diabs POS tagger [DHJ04] consisting in
combining techniques of the first and the fourth classes, Arabic Brill's POS tagger [Fre01] using
techniques of the first and the third categories, and APT [Kho01] based on the second and the third class
techniques.
Exceptionally, in this study just to be conformed to the Base Phrase (BP) Chunker [DHJ04] (see the
Section A.3.3. A.3.3. for chunker definition), we have chosen to use Diabs tagger. In this tagger a large
set of Arabic tags has been mapped (by the Linguistic Data Consortium) to a small subset of the English
tagset that was introduced with the English Penn Treebank.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
109
Figure 6.11. Influence of the NP and the singles terms indexations on the IRS performance.
By comparing the curves of Figure 6.11, we remark that the use of noun phrase in the indexation
process drops the performance of the system based on single terms. However, in the third strategy, when
we have attempted to remedy the situation by combining single terms and noun phrases, we remark that
110 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
c- Discussion
In this section, we discuss the results presented in the previous subsection while we also attempt to
reason about some of their proprieties. We do so by, first, giving a small overview about the studies
undertaken in the field of NP indexation.
Previous studies for English, French, and Chinese languages showed that the use of noun phrase in
representing the document content could improve the effectiveness of an automatic information retrieval
system. Mitra et al. in [MBS97] showed that reindexing with noun phrases the 100 first documents
retrieved by the SMART system gives a benefit at low recall. However, TREC campaigns showed that
not necessary the noun phrase indexing approaches enhance the retrieval performance, and that this
improvement can depend on the size of the collection, and the query topic [Fag87, EGH91, ZTM96]. The
given results in PRISE system [SLP97], based on the noun phrase extraction by the Tagged Text Parser,
are a good example of the difficulty in evaluating the syntagmatic analysis effect on the IRS since the
performances obtained were not significant.
Conformed to this, our experimental results show that for Arabic language the NP-based indexing
decrease the retrieval performance compared to single-term-based indexing. We could explain that this
drop is due to the noun phrase size and the normalization lack, for example !r^! C= {J air
pollution disaster and !r^! C= air pollution should be normalized under air pollution. It could
be also explained by the use of a morpho-syntactic parser and chunker based on supervised learning
depending on an annotated corpus and not on specific syntactic pattern rules.
We think that the use of morpho-syntactic parser and chunker based on syntactic pattern rules, the
use of a part-of-speech tagger based on statistical and rule-based techniques, or a tagset derived from
Arabic grammatical theory could resolve the specified problems, and be more effective. In this aim, a
deep study was undertaken by Boulaknadel [Bou08].
6. 4. Summary
In this chapter, we have presented an evaluation of the vector space model and the LSA method,
while performing linguistic processing, and using weighting schemes in an Arabic analysis and
reference corpora that we have created for this aim.
The undertaken experiments showed that light-stemming increases the performance of the Arabic
information retrieval system, especially when the Okapi BM-25 scheme is used. Thus, confirming the
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
111
112 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
7. 2. Limitations
Experimental research inherently has limitations. The work presented in this dissertation is limited in
the following main ways:
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
113
7. 3. Prospects
There are several directions in which this research can proceed. These directions can be categorized
into five broad areas:
- Automating the generic term extraction algorithm.
- Adapting the generic term extraction algorithm to other ranges of data.
- Applying the diffusion map approach results to multimedia data.
- Extending our Arabic reference corpus, and try to classify its content to non-overlapping groups.
This way it could serve for both retrieval and clustering evaluation.
- Improving our system performance by using the results of the noun phrase study undertaken by
Boulaknadel [Bou08], and the semantic query expansion.
- Implementation of a full Arabic search engine based on the previous studies undertaken in this
dissertation and those planned for further work.
114 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
A.2.1. N-grams
An n-gram is a sub-sequence of n items from a given sequence. It is a popular technique in statistical
natural language processing. For parsing, words are modeled such that each n-gram is composed of n
words. For a sequence of words, (for example the dog smelled like a skunk), the 3-grams would be:
the dog smelled, dog smelled like, smelled like a, and like a skunk. For sequences of characters,
the 3-grams that can be generated from good morning are goo, ood, od , d m, mo, mor
and so forth. Some practitioners preprocess strings to remove spaces, others do not. In almost all cases,
punctuation is removed by preprocessing. n-grams can also be used for sequences of words or, in fact,
for almost any type of data.
By converting an original sequence of items to n-grams, it can be embedded in a vector space, thus
allowing the sequence to be compared to other sequences in an efficient manner.
A.2.2. Tokenization
Tokenization, or word segmentation, is a fundamental task of almost all NLP systems. In languages
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
115
A.2.3. Transliteration
Transliteration is the practice of transcribing a word or text written in one writing system into
another writing system. It is also the system of rules for that practice.
Technically, from a linguistic point of view, it is a mapping from one system of writing into another.
Transliteration attempts to be exact, so that an informed reader should be able to reconstruct the original
spelling of unknown transliterated words. To achieve this objective transliteration may define complex
conventions for dealing with letters in a source script which do not correspond with letters in a goal
script.
This is opposed to transcription, which maps the sounds of one language to the script of another
language. Still, most transliterations map the letters of the source script to letters pronounced similarly in
the goal script, for some specific pair of source and goal language.
It is not to be confused with translation, which involves a change in language while preserving
meaning. Here we have a mapping from one alphabet into another.
Specifically for Arabic language, many transliteration systems are utilized, such as: Deutsche
Morgenlndische Gesellschaft, Adopted by the International Convention of Orientalist Scholars in
Rome28; ISO/R 233, replaced by ISO 233 in 1984; BS 4280, developed by the British Standards
Institute29; and SATTS, One-to-one mapping to Latin Morse equivalents, used by US military. However
in our work, we have used Buckwalter transliteration30,31.
The Buckwalter Arabic transliteration was developed at Xerox by Tim Buckwalter in the 1990s. It is
an ASCII only transliteration scheme, representing Arabic orthography strictly one-to-one, unlike the
more common romanization schemes that add morphological information not expressed in Arabic script.
Thus, for example, a ( waaw) will be transliterated as w regardless of whether it is realized as a
vowel [u:] or a consonant [w]. Only when the ( waaw) is modified by a ( hamza) does the
28
29
30
31
116 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
* Z n K
z g w u
E h a
v s f
j
$ q Y ~
H S k p O
x D l
F v |
m N >
<
& _ }
A.2.4. Stemming
A stemmer is a computer program or algorithm which determines a stem form of a given inflected
(or, sometimes, derived) word form, generally a written word form. The stem need not be identical to
the morphological root of the word; it is usually sufficient that related words map to the same stem, even
if this stem is not in itself a valid root.
A stemmer for English, for example, should identify the string cats (and possibly catlike, catty
etc.) as based on the root cat, and stemmer, stemming, stemmed as based on stem. English
stemmers are fairly trivial (with only occasional problems, such as dries being the third-person
singular present form of the verb dry, axes being the plural of axe as well as axis); but
stemmers become harder to design as the morphology, orthography, and character encoding of the target
language becomes more complex. For example, an Italian stemmer is more complex than an English one
(because of more possible verb inflections), a Russian one is more complex (more possible noun
declensions), an Arabic one is even more complex (due to nonconcatenative morphology and a writing
system without vowels), and so on.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
117
Suffixes
t
l
A
y
^Al
&bt
^wAl
^J4 fAl
p
x h
J At
J An
=
_
^
tt
yt
lt
mt
^J&
y
A
J=
=
\=
x=
=
X
:&
wt
st
nt
bm
= tm
hm
hn
J hA
:^
:
:4
lm
wm
km
fm
km
wA
wn
wh
C^
_
^<
<
ll
wy
ly
st
_ yp
JX nA
_ yn
x_ yh
bAl
tA
tk
ty
th
<4 fy
wA
J4 fA
T lA
J&
bA
Table A.2. Prefixes and suffixes list.
The Arabic light stemmer32, Darwishs stemmer modified by Larkey [LBC02], used in this work
identify 3 three-character, 23 two-character and 5 one-character prefixes, 18 two-character and 4 onecharacter suffixes that should be removed in stemming. The prefixes and suffixes to be removed are
shown in Table A.2.
32
118 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
A.3.1. Root
The root is the primary lexical unit of a word, which carries the most significant aspects of semantic
content and cannot be reduced into smaller constituents. Content words in nearly all languages contain,
and may consist only of, root morphemes. However, sometimes the term root is also used to describe the
word minus its inflectional endings, but with its lexical endings in place. For example, chatters has the
inflectional root or lemma chatter, but the lexical root chat. Inflectional roots are often called stems, and
a root in the stricter sense may be thought of as a monomorphemic stem.
Roots can be either free morphemes or bound morphemes. Root morphemes are essential for
affixation and compounds.
The root of a word is a unit of meaning (morpheme) and, as such, it is an abstraction, though it can
usually be represented in writing as a word would be. For example, it can be said that the root of the
English verb form running is run, or the root of the French verb accordera is accorder, since those words
are clearly derived from the root forms by simple suffixes that do not alter the roots in any way. In
particular, English has very little inflection, and hence a tendency to have words that are identical to
their roots. But more complicated inflection, as well as other processes, can obscure the root; for
example, the root of mice is mouse (still a valid word), and the root of interrupt is, arguably, rupt, which
is not a word in English and only appears in derivational forms (such as disrupt, corrupt, rupture, etc.).
The root rupt is written as if it were a word, but it's not.
This distinction between the word as a unit of speech and the root as a unit of meaning is even more
important in the case of languages where roots have many different forms when used in actual words, as
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
119
it is based on rules which determine when an ambiguous word should have a given tag. Like the
stochastic taggers, it has a machine-learning component: the rules are automatically induced from a
previously tagged training corpus.
A.3.3. Chunking
Text chunking (light parsing) is an analysis of a sentence which subsumes a range of tasks. The
simplest is finding noun groups or base NPs. More ambitious systems may add additional chunk
types, such as verb groups, or may seek a complete partitioning of the sentence into chunks of different
types. But they do not specify the internal structure of these chunks, nor their role in the main sentence.
The following example identifies the constituent groups of the sentence He reckons the current
33
34
120 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
noun; adjectives (the red ball); or complements, in the form of an adpositional phrase (such as: the man
with a black hat), or a relative clause (the books that I bought yesterday).
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
121
f ij
gf i
Code
Description
Expression
w
(i , j )
local
fij
None, no change
Binary
log fij + 1
Natural log
w
global
Normal
GfIdf
Entropy
(i )
df
i
log 2
1
N 2
fij
j
gfi
dfi
N
1 + pij log pij
j
log( N )
122 ___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
j log( N )
EG
Global Entropy
ES
Shannon Entropy
N
pij log pij
j
E1
1- Entropy
1
M
local (i , j ) wglobal (i )
i =1
w
( j)
norm
None, no normalization
2
cosine
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
123
Retrieved
Relevant
Document Collection
C.2.1. Precision
Precision is the ratio of the number of relevant documents retrieved to the total number retrieved.
124 ___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Evaluation Metrics
Precision provides an indication of the quality of the answer set. However, this does not consider the
total number of relevant documents. A system might have good precision by retrieving ten documents
and finding that nine are relevant (a 0.9 precision), but the total number of relevant documents also
matters. If there were only nine relevant documents, the system would be a huge success-however if
millions of documents were relevant and desired, this would not be a good result set.
Precision =
C.2.2. Recall
Recall considers the total number of relevant documents. It is the ratio of number of relevant
documents retrieved to the total number of documents in the collection that are believed to be relevant.
When the total number of relevant documents in the collection is unknown, an approximation of the
number is obtained.
Recall =
Precision and Recall are two complementary measures of retrieval performance. For a particular
query, it is usually possible to sacrifice one so as to boost the other. For example, lowering the retrieval
criteria so that more documents are retrieved will most likely increase the Recall rate; however, in the
mean time, this strategy will also probably admit many more non-relevant documents into the retrieval
result with the likely consequence of decreasing the Precision rate, and vice versa as explained in the
Figure C.2. Therefore it is usually recommended that a balance between these two measures be sought
for users' best needs.
Returns relevant documents but
misses many useful ones too
The ideal
Precision
Recall
125
Evaluation Metrics
Precision at n measures the precision after a fixed number of documents have been retrieved. Or,
Precision at specific recall levels, which is the precision after a fraction of relevant documents are
retrieved. Another and most commonly reported measure is the Interpolated precision-recall curve,
showing the interaction between the precision and the recall.
( Ri , Pi ) =
, max
Pr ecision( j ) ,
(1)
N 1 Re call ( j ) Ri
1 j m
i
where m is the total number of retrieved documents, Ri =
is the given recall at the ith rang, Pi is
N 1
the interpolated precision based on the given recall, representing the maximum value of function
Precision(j) with j ranging from 1 to m and insuring function Recall(j) be no less than the given Recall
of Ri =
i
, and the functions Recall(j) and Precision(j) are defined as follows:
N 1
Giving a list of the m retrieved documents ranked in descending order according to their relevancy
scores, and considering n the total number of relevant documents in the collection, and Relevant(j)
function representing the number of relevant documents in the top j ranked documents,
Recall(j) =
Relevant(j)
Relevant(j)
and Precision(j) =
.
n
j
Note that while precision is not defined at a recall of 0, this interpolation rule does define an interpolated
value for recall rang 0.
Derived from IRP curve, a single numerical value, T, denoting the area covered between this curve
and the horizontal axis (the axis of Recall) may be used to crudely estimate the overall retrieval
performance of a particular query (Figure C.3). In another word, this single value T (called Average
Precision AP) indicates the average interpolated precision over the full range (i.e. between 0 and 1) of
35
http://trec.nist.gov/pubs/trec10/appendices/measures.pdf.
126 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Evaluation Metrics
recall for a particular query.
Precision
0
Recall
C.3.1. Accuracy
Accuracy (Acc) is, the degree of veracity, closely related to precision which also called
reproducibility or repeatability because it is the degree to which further measurements or calculations
will show the same or similar results.
The results of calculations or a measurement can be accurate but not precise; precise but not
accurate; neither; or both. A result is called valid if it is both accurate and precise.
Mathematicly, the accuracy is defined as follows:
Let li be the label assigned to d i by the clustering algorithm, and i be dis actual label in the corpus.
Then, accuracy is defined as
n
(map(l ), )
i
i =1
( x, y ) = 1, if x = y
where
.
( x, y ) = 0, otherwise
map (li ) is the function that maps the output label set of the clustering algorithm to the actual label set of
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
127
Evaluation Metrics
the corpus. Given the confusion matrix of the output, a best such mapping function can be efficiently
found by Munkres's algorithm [Mun57].
MI ( L, A) =
P(l ,
li L ,
j A
) log 2
P(li , j )
P(li ) . P ( j )
where P (li ) and P ( j ) are the probabilities that a document is labeled as li and j by the algorithm
and in the actual corpus, respectively; and P (li , j ) is the probability that these two events occur
together. These values can be derived from the confusion matrix. We map the MI metric to the [0,1]
interval by normalizing it with the maximum possible MI that can be achieved with the corpus. The
____
normalized MI is defined as MI =
MI ( L, A)
.
MI ( A, A)
128 ____________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
( x, y )
x y
Considering F and G two subspaces of R d . Recursively, a set of angles between these two
subspaces could be defined, which is denoted as principal or canonical angles. Let two real-valued
matrices F and G be given, each with d rows, and their corresponding column-spaces F and G, which
are subspaces in R d . Assuming that
p = dim (F) dim(G) = q 1.
Then the principal angles l [0, 2 ] between F and G may be defined recursively for l = 1, 2, , q
by cos l = max max f T g , subject to the constraints: f = 1, g = 1, f T f j = 0, g T g j = 0, j = 1,
f F
gG
2, , l-1. The vectors ( f1 , ..., f q ) and ( g1 , ..., g q ) are called principal vectors of the pair of subspaces.
Intuitively, 1 is the angle between two closest unit vectors f1 F and g1 G , 2 is the angle
between two closest unit vectors f 2 F and g 2 G such that f 2 and g 2 are, respectively, orthogonal to
f1 and g1 . Continuing in this manner, always searching in subspaces orthogonal to principal vectors
that have already been found, the complete set of principal angles and principal vectors will be obtained.
q
The average cosine of the principal angles between the subspaces F and G is wrote as ( 1 q ) cos l . For
l =1
___________________________________________________________________________________________ 129
Fadoua Ataa Allahs Thesis
References
[AaE99]
K. Aas, and L. Eikvil, Text Categorisation: A Survey, Technical Report, June 1999, Norwegian
Computing Center.
[Abd04]
A. Abdelali, Localization in Modern Standard Arabic, Journal of the American Society for
Information Science and Technology, Vol. 55, No.1 (2004), pp. 23-28.
[ACS04]
[Abd87]
[AAE99]
H. Abu-Salem, M. Al-omari, and M. Evens, Stemming methodologies over individual query words
for an Arabic information retrieval system, Journal of the American Society for Information Science,
Vol. 50, No. 6 (1999), pp. 524-529.
[AMC05] Z. Abu Bakar, M. Mat Deris, and A. Che Alhadi, Performance Analysis of Partitional and
Incremental Clustering, Seminar Nasional Aplikasi Teknologi Informasi 2005, Yogyakarta,
Indonesia, June, 2005.
[AGG98]
[Ala90]
M.A. Al-Atram, Effectiveness of Natural Language in Indexing and Retrieving Arabic Documents,
King Abdulaziz City for Science and Technology Project number AR-8-47, Riyadh, Saudi Arabia,
1990.
[AlA89]
[Alg87]
M. Al-Gasimi, Arabization of the MINISIS System, In Proceedings of the First King Saud University
Symposium on Computer Arabization, Riyadh. Saudi Arabia, April, 1987, pp. 13-26.
[AlF02]
M. Al-Jlayl, and O. Frieder, On Arabic search: Improving the retrieval effectiveness via light
stemming approach, In Proceedings of the 11th ACM International Conference on Information and
Knowledge Management, 2002, pp. 340-347.
130 ___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
References
[AlE94]
I.A. Al-Kharashi, and M. W. Evens, Comparing Words, Stems, and Roots as Index Terms in an
Arabic Information Retrieval System, Journal of the American Society for Information Science, Vol.
45, No. 8 (1994), pp. 548-560.
M. Al-Saeedi, Awdah Almasalik ila Alfiyat Ibn Malek, Dar ihyaa al oloom, Beirut, Lebanon, 1999.
[Als96]
[AlA04a]
[AlA04b] L. Al-Sulaiti, and E. Atwell, Designing and Developing a Corpus of Contemporary Arabic, In
Proceedings of the 6th Teaching and Language Corpora Conference, Granada, Spain, 2004, pp.92.
[AlA05] L. Al-Sulaiti, and E. Atwell, Extending the Corpus of Contemporary Arabic, In Proceedings of
Corpus Linguistics Conference, Vol. 1, No. 1 (2005), pp. 15-24.
[All07]
M. P. Allen, The t test for the simple regression coefficient, Chapter in Understanding Regression
Analysis, Springer US, 1997, pp. 66-70.
[Ama02]
M. Amar, Les Fondements thoriques de lindexation: une approche linguistique, ADBS editions,
Paris, France, 2000.
[AmR02]
[Arg03]
M.E. Argentati, Principal Angles between Subspaces as Related to Rayleigh Quotient and Rayleigh
Ritz Inequalities with Applications to Eigenvalue Accuarcy and an Eigenvalue Solver, Ph.D.
Dissertation, University of Colorado, USA, 2003.
[Ars04]
[ABE05]
[ABE06]
F. Ataa Allah, S. Boulaknadel, A. El Qadi, and D. Aboutajdine, Arabic Information Retrieval System
Based on Noun Phrases, Information and Communication Technologies, Vol. 1, No. 24-28, Damask,
__________________________________________________________________________________________ 131
Fadoua Ataa Allahs Thesis
References
Syria, April, 2006, pp. 1720 - 1725.
[ABE08]
[Att00]
[BaH76]
[BCB92]
B. T. Bartell, G. W. Cottrell, and R. K. Belew, Latent Semantic Indexing is an Optimal Special Case
of Multidimensional Scaling, Proceedings of the 15th Annual International ACM SIGIR Conference
on Research and Development in Information retrieval, 1992, pp. 161-167.
[BeK03]
J. Becker, and D. Kuropka, Topic-based Vector Space Model, In Proceedings of the 6th
International Conference on Business Information Systems, Colorado Springs, June, 2003, pp. 7-12.
[Bec59]
M. Beckner, The Biological Way of Thought, Columbia University Press, New York, 1959.
[Bee96]
K. R. Beesley, Arabic finite-state Morphological Analysis and Generation In Proceedings of the 16th
International Conference on Computational Linguistics (COLING-96), Vol. 1, pp. 89-94, 1996.
[BeC87]
[BeN03]
M. Belkin and P. Niyogi, Laplacian Eigenmaps for Dimensionality Reduction and Data
Representation, Neural Computation, Vol. 6, No. 15, 2003, pp. 1373-1396.
[BeB99]
M. W. Berry, and M. Browne, Understanding Search Engines: Mathematical Modeling and Text
Retrieval, Siam Book Series: Software, Philadelphia, 1999.
[BDJ99]
M. W. Berry, Z. Drmac, and E. R. Jessup, Matrices, Vector Spaces, and Information Retrieval,
Society for Industrial and Applied Mathematics Review, Vol. 41, No. 2 (1999), pp. 335-362.
[BDO95]
M. W. Berry, S. T. Dumais, and G. W. O'Brien, Using Linear Algebra for Intelligent Information
Retrieval, Society for Industrial and Applied Mathematics Review, Vol. 37, No. 4 (1995), pp. 573595.
[BeF96]
[BjG73]
A. Bjorck, and G. Golub, Numerical Methods for Computing Angles between Linear Subspaces,
Journal of Mathematics of Computation, Vol. 27, No. 123 (1973), pp. 579-594.
[Bla06]
A. Blansch. Classification non Supervise avec Pondration Dattributs par des Mthodes
132 __________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
References
Evolutionnaires, Ph.D. Dissertation, Louis Pasteur University- Strasbourg I, September, 2006.
[BlL97]
A. Blum, and P. Langley, Selection of Relevant Features and Examples in Machine Learning,
Journal of Artificial Intelligence, Vol. 97 (1997), pp. 245-271.
[Boo80]
[BoG97]
I. Borg, and P. Groenen, Modern Multidimensional Scaling: Theory and Applications, SpringerVerlag, New York, USA, 1997.
[BoA05]
S. Boulaknadel, and F. Ataa Allah, Recherche dInformation en Langue Arabe : Influence des
Paramtres Linguistiques et de Pondration de LSA, In Actes des Rencontres des Etudiants
Chercheurs en Informatique pour le Traitement Automatique des Langues (RCITAL), Paris
Dourdan, Vol. 1 (2005), pp. 643-648.
[Bou08]
[BuH29]
[BuK81]
D. Buell, and D. H. Kraft, Threshold Values and Boolean Retrieval Systems, Journal of Information
Processing and Management, Vol. 17, No. 3 (1981), pp. 127-36.
[Can93]
[CaD90]
F. Can, and N.D. Drochak II, Incremental Clustering for Dynamic Document Databases, In
Proceeding of the 1990 Symposium on Applied Computing, 1990, pp. 61-67.
[Cha94]
B.B. Chaudhri, Dynamic Clustering for Time Incremental Data, Pattern Recognition Letters, Vol.
15, No. 1 (1994), pp. 27-34.
[Chu97]
F.R.K. Chung, Spectral Graph Theory, Conference Board of the Mathematical Sciences Conference
Regional Conference Series in Mathematics, May, 1997, No. 92.
[ChH89]
K. Church and P. Hanks, Word Association Norms, Mutual Information, and Lexicography, In
Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, 1989, pp.
76-83.
[CRJ03]
J. Clech, R. Rakotomalala, and R. Jalam, Slection multivarie de termes, In Proceedings of the 35th
Journes de Statistiques, Lyon, France, 2003, pp. 933-936.
[CoL06a]
R.R. Coifman and S. Lafon, Diffusion Maps, Applied and Computational Harmonic Analysis, Vol.
21, No. 1 (2006), pp. 6-30.
__________________________________________________________________________________________ 133
Fadoua Ataa Allahs Thesis
References
[CoL06b] R.R. Coifman and S. Lafon, Geometric Harmonics: A Novel Tool for Multiscale Out-of-Sample
Extension of Empirical Functions, Applied and Computational Harmonic Analysis, Vol. 21, No. 1
(2006), pp. 31-52.
[CLL05]
R.R. Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. Zucker, Geometric
Diffusions as a Tool for Harmonics Analysis and Structure Definition of Data: Diffusion Maps,
Proceedings of the National Academy of Sciences, Vol. 102, No. 21 (2005), pp. 7426-7431.
[Com94]
[CLR98]
[Cro77]
W.B. Croft, Clustering large files of documents using the single link method, Journal of the
American Society for Information Science, Vol. 28 (1977), pp. 341-344.
[Cro72]
D. Crouch, A clustering algorithm for large and dynamic document collections, Ph.D. Dissertation,
Southern Methodist University, 1972.
[CuW85]
J.K. Cullum, and R.A. Willoughby, Lanczos algorithms for large symmetric eigenvalue computations
Vol. 1 Theory, (Chapter 5: Real rectangular matrices), Brikhauser, Boston, 1985.
[DaL97]
M. Dash, and H. Liu, Feature Selection for Classification, Journal of Intelligent Data Analysis, Vol.
1, No. 1-4 (1997), pp. 131-156.
[Dar02]
K. Darwish, Building a Shallow Arabic Morphological Analyzer in One Day, In Proceedings of the
Association for Computational Linguistics, 2002, pp. 47-54.
[Dar03]
K. Darwish, Probabilistic Methods for Searching OCR-Degraded Arabic Text, Doctoral Dissertation,
University of Maryland, College Park, Maryland, 2003.
[DDJ01]
[Dat71]
R.T. Dattola, Experiments with a fast clustering algorithm for automatic classification, In The
SMART Retrieval System-Experiments in Automatic Document Processing, G. Salton Edition,
Prentice-Hall, Englewood Cliffs, New Jersey, 1971, Chap. 12.
[DaB79]
D.L. Davies, and D.W. Bouldin, A Cluster Separation Measure, IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol. 1, No. 2 (1979), pp. 224-227.
[DDF89]
S. Deerwester, S. Dumais, G. Furnas, G.W. Furnas, R.A. Harshman, T.K. Landauer, K.E. Lochbaum,
and L.A. Streeter, Computer information retrieval using latent semantic structure, U. S. Patent, No.
4 (1989), pp. 839-853.
134 __________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
References
[DDF90]
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman, Indexing by Latent
Semantic Analysis, Journal of the American Society for Information Science, Vol. 41, No. 6 (1990),
pp. 391-407.
[DKN03]
I.S. Dhillon, J. Kogan, and M. Nicholas, Feature Selection and Document Clustering, In M.W.
Berry, editor, A Comprehensive Survey of Text mining, Springer-Verlag, 2003.
[DhM99]
I.S. Dhillon, and D.S. Modha, Concept Decompositions for Large Sparse Text Data using
Clustering, Technical Report RJ 10147 (95022), IBM Almaden Research Center, 1999.
[DhM00]
I.S. Dhillon and D.S. Modha, A parallel data-clustering algorithm for distributed memory
multiprocessors, In Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence, Vol.
1759 (2000), pp. 245-260.
[DhM01]
I.S. Dhillon and D.S. Modha, Concept Decompositions for Large Sparse Text Data using
Clustering, Machine Learning, Vol. 42, No. 1-2 (2001), pp. 143-175.
[DHJ04]
M. Diab, K. Hacioglu, and D. Jurafsky, Automatic Tagging of Arabic Text: from Raw Text to Base
Phrase Chunks, In Proceedings of the Human Language Technology conference and the North
American Chapter of the Association for Computational Linguistics, Boston, Massachusetts, USA,
May, 2004, pp. 149-152.
[Did73]
[DHZ01]
C. Ding, X. He, H. Zha, M. Gu, and H. Simon, A minmax cut algorithm for graph partitioning and
data clustering, In Proceedings of IEEE International Conference on Data Mining, 2001, pp. 107114.
[DHZ02]
C. Ding, X. He, H. Zha, M. Gu, and H. Simon, Adaptive Dimension Reduction for Clustering High
Dimensional Data, In Proceedings of the 2nd International IEEE Conference on Data Mining,
December, 2002, pp. 147-154.
[Din99]
C. H. Ding, A Similarity-based Probability Model for Latent Semantic Indexing, Proceedings of the
22nd ACM SIGIR Conference, August, 1999, pp. 59-65.
[Din01]
[DoG03]
D.L. Donoho, and C. Grimes, Hessian Eigenmaps: New Locally Linear Embedding Techniques for
High-Dimensional Data, In Proceedings of Natl Academy of Sciences, Vol. 100, No. 10 (2003), pp.
5591-5596.
[DuH73]
R.O. Duda, and P.E. Hart, Pattern Classification and Scene Analysis, Wiley, New York, NY, USA,
1973.
__________________________________________________________________________________________ 135
Fadoua Ataa Allahs Thesis
References
[Dum91]
S. Dumais, Improving the Retrieval of Information from External Sources, Behavior Research
Methods, Instruments, & Computers, Vol. 23, No. 2 (1991), pp. 229-236.
[Dum92]
[Dum94]
S. Dumais, Latent Semantic Indexing (LSI) and TREC-2, Technical Memorandum Tm-ARH023878, Bellcore, 1994.
[Dun03]
M. H. Dunham, Data Mining: Introductory And Advanced Topics, New Jersey: Prentice Hall, 2003.
[Dun89]
G. H. Dunteman, Principal Component Analysis, Sage Publications, Newbury Park, California, USA,
1989.
[Egg04]
L. Egghe, Vector Retrieval, Fuzzy Retrieval and the Universal Fuzzy IR Surface for IR Evaluation,
Journal of Information Processing and Management, Vol. 40, No. 4 (2004), pp. 603-618.
[EGH91]
[Fag87]
[FaO95]
C. Faloutsos, and D.W. Oar, A Survey of Information Retrieval and Filtering Methods. Technical
Report CS-TR-3514, Department of Computer Science, University of Maryland, College Park, 1995.
[FiB02]
R.D. Fierro and M.W. Berry, Efficient Computation of the Riemannian SVD in Total Least Squares
Problems in Information Retrieval, in S. Van Huffel and P. Lemmerling (Eds.), Total Least Squares
and Errors-in-Variables Modeling: Analysis, Algorithms, and Applications, Kluwer Academic
Publishers, 2002, pp. 349-360.
[For03]
G. Forman, An Extensive Empirical Study of Feature Selection Metrics for Text Classification,
Journal of Machine Learning Research, Vol. 3 (2003), pp. 1289-1305.
[FrB92]
W. Frakes and R. Baeza-Yates, editors, Information Retrieval: Data Structures & Algorithms,
Prentice Hall, Englewood Cliffs, New Jersey, 1992.
[Fre01]
A. Freeman, Brill's POS Tagger and a Morphology Parser for Arabic, In Proceedings of the 39th
Annual Meeting of Association for Computational Linguistics and the 10th Conference of the
European Chapter, Workshop on Arabic Language Processing: Status and Prospects, Toulouse,
France, July, 2001.
[Fri73]
[FuT04]
B. Fuglede, and F. Topsoe, Jensen-Shannon Divergence and Hilbert Space Embedding, In IEEE
International Symposium on Information Theory, July, 2004, pp. 31.
136 __________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
References
[GeO01]
F.C. Gey, and D.W. Oard, The TREC-2001 Cross-Language Information Retrieval Track: Searching
Arabic Using English, French or Arabic Querie, In Proceedings of the 2001 Text Retrieval
Conference, National Institute of Standards and Technology, November, 2001, pp. 16-26.
[GoR71]
G. Golub, and C. Reinsch, Handbook for Automatic Computation II, Linear Algebra, SpringerVerlag, New York, 1971.
[GoV89]
G. Golub, and C. Van Loan, Matrix Computations, Johns-Hopkins, Baltimore, Maryland, 2nd Edition,
1989.
[GoD01]
[GPD04]
[GoR69]
J. C. Gower, and G. J. S. Ross, Minimum Spanning Trees and Single-Linkage Cluster Analysis,
Applied Statistics, Vol. 18, No. 1 (1969), pp. 5464.
[GRG97]
[GuB06]
S. Gurif, and Y. Bennani, Selection of Clusters Number and Features Subset during a two-levels
Clustering Task, In Proceeding of the 10th IASTED International Conference Artificial Intelligence
and Soft Computing, August, 2006, pp. 28-33.
[GuB07]
[GBJ05]
S. Gurif, Y. Bennani, and E. Janvier, -som: Weighting Features During Clustering, In Proceeding
of the 5th Workshop On Self-Organizing Maps, September, 2005, pp. 397-404.
[GuE03]
I. Guyon, and A. Elisseeff, An Introduction to Variable and Feature Selection, Journal of Machine
Learning Research, Vol. 3 (2003), pp. 1157-1182.
[HaK03]
K.M. Hammouda, and M.S. Kamel, Incremental Document Clustering Using Cluster Similarity
Histograms, In Proceeding of the IEEE International Conference on Web Intelligence, June, 2003,
pp. 597-601.
[HaK92]
L. Hagen, and A.B. Kahng, New Spectral Methods for Ratio Cut Partitioning and Clustering, IEEE
Transaction Computer-Aided Design of Integrated Circuits and Systems, Vol. 11, No. 9, September,
1992, pp. 10741085.
References
Clustering Algorithms for Topical Document Clustering, In Proceedings of the 23rd Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval,
Athens, Greece, July, 2000, pp. 224-231.
[HeP96]
[HBL94]
[HiK99]
[HKE97]
I. Hmeidi, K. Kanaan, and M. Evens, Design and Implementation of Automatic Indexing for
Information Retrieval with Arabic Documents, Journal of the American Society for Information
Science, Vol. 48, No. 10 (1997), pp. 867-881.
[Yan05]
H. Yan, Techniques for Improved LSI Text Retrieval, Ph.D. Dissertation, Wayne State University,
Detroit, Michigan, USA, 2005.
[Yan08]
H. Yan, W. I. Grosky, and F. Fotouhi, Augmenting the power of LSI in text retrieval: Singular value
rescaling, Journal of Data and Knowledge Engineering, Vol. 65 (2008), pp. 108-125.
[HNR05]
J.Z. Huang, M.K. Ng, H. Rong, and Z. Li, Automated Variable Weighting in k-means Type
Clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 5 (2005),
pp. 657-668.
[HSD00] P. Husbands, H. Simon, and C. H. Ding. On the Use of the Singular Value Decomposition for Text
Retrieval, Computational information Retrieval, M. W. Berry, Ed. Society for Industrial and Applied
Mathematics, Philadelphia, PA, 2001, pp. 145-156.
[Ibn90]
[JaD88]
A. Jain, and R. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, N.J., 1988.
[JMF99]
A.K. Jain, M.N. Murty, and P.J. Flynn, Data clustering: a review, ACM Computing Surveys, Vol.
31, No. 3 (1999), pp. 264-323.
[JaZ97]
A. Jain, and D. Zongker, Feature Selection: Evaluation, Application, and Small Sample
Performance, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 2
(1997), pp. 153-158.
[JaS71]
N. Jardine, and R. Sibson, Mathematical Taxonomy, Wiley, London and New York, 1971.
138 __________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
References
[Jia97]
J. Jiang, Using Latent Semantic Indexing for Data Mining, MS Thesis, Department of Computer
Science, University of Tennessee, December, 1997.
[Jia98]
E.P. Jiang, Information retrieval and Filtering Using the Riemannian SVD, Ph.D. Dissertation,
Department of Computer Science, University of Tennessee, August, 1998.
[JKP94]
G.H. John, R. Kohavi, and K. Pfleger, Irrelevant features and the subset selection problem, In
Proceedings of the 11th International Conference on Machine Learning, San Francisco, CA, USA,
1994, pp. 121-129.
[Jon72]
K.S. Jones, A Statistical Interpretation of Term Specificity and its Application in Retrieval, Journal
of Documentation, Vol. 28, No. 1 (1972), pp. 11-21.
[KaR90]
[Kin67]
B. King, Step-wise clustering procedures, Journal of the American Statistical Association, Vol. 69
(1967), pp. 86-101.
[KlJ04]
I.A. Klampanos, and J.M. Jose, An Architecture for Information Retrieval over Semi-Collaborating
Peer-to-Peer Networks, In Proceedings of the 2004 ACM Symposium on Applied Computing, Vol. 2
(2004), Nicosia, Cyprus, March, pp. 1078-1083.
[KJR06]
I.A. Klampanos, J.M. Jose, and C.J.K. van Rijsbergen, Single-Pass Clustering for Peer-to-Peer
Information Retrieval: The Effect of Document Ordering, Proceedings of the 1st International
Conference on Scalable information Systems, Hong Kong, May, 2006, Article 36.
[Kho01]
S. Khoja, APT: Arabic Part-of-speech Tagger, In Proceedings of the Student Workshop at the 2nd
Meeting of the North American Chapter of the Association for Computational Linguistics, 2001, pp.
20-25.
[KhG99]
S. Khoja, and R. Garside, Stemming Arabic text, Technical Report, Computing Department,
Lancaster University, Lancaster, September, 1999.
[KiR92]
K. Kira, and L. A. Rendell, A Practical Approach to Feature Selection, In Proceedings of the 9th
International Conference on Machine Learning, San Francisco, CA, USA, 1992, pp. 249-256.
[KoJ97]
R. Kohavi, and G. H. John, Wrappers for feature subset selection, Journal of Artificial Intelligence,
Vol. 97, No. 1-2 (1997), pp. 273-324.
[KoO96]
T.G. Kolda, and D.P. O'Leary, Large Latent Semantic Indexing via a Semi-Discrete Matrix
Decomposition, Technical Report, No. UMCP-CSD CS-TR-3713, Department of Computer Science,
Univ. of Maryland, 1996.
[KoO98]
T.G. Kolda, and D.P. O'Leary, A Semi-Discrete Matrix Decomposition for Latent Semantic Indexing
in Information Retrieval, ACM Transactions on Information Systems, Vol. 16, No. 4 (1998), pp. 322-
__________________________________________________________________________________________ 139
Fadoua Ataa Allahs Thesis
References
346.
[KoS96]
D. Koller, and M. Sahami, Toward Optimal Feature Selection, In Proceedings of the 13th
International Conference on Machine Learning, 1996, pp. 284292.
[KWX01] B. Krishnamurthy, J. Wang, and Y. Xie, Early Measurements of a Cluster-Based Architecture for
P2P Systems, Internet Measurement Workshop, ACM SIGCOMM, San Francisco, USA, November,
2001.
[KuL51]
[LaL06]
S. Lafon, and A.B. Lee, Diffusion Maps and Coarse-Graining: A Unified Framework for
Dimensionality Reduction, Graph Partitioning, and Data Set Parameterization, IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol. 28, No. 9 (2006), pp. 1393-1403.
[LFL98]
T.K. Landauer, P. W. Foltz, and D. Laham, An Introduction to Latent Semantic Analysis, Discourse
Processes, Vol. 25 (1998), pp. 259-284.
[Lap00]
E. Laporte, Mot et niveau lexical, Ingnierie des langues, 2000, pp. 25-46.
[LBC02]
L. S. Larkey, L. Ballesteros, and M. Connell, Improving Stemming for Arabic Information Retrieval
: Light Stemming and Cooccurrence Analysis, In Proceedings of the 25th Annual International
Conference on Research and Development in Information Retrieval (SIGIR 2002), Tampere, Finland,
August 2002, pp. 275-282.
[LaM71]
D. N. Lawley, and A. E. Maxwell, Factor Analysis as a Statistical Method, 2nd edition, American
Elsevier Publication, New York, USA, 1971.
[Law03]
N. D. Lawrence, Gaussian Process Latent Variable Models for Visualisation of High Dimensional
Data, In Proceeding of Neural Information Processing Systems, December, 2003.
[Lee94]
J.H. Lee, Properties of Extended Boolean Models in Information Retrieval, Proceedings of the 17th
Annual International ACM SIGIR Conference, Dublin, Ireland, 1994, pp. 182-190.
[Ler99]
[Let96]
T.A. Letsche, Toward Large-Scale Information Retrieval Using Latent Semantic Indexing, MS
Thesis, Department of Computer Science, University of Tennessee, August 1996.
[LeB97]
T.A. Letsche, and M.W. Berry, Large-Scale Information Retrieval with Latent Semantic Indexing,
Information Sciences, Vol. 100, No. 1-4 (1997), pp. 105-137.
[Leu01]
140 __________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
References
[Lit69]
B. Litofsky, Utility of automatic classification systems for information storage and retrieval, Ph.D.
Dissertation, University of Pennsylvania, 1969.
[LiM98]
H. Liu, and H. Motoda, Feature Selection for Knowledge Discovery & Data Mining, The Kluwer
International Series in Engineering and Computer Science, Kluwer Academic Publishers, Boston,
USA, 1998.
[Mac67]
[MaL01]
V. Makarenkov, and P. Legendre, Optimal Variable Weighting for Ultrametric and Additive Trees
and k-means Partitioning: Methods and Software, Journal of Classification, Vol. 18, No. (2001), pp.
245-271.
[MAS03]
J. Makkonen, H. Ahonen-Myka, and M. Salmenkivi, Topic Detection and Tracking with SpatioTemporal Evidence, In Proceedings of 25th European Conference on Information Retrieval
Research, 2003, pp. 251-265.
[MaS99]
C. Manning, and H. Schtze, Foundations of Statistical Natural Language Processing, MIT Press,
Cambridge, MA, 1999.
[MeS01]
[Mil02]
A. Miller, Subset Selection in Regression, 2nd edition, Chapman & Hall/CRC, 2002.
[MBS97]
M. Mitra, C. Buckley, A. Singhal, and C. Cardi, In Analysis of Statistical and Syntactic Phrases, In
Proceeding of the 5me Confrence de Recherche dInformation Assiste par Ordinateur, Montreal,
Canada, June, 1997, pp. 200-214.
[MMP02] P. Mitra, C.A. Murthy, and S.K. Pal, Unsupervised Feature Selection Using Feature Similarity,
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 3 (2002), pp.301-312.
[Mun57]
J. Munkres, Algorithms for the Assignment and Transportation Problems, Journal of the Society for
Industrial and Applied Mathematics, Vol. 5, No. 1 (1957), pp. 32-38.
[MuA04]
S. H. Mustafa, and Q. A. Al-Radaideh, Using N-grams for Arabic Text Searching, Journal of the
American Society for Information Science and Technology, Vol. 55, No. 11, September, 2004, pp.
1002-1007.
[NJW02]
[NiC05]
M. Nikkhou, and K. Choukri, Report on Survey on Arabic Language Resources and Tools in
Mediterranean Countries, ELDA, NEMLAR, 2005.
__________________________________________________________________________________________ 141
Fadoua Ataa Allahs Thesis
References
[Obr94]
G. W. OBrien, Information Management Tools for Updating an SVD Encoded Indexing Scheme,
Masters Thesis, The University of Knoxville, Tennessee, Knoxville, TN, 1994.
[PLL01]
J.M. Pena, J.A. Lozano, P. Larranaga, and I. Inza, Dimensionality Reduction in Unsupervised
Learning of Conditional Gaussian Networks, IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. 23, No. 6, June, 2001, pp. 590-603.
[PoC98]
J.M. Ponte, and W.B. Croft, A Language Modeling Approach to information retrieval, Proceedings
of the 21st Annual International ACM SIGIR Conference, 1998, pp. 275-281.
[PTS92]
W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, Numerical Recipes in C: The Art of
Scientific Computing, 2nd edition, Cambridge University Press, 1992, pp. 994.
[PrS72]
N.S. Prywes, and D.P. Smith, Organization of Information, Annual Review of Information Science
and Technology, Vol. 7 (1972), pp. 103-158.
[PNK94]
P. Pudil, J. Novovicova, and J. Kittler, Floating Search Methods in Feature Selection, Journal of
Pattern Recognition Letters, Vol. 15, No. 11 (1994), pp. 1119-1125.
[Rad79]
[Ras92]
[RoS76]
S.E. Robertson, and K. Sparck Jones, Relevance Weighting of Search Terms, Journal of American
Society for Information Sciences, Vol. 27, No. 3 (1976), pp. 129-146.
P. M. Romer, Endogenous Technical Change, Journal of Political Economy, Vol. 98, No. 5 (1990),
pp. 71-102.
[RoS00]
S.T. Roweis, and L.K. Saul, Nonlinear Dimensionality Reduction by Locally Linear Embedding,
Journal of Science, Vol. 290, No. 5500 (2000), pp. 2323-2326.
[SaB90]
G. Salton, and C. Buckley, Improving retrieval performance by relevance feedback, Journal of the
American Society for Information Science, Vol. 41, No. 4 (1990), pp. 288-297.
[Sal68]
G. Salton, Automatic Information Organization and Retrieval, McGraw-Hill, New York, 1968.
[Sal71]
[SaM83]
G. Salton and M.J. McGill, Introduction to Modern Information Retrieval, McGraw Hill Publishing
Company, New York, 1983.
[SaW78]
G. Salton, and A. Wong, Generation and Search of Clustered Files, ACM Transaction on Database
142 __________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
References
Systems, Vol. 3, No. 4, December, 1978, pp. 321-346.
[Sav02]
[SSM99]
[Sch94]
[Sch02]
N. Schmitt, Using corpora to teach and assess vocabulary, Chapter in Corpus Studies in Language
Education, Melinda Tan Edition, IELE Press, 2002, pp. 31-44.
[SAS04]
Y. Seo, A. Ankolekar, and K. Sycara, Feature Selection for Extracting Semantically Rich Words,
Technical Report CMU-RI-TR-04-18, Robotics Institute, Carnegie Mellon University, March, 2004.
[ShM00]
J. Shi, and J. Malik, Normalized Cuts and Image Segmentation, IEEE Transaction on Pattern
Analysis and Machine Intelligence, Vol. 22, No. 8 (2000), pp. 888-905.
[SBM96]
A. Singhal, C. Buckley, and M. Mitra, Pivoted document length normalization, Proceedings of the
19th Annual International ACM SIGIR Conference, Zurich, Switzerland, August, 1996, pp. 21-29.
[SnS73]
[SGM00]
[SLP97]
[Sub92]
J.L. Subbiondo, John Wilkins' Theory of Meaning and the Development of a Semantic Model, In
John Wilkins and 17th-Century British Linguistics, Chap. 5: Wilkins' Classification of Reality,
Joseph L. Subbiondo edition, Amsterdam, 1992, pp. 291-308.
[TEC05]
[TSL00]
J.B. Tenenbaum, V. de Silva, and J. C. Langford, A Global Geometric Framework for Nonlinear
Dimensionality Reduction, Journal of Science, Vol. 290 (2000), pp. 2319-2323.
[TuC91]
[VHL05]
__________________________________________________________________________________________ 143
Fadoua Ataa Allahs Thesis
References
using Diffusion Maps, Proceedings of the 44th IEEE Conference on Decision and Control and the
European Control Conference, Seville, Spain, December, 2005, pp. 7931-7936.
[Van72]
C.J. Van Rijsbergen, Automatic information structuring and retrieval, Ph.D. Dissertation, University
of Cambridge, 1972.
[Van79]
C.J. Van Rijsbergen, Information Retrieval, Second Edition, Butterworths Publishing Company,
London, 1979.
[Von06]
U. Von Luxburg, A Tutorial on Spectral Clustering, Technical Report TR-149, Max Planck Institute
of Biological, Cybernetics, 2006.
[WaK79]
[Wei99]
[VeA99]
J. Vesanto, and J. Ahola, Hunting for Correlations in Data Using the Self-Organizing Map, In
Proceeding of the International ICSC Congress on Computational Intelligence Methods and
Applications, 1999, pp. 279-285.
[WMC01] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik, Feature Selection for
SVMs, Journal of Advances in Neural Information Processing Systems, Vol. 13 (2001), pp. 668-674.
[Wit97]
D. I. Witter, Downdating the Latent Semantic Indexing Model for Information retrieval, MS Thesis,
Department of Computer Science, University of Tennessee, 1997.
[WMB94] I.H. Witten, A. Moffat, and T.C. Bell, Managing Gigabytes: Compressing and Indexing Documents
and Images. Van Nostrand Reinhold, New York, NY, 1994.
[WoF00]
W. Wong, and A. Fu, Incremental Document Clustering for Web Page Classification, In Proceeding
of the International Conference on Information Society, Japan, 2000.
[WZW85] S.K.M. Wong, W. Ziarko, and P.C.N. Wong, Generalized Vector Spaces Model in Information
Retrieval, In Proceeding of the 8th annual International ACM SIGIR Conference, Montreal, Quebec,
Canada, 1985, pp. 18-25.
[XJK01]
E.P. Xing, M.I. Jordan, and R.M. Karp, Feature Selection for High-Dimensional Genomic
Microarray Data, In Proceedings of the 18th International Conference on Machine Learning, San
Francisco, CA, USA, 2001, pp. 601-608.
[XuC98]
J. Xu, and W.B. Croft, Corpus-Based Stemming using Co-occurrence of Word Variants, In ACM
Transactions on Information Systems , Vol. 16, No. 1 (1998), pp. 61-81.
[XFW01]
J. Xu, A. Fraser, and R. Weischedel, TREC 2001 Crosslingual Retrieval at BBN, In TREC 2001,
Gaithersburg: NIST, 2001.
144 __________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
References
[Yah89]
A. H. Yahya, On the Complexity of the Initial Stages of Arabic Text Processing, First Great Lakes
Computer Science Conference, Kalamazoo, Michigan, U.S.A., October, 1989, pp. 18-20.
[YaH98]
J. Yang, and V. Honavar, Feature Subset Selection Using a Genetic Algorithm, IEEE Transaction
Intelligent Systems, Vol. 13, No. 2 (1998), pp. 44-49.
[YaP97]
Y. Yang, and J.O. Pedersen, A Comparative Study of Feature Selection in Text Categorization, In
Proceedings of the 14th International Conference on Machine Learning, San Francisco, CA, USA,
1997, pp. 412420.
[YuL03]
L. Yu, and H. Liu, Feature Selection for High-Dimensional Data: A Fast Correlation-based Filter
Solution, In Proceedings of the twentieth International Conference on Machine Learning, 2003, pp.
856-863.
[Zah71]
C. T. Zahn, Graph-Theoretic Methods for Detecting and Describing Gestalt Clusters, IEEE
Transactions on Computers, Vol. 20, No. 1 (1971), pp. 68-86.
[ZaE98]
[ZeH01]
S. Zelikovitz and H. Hirsh, Using LSI for Text Classification in the Presence of Background Text,
In Proceedings of the ACM 10th International Conference on Information and Knowledge
Management (CIKM01), Atlanta, Georgia, November, 2001, pp. 113-118.
[ZTM96]
C. Zhai, X. Tong, N. Milic-Frayling, and D. A. Evans, Evaluation of Syntactic Phrase IndexingCLARIT NLP Track Report, In Proceedings of the 5th Text Retrieval Conference, Gaithersburg, MD,
November, 1996, pp. 347-358.
[ZhZ02]
Z. Zhang, and H. Zha, Principal Manifolds and Nonlinear Dimension Reduction via Local Tangent
Space Alignement, Technical Report CSE-02-019, Dept. of Computer Science and Eng.,
Pennsylvania State University, Pennsylvania, USA, 2002.
[ZhG02]
R. Zhao, and W. I. Grosky, Negotiating the Semantic Gap: from Feature Maps to Semantic
Landscapes, Pattern Recognition, Vol. 35, No. 3 (2002), pp. 593-600.
__________________________________________________________________________________________ 145
Fadoua Ataa Allahs Thesis