Information Retrieval

Information Retrieval: Applications
to English and Arabic Documents

by
Fadoua Ataa Allah
Dissertation submitted to the Faculty of Science - Rabat of the

University of Mohamed V - Agdal in fulfillment
of the requirements for the degree of
Doctor of Philosophy
2008
Abstract
Arabic information retrieval has become a focus of research and commercial development
due to the vital necessity of such tools for people in the electronic age. The number of Arabicspeaking Internet users is assumed to achieve 43 millions during this year1; however, on the
other side, few full search engines are available to Arabic-speaking users. This dissertation
focuses on three naturally related areas of research: information retrieval, document
clustering, and dimensionality reduction.
In information retrieval, we propose an Arabic information retrieval system, based on light
stemming in the pre-processing phase, and on the Okapi BM-25 weighting scheme and the
latent semantic analysis model in the processing phase. This system has been suggested after
performing and analyzing many experiments dealing with Arabic natural language processing
and different weighting schemes found in literature. Moreover, it has been compared with
another proposed system based on noun phrase indexation.
In clustering, we propose to use the diffusion map space based on the cosine kernel and the
singular value decomposition (that we denote by the cosine diffusion map space) for clustering
documents. We illustrate experimentally, using the k-means clustering algorithm, the
robustness of document indexation in this space compared to the Saltons space. We discuss
the problems of the reduced dimension determination related to the singular value
decomposition method and the choice of clusters number, and we provide some solutions for
these issues. We provide some statistical results and discuss how the k-means algorithm
performs better in the latent semantic analysis model space than in the cosine diffusion map
space in the case of two clusters, but not in the case of multi-clusters. We propose a new
approach for online clustering, based on the cosine diffusion map and the updating singular
value decomposition method.
Concerning dimensionality reduction, we use singular value decomposition technique
for feature transformation, while we propose to supplement this reduction by a
generic term extracting algorithm for features selection in the context of information
retrieval.
http://www.abc.net.au/science/news/stories/s1623945.htm, Retrieved on 10-05-2007.
Dedication
Acknowledgements
Table of Contents
List of Tables ................................................................................................................ V
List of Figures ............................................................................................................ VII
List of Abbreviations ...................................................................................................IX
Chapter 1 Introduction ...................................................................................................1
1. 1. Research Contributions..........................................................................................2
1. 2. Thesis Layout & Brief Overview of Chapters .......................................................3
Chapter 2 Literature Review..........................................................................................5
2. 1. Introduction............................................................................................................5
2. 2. Document Retrieval ...............................................................................................5
2.2.1. DOCUMENT RETRIEVAL MODELS .................................................................................... 5
2.2.1.1. Set-theoretic Models..................................................................................................... 6
2.2.1.2. Algebraic Models ......................................................................................................... 7
2.2.1.3. Probabilistic Models..................................................................................................... 7
2.2.1.4. Hybrid Models.............................................................................................................. 8
2.2.2. INTRODUCTION TO VECTOR SPACE MODELS .................................................................. 8
2. 3. Document Clustering ...........................................................................................10

2.3.1. DEFINITION .................................................................................................................... 11
2.3.2. CLUSTERING DOCUMENT IN THE CONTEXT OF DOCUMENT RETRIEVAL ...................... 11
2.3.2.1. Cluster Generation...................................................................................................... 11
2.3.2.2. Cluster Search............................................................................................................. 12
2.3.3. CLUSTERING METHODS TAXONOMY............................................................................ 12
2.3.3.1. Hierarchical Clustering............................................................................................... 14
2.3.3.2. Partitional Clustering.................................................................................................. 14
2.3.3.3. Graph-Theoretic Clustering........................................................................................ 15
2.3.3.4. Incremental Clustering ............................................................................................... 15
2.3.4. DOCUMENT CLUSTERING METHODS USED FOR IR........................................................ 16
2. 4. Dimensionality Reduction ...................................................................................16

2.4.1. TERM TRANSFORMATION .............................................................................................. 17
2.4.2. TERM SELECTION ........................................................................................................... 18
2.4.2.1. Definition.................................................................................................................... 18
2.4.2.2. Feature Selection Methods ......................................................................................... 18
2. 5. Studied Languages ...............................................................................................20

2.5.1. ENGLISH LANGUAGE ..................................................................................................... 20
2.5.2. ARABIC LANGUAGE ....................................................................................................... 21
2.5.3. ARABIC FORMS .............................................................................................................. 21
2.5.4. ARABIC LANGUAGE CHARACTERISTICS ........................................................................ 22
Table of Contents
2.5.4.1. Arabic Morphology .................................................................................................... 24
2.5.4.2. Word-form Structures................................................................................................. 25
2.5.5. ANOMALIES ................................................................................................................... 27
2.5.5.1. Agglutination.............................................................................................................. 27
2.5.5.2. The Vowelless Nature of the Arabic Language.......................................................... 27
2.5.6. EARLY WORK ................................................................................................................ 28
2.5.6.1. Full-form-based IR ..................................................................................................... 28
2.5.6.2. Morphology-based IR................................................................................................. 29
2.5.6.3. Statistical Stemmers ................................................................................................... 30
2. 6. Arabic Corpus ......................................................................................................31

2.6.1. AFP CORPUS .................................................................................................................. 31
2.6.2. AL-HAYAT NEWSPAPER ................................................................................................ 31
2.6.3. ARABIC GIGAWORD ....................................................................................................... 32
2.6.4. TREEBANKS ................................................................................................................... 32
2.6.5. OTHER EFFORTS............................................................................................................. 33
2. 7. Summary ..............................................................................................................33
Chapter 3 Latent Semantic Model ...............................................................................34
3. 1. Introduction..........................................................................................................34
3. 2. Model Description ...............................................................................................34
3.2.1. TERM-DOCUMENT REPRESENTATION............................................................................ 35
3.2.2. WEIGHTING .................................................................................................................... 35
3.2.3. COMPUTING THE SVD ................................................................................................... 39
3.2.4. QUERY PROJECTION AND MATCHING ............................................................................ 41
3. 3. Applications and Results......................................................................................43

3.3.1. DATA.............................................................................................................................. 43
3.3.2. EXPERIMENTS ................................................................................................................ 44
3.3.2.1. Weighting Schemes Impact........................................................................................ 44
3.3.2.2. Reduced Dimension k................................................................................................. 46
3.3.2.3. Latent Semantic Model Effectiveness ........................................................................ 47
3. 4. Summary ..............................................................................................................48
Chapter 4 Document Clustering based on Diffusion Map...........................................49
4. 1. Introduction..........................................................................................................49
4. 2. Construction of the Diffusion Map ......................................................................49
4.2.1. DIFFUSION SPACE .......................................................................................................... 49
4.2.2. DIFFUSION KERNELS...................................................................................................... 51
4.2.3. DIMENSIONALITY REDUCTION ...................................................................................... 51
4.2.3.1. Singular Value Decomposition................................................................................... 52
4.2.3.2. SVD-Updating............................................................................................................ 54
II
Table of Contents
4. 3. Clustering Algorithms..........................................................................................56
4.3.1. K-MEANS ALGORITHM ................................................................................................... 56
4.3.2. SINGLE-PASS CLUSTERING ALGORITHM ....................................................................... 57
4.3.3. THE OSPDM ALGORITHM ............................................................................................. 58
4. 4. Experiments and Results......................................................................................59

4.4.1. CLASSICAL CLUSTERING ............................................................................................... 59
4.4.2. ON-LINE CLUSTERING.................................................................................................... 80
4. 5. Summary ..............................................................................................................81
Chapter 5 Term Selection ............................................................................................83
5. 1. Introduction..........................................................................................................83
5. 2. Generic Terms Definition ....................................................................................83
5. 3. Generic Terms Extraction ....................................................................................83
5.3.1. SPHERICAL K-MEANS ..................................................................................................... 87
5.3.2. GENERIC TERM EXTRACTING ALGORITHM ................................................................... 87
5. 4. Experiments and Results......................................................................................89

5. 5. The GTE Algorithm Advantage and Limitation..................................................92
5. 6. Summary ..............................................................................................................93
Chapter 6 Information Retrieval in Arabic Language .................................................94
6. 1. Introduction..........................................................................................................94
6. 2. Creating the Test Set............................................................................................94
6.2.1. MOTIVATION .................................................................................................................. 94
6.2.2. REFERENCE CORPUS ...................................................................................................... 95
6.2.2.1. Description ................................................................................................................. 95
6.2.2.2. Corpus Assessments ................................................................................................... 97
6.2.3. ANALYSIS CORPUS ........................................................................................................ 99
6. 3. Experimental Protocol .......................................................................................100

6.3.1. CORPUS PROCESSING ................................................................................................... 100
6.3.1.1. Arabic Corpus Pre-processing.................................................................................. 100
6.3.1.2. Processing Stage....................................................................................................... 103
6.3.2. EVALUATIONS .............................................................................................................. 103
6.3.2.1. Weighting Schemes Impact..................................................................................... 103
6.3.2.2. Basic Language Processing Usefulness.................................................................... 104
6.3.2.3. The LSA Model Benefit ........................................................................................... 106
6.3.2.4. The Impact of Weighting Query............................................................................... 107
6.3.2.5. Non Phrase Indexation ............................................................................................. 108
6. 4. Summary ............................................................................................................111
Chapter 7 Conclusion and Future Work ....................................................................113
7. 1. Conclusion .........................................................................................................113
III
Table of Contents
7. 2. Limitations .........................................................................................................113
7. 3. Prospects ............................................................................................................114
Appendix A Natural Language Processing................................................................115
A.1. Introduction........................................................................................................115
A.2. Basic Techniques ...............................................................................................115
A.2.1. N-GRAMS .................................................................................................................... 115
A.2.2. TOKENIZATION............................................................................................................ 115
A.2.3. TRANSLITERATION ...................................................................................................... 116
A.2.4. STEMMING .................................................................................................................. 117
A.2.5. STOP WORDS............................................................................................................... 118
A.3. Advanced Techniques ........................................................................................119

A.3.1. ROOT ........................................................................................................................... 119
A.3.2. POS TAGGING ............................................................................................................. 120
A.3.3. CHUNKING .................................................................................................................. 120
A.3.4. NOUN PHRASE EXTRACTION....................................................................................... 121
Appendix B Weighting Schemes Notations .............................................................122

Appendix C Evaluation Metrics.................................................................................124
C.1. Introduction ........................................................................................................124
C.2. IR Evaluation Metrics ........................................................................................124
C.2.1. PRECISION ................................................................................................................... 124
C.2.2. RECALL ....................................................................................................................... 125
C.2.3. INTERPOLATED RECALL-PRECISION CURVE ............................................................... 126
C.3. Clustering Evaluation.........................................................................................127

C.3.1. ACCURACY .................................................................................................................. 127
C.3.2. MUTUAL INFORMATION .............................................................................................. 128
Appendix D Principal Angles ....................................................................................129

References..................................................................................................................130
IV
List of Tables
Table 2.1. Arabic letters...............................................................................................22
Table 2.2. Different shapes of the letter gh (Ghayn). ........................................22
Table 2.3. Ambiguity caused by the absence of vowels in the words ktb and

mdrsp. ..................................................................................................23
Table 2.4. Some templates generated from roots with examples from the root (
ktb)...................................................................................................................24
Table 2.5. Derivations from a borrowed word. ...........................................................25
Table 3.1. Comparison between Different Versions of the Standard Query Method. .42
Table 3.2. Size of collections........................................................................................43
Table 3.3. Result of weighting schemes in increasing order for Cisi corpus. .............44
Table 3.4. Result of weighting schemes in increasing order for Cran corpus.............45
Table 3.5. Result of weighting schemes in increasing order for Med corpus..............45
Table 3.6. Result of weighting schemes in increasing order for Cisi-Med corpus......46
Table 3.7. The best reduced dimension for each weighting scheme in the case of four
corpuses. ..............................................................................................................47
Table 4.1. Performance of different embedding representations using k-means for the
set Cisi and Med...................................................................................................61
Table 4.2. The process running time for the cosine and the Gaussian kernels. ..........61
Table 4.3. Performance of k-means in cosine diffusion, Salton and LSA spaces for the
set Cisi and Med...................................................................................................64
Table 4.4. Measure of the difference between the approximated and the histogram
distributions. ........................................................................................................66
Table 4.5. Performances of different embedding representations using k-means for the
set Cran, Cisi and Med. .......................................................................................67
Table 4.6. Performance of k-means in cosine diffusion, Salton, and LSA spaces for the
set Cran, Cisi and Med. .......................................................................................68
Table 4.7. Measure of the difference between the approximated and the histogram
distributions. ........................................................................................................70
Table 4.8. Performance of different embedding cosine diffusion and LSA
representations using k-means for the set Cran, Cisi, Med and Reuters_1.........72
Table 4.9. Performance of k-means in Cosine diffusion, Salton and LSA spaces for the
set Cran, Cisi, Med and Reuters_1. .....................................................................72
Table 4.10. The confusion matrix for the set Cran-Cisi-Med-Reuters_1 ....................73
Table 4.11. The confusion matrix for the set S in 2-dimension cosine diffusion space.
..............................................................................................................................74
V
List of Tables
Table 4.12. The resultant confusion matrix. ................................................................74
Table 4.13. Mutual information of different embedding cosine diffusion
representations using k-means to exclude the cluster C2 from the set Cran, Cisi,
Med and Reuters_1. .............................................................................................75
Table 4.14. Performance of different embedded cosine diffusion representations using
k-means for the set S. ...........................................................................................75
Table 4.15. The confusion matrix for the set Cran-Cisi-Med-Reuters_1 clustered into
4 clusters in the 4-dimention cosine diffusion space. ..........................................75
representations using k-means for the set Cran, Cisi, Med and Reuters_2.........76
Table 4.17. Performance of k-means in Cosine diffusion, Salton and LSA spaces for
the set Cran, Cisi, Med and Reuters_2. ...............................................................77
representations using k-means for Reuters..........................................................77
Table 4.19. Performance of k-means in Cosine diffusion, Salton and LSA spaces for
Reuters. ................................................................................................................77
Table 4.20. The statistical results for the performance of k-means algorithm in cosine
diffusion and LSA spaces. ....................................................................................80
Table 4.21. Performances of the single-pass clustering. .............................................81
Table 5.1. Index size in the native and Noun phrase spaces........................................90
Table 5.2. The MIAP measure for the collection Cisi in different indexes..................90
Table 5.3. The MIAP measure for the collection Cran in different indexes. ...............91
Table 5.4. The MIAP measure for the collection Med in different indexes. ................91
Table 5.5. LSA performance in the native and Noun phrase spaces. ..........................92
Table 6.1. [AR-ENV] Corpus Statistics. ......................................................................96
Table 6.2. An example illustrating the typical approach to query term selection. ......96
Table 6.3. Token-to-type ratios for fragments of different lengths, from various
corpora.................................................................................................................98
Table A.1. Buckwalter Transliteration. .....................................................................117
Table A.2. Prefixes and suffixes list...........................................................................118
Table B.1. List of term weighting components. .........................................................123
VI
List of Figures
Figure 2.1. A taxonomy of clustering approaches. ......................................................13
Figure 3.1. A pictorial representation of the SVD. The shaded areas of U and V, as
well as the diagonal line in S, represent Ak, the reduced representation of the
original term-document matrix A.........................................................................40
Figure 3.2. The interpolated recall-precision curves of the LSA and the VSM models.
..............................................................................................................................48
Figure 4.1. Average cosine of the principal angles between 64 concept subspace and
various singular subspaces for the CLASSIC data set.........................................53
Figure 4.2. Average cosine of the principal angles between 64 concept subspace and
various singular subspaces for the NSF data set.................................................53
Figure 4.3. Representation of our data set in various diffusion spaces.......................60
Figure 4.4. Representation of our data set in Cosine and Gaussian diffusion spaces
for various t time iterations..................................................................................63
Figure 4.5. Representation of the first 100 singular values of the cosine diffusion map
on the set Cisi and Med........................................................................................64
Figure 4.6. Representation of the first 100 singular values of the Cisi and Med termdocument matrix...................................................................................................65
Figure 4.7. Histogram representation of the cluster C1 documents. ...........................66
Figure 4.8. Histogram representation of the cluster C2 documents. ...........................66
Figure 4.9. Representation of the first 100 singular values of the cosine diffusion map
on the cluster C1. .................................................................................................67
Figure 4.10. Representation of the first 100 singular values of the cosine diffusion
map on the cluster C2 ..........................................................................................67
space on the set Cran, Cisi and Med. ..................................................................68
Figure 4.12. Representation of the first 100 singular values of the Cran, Cisi and Med
term-document matrix..........................................................................................68
Figure 4.13. Histogram representation of the cluster C1 documents. .........................69
map on cluster C1. ...............................................................................................70
map on cluster C2. ...............................................................................................71
map on cluster C3. ...............................................................................................71
VII
List of Figures
map on the set Cran, Cisi, Med and Reuters_1. ..................................................72
Figure 4.20. Representation of the first clusters of the hierarchical clustering. .........73
Figure 4.21.Representation of the first 100 singular values of the cosine diffusion map
on the data set S ...................................................................................................73
Figure 4.22. Representation of the Set S clusters.........................................................74
map on the set Cran, Cisi, Med and Reuters_2. ..................................................76
map on Reuters. ...................................................................................................77
Figure 4.25. The LSA and Diffusion Map processes....................................................79
Figure 5.1. Top-Level Flowchart of GTE Algorithm. ..................................................89
Figure 6.1. Zipf law and word frequency versus rank in the [AR-ENV] collection. ..98
Figure 6.2. Token-to-type ratios (TTR) for the [AR-ENV] collection..........................99
Figure 6.3. A standardized information retrieval system...........................................100
Figure 6.4. An information retrieval system for Arabic language.............................101
Figure 6.5. Comparison between the performances of the LSA model for five
weighting schemes. ............................................................................................104
Figure 6.6. Language processing benefit...................................................................105
Figure 6.7. A new information retrieval system suggested for Arabic language.......106
Figure 6.8. A comparison between the performances of the VMS and the LSA models.
............................................................................................................................107
Figure 6.9. Weighting queries impact.......................................................................108
Figure 6.10. Arabic Information Retrieval System based on NP Extraction. ............109
Figure 6.11. Influence of the NP and the singles terms indexations on the IRS
performance. ......................................................................................................110
Figure C.1. The computation of Recall and Precision...............................................124
Figure C.2. The Precision Recall trade-off................................................................125
Figure C.3. Interpolated Recall Precision Curve. .....................................................127
VIII
List of Abbreviations
Acc: Accuracy
AFN: Affinity Set
AFP: Agence France Presse
AIR: Arabic Information Retrieval
AIRS: Arabic Information Retrieval System
AP: Average Precision
BNS: Bi-Normal Separation
CCA: Corpus of Contemporary Arabic
CHI: 2 -test
CQ: Characteristic Quotient
DF: Document Frequency
DM: Diffusion Map
ELRA: European Language Resources distribution Agency
GPLVM: Gaussian Process Latent Variable Model
GTE: Generic Term Extracting
HPSG: Head-driven Phrase Structure Grammar
ICA: Independent Component Analysis
ICA: International Corpus of Arabic
ICE: International Corpus of English
IG: Information Gain
IR: Information Retrieval
IRP: Interpolated Recall-Precision
IRS: Information Retrieval System
ISOMAPS: ISOmetric MAPS
LLE: Locally Linear Embedding
LSA: Latent Semantic Analysis
LTSA: Local Tangent Space Alignment
MDS: Multidimensional Scaling
MI: Mutual Information
MIAP: Mean Interpolated Average Precision
NLP: Natural Language Processing
IX
List of Abbreviations
nonrel: non-relevant
NP: Noun Phrase
OSPDM: On-line Single-Pass Clustering based on Diffusion Map
P2P: Peer-To-Peer
PCA: Principle Component Analysis
POS: Part Of Speech
Pr: Probability
R&D: Research and Development
rel: relevant
RSV: Retrieval Status Value
SOM: Self-Organizing Maps
SVD: Singular Value Decomposition
SVM: Support Vector Machine
TREC: Text REtrieval Conference
TS: Term Strength
TTR: Token-to-Type Ratio
TDT: Topic Detection and Tracking
VSM: Vector-Space Model
Chapter 1 Introduction
The advent of the World Wide Web has increased the importance of information retrieval. Instead of
going to the local library to look for information, people search the Web. Thus, the relative number of
manual versus computer-assisted searches for information has shifted dramatically in the past few years.
This has accentuated the need for automated information retrieval for extremely large document
collections, in order to help in reading, understanding, indexing and tracking the available literature. For
this reason, researchers in document retrieval, computational linguistics and textual data mining are
working on the development of methods to process these data and present them in a usable and suitable
format for many written languages where Arabic is one.
Known as the second2 most widely spoken language in the world, Arabic knows an important
increasing of the speaking Internet users number. In 2002 was about 4.4 million [ACS04], and 16
million in 2004, while the research commissioned from Dubai-based Internet researcher Madar shows
that this number could jump to 43 million in 20083. However, at present there are relatively few standard
Arabic search engines known. Despite their availability, according to Hermann Havermann (managing
director of German Internet tech firm Seekport, and founder member of the project Arabic search engine
SAWAFI), they are not considered as full Arabic engines. As announced in the Reuters article news4,
Hermann Havermann confirmed that There is no [full] Arabic internet search engine on the market.
You find so-called search engines, but they involve a directory search, not a local search.
The fact that any improved access to Arabic text will have profound implications for cross-cultural
communication, economic development, and international security, encourage us to take an interest
more particularly in this language.
The limited number of research in the Arabic document retrieval area over 20 years, began by the
arabization of the MINISIS system [Alg87] then the development of the Micro-AIRS system [Alka91],
are all dominated by the use of statistical methods to automatically match natural language user queries
against records. There has been interest in using natural language processing to enhance term matching
by using root, stem, and n-gram, as is highlighted in Text REtrieval Conference TREC-2001 [GeO01].
However yet to 2005, the effect of stemming upon stopwords was not studied; the Latent Semantic
2
http://encarta.msn.com/media_701500404/Languages_Spoken_by_More_Than_10_Million_People.html,
Microsoft
Encarta 2006, Retrieved on 10-05-2007.

3
http://www.abc.net.au/science/news/stories/s1623945.htm, Retrieved on 10-05-2007.
Arabic search engine may boost content, by Andrew Hammond, in Reuters, on April 26th, 2006. Retrieved on 10-05-2007.
___________________________________________________________________________________________
Fadoua Ataa Allahs Thesis
Introduction
Analysis model (LSA), developed in the early 1990s [DDF90] and known by its high capacity of
resolving the synonymy and the polysemy problems, was not utilized; and neither the indexation by
phrases was used.
We are motivated by the fact that the use of the LSA model, in an attempt to discover structure and
implicit meanings hidden, may meet the challenges of the wide use of synonyms offered in Arabic.
The employment of several weighting schemes, taking into account the term importance within both
documents and query; and the use of Arabic natural language processing, based on spelling mutation,
stemming, stopword removal and noun phrase extraction, make the study more interesting.
The first objective of our study is the improvement of the computing similarity score between
documents and a query for Arabic documents; however, this study has been extended to consider other
aspects. Many studies have proved that clustering is an important tool in information retrieval for
constructing taxonomy of a documents collection, by forming groups of closely-related documents
[FrB92, FaO95, HeP96, Leu01]. Based on the Cluster Hypothesis: closely associated documents tend
to be relevant to the same requests [Van79], clustering is used to accelerate query processing by
considering only a small number of clusters representatives, rather than the entire corpus. Typically, we
think that reducing the corpus dimension by using some features selection methods may also help a user
to find relevant information more quickly. Thus, we have been interested in developing new clustering
methods in off and on-line cases, and extending the generic term extraction method to reduce the storage
capacity for retrieval task.
1. 1. Research Contributions
In the objective of improving the performance and the complexity of document retrieval systems, ten
major contributions are proposed in this thesis:
-
Studying the Weighting Schemes found in the current text retrieval literature to discover the best
one, while the Latent Semantic model is used.
Utilizing the Diffusion map for off-line document clustering, and improving its performance by
using the Cosine distance.
Comparing the k-means algorithm performance in the Salton, LSA and cosine diffusion spaces.
Proposing two postulates indicating the appropriate reduced dimension to use for clustering,
and the optimal number of clusters.
Developing a new method for on-line clustering, based on Diffusion map and updating singular
value decomposition.
Analyzing the benefit of extracting Generic Terms in decreasing the Data Storage capacity,
required for document retrieval.
2 ____________________________________________________________________________________________
Introduction
-
Creating an Arabic Retrieval Test Collection, where documents affecting a scientific field
specialized in the environment, and queries structured into two categories would help to examine
the performance difference between the case of short 2 or 3 words and long queries
sentence.
Applying the Latent Semantic model to the Arabic language in attempt to meet the challenges of
the wide use of synonyms offered by this language.
Analyzing the Weighting Schemes influence on the use of some Arabic language processing.
Studying the effect of representing the Arabic document content by Noun Phrase in the
improvement of the proposed automatic document retrieval system based on the two previous
contributions.
1. 2. Thesis Layout & Brief Overview of Chapters

This thesis comprises seven chapters and four appendixes, briefly described as follows:
Chapter 2 reviews document retrieval, and document clustering. It surveys prior research on
dimensionality reduction techniques especially features selection methods. It focuses on
Arabic language characteristics, earlier vector space retrieval models and corpora used in
this language.
Chapter 3 describes the latent semantic analysis model by outlining the term-document
presentation, analyzing the weighting schemes found in the current text retrieval
literature. It explains the singular value decomposition method, and reviews the three
standard LSA query methods. It introduces a bunch of the English test data collections
used in this work, and evaluates the different weighting schemes presented before. It
compares the performances of the LSA and the standard vector space models.
Chapter 4 presents the diffusion map approach, and shows its efficiency on off-line documents
clustering task, when a cosine kernel is used. It validates two postulates indicating the
appropriate reduced dimension to use for clustering, as well as the optimal number of
clusters to use in that dimension. Furthermore, it proposes a new single pass approach for
the on-line document clustering, based on the diffusion map and the updating singular
value decomposition.
Chapter 5 introduces the generic term extraction method, and analyzes the impact of using this
method in reducing the storage capacity in the cases of document retrieval.
Chapter 6 describes the development of Arabic retrieval text collections. It studies the existing
Arabic natural language techniques, and implements them in a new Arabic document
retrieval system based on the latent semantic analysis model. It examines and discuses the
___________________________________________________________________________________________
Introduction
effectiveness of different index terms on these collections.
Chapter 7 summarizes the research and concludes with its major achievements and possible
directions that could be considered for future research.
Appendix A presents all natural language processing used and mentioned in this work.
Appendix B reviews the weighting schemes notations.
Appendix C outlines the commonly evaluation metrics used in retrieval and clustering evaluation
tasks, more specifically those used in this thesis.
Appendix D recalls the quantities known as principal angles, used to measure the closeness of
subspaces.
4 ____________________________________________________________________________________________
Chapter 2 Literature Review

2. 1. Introduction
In an attempt to build an Arabic document retrieval system, we have been interested in studying
some specific and elementary tools and tasks contributing to the development of the system components.
These tools include document retrieval models, document clustering algorithms, and dimension
reduction techniques, in addition to Arabic language characteristics. In this chapter, we introduce these
elements, and survey some of their prior research.
2. 2. Document Retrieval
The problem of finding relevant information is not new. Early systems tried to classify knowledge
into a set of known fixed categories. The first of these was completed in 1668 by the English
philosopher John Wilkins [Sub92]. The problem with this approach is that categorizers commonly do
not place documents into the categories where searchers expect to find them. No matter what categories
a user thinks of, these categories will not match what someone searching will find. For example, users of
e-mail systems place mails in folders or categories only to spend countless hours trying to find the same
documents because they cannot remember what category they used, or the category they are sure they
used does not contain the relevant document. Effective and efficient search techniques are needed to
help users quickly find the information they are looking for. Another approach is to try to understand the
content of the documents, ideally, by loading them into the computer for reading and understanding
before users would ask any questions; involving by that, the use of a document retrieval system.
The elementary definition of document retrieval is the matching of some stated user query against
useful parts of free-text records. These records could be any type of mainly unstructured text, such as
bibliographic records, newspaper articles, or paragraphs in a manual. User queries could range from
multi-sentence full descriptions of an information need to a few words. However, this definition is not
informative enough, because a document can be relevant even though it does not use the same words as
those provided in the query. The user is not generally interested in retrieving documents with exactly the
same words, but with the concepts that those words represent. To this end, many models are discussed.
2.2.1. Document Retrieval Models

Several events recently occurred that have a major effect on the progress of document retrieval
research. First, the evolution of computer hardware, making the running of sophisticated search
algorithms against massive amounts of data with acceptable response times more realistic. Second, the
Internet access requirements for effective text searching systems. These two events have contributed to
___________________________________________________________________________________________
Literature Review
create an interest in accelerating research to produce more effective search methodologies, including
more use of natural language processing techniques.
A great variety of document retrieval models is described in the information retrieval literature.
Based on a mathematic view, the techniques currently in use could be classed into four types: Boolean
or set-theoretic, vector or algebraic, probabilistic, and hybrid models.
A model is characterized by four parameters:
-
Representations for documents and queries.
Matching strategies for assessing the relevance of documents to a user query.
Methods for ranking query output.
Mechanisms for acquiring user-relevance feedback.
In the following paragraphs, we describe instances of each type in the context of the model
parameters.
2.2.1.1. Set-theoretic Models

The standard Boolean model [WaK79, BuK81, SaM83] represents documents by a set of index
terms, each of which is viewed as a Boolean variable and valued as True if it is present in a document.
No term weighting is allowed. Queries are specified as arbitrary Boolean expressions formed by linking
terms through the standard logical operators: AND, OR, and NOT. Retrieval status value (RSV) is a
measure of the query-document similarity. In the Boolean model, RSV equals 1 if the query expression
evaluates to True; RSV is 0 otherwise. All documents whose RSV equals to 1 are considered relevant to
the query.
Even if this model is simple, and user queries can employ arbitrarily complex expressions, still the
retrieval performance tends to be poor. It is not possible to rank the output since all retrieved documents
have the same RSV, and weights can not be assigned to query terms. The results are often counterintuitive. For example, if the user query specifies 10 terms linked by the logical connective AND, a
document that has 9 of these terms is not retrieved. User relevance feedback is often used in IR systems
to improve retrieval effectiveness [SaB90]. Typically, a user is asked to indicate the relevance or
irrelevance of a few documents placed at the top of the output. Since the output is not ranked, however,
the selection of documents for relevance feedback elicitation is difficult.
The fuzzy-set model [Rad79, Boo80, Egg04] is based on fuzzy-set theory which allows partial
membership in a set, as compared with conventional set theory which does not. It redefines logical
operators appropriately to include partial set membership, and processes user queries in a manner similar
to the case of the Boolean model. Nevertheless, IR systems based on the fuzzy-set model have proved
nearly as incapable of discriminating among the retrieved output as systems based on the Boolean
6 ____________________________________________________________________________________________
Literature Review
model.
The strict Boolean and fuzzy-set models are preferable to other models in terms of computational
requirements, which are low in terms of both the disk space required for storing document
representations and the algorithmic complexity of indexing and computing query-document similarities.
2.2.1.2. Algebraic Models

The algebraic model represents documents and queries usually as vectors, matrices or tuples. Those
vectors, matrices or tuples are transformed by the use of a finite number of algebraic operations to a onedimensional similarity measurement, to indicate the query-documents RSV. The higher the RSV, the
greater is the documents relevance to the query.
The strength of this model lies in its simplicity, and term weighting allowance. Relevance feedback
can be easily incorporated into it. However, the rich expressiveness of query specification inherent in the
Boolean model is sacrificed.
This kind of models includes: Standard vector-space known as Salton model (highlighted in Section
2.2.2) [SaM83], Generalized vector space model [WZW85], Latent semantic model (detailed in Chapter
3) [DDF90], and Topic-based vector space model [BeK03].
2.2.1.3. Probabilistic Models

The probabilistic model, introduced by Robertson and Sparck Jones [RoS76], attempts to capture the
IR problem within a probabilistic framework. To that end, the model takes the term dependencies and
relationships into account; and tries to estimate the probability of finding a document interesting for a
user, by specifying the major parameters such as the weights of the query terms and the form of the
query-document similarity.
The model is based on two main parameters Pr(rel) and Pr(nonrel), the probabilities of relevance and
non-relevance of a document to a user query. These parameters are computed using the probabilistic
term weights [RoS76, GRG97], and the actual terms present in the document. Relevance is assumed to
be a binary property so that Pr(rel) = 1-Pr(nonrel). In addition, the model uses two cost parameters, a1
and a2, to represent the loss associated with the retrieval of an irrelevant document and non-retrieval of a
relevant document, respectively.
The model may use an interaction with a user to improve its estimation, and requires termoccurrence probabilities in the relevant and irrelevant parts of the document collection, which are
difficult to estimate. However, the model serves an important function for characterizing retrieval
processes and provides a theoretical justification for practices previously used on an empirical basis (for
example, the introduction of certain term-weighting systems).
___________________________________________________________________________________________
Literature Review
This model includes: Binary independence retrieval [RoS76], Uncertain inference [CLR98],
Language models [PoC98], Divergence from randomness models [AmR02].
2.2.1.4. Hybrid Models

Many techniques are considered as hybrid models. Those are a combination of some models
included in the three seen classes. For example: Extended Boolean model (set-theoretic & algebraic)
[Lee94], Inference network retrieval [TuC91] (set-theoretic &probabilistic).
According to our best knowledge, the recent used model for Arabic language, before our work
[BoA05] where latent semantic model is utilized, was the standard vector space model [SaM83]. For this
reason, we have been interested in the algebraic models, more particularly those based on vectors, to
begin our study.
2.2.2. Introduction to Vector Space Models

Based on the assumption that the meaning of a document can be derived from the document's
constituent terms, vector-space models represent documents as vectors of terms d = (t , t ,..., t ) where
1 2
t (1 i m )
i
is a non-negative value denoting the single or multiple occurrences of term i in document d.
Thus, each unique term in the document collection corresponds to a dimension in the space. Similarly, a
query is represented as a vector q = (t ' , t ' ,..., t ' ) where term t ' (1 i m) is a non-negative value
1
denoting the number of occurrences of t' (or, merely a 1 to signify the occurrence of term t' ) in the
i
query [BeC87]. Both the document vectors and the query vector provide the locations of the objects in
the term-document space. By computing the distance between the query and other objects in the space,
objects with similar semantic content to the query will presumably be retrieved.
Vector-space models that do not attempt to collapse the dimensions of the space treat each term
independently, essentially mimicking an inverted index [FrB92]. However, vector-space models are
more flexible than inverted indices since each term can be individually weighted, allowing that term to
become more or less important within a document or the entire document collection as a whole. Also, by
applying different similarity measures to compare queries to terms and documents, properties of the
document collection can be emphasized or de-emphasized.
For example, the dot product similarity measure M ( q, d ) = q . d finds the distance between the query
and a document in the space, where the operation . is the inner product multiplication, with the inner
m
product of two m vectors X = < xi > and Y = < yi > defined to be X . Y = xi . yi .

i =1
8 ____________________________________________________________________________________________
Literature Review
The inner product or the dot product favors long documents over short ones since they contain more
terms and hence their product increases.
On the other hand by computing the angle between the query and a document rather than the
distance, the cosine similarity measure cos( q , d ) =
q.d
deemphasizes the lengths of the vectors.
q.d
X .Y is the inner product defined above, and X is the Euclidian length of the vector X.
X =
x
i =1
2
i
In some cases, the directions of the vectors are a more reliable indication of the semantic similarities of
the objects than the distance between the objects in the term-document space [FrB92].
Vector-space models, by placing documents and queries in a term-document space and computing
similarities between the queries and the documents, allow the results of a query to be ranked according
to the similarity measure used. Unlike lexical matching techniques that provide no ranking or a very
crude ranking scheme (for example, ranking one document before another document because it contains
more occurrences of the search terms), the vector-space models, by basing their rankings on the
Euclidean distance or the angle measure between the query and documents in the space, are able to
automatically guide the user to documents that might be more conceptually similar and of greater use
than other documents.
Vector-space models, specifically the latent semantic model, were developed to eliminate many of
the problems associated with exact, lexical matching techniques. In particular, since words often have
multiple meanings (polysemy), it is difficult for a lexical matching technique to differentiate between
two documents that share a given word, but use it differently, without understanding the context in
which the word was used. Also, since there are many ways to describe a given concept (synonymy),
related documents may not use the same terminology to describe their shared concepts. A query using
the terminology of one document will not retrieve the other related documents. In the worst case, a
query using terminology different from that used by related documents in the collection may not retrieve
any documents using lexical matching, even though the collection contains related documents [BeC87].
For example, a text collection contains documents on house ownership and web home pages with some
others using the word house only, some using the word home only, and some using both words. For a
query on home ownership, traditional lexical matching methods fail to retrieve documents using the
word house only, which are obviously related to the query. For the same query on home ownership,
lexical matching methods will also retrieve irrelevant documents about web home pages.
___________________________________________________________________________________________
Literature Review
2. 3. Document Clustering
Document clustering has been studied in the field of document retrieval for several decades. In the
aim to reduce the time search, the first approaches were attempted by Salton [Sal68], Litofsky [Lit69],
Crouch [Cro72], Van Rijsbergen [Van72], Prywes & Smith [PrS72], and Fritzche [Fri73]. Based on
these studies, Van Rijsbergen specifies, in his book [Van79], that while choosing a cluster method to use
in experimental document retrieval, two, often conflicting, criteria are frequently used.
The first one, and the most important in his point of view, is the theoretical soundness of the
method, meaning that the method should satisfy certain criteria of adequacy. Below, we list some of the
most important of these criteria:
1) The method produces a clustering which is unlikely to be altered drastically when further objects
are incorporated, i.e. it is stable under growth;
2) The method is stable in the sense that small errors in the description of the objects lead to small
changes in the clustering;
3) The method is independent of the initial ordering of the objects.
These conditions have been adapted from Jardine and Sibson [JaS71]. The point is that any cluster
method which does not satisfy these conditions is unlikely to produce any meaningful experimental
results.
The second criterion for choice, considered as the overriding consideration in the majority of
document retrieval experimental works, is the efficiency of the clustering process in terms of speed and
storage requirements. Efficiency is really a property of the algorithm implementing the cluster method.
It is sometimes useful to distinguish the cluster method from its algorithm, but in the context of
document retrieval this distinction becomes slightly less useful, since many cluster methods are defined
by their algorithm, so no explicit mathematical formulation exists.
The current information explosion, fueled by the availability of hypermedia and the World-Wide
Web, has led to the generation of an ever-increasing volume of data, posing a growing challenge for
information retrieval systems to efficiently store and retrieve this information [WMB94]. A major issue
that document databases are now facing is the extremely high rate of update. Several practitioners have
complained that existing clustering algorithms are not suitable for maintaining clusters in such a
dynamic environment, and they have been struggling with the problem of updating clusters without
frequently performing complete re-clustering [CaD90, Can93, Cha94]. To overcome this problem, online clustering approaches have been proposed.
In the following, we explain the clustering procedure in the context of document retrieval, we survey
a clustering methods taxonomy by focusing on needed categories, and we give an overview of some
recent studies in both classical and on-line clustering fields, after specifying the definition of the
10 ____________________________________________________________________________________________
Literature Review
clustering by comparing this approach to other classification approaches.
2.3.1. Definition
In supervised classification, or discriminant analysis, a collection of labeled (pre-classified) patterns
is provided; the problem is to label a newly encountered, yet unlabeled, pattern. Typically, the given
labeled (training) patterns are used to learn the descriptions of classes which in turn are used to label a
new pattern. In the case of clustering (unsupervised classification), the problem is to group a given
collection of unlabeled patterns into meaningful clusters. In a sense, labels are associated with clusters
also, but these category labels are data driven; that is, they are obtained solely from the data.
2.3.2. Clustering Document in the Context of Document Retrieval

Under the clustering basic idea that similar documents are grouped together to form clusters, and the
so-called cluster hypothesis, closely associated documents tend to be relevant to the same requests.
Grouping similar documents will accelerate the searching, especially if we create hierarchies of clusters
by grouping clusters to form super-clusters and so on, we have been interested in surveying and studying
this approach.
On the other hand, even if clustering is a traditional approach in text retrieval context [FaO95], but
seeing that the knowledge of traditional methods is useful as a background information for the newer
developments, and the variations or the extensions of these methods are in the heart of newer methods,
we consider this study to be of potential value.
To this end, two document clustering procedures will be involved: cluster generation and cluster
search [SaW78].
2.3.2.1. Cluster Generation

A cluster generation method first consists of the indexation of documents, then their partitioning into
groups. Many cluster generation methods have been proposed. Unfortunately, no single method meets
both requirement for soundness and efficiency. Thus, there are two classes of methods:
-
Sound methods that are based on the document-document similarity matrix.
Iterative methods that are more efficient and proceed directly from the document vectors.
a- Methods based on the Similarity matrix

These methods usually require O(n2) time (or more, where n is the documents number), and
apply graph theoretic techniques (see Section 2.3.3). A document-to-document similarity function
has to be chosen, to measure how closely two documents are related.
___________________________________________________________________________________________
11
Literature Review
b- Iterative Methods
This class consists of methods that operate in less than quadratic time (that is O(nlogn) or
O(n2/logn)) on the average [FaO95]. These methods are based directly on the item (document)
descriptions and they do not require the similarity matrix to be computed in advance. The price for
the increased efficiency is the sacrifice of the theoretical soundness; the final classification
depends on the order that the documents are processed, or else on the existence of a set of seedpoints around which the classes are to be constructed.
Although some experimental evidence exists indicating that iterative methods can be effective for
information retrieval purposes [Dat71], specifically in on-line clustering [KWX01, KlJ04, KJR06], most
researchers prefer to work with the theoretically more attractive hierarchical grouping methods, while
attempting, at the same time, to save computation time. This can be done in various ways by applying
the expensive clustering process to a subset of the documents only and then assigning the remaining unclustered items to the resulting classes; or by using only a subset of the properties for clustering
purposes instead of the full keyword vectors; or finally by utilizing an initial classification and applying
the hierarchical grouping process within each of the initial classes only [Did73, Cro77, Van79].
2.3.2.2. Cluster Search

Search method may be conducted by identifying clusters that appear most similar to a given query
item. It is carried out by first comparing a query formulation with the cluster centroids. This may then be
followed by a comparison between the query and those documents, whose corresponding query-centroid
similarity was found to be sufficiently large in the earlier comparison. Thus, searches can be conducted
rapidly because a large portion of documents are immediately rejected, the search being concentrated in
areas where substantial similarities are detectable between queries and cluster centroids.
2.3.3. Clustering Methods Taxonomy

Many taxonometric representations of clustering methodology are possible. Based on the discussion
in Jain et al. [JMF99], data clustering methods can be distinguished between hierarchical and partitional
approaches. Hierarchical algorithms produce a nested series of partitions, by finding successive clusters
using previously established ones, whereas partitional algorithms produce only one, by determining all
clusters at once. But this taxonomy, represented in Figure 2.1, must be supplemented by a specification
of cross-cutting issues that may (in principle) affect all of the different approaches regardless of their
placement in the taxonomy.
12 ____________________________________________________________________________________________
Literature Review
Figure 2.1. A taxonomy of clustering approaches.

-
Agglomerative vs. divisive [JaD88, KaR90]: An agglomerative clustering (bottom-up) starts with
one-point (singleton) clusters and recursively merges two or more most appropriate clusters. A
divisive clustering (top-down) starts with one cluster of all data points and recursively splits the
most appropriate cluster. The process continues until a stopping criterion (frequently, the
requested number k of clusters) is achieved.
Monothetic vs. polythetic [Bec59]: A monothetic class is defined in terms of characteristics that
are both necessary and sufficient in order to identify members of that class. This way of defining
a class is also termed the Aristotelian definition of a class [Van79]. A polythetic class is defined
in terms of a broad set of criteria that are neither necessary nor sufficient. Each member of the
category must possess a certain minimal number of defining characteristics, but none of the
features has to be found in each member of the category. This way of defining classes is
associated with Wittgenstein's concept of family resemblances [Van79]. Monothetic is a type
in which all members are identical on all characteristics. Whereas, polythetic is a type in which
all members are similar, but not identical.
Hard vs. fuzzy: A hard clustering algorithm allocates each pattern to a single cluster during its
operation and in its output. A fuzzy clustering method assigns degrees of membership in several
clusters, that do not have hierarchical relations with each other, to each input pattern. A fuzzy
clustering can be converted to a hard clustering by assigning each pattern to the cluster with the
largest measure of membership.
Deterministic vs. stochastic: This issue is most relevant to partitional approaches designed to
optimize a squared error function. This optimization can be accomplished using traditional
___________________________________________________________________________________________
13
Literature Review
techniques or through a random search of the state space consisting of all possible labelings.
-
Incremental vs. non-incremental: This issue arises when the pattern set to be clustered is large,
and constraints on execution time or memory space affect the architecture of the algorithm. The
early history of clustering methodology does not contain many examples of clustering algorithms
designed to work with large data sets, but the advent of data mining has fostered the
development of clustering algorithms that minimize the number of scans through the pattern set,
reduce the number of patterns examined during execution, or reduce the size of data structures
used in the algorithms operations [JMF99].
2.3.3.1. Hierarchical Clustering

Hierarchical clustering builds a tree of clusters, also known as a dendrogram. Every cluster node
contains child clusters; sibling clusters partition the points or items covered by their common parent.
Such an approach allows exploring data on different levels of granularity, easy handling of any
similarity or distance forms, and application to any attribute types. However, it has disadvantages related
to the vagueness of termination criteria, and the fact that its algorithm does not revisit once constructed
(intermediate) clusters in the purpose of their improvement.
Most hierarchical clustering algorithms are variants of the single-link [SnS73], where each item in a
class is linked to at least one other point in the class; and complete-link algorithms [Kin67], where each
item is linked to all other points in the class.
2.3.3.2. Partitional Clustering

A partitional clustering algorithm obtains a single partition of the data instead of a clustering
structure, such as the dendrogram. Partitional methods have advantages in applications involving large
data sets for which the construction of a dendrogram is computationally prohibitive. A problem
accompanying the use of a partitional algorithm is the choice of the number of desired output clusters.
The partitional techniques usually produce clusters by optimizing a criterion function defined either
locally (on a subset of the feature vectors) or globally (defined over all of the feature vectors).
Combinatorial search of the set of possible labelings for an optimum value of a criterion is clearly
computationally prohibitive. In practice, therefore, the algorithm is typically run multiple times with
different starting states, and the best configuration obtained from all of the runs is used as the output
clustering.
The most intuitive and frequently used criterion function in partitional clustering techniques is the
squared error criterion, which tends to work well with isolated and compact clusters. The k-means is the
simplest and most commonly used algorithm employing a squared error criterion [Mac67] (see Section
14 ____________________________________________________________________________________________
Literature Review
4.3.1 for more details concerning this algorithm). Several variants of the k-means algorithm have been
reported in the literature. One of them will be studied in Chapter 5.
2.3.3.3. Graph-Theoretic Clustering

Graph-theoretic clustering is basically a partitional taxonomy subclass, but even hierarchical
approaches are related to this category of algorithms, view the fact that single-link clusters are subgraphs of the minimum spanning tree of the data [GoR69, Zah71], and complete-link clusters are
maximal complete sub-graphs related to the node colorability of graphs [BaH76]. In graph-theoretic
algorithms, the data is represented as nodes in a graph and the dissimilarity between two objects is the
length of the edge between the corresponding nodes. In several methods, a cluster is a sub-graph that
remains connected after the removal of the longest edges of the graph [JaD88]; for example, in [Zah71]
the minimal spanning tree of the original graph is built and then the longest edges are deleted. However,
some other graph-theoretic methods rely on the extraction of cliques [AGG98], and are then more
related to squared error methods.
Based on graph-theoretic clustering, there has been significant interest recently in spectral clustering
using kernel methods [NJW02]. Spectral clustering techniques make use of the spectrum of the
similarity matrix of the data to cluster the points, instead of the distances between these points. The
implementation of a spectral clustering algorithm is formulated as graph partition problem where the
weight of each edge is the similarity between points that correspond to vertex connected by the edge,
with a goal of finding the minimum weight cuts in the graph. This problem can be addressed by the
means of linear algebra methods, in particular by the eigenvalue decomposition techniques, from which
the term spectral derives. These methods can roughly be divided into two main categories: Spectral
graph cuts [Wei99] Containing ratio-cut [HaK92], Normalized cut [ShM00], and Minmax cut
[DHZ01]; and Eigenmaps methods [RoS00, ZhZ02], such as Laplacian eigenmaps [BeN03], and
Hessian eigenmaps [DoG03].
2.3.3.4. Incremental Clustering

Incremental clustering is based on assumption that it is possible to consider data points one at a time
and assign them to existing clusters. A new data point is assigned to a cluster without affecting the
existing clusters significantly. This kind of algorithm is employed to improve the chances of finding the
global optimum. Data are stored in the secondary memory and data points are transferred to the main
memory one at a time for clustering. Only the cluster representations are stored permanently in the main
memory to alleviate space limitations [Dun03, AMC05]. Therefore, space requirements of the
incremental algorithm is very small, necessary only for the centroids of the clusters and this algorithm is
___________________________________________________________________________________________
15
Literature Review
iterative and therefore their time requirements are also small.
2.3.4. Document Clustering Methods Used for IR

Many Sound document clustering methods have been proposed in the context of information
retrieval. Single-link is one of the first methods used for this purpose [Van79]. However, a disadvantage
of this method, and probably of every cluster generation method is that they require (at least) one
empirically decided constant: A threshold on the similarity measure or a desirable number of clusters.
This constant greatly affects the final partitioning.
The method proposed by Zahn [Zah71] is an attempt to circumvent this problem. He suggests
finding a minimum spanning tree for the given set of points (documents) and then deleting the
inconsistent edges. An edge is inconsistent if its length l is much larger than the average length lavg of
its incident edges. The connected components of the resulting graph are the suggested clusters. Again,
the method is based on an empirically defined constant (threshold on the definition of inconsistent
edge). However, the results of the method are not very sensitive on the value of this constant.
Many iterative methods have appeared in the literature. The simplest and fastest one seems to be the
single pass method [SaW78].
Hybrid methods may be used. Salton and McGill [SaM83] suggest using an iterative method to
create a rough partition of the documents into clusters and then applying a graph-theoretic method to
subdivide each of the previous clusters. Another hybrid approach is mentioned by Van-Rijsbergen
[Van79]. Some documents are sampled from the document collection and core-clustering is constructed
using an O(n2) method for the sample of documents. The remainder of the documents is assigned to the
existing clusters using a fast assignment strategy.
2. 4. Dimensionality Reduction
As the storage technologies evolve, the amount of available data explodes in both dimensions:
samples number and input space dimension. Therefore, one needs dimension reduction techniques to
explore and to analyze his huge data sets. Often many dimensions are irrelevant, in high dimensional
data. These irrelevant dimensions can confuse analysis algorithms by hiding useful information in noisy
data. As the number of dimensions in a dataset increases, distance measures become increasingly
meaningless. Additional dimensions spread out the points until they are almost equidistant from each
other, in very high dimensions.
Various dimensionality reduction methods have been proposed including both term transformation
and term selection techniques. Feature transformation techniques attempt to generate an optimal
16 ____________________________________________________________________________________________
Literature Review
dimension of synthetic terms by creating combinations of the original terms. These techniques are
very successful in uncovering latent structure in datasets. However, since they preserve the relative
distances between documents, they are less effective when there are large numbers of irrelevant terms
that hide the difference between sets of similar documents in a sea of noise. In addition, seeing that the
synthetic terms are combinations of the originals, it may be very difficult to interpret the synthetic terms
in the context of the domain. However, term selection methods have the advantage to select most
relevant dimensions from a dataset, and reveal groups of documents that are similar within a subset of
their terms.
2.4.1. Term Transformation

Term transformation techniques, known also by term extraction, are applying a mapping of the
multidimensional space into a space of fewer dimensions. This means that the original term space is
transformed by applying algebraic transformation methods. These methods can be broadly classified
into two groups: linear and non-linear methods.
-
Linear techniques include independent component analysis (ICA) [Com94], principle

component analysis (PCA) [Dun89], factor analysis [LaM71], and singular value
decomposition (SVD, detailed in Section 3.2.3) [GoV89].
Non-linear methods are by themselves subdivided into two groups: those providing a
mapping and those giving a visualization. The non-linear mapping methods include
techniques such as kernel PCA [SSM99], and Gaussian process latent variable models
(GPLVM) [Law03]. While non-linear visualization methods are based on proximity data that
is distance measurement, include such as Locally Linear Embedding (LLE) [RoS00], Hessian
LLE [DoG03], Laplacian Eigenmaps [BeN03], Multidimensional Scaling (MDS) [BoG97],
Isometric Maps (ISOMAPS) [TSL00], and Local Tangent Space Alignment (LTSA)
[ZhZ02].
The transformations generally preserve the original, relative distances between documents. Term
transformation is often a preprocessing step, allowing analysis algorithm to use just a few of the newly
created synthetic terms. A few algorithms have incorporated the use of such transformations to identify
important terms and iteratively improve their performance [HiK99, DHZ02]. While often very useful,
these techniques do not actually remove any of the original terms from consideration. Thus, information
from irrelevant dimensions is preserved, making these techniques ineffective at revealing sets of similar
documents when there are large numbers of irrelevant terms that mask the sets. Another disadvantage of
using combinations of terms is that they are difficult to interpret, often making the algorithm results less
useful. Because of this, term transformations are best suited to datasets where most of the dimensions
___________________________________________________________________________________________
17
Literature Review
are relevant, while many are highly correlated or redundant.
2.4.2. Term Selection

2.4.2.1. Definition
Term selection (also known as subset selection) generally refers to the way of selecting a set of
feature terms which is more informative in executing a given machine learning task while removing
irrelevant or redundant terms. This process ultimately leads to the reduction of dimensionality of the
original term space, but the selected term set should contain sufficient or more reliable information
about the original data set. To this end, many criteria are used [BlL97, LiM98, PLL01, YuL03].
There are two approaches for term selection:
Forward selection starts with no terms and adds them one by one, at each step adding the one that
decreases the error the most, until no further addition does significantly decrease the error.
Backward selection starts with all the terms and removes them one by one, at each step removing
the one that decreases the error the most (or increases it only slightly), until no further removal increases
the error significantly.
2.4.2.2. Feature Selection Methods

Term selection methods have relied heavily on the analysis of the characteristics of a given data set
through statistical or information-theoretical measures. For text learning tasks, they primarily count on
the vocabulary-specific characteristics of given textual data set to identify good term features. Although
the statistics itself does not care about the meaning of text, these methods have been proved to be useful
for text learning tasks (e.g., classification and clustering) [SAS04].
Many feature selection approaches have been proposed. We suggest to review chronologically some
of these approaches.
Kira and Rendell [KiR92] described a statistical feature selection algorithm called RELIEF that uses
instance based learning to assign a relevance weight to each feature.
John et al. [JKP94] addressed the problem of irrelevant features and the subset selection problem.
They presented definitions for irrelevance and for two degrees of relevance (weak and strong). They also
state that features selected should depend not only on the features and the target concept, but also on the
induction algorithm. Further, they claim that the filter model approach to subset selection should be
replaced with the wrapper model.
Pudil et al. [PNK94] presented floating search methods in feature selection. These are sequential
search methods characterized by a dynamically changing number of features included or eliminated at
each step. They were shown to give very good results and to be computationally more effective than the
18 ____________________________________________________________________________________________
Literature Review
branch and bound method.
Koller and Sahami [KoS96] examined a method for feature subset selection based on Information
Theory: they presented a theoretically justified model for optimal feature selection based on using crossentropy to minimize the amount of predictive information lost during feature elimination.
Jain and Zongker [JaZ97] considered various feature subset selection algorithms and found that the
sequential forward floating selection algorithm, proposed by Pudil et al. [PNK94], dominated the other
algorithms tested.
Dash and Liu [DaL97] gave a survey of feature selection methods for classification.
In a comparative study of feature selection methods in statistical learning of text categorization (with
a focus on aggressive dimensionality reduction), Yang and Pedersen [YaP97] evaluated document
frequency (DF), information gain (IG), mutual information (MI), a 2 -test (CHI) and term strength
(TS); and found IG and CHI to be the most effective.
Blum and Langley [BlL97] focused on two key issues: the problem of selecting relevant features and
the problem of selecting relevant examples.
Kohavi and John [KoJ97] introduced wrappers for feature subset selection. Their approach searches
for an optimal feature subset tailored to a particular learning algorithm and a particular training set.
Yang and Honavar [YaH98] used a genetic algorithm for feature subset selection.
Liu and Motoda [LiM98] wrote their book on feature selection which offers an overview of the
methods developed since the 1970s and provides a general framework in order to examine these
methods and categorize them.
Vesanto and Ahola [VeA99] proposed to visually detect correlation using a self-organizing maps
based approach (SOM).
Makarenkov and Legendre [MaL01] try to approximate an ultra-metric in the Euclidian space or to
preserve the set of the k-nearest neighbors.
Weston et al. [WMC01] introduced a method of feature selection for SVMs which is based upon
finding those features which minimize bounds on the leaveone-out error. The method was shown to be
superior to some standard feature selection algorithms on the data sets tested.
Xing et al. [XJK01] successfully applied feature selection methods (using a hybrid of filter and
wrapper approaches) to a classification problem in molecular biology involving only 72 data points in a
7130 dimensional space. They also investigated regularization methods as an alternative to feature
selection, and showed that feature selection methods were preferable in the problem they tackled.
Mitra et al. [MMP02] use a similarity measure that corresponds to the lowest eigenvalue of
correlation matrix between two features.
___________________________________________________________________________________________
19
Literature Review
See Miller [Mil02] for a book on subset selection in regression.
Forman [For03] presented an empirical comparison of twelve feature selection methods. Results
revealed the surprising performance of a new feature selection metric, Bi-Normal Separation (BNS).
Dhillon et al. [DKN03] present two term selection techniques, the first based on the term variance
quality measure, while the second is based on co-occurrence of similar terms in the same context.
Guyon and Elisseeff [GuE03] gave an introduction to variable and feature selection. They
recommend using a linear predictor of your choice (e.g. a linear SVM) and select variables in two
alternate ways: (1) with a nested subset selection method performing forward or backward selection or
with multiplicative updates; (2) with a variable ranking method using correlation coefficient or mutual
information.
Gurif et al. [GBJ05] used a similar idea to Vesanto and Aholas work [VeA99] and integrated a
weighting mechanism in the SOM training algorithm to reduce the redundancy side effects.
More recently, some approaches have been proposed to address the difficult issue of irrelevant
features elimination in the unsupervised learning context [Bla06, GuB06]. These approaches use quality
measures of partition such as the Davies-Bouldin index [DaB79, GuB06], the Wemmert and Gancarski
index or the entropy [Bla06], in addition to Gurif and Bennani [GuB07] where they have extend the wk-means algorithm proposed by Huang et al. [HNR05] to the SOM framework and rely their feature
selection approach on the weighting coefficients learned during the optimization process.
2. 5. Studied Languages
2.5.1. English Language
English is a West Germanic language originating in England. It was the second5 widely spoken
language in the world, and is used extensively as a second language and as an official language
throughout the world, especially in Commonwealth countries, and in many international organizations.
English is the dominant international language in communication, science, business, aviation,
entertainment, radio and diplomacy. The influence of the British Empire is the primary reason for the
initial spread of the language far beyond the British Isles. Following World War II, the growing
economic and cultural influence of the United States has significantly accelerated the spread of the
language.
Hence many studies have been interested in this language. Thus, it possesses a very rich free corpus
data-base, which helped us to evaluate a bunch of our studies.
http://www.photius.com/rankings/languages2.html, Ethnologue, 13th Edition, Barbara F. Grimes, Editor. 1996,
Summer Institute of Linguistics, Retrieved on 10-05-2007.
20 ____________________________________________________________________________________________
Literature Review
2.5.2. Arabic Language

Arabic is currently the second most widely spoken language in the world, with an estimated number
of native speakers larger than 422 million6. Arabic is the official language in more than 24 countries7.
Since it is also the language of religious instruction in Islam, many more speakers have at least a passive
knowledge of the language. Until the advent of Islam in the seventh century CE, Arabic was primarily a
regional language. The Quran, Islams holy book, was revealed to the Prophet Muhammad (Peace be
upon him) in Arabic, thereby giving the language great religious significance.
Muslims believe that to fully understand the message of the Quran, it must be read in its original
language: Arabic. Thus, the importance of the Arabic language extends well beyond the borders of the
Arab world. There are over 1.5 billion Muslims worldwide, and they all strive to learn Arabic in order to
read and pray in the language of revelation. Hence, Arabic has seen a very rapid growth. Statistics show
that since 1995, when the first Arabic newspaper Asharq Alawsat (Middle East) was launched online8,
the number of Arabic websites has been growing exponentially. By 2000 there were about 20 thousand
Arabic sites on the web, and by 2006 the number was estimated at around 100 million.
2.5.3. Arabic Forms

There are three Forms of Arabic that are Classical, Modern Standard, and Colloquial: The Quran
became the fixed standard for Arabic, particularly for the written form of the language. Arabs consider
the Classical Arabic of the Quran as the ultimate in linguistic beauty and perfection. The
contemporary Modern Standard Arabic, based on the classical form of the language, is used in
literature, print media, and formal communication such as news broadcasts; while, the Colloquial
Arabic or locally spoken dialect varies from country to country and region to region throughout the
Arab world.
The written Arabic has changed comparatively little since the seventh century; spoken Arabic has
assumed many local and regional variations. It has also incorporated foreign words; for example, in the
twentieth century, many new non-Arabic words have found their way into the language, particularly
terms relating to modern technology. Although there are Modern Standard Arabic equivalents for
computer, telephone, television, and radio most Arabs, in speaking, will use the English or
French versions of these words.
http://encarta.msn.com/media_701500404/Languages_Spoken_by_More_Than_10_Million_People.html, Microsoft
Encarta 2006, Retrieved on 10-05-2007.

7
http://en.wikipedia.org/wiki/List_of_official_languages, Retrieved on 10-05-2007.
www.asharqalawsat.com.
___________________________________________________________________________________________
21
Literature Review
2.5.4. Arabic Language Characteristics

Arabic is a Semitic language, like Hebrew, Aramaic, and Amharic. Unlike Latin-based alphabets, the
orientation of writing in Arabic is from right-to-left. The Arabic alphabet consists of 28 letters, many of
which parallel letters in the Roman alphabet (see Table 2.1). The letters are strung together to form
words in one way only, there is no distinction between printing and cursive as there is in English.
Neither are there capital and lowercase letters, all the letters are the same.
Arabic letter
Corresponding
Pronunciation
Arabic letter
Corresponding
Pronunciation
a*
Alif
Daad
Baa
Taa
Taa
Thaa
th
Thaa
Ayn
Jiim
gh
Ghayn
Haa
Faa
kh
Kha
Qaaf
Daal
Kaaf
dh
Thaal
Laam
Raa
Miim
Zaayn
Nuun
Siin
Haa
sh
Shiin
w*
Waaw
Saad
y*
Yaa
"
* when Alif , waw or ya is used as a consonant
Table 2.1. Arabic letters.

The shape of the letter, however, changes depending on neighboring characters and their placement
within a word. Table 2.2 shows the four different shapes of the letter gh (Ghayn). In general, all
the letters are connected to one another except ($ $ $ $ $ )which could not be attached on the left.
Isolated
End
Middle
'&
Beginning
(&
)
Table 2.2. Different shapes of the letter gh (Ghayn).

22 ____________________________________________________________________________________________
Literature Review
In arabic, there are three long vowels aa, ii and uu represented by the letters a (alif) [a:], y
(yaa) [i:], and ! w (waaw) [u:] respectively. Diacritics (,,,,,,,) , called respectively 5 4
fatha [], 78
kasra [i], 9:;
damma [u], < 5
4 fathatayn[an], <= 7 8
kasratayn [in],
< :9 ;
Dammatayn [un], ?!
sukuun [a-] (no vowel), and 9 A
shaddah, are placed above or
below consonant to mark short vowels and gemination or tashdeed (consonant doubling), and make the
difference between words having the same representation. For example, if we consider the English
words some, sum and same, non-diacritization (vowellessness) would reduce these words to
sm. For Arabic examples see Table 2.3 (refer to Table A.1 for the mapping between the Arabic letters
and their Buckwalter transliteration). Diacritics appear in the Holy Quran, and with less consistency in
other religious texts, classical poetry, textbooks children and foreign learners. However, they appear
occasionally in complex texts to avoid ambiguity. In spite everyday writing, the reader recognizes the
words as a result of experience as well as the context.
Word
1st interpretation
2nd interpretation

ktb

kataba

kutiba
Wrote
3rd interpretation
Has been
written
kutubN

B
School
Teacher
mdrsp madorasapN
mudar~isapN
Books

9
Taught
mudarsapN
Table 2.3. Ambiguity caused by the absence of vowels in the words ktb and
mdrsp.
In addition to singular and plural constructs, Arabic has a form called dual that indicates precisely
two of something. For example, a pen is CD (qalam), two pens are <: CD (qalamayn), and pens are
FD ( aqlaam). As in French, Spanish, and many other languages, Arabic nouns are either feminine or
masculine, and the verbs and adjectives that refer to them must agree in gender. In written Arabic, case
endings are used to designate parts of speech (subject, object, prepositional phrase, etc.), in a similar
fashion to Latin and German.
English imposes a large number of constraints on word order. Arabic, however, is distinguished by
its high syntactical flexibility. This flexibility includes: the omission of some prepositional phrases
associated with verbs; the possibility of using several prepositions with the same verb while preserving
the meaning; allowing more than one matching case between the verb and the verbal subject, and the
adjective and its broken plural, and the sharpness of pronominalization phenomena where the pronouns
usually indicate the original positions of words before their extra-positioning, fronting and omission. In
other words, Arabic allows a great deal of freedom in the ordering of words in a sentence. Thus, the
___________________________________________________________________________________________
23
Literature Review
syntax of the sentence can vary according to transformational mechanisms such extra-position, fronting
and omission, or according to syntactic replacement such as an agent noun in place of a verb.
2.5.4.1. Arabic Morphology

Arabic words are divided into three types: noun, verb, and particle [Abd87, Als99]. Nouns and verbs
are derived from a closed set of around 10,000 roots [Ibn90]. Generally speaking, in English, the root is
sometimes called the word base or stem; it is the part of the word that remains after the removal of
affixes [Alku91]. In Arabic, however, the base or stem is different from the root [Ala90]. In Arabic, the
root is the original form of the word before any transformation process. The roots are commonly three or
four letters and are rarely five letters. Most Arabic words are derived from roots according to specific
rules, by applying templates and schemes, in order to construct groups of words whose meanings relate
to each other. Extensive word families are constructed by adding prefixes, infixes (letters added inside
the word), and suffixes to a root. There are about 900 patterns [AlA04a]; some of them are more
complex when the gemination is used. Table 2.4 shows an example of 3 letter roots templates. Those
words that are not derived from roots do not seem to follow a similar set of well-defined rules. They are
divided into two kinds: primitive (not derived from another a verbal root) like lion Asd, or
borrowing from foreign languages like oxide <8 Aksyd. Instead they may have group showing
their family resemblances (for an example see Table 2.5).
CCC
ktb
CaCaCa
H4

H 4
Writing notion
kataba

CaACiC
IJ4
kaAtib
=J
Writer
CaCuwC
!H4
katuwb
!
skilled writer
CiCaAC
JH4
kitaAb
J
Book
CuCay~iC
maCOCaC
B<H 4
HM
Kutay~ib
B<
?
handbook
maCOCaCap CH M
CaCaACiyC
makOtab
makOtabap N ?
<IJH4 kataAtiyb
Wrote
Desk
Library
<=J Quran school
C stands for the letters that a part of the root. An underlined C stands for a letter that is doubled.
a, i, u, designate vowels, and m represents a derivation consonant.
Table 2.4. Some templates generated from roots with examples from the root ( ktb).
24 ____________________________________________________________________________________________
Literature Review
Transliteration Pattern Transliteration English Word

Arabic Word
8

>akOsada

CH 4
faEOlala
Oxidize
8 O
Mu&akOsad
CH M
mufaEOlal
Oxidized
>aksadah
CCH 4
faEOlalah
Oxidation
8 P=
ta>aksud
CH M =
tafaEOlul
Oxidation
Table 2.5. Derivations from a borrowed word.

When using an Arabic dictionary, one does not simply look up words alphabetically. The three letter
root must first be determined, and then it can be located alphabetically in the dictionary. Under the root,
the different words that belong to that word family are listed. The number of unique Arabic words (or
surface forms) is estimated to be 6 x 10^10 words [Att00].
In this thesis, a word is any Arabic surface form, a stem is a word without any prefixes or suffixes,
and a root is a linguistic unit of meaning, which has no prefix, suffix, or infix (for more details see the
Appendix A). However, often irregular roots, which contain double or weak letters, lead to stems and
words that have letters from the root that are deleted or replaced.
2.5.4.2. Word-form Structures

Arabic word-forms can be roughly considered as equivalent to graphic words. In the Arabic
language, a word can signify a full sentence, due to its composed structure based on grammar elements
agglutination, where the prefixes and suffixes contribute into its form.
Prefixes
Arabic prefixes are sets of letters and articles attached to the beginning of the lexical word and
written as part of it. A small inventory of the prefixes in Arabic yields the following grammatical
categories: The definite article Al (the), the connectives fa and wa (and), the
prepositions bi (with), li (for, to), ka (as), the particle of the future used by verbs
sa and the conjunctive particle li (in order to), the negation T lA and the conditional particle
la, the interrogative particle ( > alif-hamza). We must take into consideration that in written
language the vowel is omitted. This means that, in practice, those grammatical categories are reduced to
no more than one consonant which is written onto the word; with exception of the definite article,
negation, and the interrogative particle alif-hamza. The one consonant prefixes are called consonant
particles. The fact that they consist of one sole consonant complicates their identification in a text.
___________________________________________________________________________________________
25
Literature Review
Indeed, many words in Arabic do start by one of these consonants. It is true that many Arabic words are
composed of three consonants, but this is not always the case. It might be possible to identify those
prefixes by comparing prefixed words to a huge database of lexical forms in order to define which
words contain prefixes and which do not. The outcome of this process, however, is not clear at all.
However, the above mentioned prefixes do also occur in combination. This means that in practice
two or three prefixes can be linked to a word. The three most frequently used combinations of prefixes
are: (1) a combination between a connective and a preposition (for instance: wa-bi, in written
language wb, meaning: and with), (2) a combination between a preposition and the article (for
instance: J& bi-Al, in written language J& bAl, meaning: with the) and (3) a combination of three
particles, which is most commonly the combination between a connective, a preposition and the article
(for instance: J& wa-bi-Al, in written language J& wbAl, meaning and with the).
Suffixes
In Arabic, suffixes are sets of letters, articles, and pronouns attached to the end of the word and
written as part of it. There are 17 used as possessive suffixes. Besides, there is the suffix of the A
(alif) which is used as an undefined accusative; and there is the suffix of the energetic, the 9 n
(nna). The possessive suffixes consist of one or two consonants. It is obvious that one consonant suffix
is more difficult to identify than two. Moreover, there will always remain combinations which are
ambiguous. The suffix J hA, for instance, of the third person singular feminine can easily be mixed
up with the undefined accusative A (alif) of a word ending with the consonant h.
The representation below shows a possible word structure. Note that the writing and reading of an
Arabic word are from right to left.
Postfix
Suffix
Scheme
Prefix
Antefix
The antefixes are prepositions or conjunctions.
The prefixes and suffixes express grammatical features and indicate the functions: noun case,
verb mood and modalities (number, gender, person ).
The postfixes are personal pronouns.
- The scheme is a stem.

Example: The word JWX 79 Y = > atata*akrunanaA express in English the sentence: Do you
remember us ?. the segmentation of this word gives the following constituents:
JX
| X
79Y =
Antefix : > Interrogative conjunction.

Prefix : ta Verbal prefix.
26 ____________________________________________________________________________________________
Literature Review
Scheme: 79Y = atata*akr.
Suffix: X wn Verbal suffix.
Postfix : JX naA Pronoun suffix.
2.5.5. Anomalies
As is generally known, the Arabic language is complicated for natural language processing because
of the combination of two main language characteristics. The first is the agglutinative nature of the
language and the second is the aspect of the vowellessness of the language which causes problems of
ambiguity at different levels, and complicates the identification of words.
2.5.5.1. Agglutination
The first problem is the identification of words in sentences. As in most European languages, Arabic
words can, to a certain degree, be identified in computer terms as a string of characters between blanks.
Two blanks in a text serve as a marker for the separation of strings of characters, but those strings of
characters do not always coincide with words. Some Arabic grammatical categories which are
considered words in other languages appear to be affixes. Those affixes are directly linked to the words
in Arabic (as is explicated in Section 2.5.4.2), which means that a string of characters between two
blanks can contain more than one word so that multiword combinations are found which are not
separated by blanks.
The string is ambiguous as the affixes could be an attached particle or a part of the word. Thus a
form such as \?C4 flky can be read as \?C4 (falaky) meaning astronomer, or \? C4 (falikay) which
means then for, or B\? C4 (falikayyi) which means then for ironing or burning. Note that there is no
deterministic way to tell whether the first letter is part of word or the prefix.
2.5.5.2. The Vowelless Nature of the Arabic Language
The second problem in Arabic is the vowellessness of the words in sentences. This causes problems
not only on the previous mentioned multiword combinations, but also on word level. The vowellessness
affects the meaning of words. As an example, we take the string of characters consisting of the
consonants ( kaf) and ( lam). A reader could identify these two consonants as the noun
(kullo) which means all, or the verb ( kul) (verb ( akala) for the second person singular
masculine) which means he eats. Another example of such ambiguity is the string ]4
mdAfE,
which could means according to its spelling ]4
mudaAfiE defender or ]4
madaAfiE
cannons.
Also it affects the grammatical labeling of words, which is especially the case for verbs. The
different persons of the verb form, both in the present and past tenses, are in most cases only identifiable
___________________________________________________________________________________________
27
Literature Review
by means of vowels which are omitted. The verb form ktbt, for example, can refer to four
possible persons: i.e.
N katabOtu for the first person singular,
N katabOta for the second
person singular masculine,
N katabOti for the second person singular feminine and N
katabatO for the third person singular feminine. It is almost impossible for a computer program to
determine the subject of these verbs. Only the context can help in defining the correct persons of a verb
form. In this respect some help might be expected from a minimal form of text categorization. Indeed, in
newspaper text, the first person singular is less likely to occur, whereas in literature this person might
occur more abundantly. Nevertheless, it seems quite difficult to tag texts automatically when they are
not vocalized or when the larger context cannot be taken into account.
2.5.6. Early Work

Although academia has made significant achievements in the Arabic text retrieval field, the complex
morphological structure of the Arabic language provides many challenges. Hence, research and
development (R&D) in the Arabic text still has a long way to go.
Existing Arabic text retrieval systems could be classified in three groups Full-form-based IR,
Morphology-based IR and Non-rule-based IR.
2.5.6.1. Full-form-based IR
Despite academic research, most of the commercial Arabic IR systems presented by search engines
are very primitive, all using a very basic string matching search. This includes what is being classified as
native Arabic search engines, which means owned or managed in or by Arab companies or institutions,
such as Sakhr web engine Johaina-sakhr and ayna; those considered as Unicode multilingual engines
such as AltaVista and Google; and web directories, where documents are classified based on the subject
categorization, such as Naseej the first Arabic portal site on the Internet launched in early 1997 to serve
the growing number of Arabs on the Internet, Al-Murshid, Art Arab, and Yasalaam9.
The issue with these types of search engines, where the search is literal, is their limitations.
Although, they are based on the simplest search and retrieval method which has the advantage of all the
returned documents without a doubt contain the exact term for which the user is looking, they also have
the biggest disadvantage that many, if not most, of the documents containing the terms in different forms
will be missed. Given the many ambiguities of Arabic written, the success rate of these engines is quite
low. For example, if the user searches for J( kitaab), which means book in English, he or she
will not find documents that only contain J?^( al-kitaab), which means the book.
All these links are retrieved on 10-05-2007.
28 ____________________________________________________________________________________________
Literature Review
2.5.6.2. Morphology-based IR
The efforts that have been made in the academic environment to evaluate more sophisticated systems
give an idea about the next generation of the Arabic search engines. Evaluation has been performed on
systems using multiple approaches of incorporating morphology. Different proposed classifications of
Arabic morphological analysis techniques, found in literature, are reviewed in the work of Al-Sughaiyer
and Al-Kharashi [AlA04a]. However, in this work we adapt the Larkey et al. classification [LBC02],
where they proposed classifying Arabic stemmers into four different classes, namely, manually
constructed dictionaries, algorithmic light stemmers which remove prefixes and suffixes, morphological
analyses which attempt to find roots, and statistical stemmers.
Constructed Dictionaries: Manually constructed dictionaries of words with stemming information
are in surprisingly wide use. Al-Kharashi and Evens worked with small text collections, for which they
manually built dictionaries of roots and stems for each word to be indexed [AlE94]. Tim Buckwalter10
developed a set of lexicons of Arabic stems, prefixes, and suffixes, with truth tables indicating legal
combinations. The BBN group used this table-based stemmer in TREC - 2001 [XFW01].
Algorithmic Light Stemmers: Light stemming refers to a process of stripping off a small set of
prefixes and/or suffixes, without trying to deal with infixes, or recognize patterns and find roots
[LBC02, Dar03]. Although light stemming can correctly conflate many variants of words into large stem
classes, it can fail to conflate other forms that should go together. For example, broken (irregular)
plurals for nouns and adjectives do not get conflated with their singular forms, and past tense verbs do
not get conflated with their present tense forms, because they retain some affixes and internal
differences, like the noun ( soduud) the plural of (sad) which means dam.
Morphological Analyzers: Several morphological analyzers have been developed for Arabic
[AlA89, Als96, Bee96, KhG9911, DDJ01, GPD04, TEC05] but few have received a standard IR
evaluation. Such analyzers find the root, or any number of possible roots for each word. Since most
verbs and nouns in Arabic are derived from triliteral (or, rarely, quadriliteral) roots, identifying the
underlying root of each word theoretically retrieves most of the documents containing a given search
term regardless of form. However, there are some significant challenges with this approach.
Determining the root for a given word is extremely difficult, since it requires a detailed morphological,
syntactic and semantic analysis of the text to fully disambiguate the root forms. The issue is complicated
further by the fact that not all words are derived from roots. For example, loan words (words borrowed
from another language) are not based on root forms, although there are even exceptions to this rule. For
10
Buckwalter, T. Qamus: Arabic lexicography, http://www.qamus.org/lexicon.htm, Retrieved on 10-10-2007
11
http://zeus.cs.pacificu.edu/shereen/research.htm#stemming, Retrieved on 10-10-2007

___________________________________________________________________________________________
29
Literature Review
example, some loans that have a structure similar to triliteral roots, such as the English word film C<4,
are handled grammatically as if they were root-based, adding to the complexity of this type of search.
Finally, the root can serve as the foundation for a wide variety of words with related meanings. The root
ktb is used for many words related to writing; including ( kataba), which means to
write; J( kitaab), which means book; ?
(maktab), which means office; and =J
(kaatib), which means author. But the same root is also used for regiment/battalion: N<
(katyba). As a result, searching based on root forms results in very high recall, but precision is usually
quite low.
2.5.6.3. Statistical Stemmers
In statistical stemmer class, we distinguish between two kinds of stemmers, those consisting in
grouping word variants using clustering techniques and n-gram. The former model consists in grouping
words that result in a common root after applying a specific algorithm as a conflation or equivalence
class. These equivalence classes are not overlapping, where each word belongs to exactly one class.
Based on the co-occurrence analysis and a variant of EMIM (expected mutual information) [Van79,
ChH89], which measures the proportion of word co-occurrences that are over and above what would be
expected by chance, statistical stemmers for Arabic language were used to refine stem-based and rootbased stemmers [LBC02]; whereas, they were applied also to n-gram stemmer for English and Spanish
languages [XuC98].
Statistical stemming applied to the best Arabic stemmers (Darwish light stemmer modified by
Larkey [LBC02], and Khoja root-based stemmer [KhG99]12) changes classes a great deal, but does not
improve (or hurt) overall retrieval performance. This may be suspected to the clustering method having
high bias against low frequency variants.
The second statistical model, n-gram, generates a document vector by moving a window of n
characters in length through the text, enabling a statistical language description by learning the
apparition probability of each group of these n characters.
N-gram stemmers have different challenges primarily caused by the significantly larger number of
unique terms in an Arabic corpus, and the peculiarities imposed by the Arabic infix structure that
reduces the rate of correct n-gram matching.
Published comparison studies of using stems against using roots for information retrieval are
discrepant. Older studies revealed that words sharing a root are semantically related, and root indexing is
12
http://zeus.cs.pacificu.edu/shereen/research.htm#stemming, Retrieved on 10-10-2007
30 ____________________________________________________________________________________________
Literature Review
reported to outperform stem and word indexing on retrieval performance [HKE97, AAE99, Dar02].
However, later works on the TREC collection showed two different results. Darwish (as cited by Larkey
et al. [LBC02]) found no consistent difference between root and stem while Al-Jlayl & Frieder,
Goweder et al. and Thagva et al. [AlF02, GPD04, TEC05] showed that stem-based retrieval is more
effective than root-based retrieval. The older studies showing the superiority of roots over stems are
based on small and nonstandard test collections, making results non-justifiable.
Similarly, the work of Larkey et al. [LBC02] showed that the statistical stemmer, based on cooccurrence, still inferior to good light stemming and morphological analysis. In addition, the work of
Mustafa and Al-Radaideh [MuA04] indicated that the digram method offers a better performance than
trigram with respect to conflation precision and conflation recall ratios, but in either case, the n-gram
approach does not appear to provide a good performance compared to the light stemming approach.
Hence, we could conclude that Al-Stem (Darwish stem-based stemmer, modified by Larkey), up the
day of this study have been effectuated, is the best known and published stemmer.
2. 6. Arabic Corpus
In attempts to study and evaluate IR systems, morphological analyzers, and machine translation
systems for Arabic language, researchers initiated the creation of corpora. Among these Arabic corpora,
we find some available such as AFP corpus, Al-Hayat newspaper, Arabic Gigaword, Treebanks, and
ICA. However, at the exception of the Initial Version of the ICA, including about 448 files of totaling
size approximately of 13.5MB in uncompressed form, that has been made available for free, all the other
corpus are not free.
2.6.1. AFP Corpus

In 2001 LDC released the Arabic Newswire catalog number LDC2001T55, a corpus composed of
articles from the Agence France Presse (AFP) Arabic Newswire. The corpus size is 869 megabytes
divided over 383,872 documents. The corpus was tagged using SGML and was trans-coded to Unicode
(UTF-8). The corpus includes articles from May 13th 1994 to December 20th 2000 with approximately
76 million tokens and 666 094 unique words.
2.6.2. Al-Hayat Newspaper

Al-Hayat newspaper is a collection from the European Language Resources Distribution Agency
(ELRA) distributed under the catalog reference ELRA W0030 Arabic Data Set. The corpus was
developed in the course of a research project at the University of Essex, in collaboration with the Open
University.
The corpus contains Al-Hayat newspaper articles with value added for Language Engineering and
___________________________________________________________________________________________
31
Literature Review
Information Retrieval applications development purposes.
The data have been distributed into seven subject-specific databases, thus following the Al-Hayat
subject tags: General, Car, Computer, News, Economics, Science, and Sport.
Mark-up, numbers, special characters and punctuation have been removed. The size of the total file
is 268 MB. The dataset contains 18 639 264 distinct tokens in 42 591 articles, organized in 7 domains.
2.6.3. Arabic Gigaword

In 2003 LDC also released Arabic Gigaword catalog number LDC2003T12, a bigger and richer
corpora compiled from different sources that includes Agence France Presse, Al Hayat News Agency,
Al Nahar News Agency and Xinhua News Agency. There are 319 files, totalling approximately 1.1GB
in compressed form (4348 MB uncompressed, and 391 619 words).
Besides this technical information, little is known about investigation of these collections and their
limitations in terms of richness and representativeness.
2.6.4. Treebanks
A treebank is a text corpus in which each sentence has been annotated with syntactic structure.
Syntactic structure is commonly represented as a tree structure, hence the name treebank. Treebanks can
be used in corpus linguistics for studying syntactic phenomena or in computational linguistics for
training or testing parsers.
Treebanks are often created on top of a corpus that has already been annotated with part-of-speech
tags. In turn, treebanks are sometimes enhanced with semantic or other linguistic information.
Treebanks can be created completely manually, where linguists annotate each sentence with
syntactic structure, or semi-automatically, where a parser assigns some syntactic structure which
linguists then check and, if necessary, correct.
Some treebanks follow a specific linguistic theory in their syntactic annotation (e.g. the
BulTreeBank13 follows head-driven phrase structure grammar (HPSG)) but most try to be less theoryspecific. However, two main groups can be distinguished: treebanks that annotate phrase structure such
the Penn Arabic Treebank14, and those that annotate dependency structure such the Prague Arabic
Dependency Treebank15.
13
http://www.bultreebank.org/, Retrieved on 10-8-2007.
14
http://www.ircs.upenn.edu/arabic/, Retrieved on 10-8-2007.
15
http://ufal.mff.cuni.cz/padt/PADT_1.0/index.html, Retrieved on 10-8-2007.
32 ____________________________________________________________________________________________
Literature Review
2.6.5. Other Efforts

International Corpus of Arabic ICA by Al-Sulaiti and Atwell [AlA04b, AlA05] is an example
among the efforts to build suitable resources that could be made available to researchers in areas related
to Arabic. ICA is designed on the principles of the ICE (International Corpus of English). This project
was an idea that aims for a new corpus that will include a wide range of sources representing
contemporary Arabic. Initial Version of the ICA CCA (Corpus of Contemporary Arabic) has been
made available for free use at http://www.comp.leeds.ac.uk/eric/latifa/research.htm16.
Others corpus are listed in the Latifa Al-Sulaiti Web site17, and in the NEMLAR survey on Arabic
language resources and tools of 2005 [NiC05].
2. 7. Summary
In this chapter, we highlight the different models of information retrieval, especially vector space
model. We give taxonomy of clustering algorithms, and explain the usefulness and the use of clustering
in the information retrieval process. We introduce dimension reduction techniques, and review
chronologically features selection methods used for clustering. Moreover, we present the Arabic
language characteristics, and underline previous work undertaken in the aim of improving Arabic
retrieval. Finally, we present the existent available Arabic corpora.
16
Retrieved on 10-8-2007.
17
http://www.comp.leeds.ac.uk/eric/latifa/arabic_corpora.htm, Retrieved on 10-8-2007.

___________________________________________________________________________________________
33
Chapter 3 Latent Semantic Model

3. 1. Introduction
As storage becomes more plentiful and less expensive, the amount of electronic information is
growing at an exponential rate, and our ability to search that information and derive useful facts is
becoming more cumbersome unless new techniques are developed. Traditional lexical (or Boolean)
document retrieval techniques become less useful. Large, heterogeneous collections are difficult to
search since the sheer volume of unranked documents returned in response to a query is overwhelming
the user. Vector-space approaches to document retrieval, on the other hand, allow the user to search for
concepts rather than specific words and rank the results of the search according to their relative
similarity to the query. One vector-space approach, Latent Semantic Analysis (LSA), is capable of
achieving significant retrieval performance gains over standard lexical retrieval techniques (see
[Dum91]) by employing a reduced-rank model of the term-document space.
LSA [DDF90], because the way of representing terms and documents in a term-document space, and
modeling the implicit higher order structure in the association of terms and documents, is considered as
vector-space approach to conceptual document retrieval. It is useful in situations where traditional
lexical document retrieval approaches fail. LSA estimates the semantic content of the documents in a
collection and uses that estimate to rank the documents in order to decrease relevance to a user's query.
Since the search is based on the concepts contained in the documents rather than the document's
constituent terms, LSA can retrieve documents related to a user's query even when the query and the
documents do not share any common terms.
In the following, we describe the latent semantic analysis model components including the termdocument presentation, the weighting schemes phase, the singular value decomposition method, and the
standard query methods used in LSA. Moreover, we evaluate the impact of the weighting schemes, and
we compare the LSA performance to the standard Vector-Space Model (VSM).
3. 2. Model Description
The latent semantic document retrieval model builds upon the prior research in document retrieval
and, using the singular value decomposition (SVD) [GoV89] to reduce the dimensions of the termdocument space, attempts to solve the synonymy and polysemy problems (Section 2.2.2) that plague
automatic document retrieval systems. LSA explicitly represents terms and documents in a rich, highdimensional space, allowing the underlying (latent), semantic relationships between terms and
documents to be exploited during searching.
34 ___________________________________________________________________________________________
Latent Semantic Model

LSA relies on the constituent terms of a document to suggest the document's semantic content.
However, the LSA model views the terms in a document as somewhat unreliable indicators of the
concepts contained in the document. It assumes that the variability of word choice partially obscures the
semantic structure of the document. By reducing the dimensionality of the term-document space, the
underlying, semantic relationships between documents are revealed, and much of the noise
(differences in word usage, terms that do not help distinguish documents, etc.) is eliminated. LSA
statistically analyses the patterns of word usage across the entire document collection, placing
documents with similar word usage patterns near each other in the term-document space, and allowing
semantically-related documents to be near each other even though they may not share terms [Dum91].
Compared to other document retrieval techniques, LSA performs surprisingly well. In one test,
Dumais [Dum91] found LSA provided 30% more related documents than standard word-based retrieval
techniques when searching the standard Med collection (see Section 3.3.1). Over five standard document
collections, the same study indicated LSA performed an average of 20% better than lexical retrieval
techniques. In addition, LSA is fully automatic and easy to use, requiring no complex expressions or
syntax to represent the query.
The following sections detail the LSA model steps.
3.2.1. Term-Document Representation

In the LSA model, terms and documents are firstly represented by an m n incidence matrix
A=[ aij ]. Each of the m unique terms in the document collection are assigned a row in the matrix, while
each of the n documents in the collection is assigned a column in the matrix. The non-zero element aij
indicates not only that term i occurs in document j, but also the number of times the term appears in that
document. Since the number of terms in a given document is typically far less than the number of terms
in the entire document collection, A is usually very sparse [BeC87].
3.2.2. Weighting
The benefits of weighting are well-known in the document retrieval community [Jon72, Dum91,
Dum92]. LSA typically uses both a local and global weighting scheme to increase or decrease the
relative importance of terms within documents and across the entire document collection, respectively.
A combination of the local and global weighting functions is applied to each non-zero element of A,
aij = L(i, j ) G (i ) ,
or aij =
L(i, j )
G (i )
(1)
(2)
where L(i,j) is the local weighting function for term i indicating its importance in the document j, and
___________________________________________________________________________________________
35

G(i) is the global weighting function for term i indicating its overall importance in the collection.
(1) [Dum91, Dum92] and formula
In addition to the formula

18
(2) , in this dissertation, also the formula
aij =
(3) [BeB99] is utilized,
L(i, j ) G (i )
N ( j)
(3)
where N(j), the document length normalization, is used to penalize the term weight for the document j
in accordance with its length. Such weighting functions are used to differentially treat terms and
documents to reflect knowledge that is beyond the collection of the documents.
Some popular local weighting schemes include [Dum91, Dum92]:
-
Term Frequency: tf or f ij is the integer representing the number of times term i appears in
document j.
Binary Weighting: is equal to 1 if a term occurs in the document and 0 otherwise,
if fij > 1
aij =
0
otherwise
log(Term Frequency + 1): is used to damp the effects of large differences in frequencies, such
that an additional occurrence of term i in document j is considered more important at smaller
term frequency levels than at larger levels.
Four well-known global weightings are: Normal, GfIdf, Idf, and Entropy. Each is defined in terms of
the term frequency f ij , the document frequency df i , which is the number of documents in which term i
occurs, and the global frequency gfi , which is the total number of times term i occurs in the whole
collection. N is the number of documents, and M is the number of terms in the collection.
-
Normal: N
fij
j
It normalizes the length of each row (term) to 1. This has the effect of giving high weight to
infrequent terms. However, it depends only on the sum of the squared frequencies and not the
distribution of those frequencies.
-
Gfldf:
gfi
dfi
N
Idf: log 2
dfi
GfIdf and Idf are closely related. Both of them weight terms inversely by the number of different
18
http://lsa.colorado.edu/~quesadaj/pdf/LSATutorial.pdf, Retrieved on 10-28-2007.
36 ____________________________________________________________________________________________

documents in which they appear, moreover, GfIdf increases the weight of frequently occurring
terms. However, neither method depends on the distribution of terms in documents. They depend
only on the number of different documents in which a term occurs.
-
N pij log pij

j log( N )
1 - Entropy: 1
f ij
where pij =
gf i
Entropy is a sophisticated weighting scheme that takes into account the distribution of terms over
N pij log pij
j log( N )
documents. The average uncertainty or entropy of a term is given by (
). Subtracting this
quantity from a constant assigns minimum weight to terms that are equally distributed over
documents (i.e. where pij =
), and maximum weight to terms which are concentrated in a few
documents.
Furthermore, there are other global weighting schemes as:
-
N pij log pij

Global Entropy: log1
j log( N )
Shannon Entropy: pij log pij [LFL98];
N
j
and Entropy:
N
1 + pij log pij
j
log(N )
19
In general, all global weighting schemes give a weaker weight to frequently terms or those occurring
in lot of documents.
Two main reasons make the use of normalization necessary:
-
Higher Term Frequencies: Long documents usually use the same terms repeatedly. As a result,
the term frequency factors may be large for long documents, increasing the average contribution of
its terms towards the query-document similarity.
-
More Terms: Generally, vocabulary is richer and more varied in long documents than shorter
ones. This enhances the number of matches between a query and a long document, increasing the
query-document similarity, and the chances of retrieval of long documents in preference over shorter
documents.
The normalization could be either explicit or implicit effectuated by the cosine based measure
(angular distance between query q and document D)
19
http://lsa.colorado.edu/~quesadaj/pdf/LSATutorial.pdf, Retrieved on 10-28-2007.

___________________________________________________________________________________________
37

cos(D, q ) =
D.q
D * q
Various normalization techniques are used in document retrieval systems. Following is a review of some
commonly used normalization techniques [SBM96]:
1
M 2
- Cosine Normalization: wi where
wi = w
(i , j ) w
(i ) ,
local
global
i
Cosine normalization is the most commonly used normalization technique in the vector-space
model. It attacks both normalization reasons in one step: higher individual term frequencies augment
individual weighting values wi , increasing the penalty on the term weights. Also, if a document is
rich, the number of individual weights in the cosine factor (M in the above formula) increases,
yielding a higher normalization factor.
-
Maximum tf Normalization: Another popular normalization technique is normalization of
individual tf weights for a document by the maximum tf in the document. The Smart systems
augmented tf
tf
factor 0,5 + 0,5 *

, and the tf
max tf
weights used in the INQUERY system
tf
0,4 + 0,6 *
are examples of such normalization.
max tf
By restricting the tf factors to a maximum value of 1.0, this technique only compensates for the first
normalization reason (higher tf s), while it does not make any correction for the second reason
(more terms). Hence, the technique turns out to be a weak form of normalization and favors the
retrieval of long documents.
-
Byte Length Normalization: More recently, a length normalization scheme based on the byte
size of documents has been used in the Okapi system. This normalization factor attacks both
normalization reasons in one shot.
Other classic weighting schemes are used in the literature such as: the Tfc, Ltc weighting [AaE99],
and the Okapi BM-25 weighting [RWH94, Dar03].
-
Tfc:
fij * idfi
M
2
f * idf k
k =1 k j
The tfxidf weighting, even widely used, does not take into account that documents may be of
different lengths. The tfc weighting is similar to the tfxidf weighting except for the fact that length
normalization is used as part of the word weighting formula.
38 ____________________________________________________________________________________________
( )
M
2
(log( f kj + 1)* idf k )
k =1
log f ij + 1 * idfi
Ltc:
A slightly different approach uses the logarithm of the word frequency instead of the raw word
frequency, thus reducing the effects of large differences in frequencies.
-
Okapi BM-25:
3 * log N log df * f
i
ij
N * dl
2 * 0.25 + 0.75 *
N
k = 1 dl
k
+ f
ij
where dl is the document length.
Among these weighting scheme tried with LSA (Section 3.3.2.1), we find that the Okapi BM-25
scheme provides a 7.9 % - 27.7 % advantage over term frequency scheme on all the English corpuses
used in this Chapter.
3.2.3. Computing the SVD

Once the m n matrix A has been created and properly weighted, a rank-k approximation
( k << min(m, n) ) to A, Ak , is computed using an orthogonal decomposition known as the singular value
decomposition (SVD) [GoV89] to reduce the redundant information of this matrix. The SVD, a
technique closely related to eigenvector decomposition and factor analysis [CuW85], is defined for a
matrix A as the product of three matrices:
A = USV T ,
where the columns of U and V are the left and right singular vectors, respectively, corresponding to the
monotonically positive decreasing (in value) diagonal elements of S, which are called the singular
values of the matrix A. As illustrated in Figure 3.1, the first k columns of the U and V matrices and the
first (largest) k singular values of A are used to construct a rank-k approximation to A via Ak = U k S kVk .
T
The columns of U and V are orthogonal, such that U TU = V TV = I r , where r is the rank of the matrix A.
A theorem due to Eckart and Young [GoR71] suggests that Ak , constructed from the k-largest singular
triplets20 of A is the closest rank-k approximation (in the least squares sense) to A [BeC87].
20
The triple
{U , ,V }
i
, where S
= diag ( 0 , 1 ,..., k 1 ) ,
vectors, respectively, corresponding to the i
th
largest singular value,
is called the
i th singular
triplet.
Ui
and Vi are the left and right singular
i , of the matrix A.
___________________________________________________________________________________________
39

k
=
k
(m x n)
Ak
(r x r)
(r x n)
(m x r)
Uk
Sk
Term
Vectors
Vk
Document
Vectors
Figure 3.1. A pictorial representation of the SVD. The shaded areas of U and V, as well as the diagonal
line in S, represent Ak, the reduced representation of the original term-document matrix A.
With regard to LSA, Ak is the closest k-dimensional approximation to the original term-document
space represented by the incidence matrix A. As stated previously, by reducing the dimensionality of A,
much of the noise that causes poor retrieval performance is thought to be eliminated. Thus, although a
high-dimensional representation appears to be required for good retrieval performance, care must be
taken not to reconstruct A. If A is nearly reconstructed, the noise caused by variability of word choice
and terms that span or nearly span the document collection will not be eliminated, resulting in poor
retrieval performance [BeC87]. Generally, the choice of the reduced dimension is empiric, depends on
the nature of the corpus, the type of the used queries if they are long or short represented by keywords,
and the performed weighting schemes. This is experimentally proved in [ABE08] and the Section
3.3.2.2.
It is worthwhile to point out that, in the context of text retrieval, document vectors could either refer
to the columns in A or the columns in V T , and term vectors could either refer to the rows in A or the
rows in U. And the same nomenclature applies to the dimension reduced model, only with the subscripts
dropped off. It is important to differentiate between these two kinds of document vectors (or term
vectors) to avoid confusion. Also note that the term vectors and document vectors in A may be referred
to as the initial/original (term or document) vectors since they have not been subjected to dimension
reduction; while in Ak they may be referred to as the (term or document) concepts, because the termdocument matrix reduction captures semantic structure (i.e. concepts) while it rejects the noise that
results from term usage variations.
In addition to the fact that the left and right singular vectors specify the locations of the terms and
documents respectively, the singular values are often used to scale the term and document vectors,
allowing clusters of terms and documents to be more readily identified. Within the reduced space,
semantically-related terms and documents presumably lie near each other since the SVD attempts to
40 ____________________________________________________________________________________________

derive the underlying semantic structure of the term-document space [BeC87].
3.2.4. Query Projection and Matching

In the LSA model, queries are formed into pseudo-documents that specify the location of the query
in the reduced term-document space. Given q, a vector whose non-zero elements contain the weighted
term-frequency counts of the terms that appear in the query (using the same weighting schemes applied
to the document collection being searched), the pseudo-document q ' , can be represented according to
three standard philosophies described in [Yan08].
These three different philosophies on how to conduct the query q in the dimension reduced model
give rise to four versions of the query method:
Version A
(I)
Underlying Philosophy: Column vectors (V T (:,1),...,V T (:, n)) in matrix V T are k-dimensional
document vectors, their dimension having been reduced from m. Dimensionally-reduced V T (:, i )
should carry some kind of latent semantic information captured from the original model and may
be used for querying purposes. However, since V T (:, i ) is k-dimensional while q is mdimensional, it is needed to translate q into some proper form in order to compare it with
V T (:, i ) .
Observing
that
A = ( A(:,1),..., A(:, n))
and
(V T (:,1),...,V T (:, n)) ,
equation
A = USV T leads to ( A(:,1),..., A(:, n)) = US (V T (:,1),...,V T (:, n)) . Thus, for any individual column
vector in A A(:, i ) = USV T (:, i ) for (1 i n) which implies that V T (:, i ) = S 1U T A(:, i ) for
(1 i n) . Treating q like a normal document vector A(:, i) , q will be transformed to

q 'a = S 1U T q and will have the same dimension as V T (:, i ) .
Query Method: First, use formula q 'a = S 1U T q to translate the original query q into a form
comparable with any column vector V T (:, i ) in matrix V T . Then compute the cosine between q 'a
and each V T (:, i) for (1 i n) .
(II)
Version B
Underlying Philosophy: As mentioned earlier, document vectors can mean two different things:
either column vectors (V T (:,1),...,V T (:, n)) in V T or column vectors ( A(:,1),..., A(:, n)) in A. In
fact, the latter ones might be a better choice for serving as document vectors because they are
rescaled from the dimensionally reduced U and V by a factor of S after the SVD process. To
utilize ( A(:,1),..., A(:, n)) for querying purposes, only one further step on the basis of version A
need to be taken, which is to scale k-dimensional
q 'b back to m dimensions:
___________________________________________________________________________________________
41

q 'b = USq'a = US ( S 1U T q ) = UU T q
21
Query Method: First, use formula q 'b = UU T q to translate the original query q into a folded-inplus-rescaled form comparable with any column vector A(:, i ) in matrix A. Then compute the
cosine between q 'b and each A(:, i ) for (1 i n) .
(III)
Version B'
Underlying Philosophy: All the reasoning behind version B sounds good except for one thing:
since m-dimensional column vectors ( A(:,1),..., A(:, n)) will be used as document vectors, it is not
needed to fold in q and then rescale it back to m dimensions: just the original query q could be
used (which is already m-dimensional) for comparing with each m-dimensional A(:, i ) for
(1 i n) .
Query Method: Compute the cosine between q and each A(:, i) for (1 i n) .
The above three different versions of query method are summarized in Table 3.1, along with the
conventional technique of lexical matching.
Lexical
Matching
Version A
Version B
Version B'
Document
Vectors
m-dim column
vectors in A
k-dim column
vectors in VT
m-dim column
vectors in A
m-dim column
vectors in A
Query
Vector
m-dim original
query vector q
k-dim folded-in m-dim folded-inquery vector

plus-rescaled
S -1UTq
vector UUTq
[BDO95]
Applicable
Literature
Many
[BeF96] [FiB02]
[Jia97] [Jia98]
[LeB97] [Let96]
[Wit97]
[DDF89]
[DDF90]
m-dim original
query vector q
[BCB92] [BDJ99]
[Din99]
[Din01][HSD00]
[KoO96]
[KoO98] [Zha98]
[ZhG02]
Table 3.1. Comparison between Different Versions of the Standard Query Method.
Based on the analysis of the three standard versions, it was proved that version B and version B' are
essentially equivalent. On the other hand, the task of seeking the best version of the standard query method has
brought a marked difference for the version B compared to the version A [Yan08]. However, this latter is
21
It should be pointed out that because of dimensional reduction, UTU=I while UUTI.
42 ____________________________________________________________________________________________

still considered admirable due to its important role in the conservation of the space.
3. 3. Applications and Results

A starting point, to any application, we create a vector-space model for our data [SaM83].
Documents will be typically represented by a term-frequency vector with its dimensions equal to the
number of unique words in the corpus, and each of its components indicating how many times a
particular word occurs in the document. To further improve the effectiveness of our systems applied to
English language, we use the TreeTagger part of speech tagger [Sch94], and we remove stopwords22.
The tagging process was done without training and the results of the tagging are used as-is. In that
respect, the results we obtain from subsequent modules could only be better if the output of the tagger
was corrected and the software trained.
3.3.1. Data
The English testing data used, in our experiments including this chapter and some following
chapters, are formed by mixing documents from multiple topics arbitrarily selected from standard
information science test collections. The text objects in these collections are bibliographic citations
(consisting of the full text of document titles, and abstracts), or the full text of short articles. Table 3.2
gives a brief description and summarizes the sizes of the datasets used.
Cisi: document abstracts in library science and related areas published between 1969 and 1977 and
extracted from Social Science Citation Index by the Institute for Scientific Information.
Cran: document abstracts in aeronautics and related areas, originally used for tests at the Cranfield
Institute of Technology in Bedford, England.
Med: document abstracts in biomedicine received from the National Library of Medicine.
Reuters-21578: short articles belonging to the Reuters-21578 collection23. This collection consists
of news stories, appearing in the Reuters newswire for 1987, mostly concerns business and the economy.
It contains multiple categories that are overlapping.
Collection name
Cisi
Cran
Med Reuters-21578
Document number 1460 1400 1033
21578
Table 3.2. Size of collections.

For document retrieval task, we have picked 30 queries from each of the first three collections,
22
ftp://ftp.cs.cornell.edu/pub/smart/english.stop, Retrieved on 10-28-2007.
23
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html, Retrieved on 10-28-2007.

___________________________________________________________________________________________
43

where 15 are used in the training phase to identify the best reduced dimension of the LSA model and the
other 15 are used in the test phase.
3.3.2. Experiments
In these experiments, we are interested in evaluating the effectiveness of the weighting schemes,
after which we compare the performances of the latent semantic analysis and the vector-space models.
3.3.2.1. Weighting Schemes Impact

As an extension to our previous work [ABE05], we study the effectiveness of 22 weighting schemes
such as combination of the global and local weighting schemes defined in Section 3.2.2, in addition to
the TFC, LTC, and Okapi BM-25 schemes.
To conform to the literature notations, we present a weighting scheme by three-letters code in which
the first letter corresponds to the local factor, the second letter to the global factor, and the third letter to
the normalization component. For example, using the weighting scheme nnn leaves the term frequency
vector unchanged, whereas the weighting schemes ntn and ntc produces respectively the well-known
tfxidf and Tfc weights. For indicating the inverse of a weighting scheme, we use a letter bar. All the
notations are listed in the Appendix B.
Scheme
MIAP
Scheme MIAP
lNn
0.14
nGn
nNn
0.14
nnn
nE n
G
nE n
1
lE n
G
0.17
lE n
1
0.18
ltc
0.18
lGn
0.19
Scheme MIAP
_
nG n
Scheme MIAP
0.21
_
nN n
0.22
nE n
Scheme MIAP
_
0.24
lE n
0.27
0.24
ltn
0.27
0.25
lE n
0.20
ntc
0.20
nE n
1
0.22
_
lNn
0.21
_
lG n
0.22
nE n
0.25
lE n
0.28
0.21
lEn
0.23
ntn
0.25
Okapi
0.32
_
S
0.27
_
G
Table 3.3. Result of weighting schemes in increasing order for Cisi corpus.
44 ____________________________________________________________________________________________
Scheme
MIAP
lNn
0.10
nNn
nE n
G
nE n
1
lE n
G
0.10
0.12
0.14
Scheme MIAP
nGn
_
nG n
nnn
lE n
1
0.18
nE n
0.19
_
lG n
0.20
nN n
Scheme MIAP
0.24
ntc
Scheme MIAP
_
0.27
lE n
0.24
nE n
0.27
lE n
0.33
lNn
0.24
ntn
0.28
ltn
0.33
0.20
lGn
0.25
nE n
0.29
lE n
0.34
0.24
ltc
0.25
lE n
0.30
Okapi
0.47
0.16
Scheme MIAP
_
S
0.32
Table 3.4. Result of weighting schemes in increasing order for Cran corpus.
Scheme
MIAP
nNn
0.09
lNn
0.09
_
nN n
ltc
_
lN
0.12
0.13
0.13
Scheme MIAP
nE n
G
ntc
lE n
G
nE n
1
_
nG n
Scheme MIAP
Scheme MIAP
0.14
nnn
0.18
_
nE n
1
0.15
nGn
0.19
lE n
0.20
lGn
0.17
_
lG n
0.20
nE n
0.18
_
lE n
0.20
nE n
0.16
lE n
0.20
ntn
0.24
0.22
ltn
0.25
0.23
lE n
0.23
lE n
0.25
0.24
Okapi
0.26
_
1
Scheme MIAP
_
G
_
G
_
S
0.25
Table 3.5. Result of weighting schemes in increasing order for Med corpus.
___________________________________________________________________________________________
45

Scheme
MIAP
nNn
0.14
nG n
0.14
lN n
0.18
_
nN
lNn
nE n
G
lE n
Scheme MIAP
Scheme MIAP
Scheme MIAP
_
0.35
0.24
lG n
0.26
lE n
0.30
lE n
0.25
lE n
0.27
lGn
0.31
lE n
0.25
ntc
0.27
nE n
0.32
lE n
0.36
0.27
nE n
0.33
ltn
0.37
0.28
ntn
0.34
Okapi
0.41
nnn
0.25
ltc
nE n
0.22
nGn
0.26
nE n
_
1
0.36
0.22
Scheme MIAP
Table 3.6. Result of weighting schemes in increasing order for Cisi-Med corpus.
The precedent experiments show that the choice of a weighting scheme is very important, because
some schemes destroy the mean interpolated average precision (MIAP) (see Section C.2.3. ). As we can
see for example in Table 3.5, the term frequency indexation (nnn) gives better result than when the first
_
ten schemes are used (nNn, lNn, n N n, ltc, l N n, n E n, ntc, l E n, n E n, n G n).

G
Also the experiments show that the Okapi BM-25 weighting scheme gives the best results, presented in
bold, overall the other schemes, in the whole examples. Moreover, we remark that there is a marked
difference in the rank of the other weighting schemes from a corpus to another. For example, by
evaluating the performance of the system, the well known and used TfxIdf scheme (ntn) is classified
between the 18th and 21st rank.
3.3.2.2. Reduced Dimension k

We would like to highlight that the best reduced dimension k is known as an empiric metric varies,
typically, between 100 and 300 [Dum94], depending on the characteristics of each corpus. However, we
can tell that this dependency is not limited to the corpus size and sparsity characteristics, but moreover it
is also related to the choice of the weighting scheme as experimentally proved in Table 3.7 and in
[ABE08]. Furthermore from this table and the results of [ABE08], we can conclude that the Okapi BM25 weighting scheme has another advantage than performing the retrieval over all the studied schemes.
This advantage is that the Okapi BM-25 weighting scheme has the smallest best reduced dimension k
compared to the other schemes in all experiments, even in the case of Arabic language [ABE08].
46 ____________________________________________________________________________________________
nnn
_
E
l S
_
E
l 1n
_
E
n 1n
E
l 1n
E
n 1n
Cisi
808
1440
1438
1458
1447
1395
487
1137
1424
427
442
1424
Cran
1204
772
1121
1189
961
1096 1370
904
971
1400
1365
886
Med
316
123
186
266
311
418
931
86
316
152
251
960
Cisi-Med 1329
437
1018
1436
2081
1260
889
333
1329
1622
564
534
_
N
n
lNn
nNn
_
G
l n
_
G
n n
lGn
nGn
ntn
Cisi
1458 1460
219
244
1433 1457
109
109
138 1460 1460
Cran
1387 1400
655
Med
810
_
N
l
_
E
n S
_
E
l G
_
E
n G
Gn n Gn
n l
ltn
Okapi
100
40
606
1366 1369 1400 1400 910 1400 1400 1070
262
1032 1032
206
321
745
157
146
341
531
366
112
53
Cisi-Med 1487 2491 2493
491
502
1634
428
219
532
868
1778
257
70
ntc
ltc
lE n
Table 3.7. The best reduced dimension for each weighting scheme in the case of four corpuses.
3.3.2.3. Latent Semantic Model Effectiveness

In this section, seeing that we have used the version A to model the data (see Section 3.2.4) in the
execution of our experiments, we are interested in evaluating the performance improvement offered by
the LSA model against the VSM.
After identifying the best weighting scheme and the most effective reduced dimension k for each
data set in the training phase, we compare in the Figure 3.2 the test phase results of the LSA model to
the VSM results.
The interpolated recall-precision curves, of the four experiments, strengthen what was known about
the LSA model. By computing the LSA and VSM MIAPs, we remark that LSA provides a 2 % - 10 %
advantage over the VSM, even with using just the version A for modeling the data.
___________________________________________________________________________________________
47
Cisi
Cran
Med
Cisi-Med
Figure 3.2. The interpolated recall-precision curves of the LSA and the VSM models.
3. 4. Summary
In this Chapter, we have first recalled the traditional technique of document retrieval, which is the
Vector-Space Model (VSM). Then, we have described one of its extended models: the Latent Semantic
Analysis (LSA), which shows substantial advancement over the traditional VSM, even when only the
version A modeling the data and queries is performed. Also, we have presented three combination
methods, found in current weighting literature, for the local, global weighting functions and the length
normalization. Through some experiments, we have juxtaposes the application of twenty five weighting
schemes, where comparison has shown advantages on behalf of the Okapi BM-25 weighting scheme.
The first advantage, of this scheme, is represented by its high performance improvement of the
information retrieval system, while the second is illustrated in getting the smallest best reduced
dimension k for the LSA model when this scheme is used.
48 ____________________________________________________________________________________________
Chapter 4 Document
Diffusion Map
Clustering
based
on
4. 1. Introduction
A great challenge of text mining arises from the increasingly large text datasets and the high
dimensionality associated with natural language. In this chapter, a systematic study is conducted, for the
first time, in the context of the document clustering problem, using the recently introduced diffusion
framework and some characteristics of the singular value decomposition.
This study is two major fold: classical clustering and on-line clustering. In the first fold, we propose
to construct a diffusion kernel based on the cosine distance, we discuss the problem of the reduced
dimension choice, and we compare the performances of k-means algorithm in four different vector
spaces: Saltons vector space, latent semantic analysis space, diffusion space based on the cosine
distance, and another based on the Euclidian distance. We also propose two postulates indicating the
optimal dimension to use for clustering as well as the optimal number of clusters to use in that
dimension.
While in the second fold, we introduce single-pass clustering, one of the most popular methods used
in online applications such as peer-to-peer information retrieval (P2P) [KWX01, KlJ04, KJR06], topic
detection and tracking (TDT) [HGM00, MAS03]. We present a new version of the classical single pass
clustering algorithm, called On-line Single-Pass clustering based on Diffusion Map (OSPDM).
4. 2. Construction of the Diffusion Map

Related to spectral graph cuts [Wei99, ShM00, MeS01] and eigenmaps [RoS00, ZhZ02, BeN03,
DoG03] methodologies, diffusion map first appeared in [CLL05]. In this section, we describe in brief its
construction in the case of a finite data.
4.2.1. Diffusion Space

Given a corpus, D, of documents, construct a weighted function k (di,dj), for di,dj D, and 1 i,j
N, with N = |D|. k (di,dj) is also known as the kernel and satisfies the following properties:
k is symmetric: k (d i , d j ) = k (d j , d i )
k is positivity preserving: for all d i and d j in the corpus D , k (d i , d j ) 0
k is positive semi-definite: for any choice of real numbers
1 , ..., N , we have
___________________________________________________________________________________________
49
Document Clustering based on Diffusion Map

N

i =1 j =1
k (d i , d j ) 0.
This kernel represents some notion of affinity or similarity between the documents of D, as it
describes the relationship between pairs of documents in the corpus. In this sense, one can think of the
documents as being the nodes of a symmetric graph whose weight function is specified by k . The kernel
measures the local connectivity of the documents, and hence captures the local geometry of the
corpus, D . The idea behind the diffusion map is to construct the global geometry of the data set from the
local information contained in the kernel k . The construction of the diffusion map involves the
following steps. First, assuming that the transition probability m1 , in one time step, between documents
d i and d j is proportional to k (d i , d j ) we construct an N N
M (i, j ) = m1 ( d i , d j ) =
k (d i , d j )
p(d i )
where
Markov matrix by defining
is the required normalization constant, given by
p (d i ) = k (d i , d j ).
j
The Markov matrix M reflects the first-order neighborhood structure of the graph. However to
capture information on larger neighborhoods, powers of the Markov matrix M are taken, inducing a
forward running in time of the random walk and constructing a Markov Chain. Thus considering M t the
tth power of M, the entry mt (d i , d j ) represents the probability of going from document d i to d j in t time
steps.
Increasing t, corresponds to propagating the local influence of each node with its neighbors. In other
words, the quantity M t reflects the intrinsic geometry of the data set defined via the connectivity of the
graph in a diffusion process and the time t of the diffusion plays the role of a scale parameter in the
analysis. When the graph is connected, we have that [Chu97]:
lim mt (d i , d j ) = 0 (d j ) , where 0 is the unique stationary distribution 0 (d i ) =
t +
p(d i )
.
l p(d l )
Using a dimensionality reduction function (the SVD in our approach), the Markov matrix M will
have a sequence of r (where r is the matrix rank) eigenvalues in non-increasing order
0 1 ... l ... r 1 with corresponding right eigenvectors l .
The stochastic matrix M t naturally induces a distance between any two documents. Thus, we define the
diffusion distance as D t Diff (d i , d j ) = l2t ( l (d i ) l (d j )) 2 and the diffusion map as the mapping from
2
50 ____________________________________________________________________________________________
t 0 0 (d )
t
1 1 (d )
the vector d , representing a document, to the vector t (d ) =
, for a value n. By retaining
M
t n1 (d )
n 1
only the first n eigenvectors, we embed the corpus D in an n -dimensional Euclidean diffusion space,
where { 0 , 1 , ..., n1 } are the coordinate axes of the documents in this space. Note that typically,
n << N and hence we obtain a dimensionality reduction of the original corpus.
4.2.2. Diffusion Kernels

Following what is described in the preceding subsection, several choices for the kernel k are possible,
all leading to different analyses of the data. Inspired by the work of [LaL06] on word clustering, we
decided first, to use the Gaussian kernel (kernel based on the Euclidian distance), for document
d d
i
j
clustering, which is defined as k (d i , d j ) = exp
, where the parameter specifies the size of
the neighborhoods defining the local geometry of the data. The smaller the parameter , the faster the
exponential decreases and hence the weight function k becomes numerically insignificant more quickly,
as we move away from the center.
However, as experiments show in Section 4.4.1, there are strong indications that this kernel is not the
right choice for the document clustering. For this reason, in addition to the fact that the cosine distance
has emerged as an effective distance for measuring document similarity [Sal71, SGM00], we propose to
use a kernel based on what is known as the cosine distance: DCos (d i , d j ) =1
d i .d j
di . d j
. We define this
DCos (d i , d j )
.
kernel as k (d i , d j ) = exp
However in the case where the vectors d i and d j are normalized, due to the fact that the two kernels
are related, as shown by the equation DCos ( d i , d j ) = 1 d i .d j =
T
2
1
d i d j , the distinction between
2
these two kernels could be ignored.
4.2.3. Dimensionality Reduction

Reduction of the data dimensionality, thereby reducing the complexity of data representation and
___________________________________________________________________________________________
51

speeding up similarity computation times, may lead to significant savings of computer resources and
processing time. However the selection of fewer dimensions may cause a significant loss of the
document local neighborhood information.
Different methods for reducing the dimensionality of the diffusion space have been investigated, such
as the Graph Laplacian [CLL05, LaL06], LaplaceBeltrami [CLL05, CoL06a, CoL06b], the Fokker
Planck operator [CLL05, CoL06a], and the singular value decomposition [VHL05]. However in this
work, we have chosen to embed a low-dimensional representation of the approximate diffusion map
using the singular value decomposition for classical clustering, and the SVD-updating method [Obr94,
BDO95] for the on-line clustering, taking advantage of the results concerning these methods in the field
of document clustering [DhM01, Ler99].
4.2.3.1. Singular Value Decomposition

Singular value decomposition is used to rewrite an arbitrary rectangular matrix, such as a Markov
matrix, as a product of three other matrices: M = USV T , where U is a matrix of left singular vectors, S
is a diagonal matrix of singular values, and V is a matrix of right singular vectors (for more details see
Section 3.2.3). As the Markov matrix is symmetric, both left and right singular vectors provide a
mapping from the document space to a newly generated abstract vector space. The elements
(0 , 1 ,..., r 1 ) of the diagonal matrix, the singular values, appear in a magnitude decreasing order. One
of the most important theorems of SVD, Eckart and Young theorem [GoR71], states that a matrix
formed from the first k singular triplets U i , i ,Vi
} of the SVD (left vector, singular value, right vector
combination) is the best approximation to the original matrix that uses k degrees of freedom. The
technique of approximating a data set with another one having fewer degrees of freedom works well,
because the leading singular triplets capture the strongest, most meaningful, regularities of the data. The
latter triplets represent less important, possibly spurious, patterns. Ignoring them actually improves
analysis, though there is the danger that by keeping too few degrees of freedom, or dimensions of the
abstract vector space, some of the important patterns will be lost [LFL98].
In [DhM99, DhM01], Dhillon and Modha compared the closeness between the subspaces spanned
by the spherical k-means concept vectors and the singular vectors by using principal angles [BjG73,
GoV89, Arg03] (for more details, see Appendix D). Seeing that the concept vectors constitute an
approximation matrix comparable in quality to the SVD, they were interested in comparing a sequence
of singular subspaces to a sequence of concept subspaces, but since it is hard to directly compare the
sequences, they compared a fixed concept subspace to various singular subspaces, and vice-versa.
52 ____________________________________________________________________________________________

By focusing on the average cosine of the principal angles between the concept subspace of 64
dimensions and various singular subspaces, plotted in the two following figures for two different Data
Sets, we remark that the average cosine tends the best to 1 when k, the number of singular vectors
constituting the singular subspace, is very small, appearing in the figures approximately equal to 6.
Figure 4.1. Average cosine of the principal angles between 64 concept subspace and various singular
subspaces for the CLASSIC data set.
Figure 4.2. Average cosine of the principal angles between 64 concept subspace and various singular
subspaces for the NSF data set.
This fact means that the concept subspace is completely contained in the singular subspace constituted
of the first six singular vectors. Thus the minimum number k of independent variables, required to
describe the approximate behavior of the underlying system in the truncated SVD matrix M n 1 , where
___________________________________________________________________________________________
53
M n1 =U n1 S n 1Vn1 , is reduced by a factor of 10 compared to the one needed and used for information
T
retrieval (considered to be between 100-300).

On the other hand, Lerman in [Ler99] presented a procedure, in clustering context, for determining
the appropriate number of dimensions for the subspace. This procedure could be considered as a visual
inspection of the thresholding method used in [VHL05, CoL06a, LaL06] and proposed by Weiss
[Wei99], for an affinity matrix, such as the Markov matrix, because in this method, the number k of the
eigenvectors used for parametrizing the data is equal to the number of eigenvalues that have a magnitude
k 1 greater than a given threshold f 0 ; while in Lerman procedure, based on the plot of the
singular values in decreasing order, and the break or discontinuity in the slope, she shows that the
degrees of freedom k is equal to the number of points on the left side of the discontinuity.
After reducing the dimension, documents are represented as k-dimensional vectors in the diffusion
space, and could be clustered by using a standard clustering algorithm, such as k-means and single pass.
4.2.3.2. SVD-Updating
Suppose an m n matrix A has been generated from a set of data in a specific space, and its SVD,
denoted by SVD(A) and defined as:
A = USV T
(1)
has been computed. If more data (represented by rows or columns) must be added, three alternatives for
incorporating them currently exist: recomputing the SVD of the updated matrix, folding-in the new rows
and columns, or using the SVD-updating method developed in [Obr94].
Recomputing the SVD of a larger matrix requires more computation time and, for large problems,
may be impossible due to memory constraints. Recomputing the SVD allows the new p rows and q
columns to directly affect the structure of the resultant matrix by creating a new matrix A( m+ p )( n+ q ) ,
computing the SVD of the new matrix, and generating a different rank-k approximation matrix Ak ,
where
Ak = U k S kVk , and k << min(m, n) .

T
(2)
In contrast, folding-in, which is essentially the process described in Section3.2.4 for query
representation version A and B, is based on the existing structure, the current Ak , and hence new rows
and columns have no effect on the representation of the pre-existing rows and columns. Folding-in
requires less time and memory but, following the study undertaken in [BDO95], has deteriorating effects
54 ____________________________________________________________________________________________

on the representation of the new rows and columns. On the other hand, as discussed in [Obr94, BDO95],
the accuracy of the SVD-updating approach can be easily compared to that obtained when the SVD of
A( m+ p )( n+ q ) is explicitly computed.
The process of SVD-updating requires two steps, which involve adding new columns and new rows.
a- Overview
Let D denote the p new columns to process, then D is an m p matrix. D is appended to the
columns of the rank-k approximation of the m n matrix A, i.e., from Equation (2), Ak so that the klargest singular values and corresponding singular vectors of
B = ( Ak D)
(3)
are computed. This is almost the same process as recomputing the SVD, only A is replaced by Ak . Let T
denote a collection of q rows for SVD-updating. Then T is a q n matrix. T is then appended to the
rows of Ak so that the k-largest singular values and corresponding singular vectors of
A
C = k
T
(4)
are computed.
b- SVD-Updating Procedures
In this section, we detail the mathematical computations required in each phase of the SVD-updating
process. SVD-updating incorporates new row or column information into an existing structured model
( Ak from Equation (2)) using the matrices D and T discussed in Section 4.2.3.1. SVD-updating exploits
the previous singular values and singular vectors of the original matrix A as an alternative to
recomputing the SVD of A( m+ p )( n+ q ) .
Updating Column. Let B = ( Ak D ) from Equation (3) and define SVD ( B ) = U B S BVBT . Then,
V
U kT B k
0
0
= ( S k U kT D ) , since Ak = U k S kVk T . If F = ( S k U kT D ) and SVD( F ) =U F S FVFT , then it
I P
V
follows that U B =U kU F , VB = k
0
0
VF , and S B = S F .
I P
Hence U B and VB are m k and ( n + p ) ( k + p ) matrices, respectively.
___________________________________________________________________________________________
55

A
Updating Row. Let C = k from Equation (4) and define SVD (C ) = U C S CVCT . Then
T
U kT
CVk = S k
TV
Iq
k
U
S
If H = k and SVD ( H ) =U H S H VHT , then it follows that U C = k
0
TVk
0
U H , VC =VkVH , and S C = S H .
I q
Hence U C and VC are ( m + q ) ( k + q ) and n k matrices, respectively.
4. 3. Clustering Algorithms
In this section, we present the clustering algorithms that we use in this chapter, which are the kmeans algorithm, the single-pass algorithm and finally the on-line single-pass clustering based on
diffusion map (OSPDM).
4.3.1. k-means Algorithm

k-means [Mac67] is one of the simplest unsupervised learning algorithms that solve the clustering
problem. The procedure follows a simple and easy way to classify a given data set through a certain
number of clusters (assume k clusters) fixed a priori. The main idea is to find the centers of natural
clusters in the data by minimizing the total intra-cluster variance, or, the squared error function
k
j =1 i =1
xi( j ) c j
, where xi( j ) c j
is a chosen distance measure between a data point xi( j ) and the
centroid c j , which is the mean point of all the points xi( j ) of the cluster j.
The algorithm starts by partitioning the input points into k initial sets, either at random or using some
heuristic data. It then calculates the centroid of each set. It constructs a new partition by associating each
point to the nearest centroid. Then the centroids are recalculated for the new clusters, and algorithm
repeated by alternate application of these two steps until convergence, which is obtained when the points
no longer switch clusters (or alternatively centroids are no longer changed).
k-means algorithm is composed of the following steps:
1- Place k points into the space represented by the objects that are being clustered. These points
represent initial group centroids.
2- Assign each object to the group that has the closest centroid.
56 ____________________________________________________________________________________________

3- Recalculate the positions of the k centroids, when all objects have been assigned.
4- Repeat Steps 2 and 3 until the centroids no longer move.
This produces a separation of the objects into groups from which the metric to be minimized can be
calculated.
Although it can be proved that the procedure will always terminate, the k-means algorithm does not
necessarily find the most optimal configuration, corresponding to the global objective function
minimum. The algorithm is also significantly sensitive to the initial randomly selected cluster centers,
and may change the quality of its final solution, in practice, be much poorer than the global optimum.
But, since the algorithm is concidered fast, a common method is to run the k-means algorithm several
times and return the best clustering found.
Another main drawback of the k-means algorithm is that it has to be told the number of clusters (i.e.
k) to find. If the data is not naturally clustered, some strange results may be got.
4.3.2. Single-Pass Clustering Algorithm

Incremental clustering algorithms are always preferred to traditional clustering techniques, since
they can be applied in a dynamic environment such as the web [WoF00, ZaE98]. Indeed, in addition to
the traditional clustering objective, the incremental clustering has the ability to process new data as they
are added to the data collection [JaD88]. This fact allows dynamic tracking of the ever-increasing large
scale information being put on the web everyday without having to perform complete re-clustering.
Thus, various approaches, including a single-pass clustering algorithm, have been proposed [HaK03].
Algorithm
Single-pass clustering, as the name suggests, requires a single, sequential pass over the set of
documents it attempts to cluster. The algorithm classifies the next document in the sequence according
to a condition on the similarity function employed. At every stage, based on the comparison of a certain
threshold and the similarity between a document and a defined cluster, the algorithm decides on whether
a newly seen document should become a member of an already defined cluster or the center of a new
one. Usually, the description of a cluster is the centroid (average vectors of the document representations
included in the cluster in question), and a document representation consists of a term-frequency vector.
Basically, the single-pass algorithm operates as follow:
___________________________________________________________________________________________
57

For each document d in the sequence loop
1- find a cluster C that minimizes the distance D(C, d);
2- if D(C, d) < t then include d in C;
3- else create a new cluster whose only document is d;
End loop.
where t is the similarity threshold value, which is often derived experimentally.
4.3.3. The OSPDM Algorithm

In our approach, we are interested in taking advantage of the semantic structure and the documents
dependencies created due to the diffusion map, in addition to the resulting reduced dimensionality by
using the SVD-updating, which leads to significant savings of computer resources and processing time.
More specifically, we take into consideration the studies in [DhM01, Ler99] where we have established
that the best reduced dimension related to the SVD method for document clustering is restricted to the
first tens of dimensions (for more details see Section 4.2.3.1).
Hence, our approach in developing the OSPDM algorithm is resumed as follow:
Given a collection D of n documents, a new document d that should be added to the existing
collection D, and a set C of m clusters.
1- Generate the term-document matrix A from the set D.
2- Compute the Markov matrix M for the n documents.
3- Generate SVD( M ) = U M S M VMT .
4- Choose the best reduced dimension for the clustering task, M k = U k S kVkT .
5- Update the term-document matrix A by adding the column representing the document d and the
needed rows, if the new document contains some new terms.
6- Update the Markov matrix M (as M is symmetric, one can update just rows RM ).
M
7- Apply SVD-updating for T = k
RM
S
a. Put H = k
RM Vk
, and generate SVD ( H ) = U H S H VHT .
58 ____________________________________________________________________________________________
U
b. Compute U T = k
0
0
U H .
1
c. Compute VT = Vk VH , and S T = S H (for the next iteration).

8- Update the centroids of the m clusters, by using the reduced dimension k of the matrix U T .
9- Apply a step of the single-pass clustering:
a. Find a cluster Ci that minimizes the distance D(Ci, UTk(n+1, 1:k)).
b. If D(Ci, UTk(n+1, 1:k)) < t then include d in Ci , with t as a specified threshold, and set n
= n + 1.
c. Else create a new cluster C m+1 whose represented by U Tk ( n + 1,1 : k ) , and m = m + 1.
4. 4. Experiments and Results

The testing data used, for evaluating the effective power of our algorithms, are formed by mixing
documents from multiple topics arbitrarily selected from our evaluation database, presented in Section
3.3.1. At each run of the test, documents from a selected number, k, of topics are mixed, and the mixed
document set along with the cluster number, k, are provided to the clustering process.
4.4.1. Classical Clustering

We have applied the diffusion process to several examples where we evaluate the results of the kmeans algorithm applied in four different vector spaces: Salton space; LSA space, where SVD is applied
to the term-document matrix; diffusion space based on the Euclidian distance; and diffusion space based
on the cosine distance. The evaluation is carried through the comparison of the averages of accuracy
(Acc) and mutual information (MI), defined in Section C.3. , resulting from thirty k-means runs.
Example 1. (Cisi and Med) In this example, the data set contains all documents of the collections
Cisi and Med. Figure 4.3 shows the two collections in the diffusion space at power t = 1, (a, c, e) for the
cosine kernel, and (b, d, f) for the Gaussian kernel, respectively, in 1, 2 and 3 dimensions. From this
figure, it appears clearly that the collections are better represented in the embedding space using the
cosine kernel.
___________________________________________________________________________________________
59
-a-
-bGaussian Kernel
Cosine Kernel
4
20
15
10
-1
-5
-2
-10
-3
-2.5
-2
-1.5
-1
-0.5
0.5
1.5
-15
-200
-150
-100
-50
-c-
-d-
-e-
-f-
50
100
Magenta color represents documents of Med collection

Cyan color represents documents of Cisi collection
Figure 4.3. Representation of our data set in various diffusion spaces.
60 ____________________________________________________________________________________________

However, we still do not know how they will be represented in other dimensions. To answer this
question, we cluster the embedded data set into k = 2 clusters. Table 4.1 recalls the results of running the
k-means program for several dimensions in the two diffusion spaces, as well as the LSA space. In bold,
we represent the best result over all dimensions for each space. Table 4.3 shows the best results of kmeans in Salton space, cosine diffusion space, and LSA space.
1-Dim
Spaces
Acc
2-Dim
MI
Gaussian diffusion 58.60 0.03
Acc
MI
3-Dim
Acc
MI
4-Dim
Acc
MI
58.68 0.15
58.78 0.21
58.68 0.10
Cosine diffusion
98.60 89.18
90.11 69.33
77.96 41.80
75.82 34.47
LSA
98.63 90.40
94.61 78.41
93.78 77.04
90.76 67.08
5-Dim
10-Dim
20-Dim
100-Dim
Spaces
Acc
MI
Gaussian diffusion 58.60 0.05
Acc
MI
Acc
MI
Acc
MI
58.60 0.05
58.60 0.05
58.60 0.05
Cosine diffusion
76.11 33.80
75.76 32.25
71.63 29.75
59.49 8.12
LSA
83.72 46.4
73.31 26.82
65.95 12.64
61.16 5.497
Table 4.1. Performance of different embedding representations using k-means for the set Cisi and Med.
From the results of Table 4.1, we can see that the embedding diffusion that one obtains is very
sensitive to the choice of a diffusion kernel, and the data representation in higher dimension produces
worse results, confirming the Dhillon and Modha results [DhM01] discussed in Section 4.2.3.1.
Moreover, by comparing in Table 4.2 the running time of the diffusion process, at t = 1, using the two
kernels, we find that the process needs just about 36 seconds to build the 2-dimension diffusion space
based on the cosine kernel, while for the Gaussian kernel it takes about 31 minutes, indicating that the
cosine kernel takes advantage of the word document matrix sparsity in the computation of the
Markov matrix and the SVD.
Cosine kernel Gaussian kernel

Distance
9s
7s
Markov matrix
2s
14 s
SVD
25 s
31 min
Table 4.2. The process running time for the cosine and the Gaussian kernels.
In Figure 4.4, we plot the first two coordinates of some powers of the Markov matrix M (a, c, e, g, i)
___________________________________________________________________________________________
61

for the cosine kernel, and (b, d, f, h, j) for the Gaussian kernel, respectively, for t equal to 2, 4, 10, 100
and 1000.
-a-
-b-
-c-
-d-
-e-
-f-
62 ____________________________________________________________________________________________
-g-
- h-
-i-
- j-
Magenta color represents documents of Med collection

Cyan color represents documents of Cisi collection
Figure 4.4. Representation of our data set in Cosine and Gaussian diffusion spaces for various t time
iterations.
From this figure, we remark that when the value of the power t increases, the data of the two
collections gets more merged. Effectively, because in this case the data point get connected by a larger
number of paths. Moreover, we remark that the dependency changing rate of data in the case Gaussian
diffusion space is larger than in the cosine space, showing by this that the cosine distance is stable than
the Euclidean distance.
On the basis of these results, and the fact that we are using un-normalized data, we have decided to
exclude from our succeeding experiments the diffusion space based on the Euclidian distance (Gaussian
diffusion space) and the use of the Markov matrix power. Thus, we will restrict our comparisons to the
cosine diffusion space for t equal to 1, LSA, and Salton spaces.
___________________________________________________________________________________________
63

Spaces
Acc
MI
Cosine diffusion 98.60 89.18

Salton
95.72 83.61
LSA
98.63 90.40
Table 4.3. Performance of k-means in cosine diffusion, Salton and LSA spaces for the set Cisi and Med.
On the other hand, despite the fact that Table 4.3 shows that the cosine diffusion representation is
just incrementally better in both accuracy and mutual information compared to the Salton representation,
we should not forget the excellent gain in the computation time. Thus, using a k-means algorithm
running in Matlab sometimes takes more than two hours when documents are represented in the Salton
space, where the length of a document vector is determined by the number of the collection terms,
usually in the thousands; while with the cosine diffusion representation, the running time is just a few
seconds, in view of the fact that the length of a document vector in the embedded space is very small,
reduced by a factor that may be larger than 1000. However, in the case of the LSA representation, we
remark that for this set of documents, k-means performs almost as accurately as in the case of the cosine
diffusion representation.
To pick the number of dimensions for the embedding space, as shown in Figure 4.5, we plot the first
100 singular values of the cosine diffusion map in the bottom curve. To help identify the discontinuity in
the slope of the singular value curve, we plot on the top part of the figure the difference between each
successive pair of singular values, magnified by a factor of 10 and displaced by 1 from the origin for
emphasis.
Figure 4.5. Representation of the first 100 singular values of the cosine diffusion map on the set Cisi
and Med.
64 ____________________________________________________________________________________________

The first singular value is always unity, corresponding to the flat singular vector. The next singular
value is greater than the rest of the singular values, following the Lermans method [Ler99], indicating
that the optimal dimension is approximately 1. This is also confirmed by the results of Table 4.1.
Figure 4.6. Representation of the first 100 singular values of the Cisi and Med term-document matrix.
Using the same technique, to pick the reduced dimension for the LSA space, we remark that for this
set of documents the discontinuity point in Figure 4.6 corresponds to the best reduced dimension found
in Table 4.1.
After clustering the set of document into two clusters, in this step, we will use the diffusion process
and the Buchaman-Wollaston & Hodgeson method to make sure that the documents of each resulting
cluster, referenced by C1 and C2, should not be further partitioned.
Given that it is well known that for well separated data, the number of empirical histogram peaks is
equal to the number of components, the Buchaman-Wollaston and Hodgeson method [BuH29] consists
in fitting each peak to a distribution. Based on this, and on the Kullback-Leibler (KL) divergence
[KuL51], the Jensen-Shannon divergence [FuT04], and the accumulation function [Rom90] to compare
between the approximation distribution and the data histogram distribution, we determine, as is shown in
Table 4.4, that we have very good approximations of the histograms of clusters C1 and C2, each
represented, respectively, in Figure 4.7 and Figure 4.8, by only one normal distribution.
___________________________________________________________________________________________
65
Figure 4.7. Histogram representation of the cluster C1 documents.
Cluster C1 Cluster C2
Kullback-Leibler
8e-15
2e-15
Jensen-Shannon
-4e-17
8e-17
Accumulation
1e-16
3e-17
Table 4.4. Measure of the difference between the approximated and the histogram distributions.
In Figure 4.9 and Figure 4.10, we represent the first hundred singular values of documents from
clusters C1 and C2, respectively, in the cosine diffusion space. We remark that the discontinuity point of
the slope coincides with the largest singular value, which means that the other singular values are
meaningless. Thus, we could represent a document in the embedded cosine diffusion space in 1dimension.
66 ____________________________________________________________________________________________
Figure 4.9. Representation of the first 100 singular values of the cosine diffusion map on the cluster C1.
Figure 4.10. Representation of the first 100 singular values of the cosine diffusion map on the cluster
C2
Example 2. (Cran, Cisi, and Med) Here, we mix all documents of the three collections: Cran, Cisi,
and Med. In Table 4.5, we present the results of the k-means algorithm running in five different
dimensions for the LSA and the cosine diffusion spaces. Table 4.6 shows the optimal performance of kmeans in the cosine diffusion, Salton and LSA spaces.
Spaces
Dim1
Acc
MI
Cosine
93.21 78.72
diffusion
LSA
89.67 72.84
Dim2
Acc
MI
Dim3
Acc
MI
Dim4
Acc
MI
Dim5
Acc
MI
98.45 92.14
97.05 90.29
94.38 87.26
92.67 86.34
86.22 81.27
86.74 82.11
92.32 86.44
79.05 67.32
Table 4.5. Performances of different embedding representations using k-means for the set Cran, Cisi
and Med.
___________________________________________________________________________________________
67

Spaces
Acc
MI

Salton
73.03 62.35
LSA
92.32 86.44
Table 4.6. Performance of k-means in cosine diffusion, Salton, and LSA spaces for the set Cran, Cisi
and Med.
From Tables Table 4.3 and Table 4.6, we remark that k-means performs much better in the cosine
diffusion space compared to the Salton space, and better than in the LSA space. However, for this set of
documents, the discontinuity of the singular value technique is not working for the LSA space, because
the marked slope discontinuity (shown in Figure 4.12) is around the 3rd singular value indicating an
optimal dimension equal to 2, while the best dimension found in Table 4.5 is the 4th.
Figure 4.11. Representation of the first 100 singular values of the cosine diffusion space on the set
Cran, Cisi and Med.
Figure 4.12. Representation of the first 100 singular values of the Cran, Cisi and Med term-document matrix.
68 ____________________________________________________________________________________________

In the objective to recluster the document of the three resulted clusters C1, C2 and C3, we conclude
from the results of Figures Figure 4.13 to Figure 4.15 to and Table 4.7 that these sets of documents
could not be further refined. Also, from Figures Figure 4.16 to Figure 4.18, we establish that the slope
discontinuity point of the singular values for a set of documents representing one cluster in the cosine
diffusion space always coincides with the largest singular value.
___________________________________________________________________________________________
69
Cluster C1 Cluster C2 Cluster C3

Kullback-Leibler
6e-17
1e-16
2e-16
Jensen-Shannon
2e-16
1e-16
1e-16
Accumulation
4e-17
4e-17
1e-17
Table 4.7. Measure of the difference between the approximated and the histogram distributions.
Figure 4.16. Representation of the first 100 singular values of the cosine diffusion map on cluster C1.
70 ____________________________________________________________________________________________
The previous results suggest two postulates for the cosine diffusion space:
Dimension Postulate: The optimal dimension of the embedding for the cosine diffusion space is
equal to the number, d, of the singular values on the left side of the discontinuity point after excluding
the largest (first) singular value. When d is equal to zero, the data will be represented in 1-dimension.
Cluster Postulate: The optimal number of clusters in a hierarchical step is equal to d+1, where d is
the optimal dimension provided by the dimension postulate of the same step.
Example 3. (Cran, Cisi, Med, and Reuters_1) For this example, we use just 500 documents from
each collection of Cran, Cisi, and Med, mixed with 425 documents from the Reuters collection. From
Table 4.8, representing the results of the k-means algorithm running in five different dimensions for the
LSA and the cosine diffusion spaces, and Table 4.9, representing its optimal performance in the cosine
diffusion, Salton and LSA spaces, it appears that k-means performs better in the cosine diffusion space
compared to both of the other spaces. However, we are interested in more than that.
___________________________________________________________________________________________
71
Spaces
Dim1
Acc
MI
Cosine
77.87 66.02
diffusion
LSA
66.85 66.42
Dim2
Acc
MI
Dim3
Acc
Dim4
MI
Acc
MI
Dim5
Acc
MI
84.73 78.88
95.93 93.90
99.22 96.66
98.04 95.61
87.74 81.41
82.28 83.30
70.82 67.44
62.83 57.28
Table 4.8. Performance of different embedding cosine diffusion and LSA representations using k-means
for the set Cran, Cisi, Med and Reuters_1.
Spaces
Acc
MI

Salton
71.68 71.62
LSA
87.74 83.30
Table 4.9. Performance of k-means in Cosine diffusion, Salton and LSA spaces for the set Cran, Cisi,
Med and Reuters_1.
Effectively, in the following we are concerned in validating our postulates in the cosine diffusion
space. Based on the dimension and cluster postulates, Figure 4.19, representing the first hundred
singular values for the chosen set of documents, indicates that the embedding dimension for this set of
data should be equal to two, and the number of clusters should be equal to 3.
Figure 4.19. Representation of the first 100 singular values of the cosine diffusion map on the set Cran,
Cisi, Med and Reuters_1.
Thus, we run the 3-means program in the 2-dimension cosine diffusion space, and we present the
generated confusion matrix in Table 4.10.
72 ____________________________________________________________________________________________
Cran Cisi Med Reuters

C1
493
59
C2
499
C3
441
425
Table 4.10. The confusion matrix for the set Cran-Cisi-Med-Reuters_1

clustered into 3 clusters in 2-dimention cosine diffusion space.
K-means clustring
2
C3
1.5
1
0.5
0
-0.5
C2
C1
-1
-1.5
-2
-2.5
-3
-2
-1.5
-1
-0.5
0.5
1.5
2.5
Figure 4.20. Representation of the first clusters of the hierarchical clustering.

From Figure 4.20, representing the first clusters of the hierarchical clustering of the set of collections
Cran-Cisi-Med-Reuters_1, we choose to exclude the documents belonging to the cluster C2 from further
decomposition, based on the fact that it is sufficiently distant from the other clusters. We then rerun the
k-means algorithm on the rest of the document set, which we call S.
By executing the cosine diffusion map process in S, we get the singular values presented in Figure
4.21.
8
Slope
Singular value
7
6
5
4
3
2
1
0
10
20
30
40
50
60
70
80
90
100
Figure 4.21.Representation of the first 100 singular values of the cosine diffusion map on the data set S
___________________________________________________________________________________________
73

From this figure, the cluster postulate suggests that this set of documents should be clustered into 3
clusters in 2 dimensions. The generated confusion matrix in Table 4.11 and the clusters shown in Figure
4.22 present the results of this experiment.
C1
493
C2
493
C3
425
Table 4.11. The confusion matrix for the set S in 2-dimension cosine diffusion space.
-1
-2
-3
-2
-1.5
-1
-0.5
0.5
1.5
Figure 4.22. Representation of the Set S clusters.

By combining the confusion matrices in Tables Table 4.10 and Table 4.11, we get Table 4.12, which
shows that the number of the misclassified documents in this matrix is equal to 15.
Cran Cisi Med Reuter
C1
493
C2
499
C3
493
C4
425
Table 4.12. The resultant confusion matrix.

To verify the validity of the dimension postulate, and to argue for our choice of running 3-means in 2dimensional cosine diffusion space, we evaluate k-means performance in multiple dimensions for the
data sets C2 and S. In the case of C2 set, we restrain the performance computation to the mutual
information, while we omit the calculation of the accuracy, as it is a multi-class metric. The results of
Tables Table 4.13 and Table 4.14 show that the best embedding dimension to partition these three
74 ____________________________________________________________________________________________

clusters is equal to two, as is indicated by the slope discontinuity shown in Figures Figure 4.19 and
Figure 4.20.
Spaces
1-Dim
2-Dim
3-Dim
4-Dim
Cosine diffusion
18.89
84.69
76.4
69.47
Table 4.13. Mutual information of different embedding cosine diffusion representations using k-means
to exclude the cluster C2 from the set Cran, Cisi, Med and Reuters_1.
Spaces
1-Dim
Acc
2-Dim
MI
Cosine
90.87 72.32
diffusion
Acc
3-Dim
MI
Acc
MI
97.55 93.76
99.08 95.07
4-Dim
Acc
MI
86.51 79.79
Table 4.14. Performance of different embedded cosine diffusion representations using k-means for the
set S.
In order to verify the results of the hierarchical clustering suggested by the two postulates, we run 4means in 4-dimension cosine diffusion space, which is indicated in Table 4.8 as the best reduced
dimension for clustering the Cran-Cisi-Med-Reuters_1 set in one step.
By presenting, in Table 4.15, the generated confusion matrix from partitioning the entire collection
into 4 clusters in the 4-dimensional cosine diffusion space, we remark that this matrix indicates the
existence of 15 misclassified documents, which is identical to the number of misclassified documents in
the confusion matrix resulted from combining the confusion matrices of the hierarchical steps, presented
in Table 4.12.
C1
492
C2
500
C3
493
C4
425
Table 4.15. The confusion matrix for the set Cran-Cisi-Med-Reuters_1 clustered into 4 clusters in the 4dimention cosine diffusion space.
In this example, we have not just validated our postulates, but moreover we have established that the
relation between them is mutual in each step of the hierarchical process. The results of Table 4.8 show
that the reduced dimension deduced graphically for a set of data depends on the number of clusters to
___________________________________________________________________________________________
75

which the data will be partitioned. Thus, when we have considered that the number of clusters of the set
Cran, Cisi, Med, and Reuters_1 is known (equal to 4), Table 4.8 indicates that the reduced dimension
resulted equal to 4 is different than the one deducted graphically from Figure 4.19.
Example 4. (Cran, Cisi, Med, and Reuters_2) To make sure that the need for many hierarchical
clustering steps does not depend on the number of clusters, especially when this number is larger than 3,
as the case of Example 3, we have chosen 500 documents from each of the collections Cran, Cisi, and
Med, different than those used in Example 3, and then we have mixed them with the 425 Reuters
documents used in Example 3.
From Figure 4.23, we can see that the marked slope discontinuity around the 4th singular value
indicates the optimal dimension shown in Table 4.16 and the correct number of clusters, since the first
hierarchical step.
Figure 4.23. Representation of the first 100 singular values of the cosine diffusion map on the set Cran,
Cisi, Med and Reuters_2.
Spaces
Cosine
diffusion
LSA
Dim1
Acc
MI
Dim2
Dim3
Dim4
Dim5
Acc
MI
Acc
MI
Acc
MI
Acc
MI
72.05 57.74
86.37
79.06
98.16
96.08
97.92
95.39
96.97
94.69
80.16 71.04
88.94
83.98
86.82
86.74
72.04
66.99
67.52
60.67
Table 4.16. Performance of different embedding cosine diffusion and LSA representations using kmeans for the set Cran, Cisi, Med and Reuters_2.
76 ____________________________________________________________________________________________
Spaces
Acc
MI

Salton
71.44 69.44
LSA
88.94 86.74
Table 4.17. Performance of k-means in Cosine diffusion, Salton and LSA spaces for the set Cran, Cisi,
Med and Reuters_2.
Example 5. (Reuters) In this example, we do a small experiment to assess how our approach to
clustering documents based on the cosine diffusion map and our proposed postulates responds to nonseparated data. To this end, we have mixed documents of four Reuters categories.
Spaces
Cosine
diffusion
LSA
1-Dim
2-Dim
3-Dim
4-Dim
Acc
MI
38.99
8.68
50.95 28.88
66.26 49.33
36.13
7.62
40.46 12.06
39.33 10.92
Acc
MI
Acc
MI
Acc
5-Dim
MI
Acc
MI
66.44 56.75
62.57
49.08
38.81 10.49
38.64
9.25
Table 4.18. Performance of different embedding cosine diffusion and LSA representations using kmeans for Reuters.
Spaces
Acc
MI

Salton
46.59 35.22
LSA
40.46 12.06
Table 4.19. Performance of k-means in Cosine diffusion, Salton and LSA spaces for Reuters.
1.8
Slope
Singular value
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
10
20
30
40
50
60
70
80
90
100
Figure 4.24. Representation of the first 100 singular values of the cosine diffusion map on Reuters.
___________________________________________________________________________________________
77

From Tables Table 4.18 and Table 4.19, we see that the k-means algorithm works much better in the
cosine diffusion space compared to both Salton and LSA spaces. However, we remark that in this
example, where documents are overlapping, the performance of k-means in the cosine diffusion space is
not as high as it is in the cases of the previous examples, where the documents are well separated. On the
other hand, the results exhibited in Figure 4.24 and Table 4.18 show that the discontinuity point
corresponds to the best reduced dimension, which means that our first proposed postulate is still valid,
whereas the second is not adequate for the case of overlapping clustering.
Examples 1-4 strongly suggest that the proposed postulates produce good results at identifying well
separated clusters. However, when data are not well separated, the notion of a cluster is no longer well
defined in the literature.
Comparing our results to the one stated for spectral clustering [Von06], we found that our cluster
postulate conforms to spectral clustering algorithms, where to construct k clusters, the k first
eigenvectors are used. Although, our cluster postulate indicates that the number of clusters is one more
than the best dimension, this difference is due to the fact that we normalize the eigenvectors by the one
corresponding to the largest eigenvalue, which means that by including the information of the first
eigenvector in the other eigenvectors we have excluded the use of this vector.
The idea to use the same approach, hierarchical clustering, in LSA space, confronts the obstacle of
the unknown number of clusters, because the results in this space do not give us any indication about the
choice of this number.
Seeing that the k-means clustering algorithm gave us similar results in both the LSA and the cosine
spaces for the data set of Example 1, we have decided to go in greater depth into the comparison
between these two spaces, and to undertake a statistical study.
By comparing the LSA and the diffusion map process in the flowcharts of Figure 4.25, we remark
that the SVD in the LSA method is applied to a term-document matrix, while in the DM approach, it is
applied to a document-document matrix. Thus, in LSA, the singular vector of a document gives its
relationship with the collection terms, while in the DM approach, a singular vector informs one about a
relationship between the collection documents.
78 ____________________________________________________________________________________________
Term-Document
Matrix
Term-Document
Matrix
Document -Document
Matrix
Singular Value
Decomposition
Markov Matrix
Singular Value
Decomposition
Dimension
Reduction
Dimension Reduction
Diffusion Map process
LSA process
Figure 4.25. The LSA and Diffusion Map processes.

Even though each topic has its own specific terms to describe it, this does not negate the fact that
close topics could have some specific terms in common. Thus, when we classify documents based on
their terms, as the number of topics in the same collection decreases, clusters become more separated.
Reciprocally, when we have a large number of topics, we could get overlapping clusters due to the terms
in common between these topics. However, if clustering is based on the relationship between
documents, this problem will be minimized.
For the statistical study, we have used 45 sets of data formed by documents from the Reuters,
Ohsumed [HBL94] and/or Cran and Cisi collections. These sets were formed such as each 15 of them
contain 2, 3 or 4 clusters.
The statistical study we undertake is based on the dependent t-test [PTS92], because we have
repeated measures. The t-test equation in this case is defined by t = N
XD
, where each component of
SD
the vector D is the difference between a pair of the accuracies or the mutual information in the cosine
diffusion and the LSA spaces, for a set of data. N is the length of vector D, or explicitly, is the number of
N
data sets. X D is the mean of D, and SD is its standard deviation, defined by:
N (d i d ) 2 .
i =1
___________________________________________________________________________________________
79

Clusters number
Two
Acc
MI
Three
Acc
MI
Four
Acc
MI
T-test value
-3.02 -3.44 0.56 0.62 1.76 3.08
Table 4.20. The statistical results for the performance of k-means algorithm in cosine diffusion and LSA
spaces.
From the results of the Table 4.20, we can conclude that the k-means clustering algorithm perform
very differently in the cosine diffusion space than in the LSA space, because the absolute value of the ttest values is extremely larger than the statistical significance threshold, which is usually equal to 0.05
[All07], for the three cases. Moreover, the results show that when there are only 2 topics, which imply
that the term distributions for these 2 topics are disjoint, the k-means algorithm performs better in the
LSA space than in the cosine diffusion space. While for multiple topics (in the cases of 3 and 4 clusters),
when documents on different topics may use overlapping vocabulary, k-means performance is better in
the cosine diffusion space. Furthermore, the performance difference of the k-means algorithm in the two
spaces becomes larger when the number of clusters increases. On the other hand, we remark that these
results conform to the ones shown in Tables Table 4.3, Table 4.6, Table 4.9 and Table 4.17.
If we take into consideration that, in the real-world clustering environment, the data sets usually
contain more than two clusters, we can conclude that the k-means algorithm performs well in the cosine
diffusion space.
4.4.2. On-line Clustering

In the following, we evaluate the results of the single-pass clustering based on Diffusion Map)
clustering algorithm applied in three different vector spaces: Salton space, diffusion space and updated
diffusion space, for three data sets. The evaluation is done by comparing the accuracy and the mutual
information.
The data set Cisi-Med contains all documents of the collections Cisi and Med. In Cran-Cisi-Med,
we mix all documents of the three collections: Cran, Cisi, and Med. In spite in Cran-Cisi-Med-Reuters,
we use 500 documents from each collection of Cran, Cisi, and Med, mixed with the 425 Reuters
documents.
80 ____________________________________________________________________________________________

Space
Set
Cisi-Med
Cran-Cisi-Med
Cran-Cisi-Med-Reuters
Salton
ACC
MI
DM
ACC
Upd-DM
MI
ACC
MI
87.16 65.52 91.29 72.12 91.41 72.56

60.82 37.21
80.5
26.07
81.61 84.08 77.87 83.89
0.24
69.29 79.83 68.25
Table 4.21. Performances of the single-pass clustering.

From the results of Table 4.21, we can see that the performance of the single-pass clustering
algorithm in the diffusion space is better than the Saltons space; while, it is almost similar to the
performance in the updated diffusion space. More precisely, the slight performance decrease in the
updated diffusion space is due to the updating process, while the dramatic variation on the mutual
information measure in Saltons space, when Reuters collection is incorporated, is due to the cluster
number inexactitude for this case even after trying a variety of threshold values.
On the other hand, viewing the fact that embedded space is restricted to the first ten dimensions, the
single-pass algorithm requires less computation time in both the diffusion and the updated diffusion
spaces than the Salton space, which is more than compensates for the updating process runtime.
4. 5. Summary
In this chapter, we have proposed a process, based on the cosine diffusion map and the singular
value decomposition, to cluster on-line and off-line documents. The experimental evaluation of our
approach for classical clustering has not only shown its effectiveness, but has furthermore helped to
formulate two postulates, based on the slope discontinuity of the singular values, for choosing the
appropriate reduced dimension in the cosine diffusion space, and finding the optimal number of clusters
for well separated data. Thus, our approach has shown many advantages compared to other clustering
methods in the literature. Firstly, the use of the cosine distance to construct the kernel has
experimentally indicated a better representation of the un-normalized data in the diffusion space than the
Gaussian kernel, and minimized the computational cost by taking advantage of the word x document
matrix sparsity. Secondly, the running time of the k-means algorithm in the reduced dimension of the
diffusion map space is very much lower than in the Salton space. Thirdly, we formulated a simple way
to find the right reduced dimension, where a learning phase is not needed. Fourth, the estimation of the
optimal number of clusters is immediate, not as in other approaches where some criteria are optimized
as a function of the number of clusters; in addition, our approach indicates this number even when there
is just one cluster. Finally, data representation in the cosine diffusion space has shown a non-trivial
statistical improvement in the case of multi-topic clustering compared to the representation in LSA
___________________________________________________________________________________________
81

space.
Similarly, the on-line clustering algorithm, based on mapping the data into low-dimensional feature
space and maintaining an up-to-date clustering structure by using the singular value decomposition
updating method, has enhanced efficiency, specifically for stationary text data.
82 ____________________________________________________________________________________________
Chapter 5 Term Selection

5. 1. Introduction
As the storage technologies evolve, the amount of available data explodes in both dimensions:
samples number and input space dimension. Therefore, one needs dimension reduction techniques to
explore and to analyze his huge data sets, which may lead to significant savings of computer resources
and processing time.
Many feature transformation approaches have been proposed for information retrieval context, while
feature selection methods are generally used in machine learning tasks (see Section 2. 4). In this chapter,
we propose to supplement, in the context of information retrieval, the feature transformation method,
based in our work on singular value decomposition (SVD), with term selection.
While latent semantic analysis, based on the SVD, has shown an effective improvement in
information retrieval, it has a major characteristic, or difficulty, in the learning phase which is the high
dimensionality of the feature space. To address this issue, we propose to use Yans approach, which
consists in extracting the generic terms [Yan05]. This approach was first proposed by Yan to improve
the LSA performance; however, by studying the benefit of the generic term extraction in the English
data collection (introduced in Section 3.3.1) we have remarked that this technique could be used as a
dimensionality reduction method.
5. 2. Generic Terms Definition

Generic Terms are an obvious minority among all the terms in the context of text retrieval; they have
a relatively well-balanced and consistent occurrence across the majority of (if not all of) document
collection topics. Due to their contrary distribution features compared to their majority counterparts
Domain Specific Terms, having a relatively concentrated occurrence in very few topics (in the extreme
case, just one topic), it is found that these terms affect relatively the information retrieval performance
[Yan05]. While in Yans work, the extraction of generic terms was implemented in the aim to improve
the retrieval performance; our approach consists in using the same algorithm but this time in the
objective of keeping the same performance and reducing the features space.
5. 3. Generic Terms Extraction

In the purpose to present the generic term extracting algorithm, we need first to precise some
definitions [Yan05], besides recalling the spherical k-means algorithm [DhM01].
Definition 1: The Concept Vector c of a set of n term vectors termi (1in) is their normalized
mean (this definition is adapted from [DhM01]).
___________________________________________________________________________________________
83
Term Selection
Given that the n term vectors do not diverge from each other too much, their Concept Vector can be
seen as a normalized representative vector for these n term vectors.
Mathematically, following Definition 1, we have:
1 n
c = term i
n i =1
1 n
n
term i = term i
n i =1
i =1
term i
i =1
(1)
Given that a certain document collection has t terms (keywords), we can use spherical k-means
clustering algorithm to partition these t terms into k clusters. Mathematically, we have:
k
= {term i : 1 i t}
(2)
Clusteri I Cluster j = (i, j[1, k] and ij)
(3)
U Cluster
j =1
and
Definition 2: The Affinity between a term vector and a cluster of term vectors is the cosine of
the term vector and the Concept Vector of the cluster.
The Affinity between a term and a cluster of terms, with a range of values between 1 and 1
inclusive, indicates how closely (in terms of the absolute value of the Affinity) and in which manner (the
Affinity being positive or negative) this term is related to this cluster.
Mathematically, given a term vector term and a cluster Cluster = {term(1), term(2), ... , term(w)}, their
Affinity is defined as
affinity(term, Cluster) =
term
c
term c term c
=
=
term
c
term 1
term
where c = term(i )
i =1
term(i )
i =1
(4)
Definition 3: The Affinity Set between a term vector and a partition of all terms in a document
collection is the set of Affinity values between the said term vector and each cluster of the said partition.
The Affinity Set records a number of Affinity values for a particular term across all the clusters of a
certain partition.
Mathematically, given a term vector term and a partition having k clusters {Cluster j }kj =1 , the Affinity
Set AFN between this term and this partition is defined as:
AFN = {affinity(term, Clusterj): 1jk)}
(5)
84 ____________________________________________________________________________________________
Term Selection
Definition 4: The Characteristic Quotient (or CQ) of a term vector with respect to a partition
of all terms in a document collection is the standard deviation of the Affinity Set defined between this
term vector and this partition over the mean of all the members in the said Affinity Set.
The Characteristic Quotient of a term vector with respect to a partition provides a sensible estimate
(educated guess) on how evenly (or unevenly) the meaning of this term participates across all the
clusters of this partition.
Mathematically, given a term vector term, a partition having k clusters {Cluster j }kj =1 , and their
Affinity Set AFN, the Characteristic Quotient of this term vector with respect to this partition is defined
as:
CQ =
stdv( AFN )
mean( AFN )
(6)
where stdv() and mean() are defined as:

mean({xi: 1in }) = x =
stdv({xi: 1in }) =
1 n
xi
n i =1
1 n
( xi x ) 2
n 1 i =1
Now we may formally define Generic Terms and Domain Specific Terms.
Definition 5: For a particular document collection, given all the terms and a meaningful partition
of these terms, the Generic Terms are those terms whose Characteristic Quotients are below the value
of GEN_CUTOFF.
Definition 6: For a particular document collection, given all the terms and a meaningful partition
of these terms, the Domain Specific Terms are those terms whose Characteristic Quotients are above or
equal to the value of GEN_CUTOFF.
The following two points shall clarify Definition 5 and Definition 6:
(I)
The phrase meaningful partition refers to a partition that groups terms in such a way so
that terms of similar meanings are most likely located in the same cluster of the partition.
(II)
GEN_CUTOFF is a small constant chosen to differentiate between Generic Terms and

Domain Specific Terms. More discussions on GEN_CUTOFF shall follow shortly.
___________________________________________________________________________________________
85
Term Selection
Comparing Definition 5 to the intuitive definition (characterization) of Generic Terms at the
beginning of the current Section, we have the following observations:
(I)
The new definition has the same spirit of the old one: A meaningful partition of terms into
many clusters stated in the new definition resembles a sensible grouping of documents
into many topics implied in the old one. The old definition was based on the distribution
pattern of Generic Terms over a range of document topics; the new one is based on the
participation (Affinity) pattern of Generic Terms among a number of term clusters.
(II)
The new definition has an advantage over the old one: Definition 5 is a working
definition on Generic Terms which makes it possible for devising an algorithm to identify
all the Generic Terms in a given document collection. In the new definition:
Characteristic Quotients are mathematically well-defined; a meaningful partition of terms
is obtainable through a clustering algorithm called Spherical k-means; and the value of
GEN_CUTOFF can be determined experimentally through trial and error.
The rationale behind the new definition of Generic Terms and Domain Specific Terms is as follows:
In a meaningful partition of terms, terms of similar meanings are grouped together cluster by cluster.
The Characteristic Quotient of a term vector with respect to this partition indicates how evenly (or
unevenly) the meaning of this term relates to all the clusters of this partition. The bigger the CQ is, then
the more unevenly the relationship becomes, and the stronger the tendency is for this term to be
categorized as a Domain Specific Term. On the other hand, the smaller the CQ is, then the more evenly
the relationship becomes, and the stronger the tendency is for this term to be categorized as a Generic
Term. Therefore, the value of CQ may be used to identify a term as a Generic Term or a Domain
Specific Term for that matter.
It is worth noting that a limited number of terms may sit on the borderline between Generic Terms
and Domain Specific Terms, whatever the actual value of GEN_CUTOFF is. Therefore increasing the
value of GEN_CUTOFF may allow some previously categorized borderline-case Domain Specific
Terms to be newly identified as Generic Terms, and vice versa.
Practically, Yan used a simpler but equally effective method to avoid the process of determining the
actual value of GEN_CUTOFF. He set up a goal to identify a fixed number (say ng) of generic terms so
that those terms whose Characteristic Quotients are among the lowest ng of all terms are automatically
identified as generic terms with the rest of the terms simultaneously being identified as domain specific
ones. In this way, he eliminated the GEN_CUTOFF value without any compromise of the validity of the
generic term identification process.
86 ____________________________________________________________________________________________
Term Selection
5.3.1. Spherical k-means

Spherical k-means [DhM01] is a variant of the well known Euclidean k-means algorithm [DuH73]
that uses cosine similarity [Ras92]. This algorithm partitions the high dimensional unit sphere using a
collection of great hypercircles, and hence Dhillon and Modha refer to this algorithm as the spherical kmeans algorithm. The algorithm computes a disjoint partitioning of the document vectors, and, for each
partition, computes a centroid normalized to have unit Euclidean norm. The normalized centroids
contain valuable semantic information about the clusters, and, hence, they refer to them as concept
vectors. The spherical k-means algorithm has a number of advantages from a computational perspective:
it can exploit the sparsity of the text data, it can be efficiently parallelized [DhM00], and converges
quickly (to a local maxima). Furthermore, from a statistical perspective, the algorithm generates concept
vectors that serve as a model which may be used to classify future documents. An adapted version of
the algorithm for term clustering is given in the top level of the GTE algorithm.
5.3.2. Generic Term Extracting Algorithm

Based on the previous definitions and discussions, we present the generic term extracting (GTE)
algorithm. Step (I) through Step (VI) are spherical k-means sub-algorithm (adapted from [DhM01]) for
achieving a meaningful partition of all terms, while it was to analyze documents, here it is used to
analyze terms. Step (VII) through Step (IX) are procedures for extracting generic terms one at a time. A
top-level flowchart of the GTE algorithm is represented in Figure 5.1.
Step (I)
Initialization: (i) normalize all term vectors termi, where i [1, t];
(ii) t = t0, loop = 1;
(iii) randomly assign t terms to k clusters, thereby having: {Cluster0, j }kj =1 ; (iv)
compute the concept vectors: {c 0, j }kj =1 .
Step (II)
term i c Tloop, j
For each termi (i [1, t]), compute ci* = arg min
term c T
i
loop , j
that if there are two or more cloop,1, ..., cloop,n
, where j [1, k]. Note
term i c Tloop , j
satisfying arg min
term c T
i
loop , j
, then
randomly assign one of cloop,1, ... , cloop,n to ci*.

Step (III) For each j [1, k], compute the new Partition: Clusterloop+1, j = {termi : c loop , j = c*i , 1 i t} .
Step (IV) For each j [1, k], compute the new concept vectors:
___________________________________________________________________________________________
87
Term Selection
Step (V)
c loop +1, j
term
termCluster
loop +1 , j
=
1
termClusterloop +1, j
term
termCluster
loop +1, j
.
1
termClusterloop +1, j
k
k
term c Tloop +1, j

term c Tloop , j
j =1 termCluster
j =1 termClusterloop +1, j
loop +1, j

.
Step (VI) Compute: improvement =
k
term c Tloop, j
j =1 termClusterloop +1, j
Step (VII) Case A: IF improvement , then

(i) loop = loop + 1;
(ii) Go to Step (II);
Case B: IF improvement < , then
Continue with the next step.
Step (VIII) For each i [1, t], compute: AFN i = {affinity i , j : j [1, k ]} = { term i c Tloop +1, j : j [1, k ]} .
stdv ( AFNi )
, where i [1, t].
Step (IX) Compute i* = arg min
mean( AFNi )
Step (X)
Case A: IF num_gen_term < MAX_GEN_TERM, then

(i) Delete termi* from termi, where i [1, t];
(ii) Add termi* to gen_term_lst, the Generic Term list;
(iii) t = t 1, loop = 1, num_gen_term = num_gen_term + 1;
(iv) Go to step (II);
Case B: IF num_gen_term = MAX_GEN_TERM, then
Continue with the next step.
Step (XI) Stop.
Figure 5.1 represents the extraction of a generic term (i.e., identification and removal) from the pool
of all terms. The old partition of terms is further adjusted via re-running the clustering sub-algorithm
before the next generic term is identified, this measure prevents the earlier-identified generic terms from
exerting compounding effects on the to-be-identified ones. Therefore, there are as many rounds of
running the clustering sub-algorithm as there are many generic terms to be identified.
88 ____________________________________________________________________________________________
Term Selection
Start
Spherical k-means
clustering sub-algorithm
Extracting
a generic term
No
Is termination
criteria
satisfied?
Yes
Exit
Figure 5.1. Top-Level Flowchart of GTE Algorithm.

The GTE Algorithm is guaranteed to terminate for the following three points:
(I)
The Spherical k-means clustering algorithm (Step (I) through Step (VI)) is known to
converge [DhM01].
(II)
Step (VII), Step (VIII) and Step (IX) Case A are sequential procedures.
(III)
Termination criteria in Step (IX) Case B guarantees that the steps mentioned in the above
two points are iterated for no more than a maximum of MAX_GEN_TERM times.
5. 4. Experiments and Results

In the aim to study the Yan approach, which consists in improving the LSA performance by
extracting generic terms, we propose first to index Cisi, Cran, and Med collections (introduced in
Section 3.3.1) in the native space, composed of unique terms occurring in documents; then, to apply the
GTE algorithm, by extracting several numbers of terms.
In Table 5.1, we put the index size of each collection in the native space and noun phrase space
(used for indexation in Chapter 3); where we remark that, even for these moderate-sized text collections,
the index size in the native space achieves tens of thousands.
___________________________________________________________________________________________
89
Term Selection
Collection Native space NP space

Cisi
9161
2087
Cran
6633
1914
Med
12173
2769
Table 5.1. Index size in the native and Noun phrase spaces.
In Table 5.2, Table 5.3, and Table 5.4, we put respectively for the Cisi, Cran, and Med collections
the mean interpolated average precision (MIAP) for several indexes, where the number of the excluded
generic terms from the indexes is indicated in the tables. We would like to note that all these results are
for the best reduced dimension in the training phase of the LSA model.
Number of extracted
generic terms
MIAP
Number of extracted
generic terms
MIAP
0.28
1300
0.28
100
0.28
1400
0.28
200
0.28
1450
0.28
300
0.28
1460
0.28
400
0.28
1470
0.28
500
0.28
1480
0.28
600
0.28
1490
0.28
700
0.28
1495
0.27
800
0.28
1500
0.27
900
0.28
2000
0.27
1000
0.28
3000
0.27
1100
0.28
3560
0.28
1200
0.28
3570
0.25
Table 5.2. The MIAP measure for the collection Cisi in different indexes.
90 ____________________________________________________________________________________________
Term Selection
Number of extracted
generic terms
MIAP
Number of extracted
generic terms
MIAP
0.51
600
0.50
100
0.52
700
0.50
200
0.51
800
0.50
300
0.51
900
0.50
400
0.51
1000
0.50
500
0.51
1100
0.50
550
0.51
1200
0.50
560
0.51
1300
0.50
570
0.51
1400
0.50
575
0.50
1500
0.48
580
0.50
Table 5.3. The MIAP measure for the collection Cran in different indexes.
Number of extracted
generic terms
MIAP
Number of extracted
generic terms
MIAP
0.66
1200
0.66
100
0.66
1300
0.66
200
0.66
1400
0.66
300
0.66
1410
0.66
400
0.66
1420
0.66
500
0.66
1425
0.65
600
0.66
1430
0.65
700
0.66
1450
0.65
800
0.66
1500
0.65
900
0.66
2000
0.65
1000
0.66
3000
0.64
1100
0.66
3500
0.63
Table 5.4. The MIAP measure for the collection Med in different indexes.
By analyzing these tables, we remark that there is not an appearing improvement in the LSA
___________________________________________________________________________________________
91
Term Selection
performance; because the existing improvement is very small affects in the case of Cisi and Med
collections the third decimal number, while in the case of Cran collection it affects the second decimal
number for an improvement less than 1%. Moreover, this improvement is remarked when no more than
the first six hundreds generic terms are excluded. For this reason, we think to use the generic term
extracting algorithm as a dimensionality reduction technique, by keeping the same performance
achieved in the native space while excluding more numbers of generic terms. As indicated in bold, by
this approach, we can exclude respectively 3560, 570, and 1420 terms from the Cisi, Cran, and Med
collections, which represent respectively about 38.8, 8.6, and 11.7% of the index size in the native
space.
On the other hand, by comparing these results to those achieved in Section 3.3.2.1, and recalled in
Table 5.5, we remark a large trade-off between sizes indicated in Table 5.1 and performances indicated
in Table 5.5, especially for Cran and Med collections.
Collection Native space NP space

Cisi
0.28
0.32
Cran
0.51
0.47
Med
0.66
0.26
Table 5.5. LSA performance in the native and Noun phrase spaces.
The reduction of the dimension may lead to significant savings of computer resources and
processing time. However poor feature selection may dramatically degrade the information retrieval
systems performance. This is clearly remarked when NP indexation is used or when a large number of
terms is excluded by using the GTE algorithm, such in the case of excluding 3570, 1500, and 3500
generic terms respectively from the Csi, Cran, and Med collections, we get a performance degradation of
3%. Thus by removing many terms, the risk to remove potentially useful information on the meaning of
the documents becomes larger. It is then clear that, in order to obtain optimal (cost-)effectiveness, the
reduction process must be performed with care.
5. 5. The GTE Algorithm Advantage and Limitation

Few term selection methods have the advantage of taking into account the interactions between
terms [CRJ03], such that usually the role of each term is evaluated independently of the other. By
analysing the results of the Cisi collection in Table 5.2, we remark that the LSA performance after
excluding 3560 terms achieves the same performance reached in the native space even after that this
92 ____________________________________________________________________________________________
Term Selection
performance has been decreased by 1% before. This remark shows that the exclusion of a specific
number of terms, using the GTE algorithm, could positively or negatively affects the LSA concepts
because of the interactions between terms.
Although, the GTE algorithm has this advantage, it has a limitation as it does not process
automatically in the elimination of terms, which is due to the fact that the performance is not monotone.
5. 6. Summary
By proposing to supplement, in the context of information retrieval, the feature transformation
method based on singular value decomposition with term selection, we have used the Yans approach.
Initially, this approach, consisting in extracting generic terms, was proposed to improve the performance
of the LSA model; however, we have used it for reducing the index size. In fact, the exclusion of generic
terms has not just reduced the storage capacity but also the capability of influencing a large number of
LSA concepts in an unpredictable way.
___________________________________________________________________________________________
93
Chapter 6 Information
Language
Retrieval
in
Arabic
6. 1. Introduction
Arabic texts are becoming widely available but due to the Arabic characteristic challenges still free
available corpora are missing automatic processing tools, and emerged standard IR-oriented algorithms
for this language.
In order to develop an Arabic IR system, we think that the improvement of former systems may
yield a predictive model to accelerate their processing and to obtain reliable results. So in the objective
of a specific study and a possible performance improvement of the Arabic information retrieval systems,
we have created an analysis corpus and a reference one, specialized in the environment field, and we
have proposed to use the latent semantic analysis method, to cure the problems arising from the vectorspace model. We have also studied how linguistic processing and weighting schemes could improve the
LSA method, and we have compared the performance of the vector-space model and the LSA approach
for the Arabic language.
As is generally known, the Arabic language is complicated for natural language processing due to
two main language characteristics. The first is the agglutinative nature of the language and the second is
the aspect of the vowellessness of the language, causing ambiguity problems at different levels. In this
work, we are more interested, specially, in the agglutinative problem.
6. 2. Creating the Test Set

A corpus, or a set of textual documents, could be seen as a language sample. For this reason, the
corpora are used for the automatic processing of natural language. As much the corpus is extensive and
varied, as much the sample is representative. [Lap00].
Text corpus represents a real usage of a language, and provides an objective reference to analyze or
even get formal descriptions of a language. A corpus of reference must satisfy two requirements. One is
to be sufficiently large; the other is the diversity of usages such as training and testing.
6.2.1. Motivation
In recent work within the framework of the information retrieval and the Arabic language automatic
processing, some sizeable newspaper corpuses (cf. Section 2. 6) have started to be available. However,
they are not free, and the topics treated by these corpuses remain of a general nature without affecting a
scientific field specialized such as the environment. For these two reasons, we have been interested in
94 ___________________________________________________________________________________________
Information Retrieval in Arabic Language

building our own corpus.
6.2.2. Reference Corpus

6.2.2.1. Description
The development of the corpus proceeded into two stages: Web harvesting, and text normalization.
These steps were executed by native speakers.
As preliminary processing, the source of our corpus was chosen from archived articles cited on the
Web sites Al-Khat Alakhdar24 and Akhbar Albiae25 , whose subjects cover various environmental
topics such as pollution, noise effects, water purification, soil degradation, forest preservation, climate
change and natural disasters. Thus, we have chosen for this corpus the appellation [AR-ENV],
designating by AR the Arabic language and by ENV the environmental thematic of the corpus.
The research district for the chosen topics took place through the use of keywords which must be
precise and allow finding a wide spectrum of documents in terms of genres. Two search strategies are
possible: the first in width which reviews most of the documents returned by a single query, the second
in depth, which examines only the first documents and explores the links. We conducted a thorough
search on the top twenty results using combinations of such keywords environment, pollution, and noise.
To ensure proper coverage of the subject, we have expanded the search using synonyms in the search
engine of both Web sites and in the Arabic version of the Google search engine26, and basing on the
terms found in the pages visited as noise pollution or degradation. We note that while gathering the
corpus, we have focused on the variety of the parameter settings involved in the creation of a document,
such as the document producer(s), the context of production or usage, the date of production, and the
document size which varies between one paragraph and forty pages.
On the other hand for each article, we have saved its URL, converted it from its original form HTML
to text file in UNICODE format, than unified its content as explained in the mutation process in Section
6.3.1.1 of this Chapter.
The corpus in its primary phase [ABE08] consists of 1 060 documents, containing 475 148 tokens
from which 54 705 are distinct; and 30 queries, 15 of them for the training phase of the best reduced
dimension choice of the LSA model, and the other 15 for test. These statistics are resumed in Table 6.1.
24
http://www.greenline.com.kw/home.asp, Retrieved on 10-22-2007.
25
http://www.4eco.com/, Retrieved on 10-22-2007.
26
http://www.google.com/intl/ar/.
___________________________________________________________________________________________
95

Statistics
[AR-ENV] Corpus
Document Number
1 060
Query Number
30
Token Number
475 148
Distinct Word Number
54 705
Table 6.1. [AR-ENV] Corpus Statistics.

The creation of queries was adjusted and calibrated manually to the collection documents. The first
queries have been inspired by a first reading of the corpus documents, after that general and ambiguous
queries were excluded. Each query is subdivided into three parts, containing a short title, a descriptive
sentence, and a narration part specifying the relevance criteria, as explicated with the example in Table
6.2. The example includes original and an English version of the query text. The average length of
queries containing just the title of the information needed is limited to approximately 2.70 tokens per
query; but when the logical parts description or narration are taken into account the average length
of the queries becomes approximately 16.17 tokens per query.
<title> J&J(^_ J:` </title>
<desc>.J&J(^_ J:` I 5= \^! cW^ dCI 5N^< /desc>
<narr> .7no^ J`J8:^ 7pX 7<ip^J& !rW^ 7g JiAj\ HgJD k`F:& Cl ! cW^</narr>
<title> The forest preservation </title>
<desc> Look for articles on preservation of the forest. </desc>
<narr> Relevant documents take up the prosecution of the forest destroyers, the ways of promoting
reforestation, and the deployment of green-space area. </narr>
Table 6.2. An example illustrating the typical approach to query term selection.
The relevance assessment was also performed manually, by reading and checking whole document
collection from Arabic narrative speaker reviewers. After picking up all documents specified as relevant
for a given query, a document is admitted as relevant to that particular query if derived from the rule of
majority: i.e., a document is defined as relevant to a particular query only if at least three out of five
from five reviewers agree on its relevancy.
The size of this corpus, used in our study as a reference corpus, although still modest, can guarantee
that the articles discuss a wide range of subjects and that their content is, to some extent, heterogeneous.
96 ____________________________________________________________________________________________

The selected articles were published over a period of three years from 2003 to 2006. We realize that due
to this period of time, most of topics covered by the corpus are well represented.
6.2.2.2. Corpus Assessments

The characteristics of any corpus determine the performance of IR and NLP techniques that uses that
corpus as a resource or dataset. Therefore linguists carry out a variety of tests to evaluate the
appropriateness of the data. These measures and evaluations vary with the task, the language, and the
techniques. Assessment tools can rearrange such a corpus store so that various observations can be
made. Using the corpus assessment tools, we first validate the collections by applying statistical and
probability tests, such as Zipfs law and Token-to-Type Ratio. These tests overcome the ultimate
problem closely linked with the corpus size and representativeness. They are useful for describing the
frequency distribution of the words in the corpus. They are also well-known tests for gauging data
sparseness and providing evidence of any imbalance of the dataset.
Zipfs law
According to Zipfs law, if we count up how often each word occurs in a corpus and then list these
words in the order of their frequency of occurrence, then the relationship between the frequency of a
given word f and its position in the list (its rank r) will be a constant k such that: f.r = k.
Ideally, a simple graph for the above equation using logarithmic scale will show a straight line with a
slope of 1. So the situation in the corpus was checked by starting with one file and increasingly adding
more files to a corpus and checking the behavior of the relation between the rank and the frequency. An
enhanced theory of Zipfs law is the Mandelbrot distribution. Mandelbrot notes that although Zipfs
formula gives the general shape of the curves, it is very bad in reflecting the details [MaS99]. So to
achieve a closer fit to the empirical distribution of words, Mandelbrot derived the following formula for
a relation between the frequency and the rank:
f = P(r+)-b
where P, b, and are parameters of the text that collectively measure the richness of the texts use of
words. The common factor is that there is still a hyperbolic relation between the rank and the frequency
as in the original equation of Zipfs law. If this formula is graphed on doubly logarithmic axes, it closely
approximates a straight line descending with a slope b just as Zipfs law describes (See Figure 6.1).
The graph shows Rank on the X-axis versus Frequency on the Y-axis, using logarithmic scale. The
line in magenta corresponds to the ranks and frequencies of words in the whole documents of our
corpus. The straight line in canyon shows the relationship between Rank and Frequency predicted by
Zipfs formula f . r = k.
___________________________________________________________________________________________
97
Figure 6.1. Zipf law and word frequency versus rank in the [AR-ENV] collection.
Token-to-Type Ratio (TTR)

Token-to-type ratio is another measure used to evaluate a collection or a dataset for its
appropriateness to be used in an IR or NLP task. The measure reflects mainly the sparseness of the data
[Sch02, Ars04].
Text length
Bengali
English
Arabic
(CILL)
(Brown) (Al-Hayat)
100
1.204
1.449
1.19
1 600
2.288
2.576
1.774
6 400
3.309
4.702
2.357
16 000
4.663
5.928
2.771
20 000
5.209
6.341
2.875
1 000 000
10.811
20.408
8.252
Table 6.3. Token-to-type ratios for fragments of different lengths, from various corpora.
The measure is obtained by dividing the number of tokens (text length) by the number of distinct
words (type). It is sensitive to sample size, with lower ratios (i.e. a higher proportion of new words)
expected for smaller (and therefore sparser) samples. A 1 000 word article might have a TTR of 2.5; a
shorter one might reach 1.3; 4 million words will probably give a form/token ratio of about 50, and so
on. The factors that influence TTR for raw textual data include various morphosyntactic features and
orthographic conventions (see Table 6.3).
98 ____________________________________________________________________________________________

For instance, the presence of a case system in a language will lead to a comparatively lower type to
token ratio. Arabic, a language with a highly inflective morphology, has a very low token-to-type ratio
compared to English [Yah89]. Figure 6.2 shows the TTR for our collection. The results confirm the
former finding of Yahya [Yah89], Goweder & De Roeck [GoD01] and Abdelali [Abd04].
Figure 6.2. Token-to-type ratios (TTR) for the [AR-ENV] collection.
Other Measures
To measure the lexical richness of the [AR-ENV] corpus, we have also used in the context of lexical
categories the lexical coverage and the grammatical category distribution measures. For further details
on these two metrics consult Boulaknadel dissertation [Bou08].
6.2.3. Analysis Corpus

The process undertaken in the work presented in this Chapter is based on the opposition of two
corpora of different properties. The first corpus is a set of analysis [AR-RS] and the second is a set of
reference [AR-ENV]. As a first step of our study, we should apply our experimental protocol on an
analysis corpus. This approach has the advantage of ensuring if the observed results are stable,
independent of a particular corpus, and are not coincidental.
Our analysis corpus is an extract of articles set repatriated from the Web27 on the Royal Speech,
published between March 2002 and December 2006. The total size of this corpus is of 101 072
occurrences corresponding roughly to 20 107 types; and 10 queries for the training of the best dimension
choice of the LSA model, and 10 others for testing. Each query is consistent with those described in
Table 6.2.
27
http://www.maec.gov.ma/arabe, Retrieved on 12-26-2007.

___________________________________________________________________________________________
99
6. 3. Experimental Protocol
In contrast to other languages such as French and English, Arabic is an agglutinant language, where
words are preceded and succeeded by prefixes and suffixes. Moreover, diacritic marks are commonly
used in this language. Therefore, an adaptation of the standardized information retrieval system,
presented in Figure 6.3, is needed, specifically in the preprocessing phase.
Natural
Language
Processing
Tokenization
User Query
Stop Word List
Corpus
Stemming
VM
Index
Query
Correspondance
Figure 6.3. A standardized information retrieval system.
6.3.1. Corpus Processing

6.3.1.1. Arabic Corpus Pre-processing
The pre-processing of a corpus helps to format textual data and prepare it to be ready for the
subsequent processing. Generally, tokenization, stop words removal and stemming are the basic natural
language processes used in an information retrieval system (see Figure 6.3). However, for particular
languages, as the case of the Arabic language, other natural languages processes are needed to overcome
the specificity requirements of these languages.
The absence of a standardized Arabic information retrieval system was the first challenge
confronting our study. To deal with this problem, in addition to have a good knowledge of the Arabic
language characteristics and anomalies (introduced in Section 2.5.5), we have investigated many Arabic
studies, researches and preprocessing tolls (referenced in Section 2.5.6). Thus, we managed to propose
the system presented in Figure 6.4, and tried to improve its performance further.
100 ____________________________________________________________________________________________
t J&
J:`7^
<`7^
Corpus
Moving Diacritics
User Query
Mutation
Tokenization
Stop Word List
Stemming
Stop Word List
Index
VM
Query
Correspondance
Figure 6.4. An information retrieval system for Arabic language.

As visualized in Figure 6.4, the removal of diacritics and the mutation of document and query
content are the preliminary processes of an Arabic information retrieval system (AIRS), then a
tokenization process takes place. Stop words are removed before and after a stemming phase, and finally
the remaining tokens are indexed to be ready for a query evaluation.
All the mentioned pre-processes are described below according to their order of appearance in
Figure 6.4.
Diacritic Elimination
Our corpus contains no diacritic texts except some words whose vowels are detected and eliminated.
The process removes all the diacritics except the diacritic shaddah since shaddah is placed above
a consonant letter as a sign for the duplication of that consonant; thus, it acts like a letter. In modern
Arabic writing, people rely on their knowledge of the language and the context while writing the Arabic
text. The Arabic surface form can be fully, partially, or entirely free of diacritics. The incompleteness of
the surface orthography in most of the standard written Arabic makes the written Arabic words
ambiguous. Thus, removing diacritics is of great importance to normalizing the queries and the
collection.
Mutation
The mutation consists in normalizing the letters appearing in several distinct forms in combination
with various letters to one form. The process changes the letters hamza-above-alif, hamzaunder-alif and alif-madah v to plain alif . The reason behind this conversion is that most
people do not formally write the appropriate alif at the beginning of the word. Thus, the letter alif is a
source of ambiguity. For example, the verb w , which means take in English, and the plural noun
___________________________________________________________________________________________
101

7` , which means letters in English, can be written as w and 7` . The normalization
process preserves the word sense intact. Similarly for words that contain hamza-under-alif such as
J8X , which means human in English, can be written as J8X . Similarly, the letter ta-marbotah
that occur at the end of the Arabic word which indicates mostly the feminine noun is, in most cases,
written as ha which makes the word ambiguous. To resolve the ambiguity, we replace any
occurrences of in the end of the word with . For example, the word k<k` alternately appears
as k<k` or xk<k` in Arabic text. We further replace the sequence of alif-maksoura positioned
before the last letter in a word and hamza positioned in the end of a word to alif-maksouramahmozah. Similarly, we replace the sequence of ya positioned before the last letter in a word
and positioned in the end of a word to .
Tokenization
In natural language processing, the basic tokenization process consists in recognizing words
resulting from the existence of white-space characters and punctuation marks as explicit delimiters. But
due to its complicated morphology and agglutinant characteristic, Arabic language needs an enhanced
and designed tokenizer to detect and separate words. However in this system, only a basic tokenizer is
used.
Stop Word Removal
By comparing the flowcharts of the two Figures Figure 6.3 and Figure 6.4, it appears clearly that the
Arabic language stemming process is preceded and followed by a stop-words removal process; and this
is often used for two reasons. The first reason is that there are some prefixes which are part of some
stop-words. For example in the demonstrative pronouns _Y^ meaning those, Y^ and \^
meaning that respectively for masculine and feminine person or thing, the definite article meaning
the, considered as a prefix in Arabic language, is a main part of these pronouns that should not be
eliminated. The second reason is that when the stemming process is applied without to be followed by a
stop-words removal step, the majority of stop-words might not be eliminated. For example, the case of
the token xCND meaning before him, after a stemming process becomes ND meaning before, which is
a stop-word.
Stemming
As reviewed in Section 2.5.6, up to the day when this study has been undertaken, light stemming of
Darwish modified by Larkey [LBC02] was the best language process performing the information
retrieval in Arabic language [AlF02, GPD04, TEC05]. For this reason, we have chosen to use this
approach in the stemming phase of the proposed AIR system.
The chosen approach is used not to produce the linguistic root of a given Arabic surface form, but to
remove the most frequent suffixes and prefixes including duals and plurals for masculine and feminine,
102 ____________________________________________________________________________________________

possessive forms, definite articles, and pronouns. A detailed list of these prefixes and suffixes is
presented in the Section A.2.4.
6.3.1.2. Processing Stage

In this stage, a retrieval process is performed. To our best knowledge, the vector-space model (VSM)
known as Saltons model [SaM83] was the only model used for retrieval in Arabic language, before our
work [BoA05], where the latent semantic analysis (LSA) model is used. The first model is highlighted in
Section 2.2.2, while the second is detailed in Chapter 3.
6.3.2. Evaluations
In the objective to ameliorate the performance of the proposed system, we evaluate the effectiveness
of this latter and some other suggestions, described and discussed below, in both created corpora: the
analysis corpus [AR-RS] and the reference corpus [AR-ENV].
6.3.2.1. Weighting Schemes Impact

In the case of the extended vector space model, analysis semantic latent, we have explored the effect
of five weighting schemes in two study cases: (a) short queries and (b) long queries. Based on the four
best performing weighting schemes, found in the Ataa Allah et al. work [ABE05], for the LSA model in
addition to the Okapi BM-25, we compare in Figure 6.5 the performance of the four weighting schemes
log(tf+1)xIdf, TfxIdf, Ltc, and Tfc, performed on the reference corpus [AR-ENV]. We remark that the
log(tf+1)xIdf scheme improves the model, while the other three have experienced some improvements
and degradations between them over the recall rates. However, the Okapi BM-25 weighting scheme
improves further the LSA method, reaching a gain of 6.17% and 6.15% respectively for short and long
queries over the log(tf+1)xIdf, and 16.75% and 33.33% when the data are not weighted. This gain is due
to the normalization factor that characterizes the Okapi BM-25 weighting scheme compared to the other
schemes (see Section 3.2.2).
Likewise, we have seen similar behavior of the five weighting schemes on the analysis corpus. The
Okapi BM-25 weighting scheme has increased the model performance by 4.48% and 3.08% respectively
for short and long queries over the log(tf+1)xIdf, and by 11.01% and 49.46% when no weighting
scheme is used.
___________________________________________________________________________________________
103
-a-
-b-
Figure 6.5. Comparison between the performances of the LSA model for five weighting schemes.
6.3.2.2. Basic Language Processing Usefulness

In order to gather certain tokens, we have stemmed those of our corpora. The primary benefit of this
preprocessing is the reduction of the database index size. While, the matrix tokens x documents of the
reference corpus [AR-ENV] has 54 705 tokens, after performing the stem process, this number is
reduced to 22 553, and to 22 491 tokens after the elimination of stop-words.
We have carried out some experiments to evaluate the usefulness of these preprocessing for Arabic
information retrieval in two study cases. In the first case, no weighting was applied, while in the second
the Okapi BM-25 weighting is used.
The results of Figure 6.6-a (reference corpus, short queries, no weighting) show that the
improvement made by the use of the stop-word list is not so significant; however, given by the
combination of the stemming process and the elimination of stop-words is more interesting, and reaches
a gain of 3.45%.
Also on the analysis corpus, the results showed that the use of a stop-word list is not necessary;
while a significant improvement, equal to 5.01%, was observed for the combination of the stemming
process and the elimination of stop-words.
For long queries, the use of the stop-word list brings a gain of 3.27% compared to the case where no
preprocessing is used, and 21.06% when the stemming is applied.
Similarly in the case of reference corpus (long queries, Figure 6.6-b), we find that the elimination of
stop-words presents a benefit of 3.22% compared to a case where all preprocessing approaches are
ignored, and a benefit of 12.47% compared with stemming approach. These results reflect the
importance of implementing a second stage of stop-word removal process in an Arabic retrieval system
(the process presented after the stemming in Figure 6.4). Also we note that the combination of stemming
104 ____________________________________________________________________________________________

and the elimination of stop-words in this case is more interesting and gives a gain of 17.47%.
Effectively, removing stop words has a significantly positive effect for stemmed Arabic, but not for
unstemmed Arabic. This difference is due to the fact that stem classes for stop words contain larger
numbers of unrelated word variants than stem classes for other words.
By comparing the results of short and long queries, we note that the effect of the stop-word removal
process depends mainly on the queries model.
- No weighting case
-a-
-b-
- Okapi BM-25 weighing case
-c-
-d-
Figure 6.6. Language processing benefit.

In Figure 6.6-c (reference corpus, short queries, Okapi BM-25 weighting), the results show that the
performance provided by the stemming reached a gain of 4.80%, while for long queries (Figure 6.6-d), it
gives a gain of 3.50%. Similarly, the same observations were found on the analysis corpus, as the
___________________________________________________________________________________________
105

stemming brings a benefit of 6.31% in the case of short queries and 4.65% in the case of long queries.
However, the elimination of stop-words is not significant for the two corpora, for both queries model
long and short.
We note that to avoid the recursive testing cost for each corpus token, in order to eliminate those of
the stop-word list, we can use weighting schemes as the Okapi as BM-25, log(tf+1)xIdf, TfxIdf, Ltc and
Tfc, which minimize the effect of these tokens in particular and the effect of high frequencies in general.
This is confirmed by the results of one of our previous work [BoA05].
Based on the latter conclusion, we propose a new system for Arabic information retrieval (see Figure
6.7), where the phase of the stop-word removal process is excluded. Thus, we choose to use a lightstemming corpus weighted by the Okapi BM-25 scheme for the remainder experiments, unless some
thing else is mentioned.
t J&
J:`7^
<`7^
Moving Diacritics
Corpus
Tokenization
User Query
Mutation
Stemming
Index
VM
Query
Correspondance
Figure 6.7. A new information retrieval system suggested for Arabic language.
6.3.2.3. The LSA Model Benefit

In this section, we want to compare between the performances of the standard vector space model
(VSM) and the latent semantic analysis model (LSA), in the case of our Arabic corpora [AR-RS] and
[AR-ENV].
106 ____________________________________________________________________________________________
-a-
-b-
Figure 6.8. A comparison between the performances of the VMS and the LSA models.
For reference corpus, the curves of Figure 6.8-a, representing the results in the case of short queries,
show a significant statistical difference of 15.90% for the LSA model over the standard VSM, while the
curves in Figure 6.8-b, representing the results in the case of long queries, show a gain of 16.30%.
Similarly, the experiments performed on analysis corpus showed an improvement of 12.77% for the
LSA model in the case of short queries, and 13.73% in the case of long queries.
6.3.2.4. The Impact of Weighting Query

We want to point out that we have used weighted queries in the former experiments. However in this
part, we are interested in studying the contribution of the weighting on two queries models: short and
long.
We have remarked that query weighting gives, in the case of the analysis corpus [AR-RS], an
improvement of 3.52% for short queries over the long ones; and a benefit of 2.70%, presented in Figure
6.9-a, in the case of the validation corpus [AR-ENV]. The fact that these results do not comply with
what is found in the literature [Sav02], have pushed us to seek the cause of this difference.
- Weighted queries
-a-
___________________________________________________________________________________________
107

- Un-weighted queries
-b-
Figure 6.9. Weighting queries impact.

To this end, we have decided to evaluate the performance of the information retrieval system with
un-weighted queries. This experiment shows that, compared to short queries, long queries increase the
performance by 5.03% for the analysis corpus and 4.30% for the corpus of reference; while, short
queries decrease the performance respectively of 3.72% and 3.60% for analysis and reference corpora.
Viewing that short queries, containing just key words of the information needed, reflect better the
reality especially that of the Internet, while long queries containing description or narration parts
move a way the user from the reality of the Web, we suggest to use an information retrieval system,
where queries are weighted.
6.3.2.5. Non Phrase Indexation

Always in the aim to improve the performance of the Arabic information retrieval system, we have
chosen to study the effect of the indexation by noun phrases (NP) [ABE06], which appear to be more
suited to indicate semantics entities than single terms [Ama02]. To this end, we need to adapt our system
to the new approach.
a- Arabic Information Retrieval System based on NP Extraction

In this approach, the corpus preprocessing differ, from those used in the suggested Arabic
information retrieval presented in Figure 6.7, first in the use of the Buckwalter transliteration as a second
step of the system, and in the use of part of speech (POS) tagging and NP extraction processes
performed before stemming. These new processes are defined in Section A.3. , and commented below.
108 ____________________________________________________________________________________________
t J&
J:`7^
<`7^
Corpus
Moving Diacritics
User Query
Transliteration
Tokenization
POS Tagging
NP Extraction
Stemming
Index
LSA
Query
Correspondance
Figure 6.10. Arabic Information Retrieval System based on NP Extraction.
Buckwalter Transliteration
Taking into account that, up to the day where this work has been done, no tool for Arabic POS
tagging was performed in Arabic script, we have applied the Buckwalter transliteration which consists in
converting Arabic characters into Latin.
Part Of Speech Tagging
Part-of-speech (POS) task consists in analyzing texts in order to assign an appropriate syntactical
category to each word (noun, verb, adjective, preposition, etc).
For Arabic language, many part-of-speech taggers have been developed which we can classify into
different categories. The first class techniques are based on tagset that have been derived from an IndoEuropean based tagset. However, the tagsets used in the second category have been derived from
traditional Arabic grammatical theory. The taggers in the third class are considered as hybrid based on
statistical and rule-based techniques; in spite of the fourth category, machine learning is used.
All works on Arabic tagging (that we are aware of) are Diabs POS tagger [DHJ04] consisting in
combining techniques of the first and the fourth classes, Arabic Brill's POS tagger [Fre01] using
techniques of the first and the third categories, and APT [Kho01] based on the second and the third class
techniques.
Exceptionally, in this study just to be conformed to the Base Phrase (BP) Chunker [DHJ04] (see the
Section A.3.3. A.3.3. for chunker definition), we have chosen to use Diabs tagger. In this tagger a large
set of Arabic tags has been mapped (by the Linguistic Data Consortium) to a small subset of the English
tagset that was introduced with the English Penn Treebank.
___________________________________________________________________________________________
109

Noun Phrase Extraction
In this step, we are interested in Noun Phrases (NP) at syntagmatic level of the linguistic analysis.
For that, we adapted the SVM-BP chunker based on a supervised machine learning perspective using
Support Vector Machines (SVMs) trained on the Arabic TreeBank, and consisting in creating nonrecursive base phrases such as noun phrases, adjectival phrases, verb phrases, preposition phrases, etc.
b- Non Phrase Indexation Effect

Assuming that complex term indexation could constitute a better representation of the text content
than single terms, we have adopted a noun phrase (NP) indexing method where the text is processed by
keeping the information related to the document syntagmatic relations.
To this end, we have tested the performance of the AIRS based on NP extraction by comparing the
behavior of the system following two indexation strategies with the performance of the suggested AIRS.
Designating by Strategy 1 indexation based on single terms, performed by the AIRS based on NP
extraction after dropping part-of-speech tagging and noun phase extraction steps. This makes the system
equivalent to the suggested AIRS presented in Figure 6.7, with the only difference being in the
transliteration process. While Strategy 2 is based on noun phrase indexation, where a new vector is
created corresponding to noun phrase extracted from each document; Strategy 3 is based on the
indexation by single terms supplemented with noun phrases.
Figure 6.11. Influence of the NP and the singles terms indexations on the IRS performance.
By comparing the curves of Figure 6.11, we remark that the use of noun phrase in the indexation
process drops the performance of the system based on single terms. However, in the third strategy, when
we have attempted to remedy the situation by combining single terms and noun phrases, we remark that
110 ____________________________________________________________________________________________

this approach performs the second strategy but not the first one. We conclude that the system based on
single terms always yields to the best performance for lower recall rates, which is the most important for
a user, seeing that a user is more interested in the relevant documents at the top of returned list.
c- Discussion
In this section, we discuss the results presented in the previous subsection while we also attempt to
reason about some of their proprieties. We do so by, first, giving a small overview about the studies
undertaken in the field of NP indexation.
Previous studies for English, French, and Chinese languages showed that the use of noun phrase in
representing the document content could improve the effectiveness of an automatic information retrieval
system. Mitra et al. in [MBS97] showed that reindexing with noun phrases the 100 first documents
retrieved by the SMART system gives a benefit at low recall. However, TREC campaigns showed that
not necessary the noun phrase indexing approaches enhance the retrieval performance, and that this
improvement can depend on the size of the collection, and the query topic [Fag87, EGH91, ZTM96]. The
given results in PRISE system [SLP97], based on the noun phrase extraction by the Tagged Text Parser,
are a good example of the difficulty in evaluating the syntagmatic analysis effect on the IRS since the
performances obtained were not significant.
Conformed to this, our experimental results show that for Arabic language the NP-based indexing
decrease the retrieval performance compared to single-term-based indexing. We could explain that this
drop is due to the noun phrase size and the normalization lack, for example !r^! C= {J air
pollution disaster and !r^! C= air pollution should be normalized under air pollution. It could
be also explained by the use of a morpho-syntactic parser and chunker based on supervised learning
depending on an annotated corpus and not on specific syntactic pattern rules.
We think that the use of morpho-syntactic parser and chunker based on syntactic pattern rules, the
use of a part-of-speech tagger based on statistical and rule-based techniques, or a tagset derived from
Arabic grammatical theory could resolve the specified problems, and be more effective. In this aim, a
deep study was undertaken by Boulaknadel [Bou08].
6. 4. Summary
In this chapter, we have presented an evaluation of the vector space model and the LSA method,
while performing linguistic processing, and using weighting schemes in an Arabic analysis and
reference corpora that we have created for this aim.
The undertaken experiments showed that light-stemming increases the performance of the Arabic
information retrieval system, especially when the Okapi BM-25 scheme is used. Thus, confirming the
___________________________________________________________________________________________
111

fact that the linguistic preprocessing is an essential step in the information retrieval process. The study
also showed that the elimination of stop-word in retrieval, for Arabic language known by its agglutinant
characteristic, could be avoided by applying some weighting schemes that address the issue of high
frequency of words in a corpus. However, the noun phrase based indexing, even when supplemented by
single term based indexing, decrease this performance.
On the other hand, by comparing the performance of the vector space model to the LSA one, we
remark an important improvement on behalf of this latter. By evaluating the influence of weighting
schemes on queries model, we ascertain the usefulness of short queries representing the reality of the
Web.
Similarly to the results of Chapter 3, the experiments of this chapter also showed that the Okapi BM25 is the best weighting schemes between those performed in this work.
In the conclusion of the carried out study in Arabic retrieval, we can tell that the suggested system,
based on light-stemming, Okapi BM-25 scheme, short weighted queries, and LSA model, could be used
as a standardized system.
112 ____________________________________________________________________________________________
Chapter 7 Conclusion and Future Work

7. 1. Conclusion
This dissertation advances a review of information retrieval models, and clustering algorithms
taxonomy; while explaining the utility of clustering in the information retrieval context. Besides, it has
tried to study the state of the art in dimensionality reduction techniques, especially extraction features
methods, and Arabic information retrieval techniques, after recalling the Arabic language characteristics,
and Arabic corpora.
The key contribution of this work lies in providing an Arabic information retrieval system based on
light stemming, Okapi BM-25 weighting scheme, and the latent semantic model, by building an analysis
and a reference Arabic corpora, improving prior models addressing Arabic document retrieval problems,
and comparing specific weighting schemes. However, other approaches have been proposed in
document clustering and dimensionality reduction.
In clustering, we have proposed to use the diffusion map space based on the cosine kernel, where the
results of the k-means clustering algorithm have shown that the indexation of documents in this space is
more effective than in the diffusion map space based on the Gaussian kernel, the Salton space and the
LSA space, specially for the case of multi-clusters. Moreover, the use of k-means algorithm in this space
has met requirement in soundness and efficiency. We have also provided, when the singular value
decomposition method is used in the construction of the diffusion space, a technique for resolving the
problem of specifying the clusters number, and another for the choice of the cosine diffusion space
dimension. Furthermore, we have improved the single pass algorithm, by using the diffusion approach
based on the updating singular value decomposition technique, which is potentially useful in widespread
on-line applications that require real time updating, such as peer to peer information retrieval.
In dimensionality reduction, we have supplemented the singular value decomposition, used in
feature transformation, with term selection method based on the extracting generic term algorithm.
This dissertation thoroughly addressed the impact of term weighting in retrieval, based on latent
semantic model, for varying combination of the local and global weighting functions in addition to the
normalization function. The effectiveness of 25 different index weighting terms was explored and the
best one which is the Okapi BM-25 was identified.
7. 2. Limitations
Experimental research inherently has limitations. The work presented in this dissertation is limited in
the following main ways:
___________________________________________________________________________________________
113

The non-availability of free Arabic corpora needed for information retrieval and clustering
evaluation, and the requirement of considerable human effort for reconstructing static test collections,
such as those used in TREC, were the reason confiding us to use an Arabic reference corpus of 1 060
documents, with a total size of 5.34 megabytes, for retrieval and English corpora for clustering.
Hardware limitation. Most of the time we have only the possibility to use a machine with duo
processor of 1.80 gigahertz, and random access memory of 1 gigabyte.
7. 3. Prospects
There are several directions in which this research can proceed. These directions can be categorized
into five broad areas:
- Automating the generic term extraction algorithm.
- Adapting the generic term extraction algorithm to other ranges of data.
- Applying the diffusion map approach results to multimedia data.
- Extending our Arabic reference corpus, and try to classify its content to non-overlapping groups.
This way it could serve for both retrieval and clustering evaluation.
- Improving our system performance by using the results of the noun phrase study undertaken by
Boulaknadel [Bou08], and the semantic query expansion.
- Implementation of a full Arabic search engine based on the previous studies undertaken in this
dissertation and those planned for further work.
114 ____________________________________________________________________________________________
Appendix A Natural Language Processing

A.1. Introduction
Document Retrieval is essentially a matter of deciding which documents in a collection should be
retrieved to satisfy a user's need for information. The user's information need is represented by a query
or profile, and contains one or more search terms, in addition perhaps to some additional information
such as importance weights. Hence, the retrieval decision is based on the comparison between the query
terms and the index terms (important words or phrases) appearing in the document itself. The decision
may be binary (retrieve/reject), or it may involve estimating the degree of relevance that the document
has to the query.
Unfortunately, the words that appear in documents and in queries often have many morphological
variants. Thus, pairs of terms such as computing and computation will not be recognized as
equivalent without some form of natural language processing (NLP).
In this appendix, we introduce some of these processing, more precisely those used and mentioned in
the thesis. The NLP can be classified into two categories: basic NLP techniques and advanced ones.
A.2. Basic Techniques

The fact that N-gram, Tokenization, Transliteration, Stemming, and removal of Stop words are less
sophisticated than the other natural language processing, they are considered as basic techniques.
A.2.1. N-grams
An n-gram is a sub-sequence of n items from a given sequence. It is a popular technique in statistical
natural language processing. For parsing, words are modeled such that each n-gram is composed of n
words. For a sequence of words, (for example the dog smelled like a skunk), the 3-grams would be:
the dog smelled, dog smelled like, smelled like a, and like a skunk. For sequences of characters,
the 3-grams that can be generated from good morning are goo, ood, od , d m, mo, mor
and so forth. Some practitioners preprocess strings to remove spaces, others do not. In almost all cases,
punctuation is removed by preprocessing. n-grams can also be used for sequences of words or, in fact,
for almost any type of data.
By converting an original sequence of items to n-grams, it can be embedded in a vector space, thus
allowing the sequence to be compared to other sequences in an efficient manner.
A.2.2. Tokenization
Tokenization, or word segmentation, is a fundamental task of almost all NLP systems. In languages
___________________________________________________________________________________________
115
Natural Language Processing

that use word separators in their writing, tokenization seems easy: every sequence of characters between
two white spaces or punctuation marks is a word. This works reasonably well, but exceptions are
handled in a cumbersome way. On the other hand, there are other languages that do not use word
separators, like the case of Arabic language. They need much more complicated processing, closer to
morphological analysis or part-of-speech tagging. Tokenizers designed for those languages are generally
very tied to a given system and language.
A.2.3. Transliteration
Transliteration is the practice of transcribing a word or text written in one writing system into
another writing system. It is also the system of rules for that practice.
Technically, from a linguistic point of view, it is a mapping from one system of writing into another.
Transliteration attempts to be exact, so that an informed reader should be able to reconstruct the original
spelling of unknown transliterated words. To achieve this objective transliteration may define complex
conventions for dealing with letters in a source script which do not correspond with letters in a goal
script.
This is opposed to transcription, which maps the sounds of one language to the script of another
language. Still, most transliterations map the letters of the source script to letters pronounced similarly in
the goal script, for some specific pair of source and goal language.
It is not to be confused with translation, which involves a change in language while preserving
meaning. Here we have a mapping from one alphabet into another.
Specifically for Arabic language, many transliteration systems are utilized, such as: Deutsche
Morgenlndische Gesellschaft, Adopted by the International Convention of Orientalist Scholars in
Rome28; ISO/R 233, replaced by ISO 233 in 1984; BS 4280, developed by the British Standards
Institute29; and SATTS, One-to-one mapping to Latin Morse equivalents, used by US military. However
in our work, we have used Buckwalter transliteration30,31.
The Buckwalter Arabic transliteration was developed at Xerox by Tim Buckwalter in the 1990s. It is
an ASCII only transliteration scheme, representing Arabic orthography strictly one-to-one, unlike the
more common romanization schemes that add morphological information not expressed in Arabic script.
Thus, for example, a ( waaw) will be transliterated as w regardless of whether it is realized as a
vowel [u:] or a consonant [w]. Only when the ( waaw) is modified by a ( hamza) does the
28
http://www.dmg-web.de, Retrieved on 10-7-2007.
29
http://www.bsi-global.com/index.xalter, Retrieved on 10-7-2007.
30
http://www.qamus.org/transliteration.htm, Retrieved on 10-7-2007.
31
http://www.xrce.xerox.com/competencies/content-analysis/arabic/info/buckwalter-about.html, Retrieved on 10-7-2007.
116 ____________________________________________________________________________________________

transliteration change to &. The unmodified letters are straightforward to read (except for maybe * =
( thaal) and E = ( ayn), v = ( thaa)), but the transliterations of letters with diacritics and the
harakat take some time to get used to, for example the nunated irab [un], [an], [in] appear
as N, F, K, and the sukun [a-] (no vowel) as o. ( Ta marbouta) is p.
* Z n K
z g w u
E h a
v s f
j
$ q Y ~
H S k p O
x D l
F v |
m N >
<
& _ }
Table A.1. Buckwalter Transliteration.
A.2.4. Stemming
A stemmer is a computer program or algorithm which determines a stem form of a given inflected
(or, sometimes, derived) word form, generally a written word form. The stem need not be identical to
the morphological root of the word; it is usually sufficient that related words map to the same stem, even
if this stem is not in itself a valid root.
A stemmer for English, for example, should identify the string cats (and possibly catlike, catty
etc.) as based on the root cat, and stemmer, stemming, stemmed as based on stem. English
stemmers are fairly trivial (with only occasional problems, such as dries being the third-person
singular present form of the verb dry, axes being the plural of axe as well as axis); but
stemmers become harder to design as the morphology, orthography, and character encoding of the target
language becomes more complex. For example, an Italian stemmer is more complex than an English one
(because of more possible verb inflections), a Russian one is more complex (more possible noun
declensions), an Arabic one is even more complex (due to nonconcatenative morphology and a writing
system without vowels), and so on.
___________________________________________________________________________________________
117

Prefixes
Suffixes
One-character Two-character Three-character One-character Two-character
t
l
A
y
^Al
&bt
^wAl
^J4 fAl
p
x h
J At
J An
=
_
^
tt
yt
lt
mt
^J&
y
A
J=
=
\=
x=
=

X
:&
wt
st
nt
bm
= tm
hm
hn
J hA
:^
:
:4
lm
wm
km
fm
km
wA
wn
wh
C^
_
^<
<
ll
wy
ly
st
_ yp
JX nA
_ yn
x_ yh
bAl
tA
tk
ty
th
<4 fy
wA
J4 fA
T lA
J&
bA
Table A.2. Prefixes and suffixes list.
The Arabic light stemmer32, Darwishs stemmer modified by Larkey [LBC02], used in this work
identify 3 three-character, 23 two-character and 5 one-character prefixes, 18 two-character and 4 onecharacter suffixes that should be removed in stemming. The prefixes and suffixes to be removed are
shown in Table A.2.
A.2.5. Stop Words

Stop words are those words which are so common that they are useless to index or use in search
32
http://www.glue.umd.edu/~kareem/research/, Retrieved on 4-15-2005
118 ____________________________________________________________________________________________

engines or other search indexes. Usually articles, adverbials or adpositions are stop words. In Arabic
some obvious stop words would be from
(min), to d^( ailaa), he !( huwa), and she
\( hiya).
It should be noted that there is no definitive list of stop words, as they can depend on the purpose of
the search. Full phrase searches, for example, would not want words removed. Also, if the search uses a
stemming algorithm then many words may not be needed in that searches stop list.
A.3. Advanced Techniques

Due to the fact that Anaphoric resolution, Chunking, Lexical acquisition, Lemmatization, Noun
phrase (NP) extraction, Parts of speech (POS) tagging, Phrase name identification, Root, Sentence
parsing, Synonym expansion, and Word sense disambiguation required a text structure analysis, they are
considered as advanced NLP. In the following we recall the definition of a root, POS tagging, chunking,
and NP extraction.
A.3.1. Root
The root is the primary lexical unit of a word, which carries the most significant aspects of semantic
content and cannot be reduced into smaller constituents. Content words in nearly all languages contain,
and may consist only of, root morphemes. However, sometimes the term root is also used to describe the
word minus its inflectional endings, but with its lexical endings in place. For example, chatters has the
inflectional root or lemma chatter, but the lexical root chat. Inflectional roots are often called stems, and
a root in the stricter sense may be thought of as a monomorphemic stem.
Roots can be either free morphemes or bound morphemes. Root morphemes are essential for
affixation and compounds.
The root of a word is a unit of meaning (morpheme) and, as such, it is an abstraction, though it can
usually be represented in writing as a word would be. For example, it can be said that the root of the
English verb form running is run, or the root of the French verb accordera is accorder, since those words
are clearly derived from the root forms by simple suffixes that do not alter the roots in any way. In
particular, English has very little inflection, and hence a tendency to have words that are identical to
their roots. But more complicated inflection, as well as other processes, can obscure the root; for
example, the root of mice is mouse (still a valid word), and the root of interrupt is, arguably, rupt, which
is not a word in English and only appears in derivational forms (such as disrupt, corrupt, rupture, etc.).
The root rupt is written as if it were a word, but it's not.
This distinction between the word as a unit of speech and the root as a unit of meaning is even more
important in the case of languages where roots have many different forms when used in actual words, as
___________________________________________________________________________________________
119

is the case in Semitic languages. In these, roots are formed by consonants alone, and different words
(belonging to different parts of speech) are derived from the same root by inserting vowels. For

example, in Arabic, the root ktb represents the idea of writing, and from it we have
kataba he wrote, and

kutiba has been written, along with other words such as
kutubN books.
A.3.2. POS Tagging

Part-of-speech (POS) tagging is the annotation of words with the appropriate POS tags based on the
context in which they appear. POS tags divide words into categories based on the role they play in the
sentence in which they appear. POS tags provide information about the semantic content of a word
(Did he cross the desert? vs. Did he desert the army?). Nouns usually denote tangible and
intangible things, whereas prepositions express relationships between things. Most POS tag sets
make use of the same basic categories. The most common set of tags contain seven different tags
(Article, Noun, Verb, Adjective, Preposition, Number, and Proper Noun). Currently the most widely
used tag sets are those for the Penn Tree Bank33 (45 tags) and for the British National Corpus34 (BNC
Enriched Tagset also known as the C7 Tagset).
Most tagging algorithms fall into one of two classes: Rule-based taggers and Stochastic taggers.
Rule-based taggers generally involve a large database of hand-written disambiguation rules which
specify, for example, that an ambiguous word is a noun rather than a verb if it follows a determiner;
while, stochastic taggers generally resolve tagging ambiguities by using a training corpus, to compute
the probability of a given word having a given tag in a given context. However, the Transformation
based tagger or the Brill tagger shares features of both tagging architectures. Like the rule-based tagger,
it is based on rules which determine when an ambiguous word should have a given tag. Like the
stochastic taggers, it has a machine-learning component: the rules are automatically induced from a
previously tagged training corpus.
A.3.3. Chunking
Text chunking (light parsing) is an analysis of a sentence which subsumes a range of tasks. The
simplest is finding noun groups or base NPs. More ambitious systems may add additional chunk
types, such as verb groups, or may seek a complete partitioning of the sentence into chunks of different
types. But they do not specify the internal structure of these chunks, nor their role in the main sentence.
The following example identifies the constituent groups of the sentence He reckons the current
33
http://www.scs.leeds.ac.uk/amalgam/tagsets/upenn.html, Retrieved on 10-1-2007.
34
http://www.natcorp.ox.ac.uk/docs/c7spec.html, Retrieved on 10-1-2007.
120 ____________________________________________________________________________________________

account deficit will narrow to only $1.8 billion in September: [NP He] [VP reckons] [NP the current
account deficit] [VP will narrow] [PP to] [NP only $1.8 billion] [PP in] [NP September].
Researchers focused on chunking task apply grammar-based methods, combining lexical data with
finite state or other grammar constraints, while others work on inducing statistical models either directly
from the words or from automatically assigned part-of-speech classes.
A.3.4. Noun Phrase Extraction

The noun phrase extraction is the continuation of a chain of finite-state tools which include
tokenizer, part of speech tagger, and chunker. This step is used in order to identify phrases whose head
is a noun or a pronoun, optionally accompanied by a set of modifiers. The modifiers may be:
determiners: articles (the, a), demonstratives (this, that), numerals (two, five, etc.), possessives (my,
their, etc.), and quantifiers (some, many, etc.); in English, determiners are usually placed before the
noun; adjectives (the red ball); or complements, in the form of an adpositional phrase (such as: the man
with a black hat), or a relative clause (the books that I bought yesterday).
___________________________________________________________________________________________
121
Appendix B Weighting Schemes Notations

Each weighting scheme can be decomposed into three steps: a local, a global and a normalization
step. For all measure we use the following symbols:
fij :
Term frequency, the number of times term i appears in document j.
df i : Document frequency, the number of documents in which term i occurs.

gf i : Global frequency, the total number of times term i occurs in the whole collection.
pij =
f ij
gf i
N: the number of documents in the collection.

M: the number of terms in the collection.
Code
Description
Expression
Local component for term i in document j:
w
(i , j )
local
fij
None, no change
Binary
log fij + 1
Global component for term i:
Natural log
w
global
None, no global change
Idf inverse document frequency
Normal
GfIdf
Entropy
(i )
df
i
log 2
1
N 2
fij
j
gfi
dfi
N
1 + pij log pij
j
log( N )
122 ___________________________________________________________________________________________
N pij log pij

log1
j log( N )
EG
Global Entropy
ES
Shannon Entropy
N
pij log pij
j
E1
1- Entropy
N pij log pij

1
j log( N )
Normalization component for document j:

n
1
M
local (i , j ) wglobal (i )
i =1
w
( j)
norm
None, no normalization
2
cosine
Table B.1. List of term weighting components.
___________________________________________________________________________________________
123
Appendix C Evaluation Metrics

C.1. Introduction
Evaluation metrics are effective tools for evaluating many criteria, such as the execution efficiency,
the storage efficiency, the performance or the effectiveness of a system.
Execution efficiency measures the time taken by a system to perform a computation. Storage
efficiency is measured by the number of bytes needed to store data. However, effectiveness is the most
common measure used in experimentations.
Thereafter, we introduce the commonly qualitative measures used in retrieval and clustering
evaluation task that we have utilized in our works.
C.2. IR Evaluation Metrics

Information retrieval is devoted to finding relevant documents, not finding simple matches to
patterns. Yet, often when information retrieval systems are evaluated, they are found to miss numerous
relevant documents. Moreover, users have become complacent in their expectation of accuracy of
information retrieval systems.
In Figure C.1, we illustrate the critical document categories that correspond to any issued query.
Namely, in the collection there are documents which are retrieved, and there are those documents are
relevant. In perfect system, these two sets would be equivalent-we would only retrieve relevant
documents. In reality, systems retrieve many non-relevant documents. To measure effectiveness, two
ratios are used: precision and recall denoting respectively purity and completeness.
Retrieved
Retrieved & Relevant
Relevant
Document Collection
Figure C.1. The computation of Recall and Precision.
C.2.1. Precision
Precision is the ratio of the number of relevant documents retrieved to the total number retrieved.
124 ___________________________________________________________________________________________
Evaluation Metrics
Precision provides an indication of the quality of the answer set. However, this does not consider the
total number of relevant documents. A system might have good precision by retrieving ten documents
and finding that nine are relevant (a 0.9 precision), but the total number of relevant documents also
matters. If there were only nine relevant documents, the system would be a huge success-however if
millions of documents were relevant and desired, this would not be a good result set.
Precision =
number of the relevant and retrieved documents

total number of the retrieved documents
C.2.2. Recall
Recall considers the total number of relevant documents. It is the ratio of number of relevant
documents retrieved to the total number of documents in the collection that are believed to be relevant.
When the total number of relevant documents in the collection is unknown, an approximation of the
number is obtained.
Recall =
number of the relevant and retrieved documents

total number of the relevant documents in the collection
Precision and Recall are two complementary measures of retrieval performance. For a particular
query, it is usually possible to sacrifice one so as to boost the other. For example, lowering the retrieval
criteria so that more documents are retrieved will most likely increase the Recall rate; however, in the
mean time, this strategy will also probably admit many more non-relevant documents into the retrieval
result with the likely consequence of decreasing the Precision rate, and vice versa as explained in the
Figure C.2. Therefore it is usually recommended that a balance between these two measures be sought
for users' best needs.
Returns relevant documents but
misses many useful ones too
The ideal
Precision
Recall
Returns most relevant

Documents but includes
lots of junk
Figure C.2. The Precision Recall trade-off.

For IR effectiveness, precision and recall are used together but in different ways. For example,
___________________________________________________________________________________________
125
Evaluation Metrics
Precision at n measures the precision after a fixed number of documents have been retrieved. Or,
Precision at specific recall levels, which is the precision after a fraction of relevant documents are
retrieved. Another and most commonly reported measure is the Interpolated precision-recall curve,
showing the interaction between the precision and the recall.
C.2.3. Interpolated Recall-Precision Curve

The Interpolated Recall-Precision curve (IRP curve) is, one of the standard TREC performance
evaluation measure35, developed to enable averaging and performance comparison between different
systems [SaM83]. This measure combines precision and recall to produce a single metric of retrieval
effectiveness. It depicts how precision changes over a range of recall (usually from 0 to 1 in increments
of 0.1).
Mathematically, an N-point IRP curve is drawn by connecting points generated by the following
formula in the order of i (0 i N-1):
( Ri , Pi ) =
, max
Pr ecision( j ) ,
(1)
N 1 Re call ( j ) Ri
1 j m
i
where m is the total number of retrieved documents, Ri =
is the given recall at the ith rang, Pi is
N 1
the interpolated precision based on the given recall, representing the maximum value of function
Precision(j) with j ranging from 1 to m and insuring function Recall(j) be no less than the given Recall
of Ri =
i
, and the functions Recall(j) and Precision(j) are defined as follows:
N 1
Giving a list of the m retrieved documents ranked in descending order according to their relevancy
scores, and considering n the total number of relevant documents in the collection, and Relevant(j)
function representing the number of relevant documents in the top j ranked documents,
Recall(j) =
Relevant(j)
Relevant(j)
and Precision(j) =
.
n
j
Note that while precision is not defined at a recall of 0, this interpolation rule does define an interpolated
value for recall rang 0.
Derived from IRP curve, a single numerical value, T, denoting the area covered between this curve
and the horizontal axis (the axis of Recall) may be used to crudely estimate the overall retrieval
performance of a particular query (Figure C.3). In another word, this single value T (called Average
Precision AP) indicates the average interpolated precision over the full range (i.e. between 0 and 1) of
35
http://trec.nist.gov/pubs/trec10/appendices/measures.pdf.
126 ____________________________________________________________________________________________
Evaluation Metrics
recall for a particular query.
Precision
0
Recall
Figure C.3. Interpolated Recall Precision Curve.

The average precision of multiple query results are combined by taking their mean. The new mean is
called the Mean Interpolated Average Precision (MIAP).
C.3. Clustering Evaluation

As for IR system, the evaluation of document clustering algorithm usually measures its effectiveness
rather than its efficiency, by somehow comparing the clusters it produces with ground truth consisting
of classes assigned to the patterns by manual means or some other means in whose veracity there is
confidence. Generally, to evaluate a single cluster, purity and entropy are used, while accuracy and
mutual information are used for the entire clustering [SGM00], [Erk06].
C.3.1. Accuracy
Accuracy (Acc) is, the degree of veracity, closely related to precision which also called
reproducibility or repeatability because it is the degree to which further measurements or calculations
will show the same or similar results.
The results of calculations or a measurement can be accurate but not precise; precise but not
accurate; neither; or both. A result is called valid if it is both accurate and precise.
Mathematicly, the accuracy is defined as follows:
Let li be the label assigned to d i by the clustering algorithm, and i be dis actual label in the corpus.
Then, accuracy is defined as
n
(map(l ), )
i
i =1
( x, y ) = 1, if x = y
where
.
( x, y ) = 0, otherwise
map (li ) is the function that maps the output label set of the clustering algorithm to the actual label set of
___________________________________________________________________________________________
127
Evaluation Metrics
the corpus. Given the confusion matrix of the output, a best such mapping function can be efficiently
found by Munkres's algorithm [Mun57].
C.3.2. Mutual Information

Mutual Information (MI) is a symmetric measure for the degree of dependency between the
clustering and the categorization. If the cluster and the class are independent, no one of them contains
information about the other and then their mutual information is equal to zero. Formally, the MI metric
does not require a mapping function, and it is generally used because it successfully captures how
related the labelings and categorizations are without a bias towards smaller clusters.
If L = {l1 , l2 , ..., lk } is the output label set of the clustering algorithm, and A = {1 , 2 , ..., k } is the
categorization set of the corpus with the underlying assignments of documents to these sets, the MI of
these two sets is defined as:
MI ( L, A) =
P(l ,
li L ,
j A
) log 2
P(li , j )
P(li ) . P ( j )
where P (li ) and P ( j ) are the probabilities that a document is labeled as li and j by the algorithm
and in the actual corpus, respectively; and P (li , j ) is the probability that these two events occur
together. These values can be derived from the confusion matrix. We map the MI metric to the [0,1]
interval by normalizing it with the maximum possible MI that can be achieved with the corpus. The
____
normalized MI is defined as MI =
MI ( L, A)
.
MI ( A, A)
128 ____________________________________________________________________________________________
Appendix D Principal Angles

Principal angles (sometimes denoted as canonical angles) concept allows to characterize or measure,
in a natural way, how two subspaces differ, by generalizing the notion of an angle between two lines to
higher-dimensional subspaces of R d [BjG73, GoV89, Arg03].
Known that for two non-zero vectors, the acute angle between the vectors x, y R d is denoted by
(x, y ) = arccos
( x, y )
x y
, and by definition could be emphasized that 0 ( x, y )
Considering F and G two subspaces of R d . Recursively, a set of angles between these two
subspaces could be defined, which is denoted as principal or canonical angles. Let two real-valued
matrices F and G be given, each with d rows, and their corresponding column-spaces F and G, which
are subspaces in R d . Assuming that
p = dim (F) dim(G) = q 1.
Then the principal angles l [0, 2 ] between F and G may be defined recursively for l = 1, 2, , q
by cos l = max max f T g , subject to the constraints: f = 1, g = 1, f T f j = 0, g T g j = 0, j = 1,
f F
gG
2, , l-1. The vectors ( f1 , ..., f q ) and ( g1 , ..., g q ) are called principal vectors of the pair of subspaces.
Intuitively, 1 is the angle between two closest unit vectors f1 F and g1 G , 2 is the angle
between two closest unit vectors f 2 F and g 2 G such that f 2 and g 2 are, respectively, orthogonal to
f1 and g1 . Continuing in this manner, always searching in subspaces orthogonal to principal vectors
that have already been found, the complete set of principal angles and principal vectors will be obtained.
q
The average cosine of the principal angles between the subspaces F and G is wrote as ( 1 q ) cos l . For
l =1
algorithms to compute the principal angles see [BjG73, Arg03].
___________________________________________________________________________________________ 129
References
[AaE99]
K. Aas, and L. Eikvil, Text Categorisation: A Survey, Technical Report, June 1999, Norwegian
Computing Center.
[Abd04]
A. Abdelali, Localization in Modern Standard Arabic, Journal of the American Society for
Information Science and Technology, Vol. 55, No.1 (2004), pp. 23-28.
[ACS04]
A. Abdelali, J. Cowie, and H. Soliman, Arabic Information Retrieval Perspectives, In Proceedings

of JEP-TALN 2004 Arabic Language Processing, Fez , Morocco, April, 2004.
[Abd87]
A. Abdul-Al-Aal, An-Nahw Ashamil, Maktabat Annahda Al-Masriya, Cairo, Egypt, 1987.
[AAE99]
H. Abu-Salem, M. Al-omari, and M. Evens, Stemming methodologies over individual query words
for an Arabic information retrieval system, Journal of the American Society for Information Science,
Vol. 50, No. 6 (1999), pp. 524-529.
[AMC05] Z. Abu Bakar, M. Mat Deris, and A. Che Alhadi, Performance Analysis of Partitional and
Incremental Clustering, Seminar Nasional Aplikasi Teknologi Informasi 2005, Yogyakarta,
Indonesia, June, 2005.
[AGG98]
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, Automatic Subspace Clustering of High

Dimensional Data for Data Mining Applications, In Proceedings of the ACM International
Conference on the Management of Data, June, 1998, pp. 94-105.
[Ala90]
M.A. Al-Atram, Effectiveness of Natural Language in Indexing and Retrieving Arabic Documents,
King Abdulaziz City for Science and Technology Project number AR-8-47, Riyadh, Saudi Arabia,
1990.
[AlA89]
S. S. Al-Fedaghi, and F. S. Al-Anzi, A New Algorithm to Generate Arabic Root-Pattern Forms, In

Proceedings of the 11th National Computer Conference, King Fahd University of Petroleum &
Minerals, Dhahran, Saudi Arabia, 1989, pp. 391-400.
[Alg87]
M. Al-Gasimi, Arabization of the MINISIS System, In Proceedings of the First King Saud University
Symposium on Computer Arabization, Riyadh. Saudi Arabia, April, 1987, pp. 13-26.
[AlF02]
M. Al-Jlayl, and O. Frieder, On Arabic search: Improving the retrieval effectiveness via light
stemming approach, In Proceedings of the 11th ACM International Conference on Information and
Knowledge Management, 2002, pp. 340-347.
[Alka91] I.A. Al-Kharashi, MICRO-AIRS: A Microcomputer-based Arabic Information Retrieval System

Comparing Words, Stems, and Roots as Index Terms, Ph.D. Dissertation, Illinois Institute of
Technology, Illinois, USA, 1991.
130 ___________________________________________________________________________________________
References
[AlE94]
I.A. Al-Kharashi, and M. W. Evens, Comparing Words, Stems, and Roots as Index Terms in an
Arabic Information Retrieval System, Journal of the American Society for Information Science, Vol.
45, No. 8 (1994), pp. 548-560.
[Alku91] M. Al-Khuli, A dictionary of theoretical linguistics: English-Arabic with an Arabic-English glossary,

Library of Lebanon, Beirut, Lebanon, 1991.
[Als99]
M. Al-Saeedi, Awdah Almasalik ila Alfiyat Ibn Malek, Dar ihyaa al oloom, Beirut, Lebanon, 1999.
[Als96]
R. Al-Shalabi, Design and Implementation of an Arabic Morphological System to Support Natural

Language Processing, Ph.D. Dissertation, Computer Science, Illinois Institute of Technology, Chicago,
1996.
[AlA04a]
I.A. Al-Sughaiyer, and I.A. Al-Kharashi, Arabic Morphological Analysis Techniques: a

Comprehensive Survey, Journal of the American Society for Information Science and Technology,
Vol.55, No. 3, February, 2004, pp.189-213.
[AlA04b] L. Al-Sulaiti, and E. Atwell, Designing and Developing a Corpus of Contemporary Arabic, In
Proceedings of the 6th Teaching and Language Corpora Conference, Granada, Spain, 2004, pp.92.
[AlA05] L. Al-Sulaiti, and E. Atwell, Extending the Corpus of Contemporary Arabic, In Proceedings of
Corpus Linguistics Conference, Vol. 1, No. 1 (2005), pp. 15-24.
[All07]
M. P. Allen, The t test for the simple regression coefficient, Chapter in Understanding Regression
Analysis, Springer US, 1997, pp. 66-70.
[Ama02]
M. Amar, Les Fondements thoriques de lindexation: une approche linguistique, ADBS editions,
Paris, France, 2000.
[AmR02]
G. Amati, and C. J. Van Rijsbergen, Probabilistic Models of Information Retrieval based on

Measuring the Divergence from Randomness, ACM Transactions on Information Systems (TOIS),
Vol. 20, No. 4 (2002), pp. 357-389.
[Arg03]
M.E. Argentati, Principal Angles between Subspaces as Related to Rayleigh Quotient and Rayleigh
Ritz Inequalities with Applications to Eigenvalue Accuarcy and an Eigenvalue Solver, Ph.D.
Dissertation, University of Colorado, USA, 2003.
[Ars04]
A. Arshad, Beyond Concordance Lines: Using Concordances to Investigating Language

Development, Internet Journal of e-Language Learning and Teaching, Vol. 1, No. 1 (2004), pp. 4351.
[ABE05]
F. Ataa Allah, S. Boulaknadel, A. El Qadi, and D. Aboutajdine, Amlioration de la performance de

lanalyse smantique latente pour des corpus de petite taille, Revue des Nouvelles technologies de
lInformation (RNTI), Vol. 1 (2005), pp. 317.
[ABE06]
F. Ataa Allah, S. Boulaknadel, A. El Qadi, and D. Aboutajdine, Arabic Information Retrieval System
Based on Noun Phrases, Information and Communication Technologies, Vol. 1, No. 24-28, Damask,
__________________________________________________________________________________________ 131
References
Syria, April, 2006, pp. 1720 - 1725.
[ABE08]
F. Ataa Allah, S. Boulaknadel A. El Qadi, and D. Aboutajdine, Evaluation de lAnalyse Smantique

Latente et du Modle Vectoriel Standard Appliqus la Langue Arabe, Revue de Technique et
Science Informatiques, sent on February 2006, accepted on January 2007, and to appear on 2008.
[Att00]
A. M. Attia, A Large-Scale Computational Processor of the Arabic Morphology, and Applications,

A Masters Thesis, Faculty of Engineering, Cairo University, Cairo, Egypt, 2000.
[BaH76]
F. B. Backer, and L. J. Hubert, A Graphtheoretic Approach to Goodness-of-Fit in Complete-Link

Hierarchical Clustering, Journal of the American Statistical Association, Vol. 71, 1976, pp. 870-878.
[BCB92]
B. T. Bartell, G. W. Cottrell, and R. K. Belew, Latent Semantic Indexing is an Optimal Special Case
of Multidimensional Scaling, Proceedings of the 15th Annual International ACM SIGIR Conference
on Research and Development in Information retrieval, 1992, pp. 161-167.
[BeK03]
J. Becker, and D. Kuropka, Topic-based Vector Space Model, In Proceedings of the 6th
International Conference on Business Information Systems, Colorado Springs, June, 2003, pp. 7-12.
[Bec59]
M. Beckner, The Biological Way of Thought, Columbia University Press, New York, 1959.
[Bee96]
K. R. Beesley, Arabic finite-state Morphological Analysis and Generation In Proceedings of the 16th
International Conference on Computational Linguistics (COLING-96), Vol. 1, pp. 89-94, 1996.
[BeC87]
N. Belkin and W. B. Croft, Retrieval Techniques, In M. Williams, editor, Annual Review of

Information Science and Technology (ARIST), Vol. 22, Chap. 4. Elsevier Science Publishers B.V.,
1987, pp. 109-145.
[BeN03]
M. Belkin and P. Niyogi, Laplacian Eigenmaps for Dimensionality Reduction and Data
Representation, Neural Computation, Vol. 6, No. 15, 2003, pp. 1373-1396.
[BeB99]
M. W. Berry, and M. Browne, Understanding Search Engines: Mathematical Modeling and Text
Retrieval, Siam Book Series: Software, Philadelphia, 1999.
[BDJ99]
M. W. Berry, Z. Drmac, and E. R. Jessup, Matrices, Vector Spaces, and Information Retrieval,
Society for Industrial and Applied Mathematics Review, Vol. 41, No. 2 (1999), pp. 335-362.
[BDO95]
M. W. Berry, S. T. Dumais, and G. W. O'Brien, Using Linear Algebra for Intelligent Information
Retrieval, Society for Industrial and Applied Mathematics Review, Vol. 37, No. 4 (1995), pp. 573595.
[BeF96]
M. W. Berry, and R. D. Fierro, Low-Rank Orthogonal Decompositions for Information Retrieval

Applications, Numerical Linear Algebra with Applications, Vol. 3, No. 4 (1996), pp. 301-328.
[BjG73]
A. Bjorck, and G. Golub, Numerical Methods for Computing Angles between Linear Subspaces,
Journal of Mathematics of Computation, Vol. 27, No. 123 (1973), pp. 579-594.
[Bla06]
A. Blansch. Classification non Supervise avec Pondration Dattributs par des Mthodes
132 __________________________________________________________________________________________
References
Evolutionnaires, Ph.D. Dissertation, Louis Pasteur University- Strasbourg I, September, 2006.
[BlL97]
A. Blum, and P. Langley, Selection of Relevant Features and Examples in Machine Learning,
Journal of Artificial Intelligence, Vol. 97 (1997), pp. 245-271.
[Boo80]
A. Bookstein, Fuzzy Requests: An Approach to Weighted Boolean Searches, Journal of the

American Society for Information Science, Vol. 31 (1980), pp. 240-247.
[BoG97]
I. Borg, and P. Groenen, Modern Multidimensional Scaling: Theory and Applications, SpringerVerlag, New York, USA, 1997.
[BoA05]
S. Boulaknadel, and F. Ataa Allah, Recherche dInformation en Langue Arabe : Influence des
Paramtres Linguistiques et de Pondration de LSA, In Actes des Rencontres des Etudiants
Chercheurs en Informatique pour le Traitement Automatique des Langues (RCITAL), Paris
Dourdan, Vol. 1 (2005), pp. 643-648.
[Bou08]
S. Boulaknadel, Recherche d'Iinformation en Langue Arabe, Ph.D. Dissertation, Mohamed V

University, Morocco, 2008.
[BuH29]
H. G. Buchaman-Wollaston, and W. G. Hodgeson, A New Method of Treating Frequency Curves in

Fishery Statistics, with some Results, Journal of the International Council for the Exploration of the
Sea, Vol. 4 (1929), pp. 207-225.
[BuK81]
D. Buell, and D. H. Kraft, Threshold Values and Boolean Retrieval Systems, Journal of Information
Processing and Management, Vol. 17, No. 3 (1981), pp. 127-36.
[Can93]
F. Can, Incremental Clustering for Dynamic Information Processing, ACM Transactions on

Information Processing Systems, Vol. 11 (1993), pp. 143-164.
[CaD90]
F. Can, and N.D. Drochak II, Incremental Clustering for Dynamic Document Databases, In
Proceeding of the 1990 Symposium on Applied Computing, 1990, pp. 61-67.
[Cha94]
B.B. Chaudhri, Dynamic Clustering for Time Incremental Data, Pattern Recognition Letters, Vol.
15, No. 1 (1994), pp. 27-34.
[Chu97]
F.R.K. Chung, Spectral Graph Theory, Conference Board of the Mathematical Sciences Conference
Regional Conference Series in Mathematics, May, 1997, No. 92.
[ChH89]
K. Church and P. Hanks, Word Association Norms, Mutual Information, and Lexicography, In
Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, 1989, pp.
76-83.
[CRJ03]
J. Clech, R. Rakotomalala, and R. Jalam, Slection multivarie de termes, In Proceedings of the 35th
Journes de Statistiques, Lyon, France, 2003, pp. 933-936.
[CoL06a]
R.R. Coifman and S. Lafon, Diffusion Maps, Applied and Computational Harmonic Analysis, Vol.
21, No. 1 (2006), pp. 6-30.
__________________________________________________________________________________________ 133
References
[CoL06b] R.R. Coifman and S. Lafon, Geometric Harmonics: A Novel Tool for Multiscale Out-of-Sample
Extension of Empirical Functions, Applied and Computational Harmonic Analysis, Vol. 21, No. 1
(2006), pp. 31-52.
[CLL05]
R.R. Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. Zucker, Geometric
Diffusions as a Tool for Harmonics Analysis and Structure Definition of Data: Diffusion Maps,
Proceedings of the National Academy of Sciences, Vol. 102, No. 21 (2005), pp. 7426-7431.
[Com94]
P. Comon, Independent component analysis, A new concept?, In Proceeding of Signal Processing,

Vol. 36, No. 3 (1994), pp. 287-314.
[CLR98]
F. Crestani, M. Lalmas, C. J. Van Rijsbergen, and I. Campbell, Is this Document Relevant?

Probably: A Survey of Probabilistic Models in Information Retrieval, ACM Computing Surveys
(CSUR), Vol. 30, No. 4 (1998), pp. 528-552.
[Cro77]
W.B. Croft, Clustering large files of documents using the single link method, Journal of the
American Society for Information Science, Vol. 28 (1977), pp. 341-344.
[Cro72]
D. Crouch, A clustering algorithm for large and dynamic document collections, Ph.D. Dissertation,
Southern Methodist University, 1972.
[CuW85]
J.K. Cullum, and R.A. Willoughby, Lanczos algorithms for large symmetric eigenvalue computations
Vol. 1 Theory, (Chapter 5: Real rectangular matrices), Brikhauser, Boston, 1985.
[DaL97]
M. Dash, and H. Liu, Feature Selection for Classification, Journal of Intelligent Data Analysis, Vol.
1, No. 1-4 (1997), pp. 131-156.
[Dar02]
K. Darwish, Building a Shallow Arabic Morphological Analyzer in One Day, In Proceedings of the
Association for Computational Linguistics, 2002, pp. 47-54.
[Dar03]
K. Darwish, Probabilistic Methods for Searching OCR-Degraded Arabic Text, Doctoral Dissertation,
University of Maryland, College Park, Maryland, 2003.
[DDJ01]
K. Darwish, D. Doermann, R. Jones, D. Oard, and M. Rautiainen, TREC-10 Experiments at

Maryland: CLIR and Video, In Proceedings of the 2001 Text Retrieval Conference National Institute
of Standards and Technology, November, 2001, pp. 552.
[Dat71]
R.T. Dattola, Experiments with a fast clustering algorithm for automatic classification, In The
SMART Retrieval System-Experiments in Automatic Document Processing, G. Salton Edition,
Prentice-Hall, Englewood Cliffs, New Jersey, 1971, Chap. 12.
[DaB79]
D.L. Davies, and D.W. Bouldin, A Cluster Separation Measure, IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol. 1, No. 2 (1979), pp. 224-227.
[DDF89]
S. Deerwester, S. Dumais, G. Furnas, G.W. Furnas, R.A. Harshman, T.K. Landauer, K.E. Lochbaum,
and L.A. Streeter, Computer information retrieval using latent semantic structure, U. S. Patent, No.
4 (1989), pp. 839-853.
134 __________________________________________________________________________________________
References
[DDF90]
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman, Indexing by Latent
Semantic Analysis, Journal of the American Society for Information Science, Vol. 41, No. 6 (1990),
pp. 391-407.
[DKN03]
I.S. Dhillon, J. Kogan, and M. Nicholas, Feature Selection and Document Clustering, In M.W.
Berry, editor, A Comprehensive Survey of Text mining, Springer-Verlag, 2003.
[DhM99]
I.S. Dhillon, and D.S. Modha, Concept Decompositions for Large Sparse Text Data using
Clustering, Technical Report RJ 10147 (95022), IBM Almaden Research Center, 1999.
[DhM00]
I.S. Dhillon and D.S. Modha, A parallel data-clustering algorithm for distributed memory
multiprocessors, In Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence, Vol.
1759 (2000), pp. 245-260.
[DhM01]
I.S. Dhillon and D.S. Modha, Concept Decompositions for Large Sparse Text Data using
Clustering, Machine Learning, Vol. 42, No. 1-2 (2001), pp. 143-175.
[DHJ04]
M. Diab, K. Hacioglu, and D. Jurafsky, Automatic Tagging of Arabic Text: from Raw Text to Base
Phrase Chunks, In Proceedings of the Human Language Technology conference and the North
American Chapter of the Association for Computational Linguistics, Boston, Massachusetts, USA,
May, 2004, pp. 149-152.
[Did73]
E. Diday, The Dynamic Cluster Method and Sequentialization in Nonbierarchical Clustering,

International Journal of Computer and Information Science, Vol. 2, No. 1(1973), pp. 63-69.
[DHZ01]
C. Ding, X. He, H. Zha, M. Gu, and H. Simon, A minmax cut algorithm for graph partitioning and
data clustering, In Proceedings of IEEE International Conference on Data Mining, 2001, pp. 107114.
[DHZ02]
C. Ding, X. He, H. Zha, M. Gu, and H. Simon, Adaptive Dimension Reduction for Clustering High
Dimensional Data, In Proceedings of the 2nd International IEEE Conference on Data Mining,
December, 2002, pp. 147-154.
[Din99]
C. H. Ding, A Similarity-based Probability Model for Latent Semantic Indexing, Proceedings of the
22nd ACM SIGIR Conference, August, 1999, pp. 59-65.
[Din01]
C. H. Ding, A Probabilistic Model for Dimensionality Reduction in Information Retrieval and

Filtering, In Proceedings of the 1st SIAM Computational Information Retrieval Workshop, 2000.
[DoG03]
D.L. Donoho, and C. Grimes, Hessian Eigenmaps: New Locally Linear Embedding Techniques for
High-Dimensional Data, In Proceedings of Natl Academy of Sciences, Vol. 100, No. 10 (2003), pp.
5591-5596.
[DuH73]
R.O. Duda, and P.E. Hart, Pattern Classification and Scene Analysis, Wiley, New York, NY, USA,
1973.
__________________________________________________________________________________________ 135
References
[Dum91]
S. Dumais, Improving the Retrieval of Information from External Sources, Behavior Research
Methods, Instruments, & Computers, Vol. 23, No. 2 (1991), pp. 229-236.
[Dum92]
S. Dumais, Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval, Technical

Memorandum Tm-ARH-017527, Bellcore, 1992.
[Dum94]
S. Dumais, Latent Semantic Indexing (LSI) and TREC-2, Technical Memorandum Tm-ARH023878, Bellcore, 1994.
[Dun03]
M. H. Dunham, Data Mining: Introductory And Advanced Topics, New Jersey: Prentice Hall, 2003.
[Dun89]
G. H. Dunteman, Principal Component Analysis, Sage Publications, Newbury Park, California, USA,
1989.
[Egg04]
L. Egghe, Vector Retrieval, Fuzzy Retrieval and the Universal Fuzzy IR Surface for IR Evaluation,
Journal of Information Processing and Management, Vol. 40, No. 4 (2004), pp. 603-618.
[EGH91]
D. A. Evans, K. Ginther-Webster, M. Hart, R. G. Lefferts, and I. Monarch, Automatic Indexing

using Selective NLP and First-Order Thesauri, In Proceedings of the Conference on Intelligent Text
and Image Handling, Barcelona, Spain, 1991, pp. 394-401.
[Fag87]
J. Fagan, Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of

Syntactic and Non-syntactic methods, Doctoral dissertation, Cornell University, 1987.
[FaO95]
C. Faloutsos, and D.W. Oar, A Survey of Information Retrieval and Filtering Methods. Technical
Report CS-TR-3514, Department of Computer Science, University of Maryland, College Park, 1995.
[FiB02]
R.D. Fierro and M.W. Berry, Efficient Computation of the Riemannian SVD in Total Least Squares
Problems in Information Retrieval, in S. Van Huffel and P. Lemmerling (Eds.), Total Least Squares
and Errors-in-Variables Modeling: Analysis, Algorithms, and Applications, Kluwer Academic
Publishers, 2002, pp. 349-360.
[For03]
G. Forman, An Extensive Empirical Study of Feature Selection Metrics for Text Classification,
Journal of Machine Learning Research, Vol. 3 (2003), pp. 1289-1305.
[FrB92]
W. Frakes and R. Baeza-Yates, editors, Information Retrieval: Data Structures & Algorithms,
Prentice Hall, Englewood Cliffs, New Jersey, 1992.
[Fre01]
A. Freeman, Brill's POS Tagger and a Morphology Parser for Arabic, In Proceedings of the 39th
Annual Meeting of Association for Computational Linguistics and the 10th Conference of the
European Chapter, Workshop on Arabic Language Processing: Status and Prospects, Toulouse,
France, July, 2001.
[Fri73]
M. Fritzche, Automatic Clustering Techniques in Information Retrieval, Diplomarbeit, Institut fr

Informatik der Universitt Stuttgart 1973.
[FuT04]
B. Fuglede, and F. Topsoe, Jensen-Shannon Divergence and Hilbert Space Embedding, In IEEE
International Symposium on Information Theory, July, 2004, pp. 31.
136 __________________________________________________________________________________________
References
[GeO01]
F.C. Gey, and D.W. Oard, The TREC-2001 Cross-Language Information Retrieval Track: Searching
Arabic Using English, French or Arabic Querie, In Proceedings of the 2001 Text Retrieval
Conference, National Institute of Standards and Technology, November, 2001, pp. 16-26.
[GoR71]
G. Golub, and C. Reinsch, Handbook for Automatic Computation II, Linear Algebra, SpringerVerlag, New York, 1971.
[GoV89]
G. Golub, and C. Van Loan, Matrix Computations, Johns-Hopkins, Baltimore, Maryland, 2nd Edition,
1989.
[GoD01]
A. Goweder, and A. De Roeck, Assessment of a Significant Arabic Corpus, In Proceedings of the

39th Annual Meeting of the Association for Computational Linguistics, Arabic language Processing,
Toulouse, France, 2001, pp. 73-79.
[GPD04]
A. Goweder, M. Poesio, A. De Roeck, and J. Reynolds, Identifying Broken Plurals in Unvowelised

Arabic Text, In Proceedings of Empirical Methods In Natural Language Processing, Geneva, July,
2004, pp. 246-253.
[GoR69]
J. C. Gower, and G. J. S. Ross, Minimum Spanning Trees and Single-Linkage Cluster Analysis,
Applied Statistics, Vol. 18, No. 1 (1969), pp. 5464.
[GRG97]
V. N. Gudivada, V. V. Raghavan, W. I. Grosky, and R. Kasanagottu, Information Retrieval on the

World-Wide Web, IEEE Internet Computing, Vol. 1, No. 5 (1997), pp. 58-68.
[GuB06]
S. Gurif, and Y. Bennani, Selection of Clusters Number and Features Subset during a two-levels
Clustering Task, In Proceeding of the 10th IASTED International Conference Artificial Intelligence
and Soft Computing, August, 2006, pp. 28-33.
[GuB07]
S. Gurif, and Y. Bennani, Dimensionality Reduction through Unsupervised Features Selection, In

Proceeding of the 10th International Conference on Engineering Applications of Neural Networks,
Thessaloniki, Hellas, Greece, August, 2007, pp. 98-106.
[GBJ05]
S. Gurif, Y. Bennani, and E. Janvier, -som: Weighting Features During Clustering, In Proceeding
of the 5th Workshop On Self-Organizing Maps, September, 2005, pp. 397-404.
[GuE03]
I. Guyon, and A. Elisseeff, An Introduction to Variable and Feature Selection, Journal of Machine
Learning Research, Vol. 3 (2003), pp. 1157-1182.
[HaK03]
K.M. Hammouda, and M.S. Kamel, Incremental Document Clustering Using Cluster Similarity
Histograms, In Proceeding of the IEEE International Conference on Web Intelligence, June, 2003,
pp. 597-601.
[HaK92]
L. Hagen, and A.B. Kahng, New Spectral Methods for Ratio Cut Partitioning and Clustering, IEEE
Transaction Computer-Aided Design of Integrated Circuits and Systems, Vol. 11, No. 9, September,
1992, pp. 10741085.
[HGM00] V. Hatzivassiloglou, L. Gravano, and A. Maganti, An Investigation of Linguistic Features and

__________________________________________________________________________________________ 137
References
Clustering Algorithms for Topical Document Clustering, In Proceedings of the 23rd Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval,
Athens, Greece, July, 2000, pp. 224-231.
[HeP96]
M. A. Hearst and J. O. Pedersen, Reexamining the Cluster Hypothesis: Scatter/Gather on retrieval

results, Proceedings of the 19th International ACM Conference on Research and Development in
Information Retrieval, Zurich, Switzerland, August, 1996, pp. 76-84.
[HBL94]
W. R. Hersh, C. Buckley, T. J. Leone, and D. H. Hickam, Ohsumed: An Interactive Retrieval

Evealuation and new large Test Collection for Research, In Proceedings of the 17th Annual
International ACM Conference on Research and Development in Information Retrieval, Dublin,
Ireland, 1994, pp. 192-201.
[HiK99]
A. Hinneburg, and D. A. Keim, Optimal Grid-Clustering: Towards Breaking the Curse of

Dimensionality in High-Dimensional Clustering, In Proceedings of the 25th International Conference
on Very Large Data Bases, Edinburgh, 1999, pp. 506-517.
[HKE97]
I. Hmeidi, K. Kanaan, and M. Evens, Design and Implementation of Automatic Indexing for
Information Retrieval with Arabic Documents, Journal of the American Society for Information
Science, Vol. 48, No. 10 (1997), pp. 867-881.
[Yan05]
H. Yan, Techniques for Improved LSI Text Retrieval, Ph.D. Dissertation, Wayne State University,
Detroit, Michigan, USA, 2005.
[Yan08]
H. Yan, W. I. Grosky, and F. Fotouhi, Augmenting the power of LSI in text retrieval: Singular value
rescaling, Journal of Data and Knowledge Engineering, Vol. 65 (2008), pp. 108-125.
[HNR05]
J.Z. Huang, M.K. Ng, H. Rong, and Z. Li, Automated Variable Weighting in k-means Type
Clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 5 (2005),
pp. 657-668.
[HSD00] P. Husbands, H. Simon, and C. H. Ding. On the Use of the Singular Value Decomposition for Text
Retrieval, Computational information Retrieval, M. W. Berry, Ed. Society for Industrial and Applied
Mathematics, Philadelphia, PA, 2001, pp. 145-156.
[Ibn90]
Ibn Manzour, Lisan Al-Arab, Arabic Encyclopedia, 1290.
[JaD88]
A. Jain, and R. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, N.J., 1988.
[JMF99]
A.K. Jain, M.N. Murty, and P.J. Flynn, Data clustering: a review, ACM Computing Surveys, Vol.
31, No. 3 (1999), pp. 264-323.
[JaZ97]
A. Jain, and D. Zongker, Feature Selection: Evaluation, Application, and Small Sample
Performance, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 2
(1997), pp. 153-158.
[JaS71]
N. Jardine, and R. Sibson, Mathematical Taxonomy, Wiley, London and New York, 1971.
138 __________________________________________________________________________________________
References
[Jia97]
J. Jiang, Using Latent Semantic Indexing for Data Mining, MS Thesis, Department of Computer
Science, University of Tennessee, December, 1997.
[Jia98]
E.P. Jiang, Information retrieval and Filtering Using the Riemannian SVD, Ph.D. Dissertation,
Department of Computer Science, University of Tennessee, August, 1998.
[JKP94]
G.H. John, R. Kohavi, and K. Pfleger, Irrelevant features and the subset selection problem, In
Proceedings of the 11th International Conference on Machine Learning, San Francisco, CA, USA,
1994, pp. 121-129.
[Jon72]
K.S. Jones, A Statistical Interpretation of Term Specificity and its Application in Retrieval, Journal
of Documentation, Vol. 28, No. 1 (1972), pp. 11-21.
[KaR90]
L. Kaufman, and P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis,

Wiley series in probability and mathematical statistics-Applied probability and statistics, A WileyInterscience, New York, NY, 1990.
[Kin67]
B. King, Step-wise clustering procedures, Journal of the American Statistical Association, Vol. 69
(1967), pp. 86-101.
[KlJ04]
I.A. Klampanos, and J.M. Jose, An Architecture for Information Retrieval over Semi-Collaborating
Peer-to-Peer Networks, In Proceedings of the 2004 ACM Symposium on Applied Computing, Vol. 2
(2004), Nicosia, Cyprus, March, pp. 1078-1083.
[KJR06]
I.A. Klampanos, J.M. Jose, and C.J.K. van Rijsbergen, Single-Pass Clustering for Peer-to-Peer
Information Retrieval: The Effect of Document Ordering, Proceedings of the 1st International
Conference on Scalable information Systems, Hong Kong, May, 2006, Article 36.
[Kho01]
S. Khoja, APT: Arabic Part-of-speech Tagger, In Proceedings of the Student Workshop at the 2nd
Meeting of the North American Chapter of the Association for Computational Linguistics, 2001, pp.
20-25.
[KhG99]
S. Khoja, and R. Garside, Stemming Arabic text, Technical Report, Computing Department,
Lancaster University, Lancaster, September, 1999.
[KiR92]
K. Kira, and L. A. Rendell, A Practical Approach to Feature Selection, In Proceedings of the 9th
International Conference on Machine Learning, San Francisco, CA, USA, 1992, pp. 249-256.
[KoJ97]
R. Kohavi, and G. H. John, Wrappers for feature subset selection, Journal of Artificial Intelligence,
Vol. 97, No. 1-2 (1997), pp. 273-324.
[KoO96]
T.G. Kolda, and D.P. O'Leary, Large Latent Semantic Indexing via a Semi-Discrete Matrix
Decomposition, Technical Report, No. UMCP-CSD CS-TR-3713, Department of Computer Science,
Univ. of Maryland, 1996.
[KoO98]
T.G. Kolda, and D.P. O'Leary, A Semi-Discrete Matrix Decomposition for Latent Semantic Indexing
in Information Retrieval, ACM Transactions on Information Systems, Vol. 16, No. 4 (1998), pp. 322-
__________________________________________________________________________________________ 139
References
346.
[KoS96]
D. Koller, and M. Sahami, Toward Optimal Feature Selection, In Proceedings of the 13th
International Conference on Machine Learning, 1996, pp. 284292.
[KWX01] B. Krishnamurthy, J. Wang, and Y. Xie, Early Measurements of a Cluster-Based Architecture for
P2P Systems, Internet Measurement Workshop, ACM SIGCOMM, San Francisco, USA, November,
2001.
[KuL51]
S. Kullback, and R. A. Leibler, On Information and Sufficiency, Annual Mathematical Statistics,

Vol. 22 (1951), pp.79-86.
[LaL06]
S. Lafon, and A.B. Lee, Diffusion Maps and Coarse-Graining: A Unified Framework for
Dimensionality Reduction, Graph Partitioning, and Data Set Parameterization, IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol. 28, No. 9 (2006), pp. 1393-1403.
[LFL98]
T.K. Landauer, P. W. Foltz, and D. Laham, An Introduction to Latent Semantic Analysis, Discourse
Processes, Vol. 25 (1998), pp. 259-284.
[Lap00]
E. Laporte, Mot et niveau lexical, Ingnierie des langues, 2000, pp. 25-46.
[LBC02]
L. S. Larkey, L. Ballesteros, and M. Connell, Improving Stemming for Arabic Information Retrieval
: Light Stemming and Cooccurrence Analysis, In Proceedings of the 25th Annual International
Conference on Research and Development in Information Retrieval (SIGIR 2002), Tampere, Finland,
August 2002, pp. 275-282.
[LaM71]
D. N. Lawley, and A. E. Maxwell, Factor Analysis as a Statistical Method, 2nd edition, American
Elsevier Publication, New York, USA, 1971.
[Law03]
N. D. Lawrence, Gaussian Process Latent Variable Models for Visualisation of High Dimensional
Data, In Proceeding of Neural Information Processing Systems, December, 2003.
[Lee94]
J.H. Lee, Properties of Extended Boolean Models in Information Retrieval, Proceedings of the 17th
Annual International ACM SIGIR Conference, Dublin, Ireland, 1994, pp. 182-190.
[Ler99]
K. Lerman, Document Clustering in Reduced Dimension Vector Space, Unpublished Manuscript,

1999, http://www.isi.edu/~lerman/papers/papers.html.
[Let96]
T.A. Letsche, Toward Large-Scale Information Retrieval Using Latent Semantic Indexing, MS
Thesis, Department of Computer Science, University of Tennessee, August 1996.
[LeB97]
T.A. Letsche, and M.W. Berry, Large-Scale Information Retrieval with Latent Semantic Indexing,
Information Sciences, Vol. 100, No. 1-4 (1997), pp. 105-137.
[Leu01]
A. Leuski, Evaluating Document Clustering for Interactive Information Retrieval, Proceedings of

the ACM 10th International Conference on Information and Knowledge Management, Atlanta,
Georgia, November, 2001, pp. 33-40.
140 __________________________________________________________________________________________
References
[Lit69]
B. Litofsky, Utility of automatic classification systems for information storage and retrieval, Ph.D.
Dissertation, University of Pennsylvania, 1969.
[LiM98]
H. Liu, and H. Motoda, Feature Selection for Knowledge Discovery & Data Mining, The Kluwer
International Series in Engineering and Computer Science, Kluwer Academic Publishers, Boston,
USA, 1998.
[Mac67]
J. B. MacQueen, Some Methods for Classification and Analysis of Multivariate Observations,

Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley,
University of California Press, Vol. 1 (1967), pp. 281-297.
[MaL01]
V. Makarenkov, and P. Legendre, Optimal Variable Weighting for Ultrametric and Additive Trees
and k-means Partitioning: Methods and Software, Journal of Classification, Vol. 18, No. (2001), pp.
245-271.
[MAS03]
J. Makkonen, H. Ahonen-Myka, and M. Salmenkivi, Topic Detection and Tracking with SpatioTemporal Evidence, In Proceedings of 25th European Conference on Information Retrieval
Research, 2003, pp. 251-265.
[MaS99]
C. Manning, and H. Schtze, Foundations of Statistical Natural Language Processing, MIT Press,
Cambridge, MA, 1999.
[MeS01]
M. Meila, and J. Shi, A Random Walks View of Spectral Segmentation, In Proceedings of

International Workshop on Artificial Intelligence and Statistics, Key West, Florida, USA, January,
2001.
[Mil02]
A. Miller, Subset Selection in Regression, 2nd edition, Chapman & Hall/CRC, 2002.
[MBS97]
M. Mitra, C. Buckley, A. Singhal, and C. Cardi, In Analysis of Statistical and Syntactic Phrases, In
Proceeding of the 5me Confrence de Recherche dInformation Assiste par Ordinateur, Montreal,
Canada, June, 1997, pp. 200-214.
[MMP02] P. Mitra, C.A. Murthy, and S.K. Pal, Unsupervised Feature Selection Using Feature Similarity,
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 3 (2002), pp.301-312.
[Mun57]
J. Munkres, Algorithms for the Assignment and Transportation Problems, Journal of the Society for
Industrial and Applied Mathematics, Vol. 5, No. 1 (1957), pp. 32-38.
[MuA04]
S. H. Mustafa, and Q. A. Al-Radaideh, Using N-grams for Arabic Text Searching, Journal of the
American Society for Information Science and Technology, Vol. 55, No. 11, September, 2004, pp.
1002-1007.
[NJW02]
A. Ng, M. Jordan, and Y. Weiss, On Spectral Clustering: Analysis and an Algorithm, In

Proceedings of 14th Advances in Neural Information Processing Systems, 2002.
[NiC05]
M. Nikkhou, and K. Choukri, Report on Survey on Arabic Language Resources and Tools in
Mediterranean Countries, ELDA, NEMLAR, 2005.
__________________________________________________________________________________________ 141
References
[Obr94]
G. W. OBrien, Information Management Tools for Updating an SVD Encoded Indexing Scheme,
Masters Thesis, The University of Knoxville, Tennessee, Knoxville, TN, 1994.
[PLL01]
J.M. Pena, J.A. Lozano, P. Larranaga, and I. Inza, Dimensionality Reduction in Unsupervised
Learning of Conditional Gaussian Networks, IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. 23, No. 6, June, 2001, pp. 590-603.
[PoC98]
J.M. Ponte, and W.B. Croft, A Language Modeling Approach to information retrieval, Proceedings
of the 21st Annual International ACM SIGIR Conference, 1998, pp. 275-281.
[PTS92]
W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, Numerical Recipes in C: The Art of
Scientific Computing, 2nd edition, Cambridge University Press, 1992, pp. 994.
[PrS72]
N.S. Prywes, and D.P. Smith, Organization of Information, Annual Review of Information Science
and Technology, Vol. 7 (1972), pp. 103-158.
[PNK94]
P. Pudil, J. Novovicova, and J. Kittler, Floating Search Methods in Feature Selection, Journal of
Pattern Recognition Letters, Vol. 15, No. 11 (1994), pp. 1119-1125.
[Rad79]
T. Radecki, Fuzzy Set Theoretical Approach to Document Retrieval, Journal of Information

Processing and Management, Vol. 15 (1979), pp. 247-259.
[Ras92]
E. Rasmussen, Clustering Algorithms, In Information Retrieval: Data Structures and Algorithms,

1992, pp. 419-442.
[RoS76]
S.E. Robertson, and K. Sparck Jones, Relevance Weighting of Search Terms, Journal of American
Society for Information Sciences, Vol. 27, No. 3 (1976), pp. 129-146.
[RWH94] S. E. Robertson, S. Walker, M. Hancock-Beaulieu, and M. Gatford, Okapi at TREC-3, In

Proceedings of TREC-3, November, 1994, pp. 109-126.
[Rom90]
P. M. Romer, Endogenous Technical Change, Journal of Political Economy, Vol. 98, No. 5 (1990),
pp. 71-102.
[RoS00]
S.T. Roweis, and L.K. Saul, Nonlinear Dimensionality Reduction by Locally Linear Embedding,
Journal of Science, Vol. 290, No. 5500 (2000), pp. 2323-2326.
[SaB90]
G. Salton, and C. Buckley, Improving retrieval performance by relevance feedback, Journal of the
American Society for Information Science, Vol. 41, No. 4 (1990), pp. 288-297.
[Sal68]
G. Salton, Automatic Information Organization and Retrieval, McGraw-Hill, New York, 1968.
[Sal71]
G. Salton, The SMART Retrieval System Experiments in Automatic Document Processing,

Prentice-Hall Inc, Englewood Cliffs, New Jersey, 1971.
[SaM83]
G. Salton and M.J. McGill, Introduction to Modern Information Retrieval, McGraw Hill Publishing
Company, New York, 1983.
[SaW78]
G. Salton, and A. Wong, Generation and Search of Clustered Files, ACM Transaction on Database
142 __________________________________________________________________________________________
References
Systems, Vol. 3, No. 4, December, 1978, pp. 321-346.
[Sav02]
J. Savoy, Morphologie et recherche dinformation, Technical Report, CLEF, 2002.
[SSM99]
B. Scholkopf, A. J. Smola, and K. Muller, Kernel Principal Component Analysis, In B. Scholkopf,

C. J. C. Burges, and A. J. Smola edition, Advances in Kernel Methods - Support Vector Learning,
MIT Press, Cambridge, MA, 1999, pp. 327-352.
[Sch94]
H. Schmid, Probabilistic Part-of-Speech Tagging Using Decision Trees, Proceedings of

International Conference on New Methods in Language Processing, Manchester, UK, July, 1994, pp.
172-176.
[Sch02]
N. Schmitt, Using corpora to teach and assess vocabulary, Chapter in Corpus Studies in Language
Education, Melinda Tan Edition, IELE Press, 2002, pp. 31-44.
[SAS04]
Y. Seo, A. Ankolekar, and K. Sycara, Feature Selection for Extracting Semantically Rich Words,
Technical Report CMU-RI-TR-04-18, Robotics Institute, Carnegie Mellon University, March, 2004.
[ShM00]
J. Shi, and J. Malik, Normalized Cuts and Image Segmentation, IEEE Transaction on Pattern
Analysis and Machine Intelligence, Vol. 22, No. 8 (2000), pp. 888-905.
[SBM96]
A. Singhal, C. Buckley, and M. Mitra, Pivoted document length normalization, Proceedings of the
19th Annual International ACM SIGIR Conference, Zurich, Switzerland, August, 1996, pp. 21-29.
[SnS73]
P. H. A. Sneath, and , R. R. Sokal, Numerical Taxonomy, Freeman, London, UK, 1973.
[SGM00]
A. Strehl, J. Ghosh, and R. Mooney, Impact of Similarity Measures on Web-page Clustering, In

Proceedings of AAAI workshop on AI for Web Search, K. Bollacker (Ed) TR WS-00-01, AAAI Press,
July, 2000, pp. 58-64.
[SLP97]
T. Strzalkowski, F. Lin, and J. Perez-Carballo, Natural language Information Retrieval: Trec-6

Report, In Proceeding of the 6th Text Retrieval Conference, 1997, pp. 347-366.
[Sub92]
J.L. Subbiondo, John Wilkins' Theory of Meaning and the Development of a Semantic Model, In
John Wilkins and 17th-Century British Linguistics, Chap. 5: Wilkins' Classification of Reality,
Joseph L. Subbiondo edition, Amsterdam, 1992, pp. 291-308.
[TEC05]
K. Taghva, R. Elkhoury, and J. Coombs, Arabic Stemming Without A Root Dictionary, In

Proceeding of Information Technology: Coding and Computing, Las Vegas, NV, April, 2005, pp.
152-157.
[TSL00]
J.B. Tenenbaum, V. de Silva, and J. C. Langford, A Global Geometric Framework for Nonlinear
Dimensionality Reduction, Journal of Science, Vol. 290 (2000), pp. 2319-2323.
[TuC91]
H. Turtle, and W. Croft, Evaluation of an Inference Network-based Retrieval Model, ACM

Transactions on Information Systems, Vol.9, No. 3, 1991, pp. 187-222.
[VHL05]
U .Vaidya, G. Hagen, S. Lafon, A. Banaszuk, I. Mezic, and R. R. Coifman, Comparison of Systems
__________________________________________________________________________________________ 143
References
using Diffusion Maps, Proceedings of the 44th IEEE Conference on Decision and Control and the
European Control Conference, Seville, Spain, December, 2005, pp. 7931-7936.
[Van72]
C.J. Van Rijsbergen, Automatic information structuring and retrieval, Ph.D. Dissertation, University
of Cambridge, 1972.
[Van79]
C.J. Van Rijsbergen, Information Retrieval, Second Edition, Butterworths Publishing Company,
London, 1979.
[Von06]
U. Von Luxburg, A Tutorial on Spectral Clustering, Technical Report TR-149, Max Planck Institute
of Biological, Cybernetics, 2006.
[WaK79]
W. G. Waller, and D. H. Kraft, A mathematical model of a weighted Boolean retrieval system,

Journal of Information Processing & Management, Vol. 15, No. 5 (1979), pp. 235-245.
[Wei99]
Y. Weiss, Segmentation Using Eigenvectors: A Unifying View, Proceeding of IEEE International,

Conference of Computer Vision, Vol. 14 (1999), pp. 975-982.
[VeA99]
J. Vesanto, and J. Ahola, Hunting for Correlations in Data Using the Self-Organizing Map, In
Proceeding of the International ICSC Congress on Computational Intelligence Methods and
Applications, 1999, pp. 279-285.
[WMC01] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik, Feature Selection for
SVMs, Journal of Advances in Neural Information Processing Systems, Vol. 13 (2001), pp. 668-674.
[Wit97]
D. I. Witter, Downdating the Latent Semantic Indexing Model for Information retrieval, MS Thesis,
Department of Computer Science, University of Tennessee, 1997.
[WMB94] I.H. Witten, A. Moffat, and T.C. Bell, Managing Gigabytes: Compressing and Indexing Documents
and Images. Van Nostrand Reinhold, New York, NY, 1994.
[WoF00]
W. Wong, and A. Fu, Incremental Document Clustering for Web Page Classification, In Proceeding
of the International Conference on Information Society, Japan, 2000.
[WZW85] S.K.M. Wong, W. Ziarko, and P.C.N. Wong, Generalized Vector Spaces Model in Information
Retrieval, In Proceeding of the 8th annual International ACM SIGIR Conference, Montreal, Quebec,
Canada, 1985, pp. 18-25.
[XJK01]
E.P. Xing, M.I. Jordan, and R.M. Karp, Feature Selection for High-Dimensional Genomic
Microarray Data, In Proceedings of the 18th International Conference on Machine Learning, San
Francisco, CA, USA, 2001, pp. 601-608.
[XuC98]
J. Xu, and W.B. Croft, Corpus-Based Stemming using Co-occurrence of Word Variants, In ACM
Transactions on Information Systems , Vol. 16, No. 1 (1998), pp. 61-81.
[XFW01]
J. Xu, A. Fraser, and R. Weischedel, TREC 2001 Crosslingual Retrieval at BBN, In TREC 2001,
Gaithersburg: NIST, 2001.
144 __________________________________________________________________________________________
References
[Yah89]
A. H. Yahya, On the Complexity of the Initial Stages of Arabic Text Processing, First Great Lakes
Computer Science Conference, Kalamazoo, Michigan, U.S.A., October, 1989, pp. 18-20.
[YaH98]
J. Yang, and V. Honavar, Feature Subset Selection Using a Genetic Algorithm, IEEE Transaction
Intelligent Systems, Vol. 13, No. 2 (1998), pp. 44-49.
[YaP97]
Y. Yang, and J.O. Pedersen, A Comparative Study of Feature Selection in Text Categorization, In
Proceedings of the 14th International Conference on Machine Learning, San Francisco, CA, USA,
1997, pp. 412420.
[YuL03]
L. Yu, and H. Liu, Feature Selection for High-Dimensional Data: A Fast Correlation-based Filter
Solution, In Proceedings of the twentieth International Conference on Machine Learning, 2003, pp.
856-863.
[Zah71]
C. T. Zahn, Graph-Theoretic Methods for Detecting and Describing Gestalt Clusters, IEEE
Transactions on Computers, Vol. 20, No. 1 (1971), pp. 68-86.
[ZaE98]
O. Zamir, and O. Etzioni, Web Document Clustering: A Feasibility Demonstration, In Proceedings

of the 21st International ACM SIGIR Conference, Melbourne, Australia, 1998, pp 46-54.
[ZeH01]
S. Zelikovitz and H. Hirsh, Using LSI for Text Classification in the Presence of Background Text,
In Proceedings of the ACM 10th International Conference on Information and Knowledge
Management (CIKM01), Atlanta, Georgia, November, 2001, pp. 113-118.
[ZTM96]
C. Zhai, X. Tong, N. Milic-Frayling, and D. A. Evans, Evaluation of Syntactic Phrase IndexingCLARIT NLP Track Report, In Proceedings of the 5th Text Retrieval Conference, Gaithersburg, MD,
November, 1996, pp. 347-358.
[ZhZ02]
Z. Zhang, and H. Zha, Principal Manifolds and Nonlinear Dimension Reduction via Local Tangent
Space Alignement, Technical Report CSE-02-019, Dept. of Computer Science and Eng.,
Pennsylvania State University, Pennsylvania, USA, 2002.
[ZhG02]
R. Zhao, and W. I. Grosky, Negotiating the Semantic Gap: from Feature Maps to Semantic
Landscapes, Pattern Recognition, Vol. 35, No. 3 (2002), pp. 593-600.
__________________________________________________________________________________________ 145

Information Retrieval

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Information Retrieval

Загружено:

Авторское право:

Доступные форматы

Information Retrieval: Applications

to English and Arabic Documents

Fadoua Ataa Allah

Dissertation submitted to the Faculty of Science - Rabat of the

http://www.abc.net.au/science/news/stories/s1623945.htm, Retrieved on 10-05-2007.

2. 3. Document Clustering ...........................................................................................10

2. 4. Dimensionality Reduction ...................................................................................16

2. 5. Studied Languages ...............................................................................................20

2. 6. Arabic Corpus ......................................................................................................31

3. 3. Applications and Results......................................................................................43

4. 4. Experiments and Results......................................................................................59

5. 4. Experiments and Results......................................................................................89

6. 3. Experimental Protocol .......................................................................................100

A.3. Advanced Techniques ........................................................................................119

Appendix B Weighting Schemes Notations .............................................................122

C.3. Clustering Evaluation.........................................................................................127

Appendix D Principal Angles ....................................................................................129

Encarta 2006, Retrieved on 10-05-2007.

http://www.abc.net.au/science/news/stories/s1623945.htm, Retrieved on 10-05-2007.

1. 2. Thesis Layout & Brief Overview of Chapters

Chapter 2 Literature Review

2.2.1. Document Retrieval Models

Representations for documents and queries.

Matching strategies for assessing the relevance of documents to a user query.

Methods for ranking query output.

Mechanisms for acquiring user-relevance feedback.

2.2.1.1. Set-theoretic Models

2.2.1.2. Algebraic Models

2.2.1.3. Probabilistic Models

2.2.1.4. Hybrid Models

2.2.2. Introduction to Vector Space Models

is a non-negative value denoting the single or multiple occurrences of term i in document d.

product of two m vectors X = < xi > and Y = < yi > defined to be X . Y = xi . yi .

deemphasizes the lengths of the vectors.

2.3.2. Clustering Document in the Context of Document Retrieval

2.3.2.1. Cluster Generation

Sound methods that are based on the document-document similarity matrix.

a- Methods based on the Similarity matrix

2.3.2.2. Cluster Search

2.3.3. Clustering Methods Taxonomy

Figure 2.1. A taxonomy of clustering approaches.

2.3.3.1. Hierarchical Clustering

2.3.3.2. Partitional Clustering

2.3.3.3. Graph-Theoretic Clustering

2.3.3.4. Incremental Clustering

2.3.4. Document Clustering Methods Used for IR

2.4.1. Term Transformation

Linear techniques include independent component analysis (ICA) [Com94], principle

2.4.2. Term Selection

2.4.2.2. Feature Selection Methods

http://www.photius.com/rankings/languages2.html, Ethnologue, 13th Edition, Barbara F. Grimes, Editor. 1996,

Summer Institute of Linguistics, Retrieved on 10-05-2007.

2.5.2. Arabic Language

2.5.3. Arabic Forms

Encarta 2006, Retrieved on 10-05-2007.

http://en.wikipedia.org/wiki/List_of_official_languages, Retrieved on 10-05-2007.

2.5.4. Arabic Language Characteristics

* when Alif , waw or ya is used as a consonant

Table 2.1. Arabic letters.

Table 2.2. Different shapes of the letter gh (Ghayn).

2.5.4.1. Arabic Morphology

<=J Quran school

Transliteration Pattern Transliteration English Word

Table 2.5. Derivations from a borrowed word.