Вы находитесь на странице: 1из 9

A Hybrid Approach for Indexing and Retrieval of

Archaeological Textual Information

Ammar Halabi1, Ahmed-Derar Islim2, and Mohamed-Zakaria Kurdi1,3


1
School of Informatics and Computing, Indiana University, Bloomington
ammar.halabi@gmail.com
2
Financial Mathematics Department, Florida State University, Tallahassee
derarief@gmail.com
3
Department of Computer Science, Mamoun University of Science and Technology,
Aleppo, Syria
mzkurdi@yahoo.com

Abstract. This paper focuses on the problem of archaeological textual informa-


tion retrieval, covering various field-related topics, and investigating different
issues related to special characteristics of Arabic.
The suggested hybrid retrieval approach employs various clustering and
classification methods that enhances both retrieval and presentation, and infers
further information from the results returned by a primary retrieval engine,
which, in turn, uses Latent Semantic Analysis (LSA) as a primary retrieval
method. In addition, a stemmer for Arabic words was designed and imple-
mented to facilitate the indexing process and to enhance the quality of retrieval.
The performance of our module was measured by carrying out experiments
using standard datasets, where the system showed promising results with many
possibilities for future research and further development.

Keywords: Information retrieval, Arabic Information Retrieval, Arabic Stem-


ming, Arabic Lexical Analysis, Latent Semantic Analysis, Automatic Docu-
ment Categorization.

1 Introduction
Today, with new archaeological sites being discovered and interesting findings being
unearthed in already established ones, growing archaeological data places a signifi-
cant amount of information at the archaeological community's disposal by every mis-
sion at the end of a successful season.
In principle, archaeologists and researchers in associated disciplines should be able
to access this information in a convenient and consistent manner to effectively re-
trieve material for the support of their own research, and to conduct collaborative
research, via information exchange, with other researchers in the community or in
other research communities.
The needs of information recording, organization, acquisition and dissemination in
the archaeological community suggest interesting possibilities for the adoption of

R. Setchi et al. (Eds.): KES 2010, Part IV, LNAI 6279, pp. 527535, 2010.
Springer-Verlag Berlin Heidelberg 2010
528 A. Halabi, A.-D. Islim, and M.-Z. Kurdi

computer-based information systems. We started our research with the goal of design-
ing and implementing a robust Archaeological Information Retrieval system, which is
capable of indexing and searching cross-language corpora as well as cross-media
corpora. This information retrieval system is a basic need and a critical component in
a complete Archaeological Information System, and can be considered as the starting
point for developing such system [10].
The main differences between such an archaeological system and other common
information retrieval engines are:
Limitation of the domain: in domain oriented applications, the lexicon may be
big but it is usually limited. This leads to a significant reduction of the textual
ambiguity.
Multilingualism: the archaeological data may be in several languages (especially
old ones such as Acadian, Sumerian, Aramean, Assyrian, etc.)
In this paper, we tackle the problem of the retrieval of textual information archae-
ology. We introduce a background on textual information retrieval systems, followed
by proposing architecture for the Archaeological Textual Information Retrieval Sys-
tem, and we end by presenting our results and conclusions. The current application
covers Arabic language only but the adopted design makes it relatively easy to add
new languages: only a lightweight stemmer need to be added per language.

2 State of the Art

2.1 Latent Semantic Analysis (LSA)

Latent Semantic Analysis (LSA) [5] is a technique used in statistics and natural lan-
guage processing to find hidden or latent relations between a set of observations
and a set of associated features. This is done by mapping features and observations
onto an intermediate concept space which only preserves the most significant charac-
teristics. By this new representation, relations that were unobvious between features
and observations can be revealed. In the context of natural language, features are
represented by terms, and observations are represented by documents. LSA can be
applied for document retrieval by projecting user queries and indexed documents onto
the concept space to uncover the relation between the user needs and documents in
the corpus. The application of LSA in textual information retrieval is known as Latent
Semantic Indexing (LSI).
In theory, to find a lower rank approximation of the 2D term-document matrix,
LSA makes use of the reduced Singular Value Decomposition (SVD); a matrix fac-
torization tool used in signal processing and statistics [1].
Using SVD, the term-document matrix X (where row vectors represent terms, and
column vectors represent documents) is written as the product of three orthonormal
matrices, U, , and V, as shown in Fig. 2
As in Fig. 2, U holds the eigenvectors of the matrix X.XT, V holds the eigenvectors
of the matrix XT.X, and is a diagonal matrix having its diagonal formed by the
square roots of the eigenvalues of the matrix X.XT (or equally by the square roots of
A Hybrid Approach for Indexing and Retrieval of Archaeological Textual Information 529

Fig. 1. SVD of the term-document matrix (X)

the eigenvalues of the matrix XT.X). The values are the singular values,
where and the left and right singular vectors.
The LSA concept space is extracted by keeping only the k largest singular values
and the corresponding k left and right singular vectors. These k singular values repre-
sent the new reduced concept space, where the left and right singular vectors provide
the means to transform to and from this space.
After computing the SVD of the term-document matrix, relations between terms,
documents, or terms and documents can be revealed. Moreover, user query vectors
can be projected on the concept space to retrieve the closest set of documents that
meet the users needs.
LSI can be applied for document indexing and retrieval including cross-lingual
corpora, and for modeling the process of human learning and text comprehen-
sion [13], [5]. LSI is also reported to outperform the Vector Space retrieval model,
where comparisons between query vectors and document vectors are done in the
original space [14].

2.2 Automatic Document Clustering and Classification

Document clustering can be defined as the partitioning of a dataset of documents into


subsets (clusters), where all documents in each of these subsets share some common
traits expressed using some certain distance measure. This technique is an unsuper-
vised learning method that no prior information related to potential similarities be-
tween documents is used in the learning process [2].
Document classification is a supervised machine learning method which assigns
documents to pre-defined labels (categories). This technique is used in pattern recog-
nition and data analysis. It works by constructing a classifier which learns a model
from a training dataset composed of documents along with their corresponding cate-
gories. This dataset has to contain enough information for the classifier model to be
effective at predicting the classes of new documents.
530 A. Halabi, A.-D. Islim, and M.-Z. Kurdi

3 The Archaeological Text Retrieval System


In Fig. 3, we illustrate our proposed hybrid architecture for The Archaeological Text
Retrieval System.
This architecture was primarily based on the general architecture and conventions
used in text retrieval systems, with an additional layer of functionality inspired by the
work of Sahami [22].

Fig. 2. Generic architecture of IR systems; the crescent refers to offline operations; lighting
bolts refer to online operations

For illustrative purposes, the architecture can be divided, into two main blocks.
The first block (1) is modeled after the general architecture of text retrieval systems
which provides primary retrieval of documents. This is achieved by:
Document pre-processing, which includes parsing files of different formats, term
stopping, and stemming.
Construction of inverted index.
Utilizing a primary retrieval engine based on the well defined LSI retrieval model.
The second block (2) adds extra levels of functionality in order to refine the results of
primary retrieval, i.e. set of retrieved documents, obtained from the first block and
also to improve the quality of result demonstration, where documents are automati-
cally clustered and classified into different categories giving the user a better percep-
tion of the retrieval result than in the case of simply presenting a list of retrieved
documents. This is achieved by:
A Hybrid Approach for Indexing and Retrieval of Archaeological Textual Information 531

Feature selection for primarily retrieved documents.


Clustering of primarily retrieved documents.
Classification of new documents into one of the learned categories.
By this architecture, we seek to improve the quality of retrieval and result demonstra-
tion of the overall system. In another words, the features provided by the Archaeo-
logical Text Retrieval subsystem are:
Primary, semi-semantic text retrieval.
Enhanced retrieval results automatically refined using clustering and classification
techniques.
The problem of archaeological text retrieval is similar to the problem of general text
retrieval where only minimal modifications of the general text retrieval system are
required to adapt for retrieval in a specific field, i.e. special stop lists, and additional
stemming rules. Therefore, we focused on the problem from a general point of view
assuring that selected methods can be extended to build a cross-media retrieval
system.

3.1 Document Pre-processing

Concerning Arabic in particular, we came out with our own stop-word list which
includes functional vocabulary like prepositions, adverbs, pronouns, and others.
We also implemented a Light Stemming algorithm which strips frequently used
suffices and postfixes of Arabic words. Light stemming of Arabic terms is reported to
contribute to the effectiveness of retrieval better than Root Normalization, which
adopts a more aggressive stemming approach by reducing words to their roots [15],
[16]. Our algorithm is similar in concept to the stemming algorithm described in [16]
and [4]. However, we suggested and implemented a different set of stemming rules,
making more use of the knowledge of Arabic morphology.

3.2 Indexing

In this operation, the index is constructed and the term-document matrix is built,
which serves as an abstract representation of documents for the retrieval model to act
upon.

3.3 Primary Text Retrieval

Latent Semantic Indexing (LSI) was chosen as a primary retrieval model for the fol-
lowing reasons:
LSI generally outperforms the vector space model and provides solutions to the
problems of synonymy [3], [5].
An LSI-based retrieval engine can be extended to achieve cross language informa-
tion retrieval [19], [20].
LSI uses the reduced SVD decomposition to project document and query vectors
on a new dimensionally-reduced space. This new representation of documents can
also be used for document clustering to refine primary search results [17], [24].
532 A. Halabi, A.-D. Islim, and M.-Z. Kurdi

This LSA-based retrieval engine provides primary retrieval of relevant documents


to be further refined by further processing of the retrieval result.

3.4 Document Clustering

Both HAC and k-Means clustering algorithms were used in document clustering. The
output of HAC was used to seed the K-Means algorithm. By this approach, after
documents are grouped under different levels of hierarchy using HAC, one level of
the output hierarchy is used to seed K-Means which refines clustering at the given
level by making useful re-assignments of documents into clusters. In addition, using
HAC to seed K-Means can yield improvements in performance, where K-Means will
potentially converge faster than in the case of randomized document seeding [22].
Upon clustering, cluster descriptors are extracted, which are the most representative
terms of the documents contained in the corresponding cluster. They effectively assist
users to understand the categories of clustered retrieval results. These descriptors are
extracted using the Probabilistic Odds method [22].

3.5 Document Classification

A Nave Bayesian classifier [18] was employed for the classification of user queries
and new documents. Nave Bayesian classifiers have yielded good results in text clas-
sification [6], [22], and they have been applied successfully [23].
After primary retrieval results are clustered, resultant clusters serve as a training set
to train the classifier, so as new documents or queries presented to the classifier when
the system is online will be classified as belonging to one of the learned classes.
This allows users to accurately determine the most a cluster which is most relative
to their search queries or example documents.

4 Results
By using our light Arabic stemmer and refining primarily retrieved documents, we
obtained promising results where these techniques proved efficient and effective in
Arabic textual information retrieval.
Fig. 4 compares the effects of our stemming technique to the light10 stemmer [16]
on the performance of document clustering using the k-means algorithm with docu-
ment features projected on the reduced space of SVD. This experiment was performed
using the Sulaiti dataset [25], which was assembled from newspapers, magazines,
radio, TV and webpages, summing to 411 texts that are manually classified under 8
different categories. The quality of document clustering was measured using the F-
measure:

2 precision recall
F=
precision + recall
These results show that our stemmer matches the light10 stemmer for effectiveness
regarding the document clustering. Moreover, by looking at Table I, It is shown by
stemming the documents of three datasets that our proposed stemmer does a better job
A Hybrid Approach for Indexing and Retrieval of Archaeological Textual Information 533

Fig. 3. Document clustering performance as a function of (k)

Table 1. Number of Terms Extracted from Arabic Datasets Using Different Stemmers

Number of Terms
Dataset
No Stemmer Light10 New Stemmer
Sulaiti 108209 44329 31198
Hammadeh 22921 10197 7980
Mashkat 656807 222478 135369

in reducing the number of features (terms) to be used in indexing and retrieval. This
considerably improves the performance of computationally-demanding retrieval sys-
tems when stemming is applied. The datasets used include the Sulaiti dataset [25] in
addition to two datasets we assembled from researchers in archaeology and an online
resource for ancient Arabic texts.
Regarding the performance of our proposed retrieval model, we were short of free
Arabic datasets that are tailored for evaluating retrieval engines. Therefore, we were
not able to conduct numerical experiments to measure the quality of our Arabic re-
trieval engine. However, empirical results and user reactions indicated that retrieval
results of the primary LSI retrieval engine on indexed Arabic corpora were good. In
addition, further clustering and classification operations successfully improved the
quality of presentation for the primarily retrieved group of documents, and success-
fully assisted users reaching required information more rapidly.

5 Conclusions
In this paper, we have demonstrated our work in Arabic information retrieval and
viewed the architecture of our Archaeological Text Retrieval System. This architecture
was designed after the generic retrieval systems architecture with an additional layer
of functionality that improves presentation of retrieval results. Our results showed that
534 A. Halabi, A.-D. Islim, and M.-Z. Kurdi

our stemming algorithm is highly effective, and that statistical and probabilistic meth-
ods for retrieval and language modeling such as the LSI, automatic document cluster-
ing, and classification are effective for Arabic textual information.

References
1. Akritas, G., Malaschonok, G.I.: Applications of Singular-Value Decomposition. Mathe-
matics and Computers in Simulation 67(1-2), 1531 (2004)
2. Berkhin, P.: Survey of clustering data mining techniques. Tech. Rep., Accrue Software,
San Jose, CA (2002)
3. Berry, M.W., Dumais, S.T., OBrien, G.W.: Using Linear Algebra for Intelligent Informa-
tion Retrieval. SIAM Review 37(4), 573595 (1995)
4. Chen, F.G.: Building an Arabic Stemmer for Information Retrieval. In: Proc. Eleventh
Text Retrieval Conference TREC 2002, Gaithersburg, Maryland, USA, pp. 1922 (2002)
5. Deerwester, S., Dumais, S., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by La-
tent Semantic Analysis. Journal of the Society for Information Science 41(6), 391407
(1990)
6. Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and repre-
sentations for text categorization. In: 7th ACM International Conference on Information
and Knowledge Management ACM-CIKM 1998, Bethesda, USA, pp. 148155 (1998)
7. Fox: Lexical Analysis and Stoplists. In: Frakes, W.B., Baeza-Yates, R. (eds.) Information
Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs (1992)
8. Frakes, W.B.: Stemming Algorithms. In: Frakes, W.B., Baeza-Yates, R. (eds.) Information
Retrieval: Data Structures. Prentice Hall, Englewood Cliffs (1992)
9. Frakes, B., Baeza-Yates, R.: Information Retrieval: Data Structures and Algorithms. Pren-
tice-Hall, Englewood Cliffs (1992)
10. Halabi, A.D.I., Keshishian, R., Rehawi, O.: The Archaeological Text Retrieval System.
BSc. thesis, Dept. Artificial Intelligence, Faculty of Informatics, University of Aleppo
(2007)
11. Hearst, M.A., Pedersen, J.O.: Reexamining the Cluster Hypothesis: Scatter/Gather on Re-
trieval Results. In: Proc. 19th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval (SIGIR 1996), Zurich, Switzerland, June 1996,
pp. 7684 (1996)
12. Hull: Stemming algorithms A case study for detailed evaluation. Journal of the American
Society for Information Science 47(1), 7084 (1996)
13. Landauer, T.K., Foltz, P.W., Laham, D.: Introduction to Latent Semantic Analysis. Dis-
course Processes 25, 259284 (1998)
14. Landauer, T.K., Littman, M.L.: A statistical method for language-independent representa-
tion of the topical content of text segments. In: Proc. Eleventh International Conference:
Expert Systems and Their Applications, Avignon, France, vol. 8, pp. 7785 (May 1991)
15. Larkey, L., Ballesteros, L., Connell, M.E.: Improving stemming for Arabic information re-
trieval: light stemming and co-occurrence analysis. In: Proc. 25th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval, Tam-
pere, Finland, pp. 275282 (2002)
16. Larkey, L., Ballesteros, L., Connell, M.: Light Stemming for Arabic Information Retrieval.
In: Soudi, A., van den Bosch, A., Neumann, G. (eds.) Arabic Computational Morphology:
Knowledge-based and Empirical Methods. Series on Text, Speech, and Language Tech-
nology. Kluwer/Springers (2005)
A Hybrid Approach for Indexing and Retrieval of Archaeological Textual Information 535

17. Lerman, K.: Document Clustering in Reduced Dimension Vector Space (1999) (unpub-
lished), http://www.isi.edu/~lerman/papers/papers.html (retrieved on
13-08-2007)
18. Lewis, D.D.: Naive Bayes at forty: The independence assumption in information retrieval.
In: Ndellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 415. Springer,
Heidelberg (1998)
19. Littman, M.L., Dumais, S.T., Landauer, T.K.: Automatic cross-language information re-
trieval using latent semantic indexing. In: Grefenstette, G. (ed.) Cross-Language Informa-
tion Retrieval, pp. 5162. Kluwer Academic Publishers, Dordrecht (1998)
20. Littman, M.L., Jiang, F.: A Comparison of Two Corpus-Based Methods for Translingual
Information Retrieval. Tech. Rep. CS-98-11, Duke University, Department of Computer
Science, Durham, NC (June 1998)
21. Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval, August
13. Cambridge University Press, Cambridge (2007), http://www-
csli.stanford.edu/~schuetze/information-retrieval-book.html
22. Sahami, M.: Using Machine Learning to Improve Information Access. Ph.d. thesis, Dept.
Computer Science, Stanford University (1999)
23. [Sahami et al 1998] Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian Ap-
proach to Filtering Junk E-mail. In: Proc. AAAI 1998 Workshop on Learning for Text
Categorization, Madison, Wisconsin, USA, pp. 5562 (1998)
24. Schutze, H., Silverstein, C.: Projections for efficient document clustering. In: Proc. 20th
Annual International ACM SIGIR Conference on Research and Development in Informa-
tion Retrieval, Philadelphia, Pennsylvania, USA, pp. 7481 (1997)
25. Al-Sulaiti, L., Atwell, E.: Designing and Developing a Corpus of Contemporary Arabic.
In: Proc. Sixth TALC Conference, Granada, Spain, p. 92 (2004)