Академический Документы
Профессиональный Документы
Культура Документы
Abstract— A general-purpose data mining model for Arabic has become closer to Indo-European languages in both
texts (Arabic Meaning Extraction through Lexical Resources, lexicon and syntax) [15].
ArMExLeR) is proposed which employs a chained pipeline of To overcome these difficulties, our project capitalizes on
existing public domain and published lexical resources the use of existing Arabic lexical resources that are linked to
(Stanford Parser, WordNet, Arabic WordNet, SUMO, larger, general-purpose resources, by devising specific
AraMorph, A Frequency Dictionary of Arabic) in order to strategies to fill the gaps in these resources. Resources are
extract a weakly hierarchised, single-predicate level, aligned through a pipeline which is fed by the input text and
representation of meaning. This kind of model would be of outputs, after several rings in the chain, a relatively hollow
high impact on the study of the computational analysis of
semantic representation that allows for further data mining
Arabic for there is no such comparable tool for this language,
and will be a challenge for the nature of its specificities. One
operations, thanks to the Suggested Upper Merged Ontology
should, in fact, cope with the unique writing system that is (SUMO) format.
mostly consonant-based and doesn’t always mark vowels Next sections will discuss [II] the tools used in the
explicitly. This is crucial when you want to analyze an Arabic project, [III] the workflow of the system, [IV] an example
corpus for the same consonantal ductus may be read in several derivation, [V] test results and [VI] some conclusions and
ways. suggestions for further developments.
parser and either of these yields a good performance found in their respective tables, the three components are
statistical parsing system. As well as providing an English defined suitable and the word is confirmed as valid.
parser, the Stanford parser can be and has been adapted to Hereafter (Fig. 1), we put in evidence an example of an
work with other languages such as Chinese (based on the Arabic word and analyzed by AraMorph:
Chinese Treebank), German (based on the Negra corpus) and
Arabic (based on the Penn Arabic Treebank). Finally, this WORD NO. 10223: الثوري 17 occurrences
parser has also been used for other languages, such as Italian,
Bulgarian, and Portuguese. UNVOCALIZED TRANSCRIPTION: Al+vwry+
Although the parser provides Stanford Dependencies INPUT STRING: الثوري
output as well as phrase structure trees for English and SOLUTION: ّ ِ َْ
Al+vaworiy~+ ال٭ثوري٭
Chinese, this component has to be implemented for Arabic. vaworiy~_1 []ثور
So, for now, we have an analysis of the sentences that cannot ENGLISH GLOSS: the+revolutionary+
trace the subject or the object of a verb in a sentence, but we
have a reliable parsing of constituents that could be used to POS ANALYSIS:
deepen the analysis with other tools. Al/DET+vaworiy~/ADJ+
The Arabic Parser from Stanford can only cope with non- Figure 1. Excerpt of an AraMorph analysis of Meedan Memory
vocalized texts, the tokenization being based on the Arabic
used in the Penn Arabic Treebank (ATB) and is based on a However, AraMorph presents some problems regarding
whitespace tokenizer. Segmentation is done based on the the analysis of texts types that do not match to ideal text
Buckwalter analyzer (morphology). genre targeted by Buckwalter (newspaper texts and other
The character encoding is set on Universal Character Set Modern Standard Arabic non-literary texts). Indeed, the
(UCS) Transformation Format—8-bit (UTF-8), but it may be program shows three major weaknesses:
changed if needed. The normalization of the text is needed in it does not either fully or sparsely analyze vocalized
order to analyze the text, because otherwise the parser cannot texts;
recognize the Arabic ductus, that is the representation of it does reject many words attested in some textual
consonants and long vowels which are the only obligatory types which are not contained either in the sample of
component of Arabic script, and often the only one actually the text corpora chosen by Buckwalter, or in the
present in texts (auxiliary graphemes, such as short vowels lookup lists of AraMorph;
and consonant reduplication markers, must be deleted). For there is neither any stylistic nor chronological
the part-of-speech (POS) tags, the parser uses an “augmented information in the lookup lists; the same for a lot of
Bies” tag set that uses the Buckwalter morphological transliterated foreign named entities which cannot be
analyzer and links it to a subset of the POS tags used in the found in classical texts and in modern literary texts,
Penn English Treebank (sometimes with slightly different giving rise to a number of false positives.
meanings) [1]. Phrasal categories are the same from the Penn
ATB Guidelines [16]. As mentioned, there is no tool in the In order to overcome these problems, a group of
parser itself that can normalize or segment an Arabic ductus, researchers on linguistics at Roma Tre University had
so one is compelled to employ other tools in order to perform modified the original AraMorph in a new algorithm named
these tasks. “Revised AraMorph” (RAM), within a project of
B. AraMorph automatically analysis of ḥadīṯ corpora (SALAH project) [7].
The modifications, which are implemented to solve the
AraMorph [2] is a program designed to allow the weakness previously listed, are respectively:
morphological analysis for Arabic texts in order to segment
a mechanism which takes into account the vowels
Arabic words in prefixes, stems and suffixes according to the
present in the text in order to reduce ambiguity
following rules:
linked to non-vocalized texts;
the prefix can be 0 to 4 characters in length;
a file with additional stems automatically extracted
the stem can be 1 to infinite characters in length; from Anthony Salomé Arabic-English dictionary (a
the suffix can be 0 to 6 characters in length. work from the end of 19th century encoded in TEI-
compliant XLM format) [11] and with additional
Each possible segmentation is verified by asking the lists of prefixes and suffixes with the relative
software to check if the prefix, the stem and the suffix exist combination tables of most frequent unrecognized
in the embedded dictionary. In fact, the program has three tokens;
tables containing all Arabic prefixes, all Arabic stems and all a mechanism which removes automatically items in
Arabic suffixes, respectively. Indeed, if all three components order to allow them matching to contemporary
are found in these tables, the program checks if their foreign named entities, especially proper names and
morphological categories are listed as compatible pairs in place-names. In the other hand, the items above are
three tables (table AB for prefix and stem compatibility; not included in Salomé’s dictionary (this way Arabic
table AC for prefix and suffix compatibility; table BC for named entities which can be found in Classical texts
stem and suffix compatibility). Finally, if all three pairs are are retained for the analysis).
without anaphoric elements (which are not in the scope of inter-clausal relations (in order to avoid wrong
the current model). interpretations of counterfactuals and other “possible worlds”
The evaluation of the system was performed on the structures) and the development of links to other existing
remaining 196 cases (56% of the original sample). The resources, such as the Arabic version of VerbNet.
analysis was deemed correct if the verb and at least one other
PCs were regarded as properly assigned by at least one of the REFERENCES
evaluators. This choice was motivated by the inherent [1] http://www.ldc.upenn.edu/Catalog/docs/LDC2010T13/atb1-v4.1-
problem in role-assignment caused by the lack of a taglist-conversion-to-PennPOS-forrelease.lisp
dependency module for Arabic in SP (while such a module is [2] T. Buckwalter, Buckwalter Arabic Morphological Analyzer Version
available for English): therefore, the list of arguments and 1.0. Philadelphia:Linguistic Data Consortium, 2002.
their relative order is not yet reliable. Only one agreement [3] T. Buckwalter and D. Parkinson, A Frequency Dictionary of Arabic.
London and New York: Routledge, 2011.
was deemed sufficient because a relatively high degree of
disagreement between annotators has always been noticed [4] S. Green and J. DeNero, “A class-based agreement model for
generating accurately inflected translations”, in ACL ’12,
for WordNet-related semantic projects (such as SemCor). Proceedings of the 50th Annual Meeting of the Association for
Results are summarized in Table I: Computational Linguistics: Long Papers, vol. 1, pp. 146-155.
[5] C. Fellbaum, “WordNet and wordnets”, in: Brown, Keith et al. (eds.),
TABLE I. RESULT SUMMARY Encyclopedia of Language and Linguistics, Second Edition, Oxford:
Elsevier, 2005, pp. 665-670
error rate 60.54%
[6] S. Green and Ch. D. Manning, “Better Arabic parsing: baselines,
precision 64.90% evaluations, and analysis”, in COLING ’10, Proceedings of the 23rd
recall 74.56% International Conference on Computational Linguistics, pp. 394-402.
F measure 1.39 [7] M. Boella, F. R. Romani, A. Al-Raies, C. Solimando, and G.
Lancioni, “The SALAH Project: segmentation and linguistic analysis
Comparing these results with other systems is not easy, of hadīth arabic texts. information retrieval technology lecture notes”
in Computer Science vol. 7097, Springer, Heidelberg, 2011, pp 538-
since the ArMExLeR system evaluation applies to a specific 549.
subset of relations at the end of a relatively complex, [8] Ch. M. Meyer and I. Gurevych, “What psycholinguists know about
automatic pipeline, which is not the case for other systems in chemistry: aligning Wiktionary and WordNet for increased domain
Arabic text data mining. Therefore, we shall defer cross- coverage”, in Proceedings of the 5th International Joint Conference
comparison of our system to further research. on Natural Language Processing, Chiang Mai, Thailand, 2011, pp.
883-892.
VI. CONCLUSIONS AND FURTHER DEVELOPMENTS [9] E. Niemann and I. Gurevych, “The people’s web meets linguistic
knowledge: automatic sense alignment of wikipedia and wordnet”, in:
The ArMExLeR project shows a number of interesting Proceedings of the International Conference on Computational
features, which pave the way to further refinements and Semantics (IWCS), Oxford, United Kingdom, 2011, pp. 205-214.
developments. [10] E. Wolf and I. Gurevych, “Aligning Sense Inventories in Wikipedia
First, the system performs reasonably well, despite some and WordNet”, in: Proceedings of the First Workshop on Automated
shortcomings in some of the elements of the pipeline, which Knowledge Base Construction (AKBC), Grenoble, France, 2010, pp.
shows the feasibility of a predominantly symbolic, rather 24-28.
than purely statistic, approach to content extraction, [11] H. A. Salmoné, An Advanced Learner’s Arabic-English Dictionary.
Beirut: Librairie du Liban, 1889.
especially in the case of a morphologically complex
[12] N. Habash, O. Rambow, and R. Roth, “MADA+TOKAN: A toolkit
language such as Arabic. for Arabic tokenization, diacritization, morphological disambiguation,
Second, a partial syntactic analysis reveals itself to be POS tagging, stemming and lemmatization”, in Proceedings of the
sufficient to extract a reasonable amount of information from 2nd International Conference on Arabic Language Resources and
corpus texts. This is encouraging, since it is expected that a Tools (MEDAR), Cairo, Egypt, 2009, pp. 102-109.
fuller match between syntax and semantics (especially when [13] I. Turki Khemakhem, S. Jamoussi and A. Ben Hamadou, “Integrating
a fuller argument extraction component is developed, which morpho-syntactic features in English-Arabic statistical machine
translation”, in Proceedings of the Second Workshop on Hybrid
includes nominalization, a highly prominent feature in Approaches to Translation”, Sofia, Bulgaria, 2013, pp. 74-81.
Arabic texts) can bring significant improvements. [14] M. Daoud, D. Daoud and Ch. Boitet, “Collaborative construction of
Third, the results of the project demonstrate that a very Arabic lexical resources”, in Proceedings of the 2nd International
complex pipeline of several independent projects can work Conference on Arabic Language Resources and Tools (MEDAR),
provided a consistent way to link chains can be found. This Cairo, Egypt, 2009, pp. 119-124.
stresses the importance of developing links between existing [15] S. Izwaini, “Problems of Arabic Machine Translation: evaluation of
lexical resources in order to capitalize on their three systems”, in Proceedings of the International Conference on the
Challenge of Arabic for NLP/MT, The British Computer Society
interconnection. (BSC), London, 2006, pp. 118-148.
Further developments in the project will include —
[16] Arabic Treebank Guidelines, http://www.ircs.upenn.edu/arabic/
besides optimization, in order to allow researcher for tests on guidelines.html. Accessed October 2013.
larger data sets, and refinements in the evaluation stage, to [17] https://github.com/anastaw/Meedan-Memory. Accessed October
allow a finer assessment of the contribution of the single 2013.
components— strategies for anaphora resolution, analysis of