Вы находитесь на странице: 1из 8

A Large Scale Corpus of Gulf Arabic

Salam Khalifa, Nizar Habash, Dana Abdulrahim , Sara Hassan


Computational Approaches to Modeling Language Lab, New York University Abu Dhabi, UAE

University of Bahrain, Bahrain
{salamkhalifa,nizar.habash,sah650}@nyu.edu,darahim@uob.edu.bh

Abstract
Most Arabic natural language processing tools and resources are developed to serve Modern Standard Arabic (MSA), which is the
official written language in the Arab World. Some Dialectal Arabic varieties, notably Egyptian Arabic, have received some attention
lately and have a growing collection of resources that include annotated corpora and morphological analyzers and taggers. Gulf Arabic,
however, lags behind in that respect. In this paper, we present the Gumar Corpus, a large-scale corpus of Gulf Arabic consisting of 110
million words from 1,200 forum novels. We annotate the corpus for sub-dialect information at the document level. We also present
results of a preliminary study in the morphological annotation of Gulf Arabic which includes developing guidelines for a conventional
orthography. The text of the corpus is publicly browsable through a web interface we developed for it.

Keywords: Arabic Dialects, Corpus, Large-Scale, Gulf Arabic

1. Introduction 2. Related Work


Most Arabic natural language processing (NLP) tools and 2.1. Dialectal Corpora
resources are developed to serve Modern Standard Arabic
There have been many notable efforts on the develop-
(MSA), the official written language in the Arab World.
ment of annotated Arabic language corpora (Maamouri
Using such tools to understand and process Dialectal Ara-
and Cieri, 2002; Maamouri et al., 2004; Smr and Hajic,
bic (DA) is a challenging task because of the phonological
2006; Habash and Roth, 2009; Zaghouani et al., 2014).
and morphological differences between DA and MSA. In
Most contributions however targeted MSA, developing an-
addition, there is no standard orthography for DA, which
notation guidelines and producing large-scale Arabic Tree-
only complicates matters more. Some DA varieties, no-
banks. These resources were instrumental in pushing the
tably Egyptian Arabic, have received some attention lately
state-of-the-art of Arabic NLP.
and have a growing collection of resources that include an-
notated corpora and morphological analyzers and taggers. Contributions that are specific to DA are limited in size,
Gulf Arabic (GA), broadly defined as the variety of Ara- more scattered and more recent. Some of the earliest
bic spoken in the countries of the Gulf Cooperation Coun- and relatively largest efforts have targeted Egyptian Ara-
cil (Bahrain, Kuwait, Oman, Qatar, Saudi Arabia, and the bic (EGY). They include CALLHOME Egyptian Arabic
United Arab Emirates), however, lags behind in that re- (CHE) corpus (Gadalla et al., 1997) and its associated
spect. Egyptian Colloquial Arabic Lexicon (ECAL) (Kilany et
In this paper, we present the Gumar Corpus,1 a large-scale al., 2002). In addition, there is the YADAC corpus (Al-
corpus of GA that includes a number of sub-dialects. We Sabbagh and Girju, 2012), which was based on dialectal
also present preliminary results on GA morphological an- content identification and web harvesting of blogs, micro
notation. Building a morphologically annotated GA cor- blogs, and forums of EGY content. And most recently, the
pus is a first step towards developing NLP applications, Linguistic Data Consortium collected and annotated a siz-
for searching, retrieving, machine-translating, and spell- able EGY corpus (Maamouri et al., 2012b; Maamouri et al.,
checking GA text among other applications. The impor- 2012a; Maamouri et al., 2014). Levantine Arabic received
tance of processing and understanding GA text (as with less attention, with notable efforts including the Levantine
all DA text) is increasing due to the exponential growth Arabic Treebank (LATB) of Jordanian Arabic (Maamouri
of socially generated dialectal content in social media and et al., 2006) and the Curras corpus of Palestinian Arabic
printed works (Sarnakh, 2014), in addition to existing ma- (Jarrar et al., 2014). Efforts on other dialects include cor-
terials such as folklore and local proverbs that are found pora for Tunisian Arabic (Masmoudi et al., 2014) and Al-
scattered on the web. gerian Arabic (Smali et al., 2014). There are also some
The rest of this paper is structured as follows. We present efforts that targeted multiple dialects such as the COLABA
some related work in Dialectal Arabic NLP in Section 2. project (Diab et al., 2010) which annotated dialectal con-
This is followed by a background discussion on GA in Sec- tent resources for Egyptian, Iraqi, Levantine, and Moroccan
tion 3. We then discuss the collection of the corpus and de- dialects from online weblogs, the Tharwa multi-dialectal
scribe its genre in Section 4. We present our preliminary lexicon (Diab et al., 2014), the multidialectal parallel Ara-
annotation study and evaluate it in Section 5. Finally, we bic corpus (Bouamor et al., 2014), and the highly dialec-
present the Gumar Corpus web interface in Section 6. tal online commentary corpus (Zaidan and Callison-Burch,
2011). Most recently, in this conference proceedings, Al-
Shargi et al. (2016) present two morphologically annotated
1 
Gumar Q /gumEr/ is the word for moon in Gulf Arabic. corpora for Moroccan Arabic and Sanaani Yemeni Arabic.

4282
As far as Gulf Arabic is concerned, Halefom et al. (2013) loum and Habash (2011), or modeling DAs directly, with-
created an Emirati Arabic Corpus (EAC) consisting of 2 out relying on existing MSA contributions (Habash and
million words of transcribed Emirati TV and radio shows. Rambow, 2006). One of the notable recent contributions
The corpus was transcribed in broad IPA and translated to for Egyptian Arabic morphological analysis is CALIMA
English. Morphological and lexical annotations as well as (Habash et al., 2012a). The CALIMA analyzer for EGY
Arabic script annotation was manually done for a small por- and the commonly used SAMA analyzer for MSA (Graff
tion of the corpus (around 15,000 words). Furthermore, et al., 2009) are central in the functioning of the EGY mor-
Ntelitheos and Idrissi (Forthcoming) created an Emirati phological tagger MADA-ARZ (Habash et al., 2013), and
Arabic Language Acquisition Corpus (EMALAC) consist- its successor MADAMIRA (Pasha et al., 2014), which sup-
ing of 78,000 words following the style of the widely stud- ports both MSA and EGY. Eskander et al. (2013) describe
ied CHILDES collection of corpora (MacWhinney, 2000). a technique for automatic extraction of morphological lex-
Both EAC and EMALAC were created by linguists with icons from morphologically annotated corpora and demon-
the primary purpose of studying the grammatical system strate it on EGY. Al-Shargi et al. (2016) apply the tech-
of the Emirati Arabic dialect and its development. There nique of Eskander et al. (2013) and build two morpholog-
is a lot of emphasis in the annotations of these corpora on ical analyzers for Moroccan Arabic and Sanaani Yemeni
the phonological and morphosyntactic phenomena of Emi- Arabic. As for GA, we are aware of a single effort on a rule
rati Arabic. Our Gumar Corpus is differently oriented and based stemmer (Abuata and Al-Omari, 2015) that works on
designed: text as opposed to speech is the starting point. sets of words collected online; they compare their results
And computational models of GA is our target. Our corpus to other well known MSA stemmers. In this paper we use
only includes language created by adult speakers (unlike MADAMIRA-EGY as a starting point for the GA morpho-
EMALAC) and that is a slightly conventionalized novel- logical annotation following the approach taken by Jarrar et
like form. Finally the Gumar Corpus includes texts from a al. (2014).
number of the Gulf countries and is not limited to the UAE.
For recent surveys of Arabic resources for NLP, see Za-
3. Gulf Arabic Dialect
ghouani (2014) and Shoufan and Al-Ameri (2015). Strictly speaking, Gulf Arabic refers to the linguistic va-
rieties spoken on the western coast of the Arabian Gulf,
2.2. Dialectal Orthography in Bahrain, Qatar, and the seven Emirates of the UAE
Due to the lack of standardized orthography guidelines for (Qafisheh, 1977), as well as in Kuwait and in Al-Hasa
DA, along with the phonological differences from MSA, the eastern region of Saudi Arabia (Holes, 1990). Omani,
and dialectal variations within the dialects themselves, Hijazi, Najdi, and Baharna Arabic, among other additional
there are many orthographic variations for written DA con- dialects spoken in the Arabian Peninsula, are usually not
tent. Writers in DA, regardless of the context, are often included in grammars of Gulf Arabic due to the fact that
inconsistent with others and even with themselves when it they considerably vary in their linguistic features from the
comes to the written form of a dialect, writing with MSA set of dialects listed above. In this current project, we ex-
driven orthography, or phonologically driven orthography tend the use of the term Gulf Arabic to include any Ara-
in Arabic script or even Latin script (Darwish, 2013; Al- bic variety spoken by the indigenous populations residing
Badrashiny et al., 2014). These orthographic variations the six countries of the Gulf Cooperation Council: Bahrain,
make it difficult for computational models to properly iden- Kuwait (KW), Oman (OM), United Arab Emirates (AE),
tify and reason about the words of a given dialect (Habash Qatar (QA), and Kingdom of Saudi Arabia (SA).
et al., 2012b), hence, a conventional form for the ortho- The cultural homogeneity of the Gulf region does not nec-
graphic notations is important. Habash et al. (2012b) essarily entail linguistic homogeneity. Indeed, GA dialects
proposed a Conventional Orthography for Dialectal Arabic extensively differ in their morpho-phonological and lexical
(CODA). CODA is designed for the purpose of develop- features, reflecting a number of geographical and social fac-
ing conventional computational models of Arabic dialects tors (Holes, 1990) in addition to being influenced by differ-
in general which makes it easy to be extended to other ent contact languages at different time periods. A number
dialects. Initially, the guidelines of CODA were mainly of linguistic features set GA dialects apart from other di-
specific to EGY. Jarrar et al. (2014) extended the ex- alects spoken in the Arab world. One of the distinguishing
isting CODA to cover Palestinian Arabic. Recent work phonological features of most GA dialects includes main-
on Tunisian (Zribi et al., 2014), Algerian (Saadane and taining the pharyngealized fricative /zQ /,2 as well as the in-
Habash, 2015) and Maghrebi Arabic (Turki et al., 2016) terdental /T/ and /D/, unlike what happens in other Arabic
extended the original version of CODA. We extend CODA dialects. Among the most prominent phonological features
to cover Gulf Arabic. in GA are the variant pronunciations of the sounds /q/, /dZ/,
/S/, and /k/. /q/ may be realized in certain dialects as /g/
2.3. Arabic Dialect Morphological Modeling as in /ga:l/ he said, or as /dZ/ as in /dZIdIr/ pot. /dZ/,
on the other hand, may be realized in some varieties as
Most of the work that explored morphology in Arabic
/j/, as in /jImEl/ camel. The palatal /tS/ and the velar /k/
focused on MSA (Al-Sughaiyer and Al-Kharashi, 2004;
may both turn into the alveopalatal /tS/ as in /tSa:j/ tea, and
Buckwalter, 2004; Graff et al., 2009; Pasha et al., 2014).
/tSEf/ palm. Moreover, The 2nd singular feminine posses-
Contributions to DA morphology analysis are usually based
sive and object pronoun /kI/ retains its phonological form
on either extending available MSA tools to cover DA char-
acteristics, as in the work of Abo Bakr et al. (2008) and Sal- 2
Phonetic transcription is presented in IPA.

4283
Gumar Corpus long conversational novels. We have found a huge collec-
Words 112,410,688 tion of these novels online in one place.3 We automatically
Sentences 9,335,224 downloaded about 1,200 MS Word documents. Usually,
Documents 1,236 such novels are written in lengthy threads that can be found
in online forums. The data we got was collected by volun-
Table 1: Statistics on the Gumar Corpus teering forum members into MS Word documents and then
published by another member in an organized matter.4
Corpus Genre The main theme of most of the novels is
in certain dialects (e.g. some Saudi dialects) but is realized romantic; but they also include drama and tragedy. The
in some other dialects as /tS/, /S/, or /ts/. structure of a novel is simple. It starts with a brief intro-
In terms of the morphological features, and as in the case duction that contains the title of the novel, the writers pen
with most spoken Arabic dialects, GA dialects have also name (no real names are used) and the country of the novel.
lost case inflection (with the exception of some Bedouin The introduction is then followed by a prologue that usu-
dialects, e.g. /bIntIn P@sQ i:lE/ respectable girl). Posses- ally contains a small piece of dialectal poetry or a small
sion may also be marked by clitics such as /ma:l/ and /ag/ piece of literary writing usually in MSA. It also contains a
(Holes, 1990), e.g. /lI-kta:b ma:lEl m2r2h/ The book of the brief description of the novel characters, though some writ-
lady. Negation is also marked by the particles /mu:/ (and ers prefer to introduce the characters as their role appears.
its variants /mUb/, /mUhUb/, and /hUb/) (Holes, 1990). The Then comes the main body of the novel, which is often a di-
plural and the dual masculine and feminine forms of verbs alogue between the characters. There are also some pieces
and nouns are collapsed into one form in most dialects, but of narration between conversations in either the dialect or
some distinctions are still maintained in certain others. For MSA. The last part of the novel usually has some "moral"
instance, in some varieties of Saudi, Emirati, and Omani lessons narrated by the writer. Writers tend to ask the au-
Arabic, verbs and possessive pronouns inflected for the 3rd dience for positive criticism and opinions and whether they
plural feminine are quite distinct from the masculine forms: should continue writing more novels or not.
/ga:mEt/ She stood up - /ga:mEn/ They[2FP] stood The novels are entirely written in DA except for the parts
up. mentioned above. The dialect of the novel is not necessarily
the same as the dialect of the writer. Most of the time the
/wEdZhIk/ Your[2FS] face - /w@dZu:hkIn/ Your[2FP] writers remain anonymous under nicknames, though they
faces. ask to be credited if the novel is transferred to another fo-
Additional morphological features include the cliticized rum. Hence some writers are quite famous among the audi-
/ba/ for future (instead of the standard /sa/ prefix). Alter- ence. The targeted audience is mainly female teenagers, the
natively, in some varieties of GA, the motion verb /ra:/ nature of publishing the novels is highly interactive and de-
has grammaticalized into a future marker (e.g. Kuwaiti, pendent on the activity of the audience. The writer usually
Bahraini, and some varieties of Saudi Arabic). Less promi- ends each "part" in the novel with a teaser and demands
nently, yet still a distinctive feature in some GA dialects, is participation and encouragement from the audience. Ta-
the the epenthetic /n/ found in the active participle in some ble 1 shows statistics on all the collected text. Words are
varieties of Emarati and Baharna Arabic: /ma:QtQ Inh@m/ whitespace tokenized and the counts include punctuation.
Ive/d given them. Finally, the lexicon of GA consists The number of sentences represents the number of lines.
of standard Arabic cognates that may or may not follow Most of the time, each document represents a single novel;
the phonotactics of the respective dialects. Unsurprisingly, but in few cases a novel may be split into more than one
many cognate expressions that are highly frequent in a document.
given dialect have reduced in form, such as Corpus Dialects and Dialect Annotation We have an-
- /liPay SayP/ For which thing/le:S/ Why. notated the corpus on the level of documents for the di-
As for lexical borrowings in GA dialects, there is an un- alect, novel name and writer name for each. The dialect
doubtedly substantial amount of lexical items that have of the written text was the most challenging to know. In
been borrowed from various contact languages throughout some documents, the dialect or the country of the writer
different historical periods, e.g., was explicitly stated; in others, names of cities clearly in-
dicated the origin country. However, in many cases, further
/QEmbElu:sQ / Ambulance from English.
investigation was needed. The GA dialects are closely in-
/hEst/ There is from Persian. tertwined, yet when thoroughly observed show evident dif-
ferences. These differences were observed through com-
/Pa:lu:/ Potatoes from Hindi. mon trends in relation to each GA dialect. It is important
to point that the names given to the characters in each story
4. Corpus Description have shown a trend with the dialect used, for example: SA
Corpus Collection Gulf Arabic, just like any other Ara-
bic dialect has no written convention nor is it used as a 3
www.graaam.com
formal mean of communication in the media, education or 4
There are no copyright claims by the anonymous writers or
official documents. Hence there are no known go-to re- organizers; and we do not claim any copyrights to the text. We
sources. A unique genre of written materials that is specifi- will make the cleaned up and extended versions of the data fully
cally known to GA is online anonymous publicly published publicly available.

4284
dialect novels have repeatedly used the names J
fySl5
Dialect Percentage
/fe:sQ El/ Faisal, QK
Q@
YJ. bd Alzyz /QEbdIlQEzi:z/ Ab- SA 60.52

dulaziz and QK trky /tIrkej/ Turky for male characters AE 13.35

KW 5.91
and wsn /wEsEn/ Wasan and
A lmA /lEmE/ Lama, OM 1.13
Xm.' njwd /ndZu:d/ Njood for female characters. On the QA 0.65
other hand the AE dialect novels have used the names BH 0.49
@Q  hzA /hEzza:Q/ Hazza, YK @P zAyd /za:jId/ Zayed

GA (other) 10.03
and Y@ P rAd /ra:SId/ Rashid as male names and k

Arabic (other) 7.93

HSh /IsQ sQ E/ Hessa, ZAJJ
myA /me:TE/Maytha and
 mh /SEmmE/ Shamma for female characters. It is Table 2: Distribution of Dialects across the Gumar Corpus
worth mentioning that in both SA and AE dialect novels the
male names could be related to the leaders of each coun-
try and hence this may be a cultural influence that could cases we marked as GA (other) also. The rest of the cor-
be related to the authors intuition when selecting charac- pus (7.93%) is mostly MSA (original text or translation at-
ter names. The KW dialect novels used the names
PA tempts of existing non Arabic text) and other DA such as
DAry /dQ a:rej/ Dhari,
PA
 mAry /mSa:rej/ Mshari and Egyptian, Iraqi, Levantine, ... etc.
hAJ. SbAH /sQ Uba:/ Subah for male characters while the
  fwzyh 5. Preliminary Investigation into GA
names j.J
. sbyjh /sIbi:tSE/ Sebichah and K
P
/fozIjjE/ Fawziyya were noticed to be prominent for fe-
Annotation
male characters. The second noticed trend is the use of We describe next a pilot study in semi-automatic annotation
Hijri dates in SA dialect novels, which is explained by the of GA. We use the MADAMIRA (Pasha et al., 2014) mor-
countrys official calendar use of the Hijri date. This trend phological tagger (which works in two modes: MSA and
was only noticed in SA dialect novels. Thirdly, the use of EGY) and manually change its output in accordance to the
commonly spoken words that could be directly traced to a orthography and morphology guidelines that are discussed

particular dialect such as Qm 'A hAlHzzh /hElEzzE/ Now next in this section.
and X AK
yA mwd /ja: mQEwwEd/ O man in KW di- Orthography Guidelines GA speakers who write in the
P QK . bzrAn /bIzra:n/ Little kids, @P wrak /wEra:k/
alect, @ dialect produce spontaneous inconsistent spellings that

as in H. Am.' A @P wrAk mA tjAwb /wEra:k ma: tdZa:wIb/ sometimes reflect the phonology of the GA, and other times
Why are you[2MS] not answering? which is traced to SA the words cognate relationship with MSA. We follow the
dialect and AE words such as @YJ
sydA /si:dE/ Straight work of Habash et al. (2012b) for CODA in order to over-

forward, H. Q@ Aqrb /Igr@b/ Come in, please! and A
   g come these inconsistencies. There are the general CODA
xAwqh /xa:Su:gE/ Spoon. The above are trends noticed rules that apply for every dialect, among which is affix
when annotating over a thousand GA novels that helped in spelling. Affix spelling includes the spelling of the Ta
adding efficiency to the task. These trends were noticed Marbuta; if the Ta Marbuta is at the end of the word it

will always be h and not h regardless of the pronuncia-
as the process progressed and selecting a dialect was only
completed once parts of the story were read, alongside facts tion. When inside a word (before a clitic) Ta Marbuta will
become K, (e.g. YK
Yg. k
   PAJ syArh HSh jdydh /sE-
provided by the author, which include providing an insight

to the readers that the story will be written in a particular jja:rEt isQ sQ ah jIdi:dEh/ Hissas car is new, AEPA  J
syArthA
dialect and/or constantly providing details of where events /sajja:retha:/ her car). Following on the discussion about
took place (i.e. city and/or country names). GA dialect properties in section 3., we extend several as-
Following on the annotation effort, we present the distribu- pects of the original CODA. The root consonant mapping
tion of the dialects across the corpus, see Table 2. We have rules are extended to cover the GA pronunciations that are
observed that 92% of the entire corpus is actually written unseen in other dialects, see Table 3. Another aspect is the
in GA with SA being the most dominant and BH the least. spelling of the 2nd person singular feminine pronominal
There is also around 10% that is identified as GA (other) clitic; if spelled differently than the k /kI/ equivalent in
which are the cases of a novel containing a combination MSA, it is mapped to h. j /dZ/, (e.g.   . AJ ktAb /kta:bIS/,
of several GA dialects that is due to multiple writers with K . AJ ktAbts /kta:bIts/, and l'. . AJ ktAbj /kta:bItS/ Your[FS]
different dialects or due to the existence of different char-  
acters in the novel. It was sometimes hard to differenti- book, becomes l' . . AJ ktAbj /kta:bItS/ but not K. AJ ktAbik
ate through the text between the three dialects of OM, QA /kta:bIk/) in CODA. As with the original CODA for EGY,
and AE even with a native speaker annotating, hence these we also maintain a list of exceptional spellings for uniquely
dialectal words. One example in GA is the spelling of the
kAn /ka:n/ was and its variant Ag
perfective verb A . jAn
5
Arabic transliteration is presented in the Habash-Soudi-
/tSa:n/: the perfective verb form is used as in MSA, however
Buckwalter scheme (Habash et al., 2007): (in alphabetical order)
  P   the other variant is considered a modal auxiliary (Brustad,
@ H. H H h. h p X XP
2000). Both spellings are kept due to the difference in their
b t j H x d r z s S D T D f q k l m n h w y usage despite the fact that they share the same origin. An-
 
and the additional symbols: Z, @, A @, A @, w ' , y Z', h , . other example is the the negation particles I. mb /mUb/

4285
CODA Pron. variation Example Evaluation We conduct an evaluation of the quality of
 q /q/ or /g/ or /dZ/ @Y qidAm
 /dZIdda:m/ Front automatic morphological annotation tools (taggers) on this
elm /gElQ Em/ pen(cil) corpus to assess the amount of effort needed to manually
annotate it. Following the annotation guidelines discussed
k /k/ or /tS/ or /ts/ YJ. kbd /tSEbd/ Liver above, we manually annotated around 4K words from four
h. j /dZ/ or /j/ g. jls /jIlas/ He sat different novels with a goal to capture different dialects,
 /S/ or /tS/
A Ay /tSa:j/ Tea styles of writing, . . . etc. An example of an annotated sen-
tence is shown in Table 5. As a preliminary experiment,
we investigated the frequency of out-of-vocabulary (OOV)
Table 3: GA root consonants mapping rules words in both systems of MADAMIRA: MSA and EGY.
The Egyptian model OOV (5.6%) was almost half that of
mAnyb /mEni:b/ Im not, both of which MSA (9.3%), suggesting it is better to work with Egyptian
not and I.
KA
as the base system for manual annotation.
have a number of non-CODA variants such as H. mwb
mnyb, respectively. The complete guidelines for
and I.
J Feature MADAMIRA-MSA MADAMIRA-EGY
the GA CODA will be available separately as a technical re-
CODA 83.81 88.34
port. An example of the application of several CODA rules
Morph 76.16 83.62
is presented in Table 4
POS 72.37 80.39
Morphology Guidelines For every input word CATiB 76.28 81.51
MADAMIRA produces a list of analyses specifying Lemma 64.03 77.02
every possible morphological interpretation of that word,
covering all morphological features of the word (diacriti-
zation, part-of-speech (POS), lemma, and 13 inflectional Table 6: Results on evaluation of our Gold annotation
and clitic features). MADAMIRA then applies a set of against the output of MADAMIRA in both modes: MSA
models (support vector machines and N-gram language and EGY
models) to produce a prediction, per word in-context, for
different morphological features, such as POS, lemma, We evaluated using the accuracy measure for word or-
gender, number or person. A ranking component scores thography, morphemic tokenization, POS, CATiB POS and
the analyses produced by the morphological analyzer using lemma against the output of both models of MADAMIRA:
a tuned weighted sum of matches with the predicted fea- MSA and EGY. Table 6 shows the results of the the eval-
tures. The top-scoring analysis is chosen as the predicted uation for all words. These numbers allow us to assess
interpretation for that word in context (Pasha et al., 2014). the basic quality of the these tools on GA. As expected,
We follow a similar approach that was used to morpholog- MADAMIRA-EGY outperforms MADAMIRA-MSA be-
ically annotate both the EGY and LEV corpora. We select tween 4 and 13% absolute on different metrics, confirming
the following set of features with their initial values from that it is better to use it as a baseline. This is similar to re-
the output of MADAMIRA-EGY on a given GA text to an- sults reported by Jarrar et al. (2014) on Palestinian Arabic.
notate: Error Analysis We manually investigated the four sets of
100 words from different parts of the MADAMIRA-EGY
Word orthography We follow the previously dis- annotated sub-corpus (total 400 words, with an average of
cussed orthography guidelines. 30 words containing at least one error). 52.1% of the errors
Morphemic tokenization A word is split into its mor- are likely due the wrong assignment of POS especially in
phemes and stem. proper nouns that look like nouns or adjectives. Another
major source of error is the lemma 18.2% and out of vo-
Part of Speech We use the MADAMIRA POS tag set cabulary related errors are 16.5% and this happens for two
main reasons, either the word is never seen before or be-
CATiB 6 POS Except the tag for passive verbs
cause of a typo. Finally, errors that come from a mistake
(Habash and Roth, 2009).
in merging or splitting word tokens, typos and tokenization
Lemma Diacritized form of the lemma. words combine around 13%.
English gloss The English translation of the lemma 6. Gumar Interface
Beside the above guidelines, there exist cases of erroneous Following the collection of the corpus, we created a sim-
merging and splitting of words that gets no analysis from ple online interface that is specific for searching the cor-
the automatic annotation. For merged words, we place a pus.6 The entire text of the corpus is stored in a relational
# symbol where a split should happen, this fix aggregate database in an optimized manner. The lookup of the data is
through all the other annotated features. In the case where a simple search query that matches the user input to either a
there is a split, we place the # symbol at the end of the word token, lemma or stem form. Through the website in-
first split part to indicate the merging position. terface, the rows of results are displayed to the user includ-
Table 5 shows an annotation example following the above ing the full context, word analysis that includes the POS,
guidelines. Where green and red shaded cells indicates a
6
change. http://camel.abudhabi.nyu.edu/gumar/

4286
Example 1 Raw

. . . K
AK
J
Jm 'A @ . . .
. . . Asm hAlHtsy mnts yAwylts . . .
CODA m 'A @ . . .
. . . i.K
AK
i.J

. . . Asm hAlHky mnj yAwylj . . .


English [If] I hear this talk from you [again][,] you will suffer
Example 2 Raw ?QAg . . .
. Y@
. . . s Ald jAhz?
CODA ?QAg . . .
. @Y@
. . . s AldA jAhz?
English . . . Is lunch ready?
Example 3 Raw Am. '@

j@ AK@ KP
Q

I.
J : PA
sArh: mnyb Syrrwnh AnA AllHyn fy AljAmh
 '@ m '@ AK@ K Q : PA
CODA Am .


I.
KA
sArh: mAnyb Syrwnh AnA AlHyn fy AljAmh
English Sarah: Im not a child Im now in university.

Table 4: Example of application of CODA rules

MADAMIRA-EGY
Raw CODA Morph POS CATiB 6 Lemma Gloss
XAK
P zyAd zyAd zyAd noun_prop PROP ziyAd Ziad
: : : : punc PNX : :
ZA
lysA lysA lysA noun NOM aloyas valiant
ZA
mysA mAysA mA+y+sA verb VRB asA be_harmed
JK@
PB AntwlAzm NOAN NOAN NOAN NOAN NOAN NOAN
. Am.'
K tjAbwn tjAbwn t+jAb+wn verb VRB ajAb be_answered
lY lY prep PRT alaY on
kl kl kl noun_quant NOM kul all

@ Aly Ally Ally pron_rel NOM Alliy which
K
yqwlkm yqwlkm y+qwl+km verb VRB qAl said

Manual Annotation
Raw CODA Morph POS CATiB 6 Lemma Gloss
XAK
P zyAd zyAd zyAd noun_prop PROP ziyAd Ziad
: : : : punc PNX : :
ZA
lysA lysA lysA noun_prop PROP laysaA Laysaa
ZA
mysA mysA mysA noun_prop PROP MayosaA Maysaa
JK@
PB AntwlAzm Antw#lAzm Antw#lAzm pron#noun NOM#NOM Antw#lAzim you#necessary
. Am.'
K tjAbwn tjAwbwn t+jAwb+wn verb VRB jAwab comply
lY lY prep PRT alaY on
kl kl kl noun_quant NOM kul all

@ Aly Ally Ally pron_rel NOM Alliy which
K
yqwlkm yqwlh#lkm y+qwl+h#l+km verb#prep VRB#PRT qAl#la said#to

Table 5: Example of manual annotation following the orthography and morphology guidelines. Columns represent features
to be annotated and rows represent words. NOAN means that no analysis was given automatically.

lemma, stem and gloss entries in addition to the informa- modes. The evaluation of the accuracy suggests that us-
tion about the novel the word belongs to. See Figure 1. ing MADAMIRA-EGY automatic annotations as a starting
point for manual annotation of GA speeds up the process.
7. Conclusion and Future Work We plan to semi-automatically annotate the corpus and in-
We collected the Gumar Corpus that consists of 100 mil- clude a careful manual check at a large portion of it (1M
lion words from 1200 forum novels. We annotated the cor- words). We are also looking forward to building a mor-
pus for sub-dialect information at the document level of phological analyzer for GA. We also plan to use the Gu-
the novels in addition to the informations about the name mar Corpus dialect annotations for some NLP tasks such
and the writers name of the novel. We also performed as dialect identification. We will make this corpus and its
a preliminary investigation on the annotation of GA text. annotations publicly available.
As an initial experiment we annotated around 4K words
from four different novels using proposed orthography and
8. Acknowledgments
morphology guidelines that followed previous efforts. We
compared our gold annotations to the automatic annota- We would like to thank Dimitrios Ntelitheos for helpful dis-
tions provided by MADAMIRA on its both MSA and EGY cussions. We would also like to thank Mustafa Jarrar and

4287
Figure 1: Online web interface for browsing Gumar

Rami Asia for sharing their set up of the Curras database Diab, M., Habash, N., Rambow, O., AlTantawy, M., and
browsing website. Benajiba, Y. (2010). Colaba: Arabic dialect annotation
and processing. In Proceedings of the LREC Workshop
9. Bibliographical References for Language Resources (LRs) and Human Language
Technologies (HLT) for Semitic Languages: Status, Up-
Abo Bakr, H., Shaalan, K., and Ziedan, I. (2008). A Hy- dates, and Prospects.
brid Approach for Converting Written Egyptian Collo- Diab, M. T., Al-Badrashiny, M., Aminian, M., Attia, M.,
quial Dialect into Diacritized Arabic. In The 6th In- Elfardy, H., Habash, N., Hawwari, A., Salloum, W.,
ternational Conference on Informatics and Systems, IN- Dasigi, P., and Eskander, R. (2014). Tharwa: A large
FOS2008. Cairo University. scale dialectal arabic-standard arabic-english lexicon. In
Abuata, B. and Al-Omari, A. (2015). A rule-based stem- LREC, pages 37823789.
mer for Arabic Gulf Dialect. Journal of King Saud Uni-
Eskander, R., Habash, N., and Rambow, O. (2013).
versity - Computer and Information Sciences, 27(2):104
Automatic Extraction of Morphological Lexicons from
112.
Morphologically Annotated Corpora. In Proceedings of
Al-Badrashiny, M., Eskander, R., Habash, N., and Ram- tenth Conference on Empirical Methods in Natural Lan-
bow, O. (2014). Automatic Transliteration of Roman- guage Processing.
ized Dialectal Arabic. CoNLL-2014, page 30.
Gadalla, H., Kilany, H., Arram, H., Yacoub, A., El-
Al-Sabbagh, R. and Girju, R. (2012). A supervised POS
Habashi, A., Shalaby, A., Karins, K., Rowson, E., Mac-
tagger for written Arabic social networking corpora.
Intyre, R., Kingsbury, P., Graff, D., and McLemore, C.
In Jeremy Jancsary, editor, Proceedings of KONVENS
(1997). CALLHOME Egyptian Arabic Transcripts. In
2012, pages 3952. GAI, September. Main track: oral
Linguistic Data Consortium, Philadelphia.
presentations.
Graff, D., Maamouri, M., Bouziri, B., Krouna, S., Kulick,
Al-Shargi, F., Kaplan, A., Eskander, R., Habash, N., and
S., and Buckwalter, T. (2009). Standard Arabic Morpho-
Rambow, O. (2016). A Morphologically Annotated
logical Analyzer (SAMA) Version 3.1. Linguistic Data
Corpus and a Morphological Analyzer for Moroccan and
Consortium LDC2009E73.
Sanaani Yemeni Arabic. In Proceedings of the Interna-
tional Conference on Language Resources and Evalua- Habash, N. and Rambow, O. (2006). MAGEAD: A Mor-
tion (LREC), Portoro, Slovenia. phological Analyzer and Generator for the Arabic Di-
Al-Sughaiyer, I. A. and Al-Kharashi, I. A. (2004). Ara- alects. In Proceedings of the 44th Meeting of the Associ-
bic morphological analysis techniques: A comprehen- ation for Computational Linguistics (ACL06), Sydney,
sive survey. Journal of the American Society for Infor- Australia.
mation Science and Technology, 55(3):189213. Habash, N. and Roth, R. (2009). CATiB: The Columbia
Bouamor, H., Habash, N., and Oflazer, K. (2014). A multi- Arabic Treebank. In Proceedings of the ACL-IJCNLP
dialectal parallel corpus of arabic. In Proceedings of the 2009 Conference Short Papers, pages 221224, Suntec,
Ninth International Conference on Language Resources Singapore.
and Evaluation (LREC-2014). European Language Re- Habash, N., Soudi, A., and Buckwalter, T. (2007). On Ara-
sources Association (ELRA). bic Transliteration. In A. van den Bosch et al., editors,
Brustad, K. (2000). The Syntax of Spoken Arabic: A Arabic Computational Morphology: Knowledge-based
Comparative Study of Moroccan, Egyptian, Syrian, and and Empirical Methods. Springer.
Kuwaiti Dialects. Georgetown University Press. Habash, N., Eskander, R., and Hawwari, A. (2012a).
Buckwalter, T. (2004). Buckwalter Arabic Morpho- A Morphological Analyzer for Egyptian Arabic. In
logical Analyzer Version 2.0. LDC catalog number NAACL-HLT 2012 Workshop on Computational Mor-
LDC2004L02, ISBN 1-58563-324-0. phology and Phonology (SIGMORPHON2012), pages
Darwish, K. (2013). Arabizi Detection and Conversion to 19.
Arabic. CoRR. Habash, N., Diab, M. T., and Rambow, O. (2012b). Con-

4288
ventional Orthography for Dialectal Arabic. In LREC, Growth in Child Emirati Arabic. Hamid Ouali (ed.) Per-
pages 711718. spectives on Arabic Linguistics 29.
Habash, N., Roth, R., Rambow, O., Eskander, R., and Pasha, A., Al-Badrashiny, M., Kholy, A. E., Eskander, R.,
Tomeh, N. (2013). Morphological Analysis and Dis- Diab, M., Habash, N., Pooleery, M., Rambow, O., and
ambiguation for Dialectal Arabic. In Proceedings of the Roth, R. (2014). MADAMIRA: A Fast, Comprehensive
2013 Conference of NAACL-HLT, Atlanta, GA. Tool for Morphological Analysis and Disambiguation of
Halefom, G., Leung, T., and Ntelitheos, D. (2013). A cor- Arabic. In In Proceedings of LREC, Reykjavik, Iceland.
pus of Emirati Arabic. Technical Report NRF Grant (31 Qafisheh, H. A. (1977). A Short Reference Grammar of
H001), United Arab Emirates University. Gulf Arabic.
Holes, C. (1990). Gulf Arabic. Psychology Press. Saadane, H. and Habash, N. (2015). A Conventional
Orthography for Algerian Arabic. In ANLP Workshop
Jarrar, M., Habash, N., Akra, D., and Zalmout, N. (2014).
2015, page 69.
Building a Corpus for Palestinian Arabic: a Preliminary
Salloum, W. and Habash, N. (2011). Dialectal to Standard
Study. ANLP 2014, page 18.
Arabic Paraphrasing to Improve Arabic-English Statis-
Kilany, H., Gadalla, H., Arram, H., Yacoub, A., El- tical Machine Translation. In Proceedings of the First
Habashi, A., and McLemore, C. (2002). Egyp- Workshop on Algorithms and Resources for Modelling
tian Colloquial Arabic Lexicon. LDC catalog number of Dialects and Language Varieties, pages 1021, Edin-
LDC99L22. burgh, Scotland.
Maamouri, M. and Cieri, C. (2002). Resources for Arabic Sarnakh. (2014). Sukayk 1: riwayah emaratiyya
Natural Language Processing. In International Sympo- mah.aliyyah sakhirah. Midad Publishing & Distribution.
sium on Processing Arabic, volume 1. Shoufan, A. and Al-Ameri, S. (2015). Natural language
Maamouri, M., Bies, A., Buckwalter, T., and Mekki, W. processing for dialectical arabic: A survey. In ANLP
(2004). The Penn Arabic Treebank: Building a Large- Workshop 2015, page 36.
Scale Annotated Arabic Corpus. In NEMLAR Confer- Smali, K., Abbas, M., Meftouh, K., and Harrat, S. (2014).
ence on Arabic Language Resources and Tools, pages Building resources for Algerian Arabic dialects. In 15th
102109, Cairo, Egypt. Annual Conference of the International Communication
Maamouri, M., Bies, A., Buckwalter, T., Diab, M., Habash, Association Interspeech.
N., Rambow, O., and Tabessi, D. (2006). Developing Smr, O. and Hajic, J. (2006). The Other Arabic Treebank:
and Using a Pilot Dialectal Arabic Treebank. In Pro- Prague Dependencies and Functions. In Ali Farghaly,
ceedings of the Fifth International Conference on Lan- editor, Arabic Computational Linguistics: Current Im-
guage Resources and Evaluation, LRECAZ06. plementations. CSLI Publications.
Maamouri, M., Bies, A., Kulick, S., Tabessi, D., Turki, H., Adel, E., Daouda, T., and Regragui, N. (2016).
and Krouna, S. (2012a). Egyptian Arabic Tree- A Conventional Orthography for Maghrebi Arabic. In
bank DF Parts 1-8 V2.0 - LDC catalog num- Proceedings of the International Conference on Lan-
bers LDC2012E93, LDC2012E98, LDC2012E89, guage Resources and Evaluation (LREC), Portoro,
LDC2012E99, LDC2012E107, LDC2012E125, Slovenia.
LDC2013E12, LDC2013E21. Zaghouani, W., Mohit, B., Habash, N., Obeid, O., Tomeh,
Maamouri, M., Krouna, S., Tabessi, D., Hamrouni, N., and N., Rozovskaya, A., Farra, N., Alkuhlani, S., and
Habash, N. (2012b). Egyptian Arabic Morphological Oflazer, K. (2014). Large scale Arabic error annotation:
Annotation Guidelines. Guidelines and framework. In International Conference
on Language Resources and Evaluation (LREC 2014).
Maamouri, M., Bies, A., Kulick, S., Ciul, M., Habash, N.,
Zaghouani, W. (2014). Critical survey of the freely avail-
and Eskander, R. (2014). Developing an Egyptian Ara-
able arabic corpora. In Proceedings of the Workshop
bic Treebank: Impact of Dialectal Morphology on An-
on Free/Open-Source Arabic Corpora and Corpora Pro-
notation and Tool Development. In Proceedings of the
cessing Tools, LREC, pages 18.
Ninth International Conference on Language Resources
and Evaluation (LREC-2014). European Language Re- Zaidan, O. F. and Callison-Burch, C. (2011). The Arabic
sources Association (ELRA). online commentary dataset: an annotated dataset of in-
formal arabic with high dialectal content. In Proceed-
MacWhinney, B. (2000). The childes project: Tools for
ings of the 49th Annual Meeting of the Association for
analyzing talk: Volume i: Transcription format and pro-
Computational Linguistics: Human Language Technolo-
grams, volume ii: The database. Computational Linguis-
gies: short papers-Volume 2, pages 3741. Association
tics, 26(4):657657.
for Computational Linguistics.
Masmoudi, A., Ellouze Khmekhem, M., Esteve, Y., Zribi, I., Boujelbane, R., Masmoudi, A., Ellouze, M.,
Hadrich Belguith, L., and Habash, N. (2014). A cor- Belguith, L., and Habash, N. (2014). A Conventional
pus and phonetic dictionary for tunisian arabic speech Orthography for Tunisian Arabic. In Proceedings of
recognition. In Proceedings of the Ninth International the Language Resources and Evaluation Conference
Conference on Language Resources and Evaluation (LREC), Reykjavik, Iceland.
(LREC-2014). European Language Resources Associa-
tion (ELRA).
Ntelitheos, D. and Idrissi, A. (Forthcoming). Language

4289

Вам также может понравиться