Академический Документы
Профессиональный Документы
Культура Документы
Abstract name of the noun, its opposite gender and that whether
it is a living noun or not. The second important
This work is about the computation of gender of component of the system is an Urdu noun dictionary,
nouns in Urdu text which comes under morphology which contains the most frequently used nouns along
[1]. The system developed for this purpose takes Urdu with some of their characteristics. The third part of the
text as input and processes it for the identification system is a rule base. The rule base is constructed to
and/or conversion of nouns used in that text, using a compute the opposite gender of animate nouns.
system generated database of Urdu nouns. If a noun is Corpus is another component of the system as shown
not found in the system’s database then it is searched in Figure1.1. It is a large and structured set of texts
in an Urdu Noun dictionary. This dictionary contains usually electronically stored and processed [6]. It can
information about different nouns. The system then be used to do statistical analysis, checking occurrences
applies the gender conversion rules to the noun if it is or validating the linguistic rules in a specific domain.
an animate one [2, 3, 4], otherwise its context is Here, it is used to find out the context information of
analyzed for finding its correct gender. This system is nouns used in corpus. The rest of the paper is
designed to help those people who are interested to organized as follows. Section 2 discusses the major
know the correct gender of Urdu nouns for various system components along with the sources from where
purposes like automatic part-of-speech tagging. the data is acquired and the formulation of rules.
Further, the system can also be used in Urdu natural Section 3 shows the details of the developed algorithm,
language systems with respect to nouns. its implementation and flow of information inside the
system. Section 4 is about evaluation of the software.
Section 5 concludes the research paper. The detailed
1. Introduction architecture of the system can be understood from
Figure1.1:
This research paper is about the computational List of inanimate nouns along with gender
67
Proceedings of the Conference on Language & Technology 2009
namely the database, the dictionary, the corpus and the corpus is bi-directional i.e. information can be
rule base. retrieved from and updated to the corpus, means
whenever the system finds new data it will be added to
2.1. Database corpus for later use.
If a particular noun is not found in the dictionary
The purpose of inclusion of database into the system then the system uses the context of that noun in corpus
was to provide a source for the fast access of to find its corresponding gender. The context
information i.e. whenever the system tries to compute information leads the system to decide that whether a
the gender of any noun; it first checks the database for noun is masculine or feminine. The corpus can also be
the relevant information before trying the other accessed by the system for those animate nouns whose
options. The database was developed using Microsoft conversion to opposite gender cannot be achieved via
SQL Server 2000. The data inside the database was rule base. Further, a website about Patras Bokhari’s [8]
organized in a systematic manner. The database work was accessed for data collection, which was used
consisted of a table of nouns, which is updated by the to check the system’s performance.
system after each successful iteration. The table
contains information about nouns i.e. a reference 2.4. Rule base
identifier, name, type, gender and the information that
whether the noun is animate or inanimate. The rule base is the last component of the system. It
The system can be given input in the form of Urdu contains rules for the conversion of animate nouns to
text; it processes the input file word by word. A word their opposite gender. These rules were studied and
is read from the input file then this word is searched in developed during the literature study. The use of the
the database, assuming it as a noun. If the word gets rule base is done by the system when an animate noun
matched with any word inside database, then it is is found in the dictionary. The rules were developed
temporarily copied to a file for latter use i.e. to display keeping in view the frequency of patterns for
to the system user. If the word is not found in the conversion. As it is obvious that nouns take a
database then the system will try other options to particular method of conversion from one gender to
compute or to find its gender. Finally, the word will be the other, the rules were developed for the most
stored in the database for latter use. commonly observed patterns. It is worth mentioning
here that a noun may incorrectly be converted to its
2.2. Dictionary opposite gender by the system. In such a case there is
an evaluation process in the system, which was
The online Urdu Dictionary [7] hosted by Center for implemented through analysis algorithm. The function
Research in Urdu Language Processing (CRULP), of this algorithm is to validate the conversion of nouns,
National University of Computer and Emerging by cross checking the output with corpus. This cross
Sciences, Lahore, was accessed for most commonly checking process is the inspection of the context of
used nouns and their corresponding morphological nouns, which are incorrectly converted.
information. The required information was An appendix is given at the end to mention the
downloaded in HTML form, which was then frequently found patterns in the gender conversion
converted into XML. The data in XML format was process. Based on the observations given in the
rearranged locally for efficient access using Microsoft appendix, the following rules were defined in rule base.
SQL Server. The dictionary consists of a table with the The “0” represents the null character i.e. when there is
attributes like reference identifier, word, gender and no gender inflection marker; it is represented by a “0”.
information that whether the noun is animate or Rule No. 1.
inanimate. If last character of masculine is “ا/ﮦ/0” then replace
“ا/ﮦ/0” with “/” to form feminine. This rule was
2.3. Corpus derived from observations 1-4 while converting
masculine nouns into feminine ones.
The third component of the system is an Urdu Rule No. 2.
corpus, which was organized in a database using If last character of feminine is “/” then replace
Microsoft SQL Server 2000. The corpus normally “/” with “ا/ﮦ/0” to form masculine. This rule was
contains news and current affairs, which was mostly derived from observations 1-4 while converting
acquired from Urdu websites in Unicode form. This feminine nouns into masculine ones.
corpus can be used to compute the gender of a noun, if Rule No. 3.
it is not available in the dictionary. The access to the If last character of the feminine is “( ”ﮦcombined or
separated) then remove “ ”ﮦfrom feminine to form
68
Proceedings of the Conference on Language & Technology 2009
masculine. This rule was derived from observations 1, If last character of masculine is “ن/” then replace
5 and 6. “ن/” with “ ”اto form feminine from masculine.
Rule No. 4. This rule is derived from observation No. 11.
If the last character of masculine is not “ا/ ”ﮦthen add Rule No. 8.
“( ”ﮦcombined or separated) to masculine to form If last three characters of feminine are “
”اthen
feminine. This rule was derived from observations 1, 5 remove them to form masculine from feminine. This
and 6. rule was derived from observation No. 12.
Rule No. 5. Rule No. 9.
If the last character of feminine is “ ”نthen replace “”ن If last two characters of feminine are “
” then replace
with “0/ ا/ﮦ/” to form masculine from feminine. This them with “0” to form masculine. This rule was
rule was derived from observations 7-10. derived from observation No. 13.
Rule No. 6. There is no regular pattern found in observations 14
If last three or last two characters of feminine are “ ا/ to 16, therefore the gender of these nouns cannot be
” then replace “ ا/ ” with “ن/” to form computed through a rule. The information about all
masculine from feminine. This rule was derived from such nouns was stored in the noun dictionary.
observation No. 11. Consider the sentences in Table 2.1, which show
Rule No. 7. masculine and feminine gender of inanimate nouns
with the help of context they are used in. Here nouns
are used only when they are possessed by someone.
Table 2.1
The above table contains some gender markers In example (6), two nouns “” (pen) and “( ”بbook)
(boldface in the third column) which are very helpful are used where the gender identification is not possible
in the gender recognition process. These markers are from the context information. To handle such situation,
used by the system to identify the gender of nouns. the system was designed to find multiple occurrences
There is another issue. Sometimes, a noun’s context of such nouns with varying context information in the
gives no information about its gender. For example: corpus. After deciding the gender from corpus, the
(6) -ب اور &ؤ resultant information is stored in the database. As the
Kitāb aur qālm lāau last option, if the system is not able to find or calculate
Book and pen bring the gender of a noun then the expert users will be
“Bring book and pen.” given an option interactively to store the noun’s
gender in database.
69
Proceedings of the Conference on Language & Technology 2009
70
Proceedings of the Conference on Language & Technology 2009
The corpus contains data from a verity of domains, but [6] S. Gries and A. Stefanowitsch, “Corpus Linguistics and
it can be broadly classified under news and current Linguistic Theory (CLLT)”, Mouton de Gruyter, USA and
affairs domain. The purpose of the use of this corpus Germany, ISSN 1613-7035, 2005-2008.
by the system was to evaluate its performance in a
[7] Center for Research in Urdu Language Processing
given set of data i.e. a file. During the testing of the (CRULP), Online Urdu Dictionary Service, Pakistan,
system, it was observed that 90% of the nouns were Retrieved 11- 21- 2007,
present in the dictionary. Later, 73% of the animate Available: http://www.crulp.org/oud/WordIndex.aspx.
nouns were successfully converted into their opposite
gender. For the remaining 27% of the animate and for [8] S. A. Bokhari, Pakistan Data Management Services,
all other inanimate nouns, the system was supported Karachi, 2005, Retrieved 01-02-2008,
by a corpus. The corpus was accessed by the system Available: http://patrasbokhari.com
for finding the gender of noun and resulted in an
overall accuracy of 87%. The remaining 13 % of the [9] P. K. Das, Grammatical Agreement in Hindi-Urdu and
its Major Varieties, PhD thesis, JNU, New Delhi-67, India,
inanimate nouns were entered by the system user. 2005.
After finding a noun’s gender through its context, the
resultant information was stored into the database. [10] Visual Studio.Net, Visual C++, Microsoft SQL Server
This way the performance of the system can 2000, Microsoft Windows XP, Microsoft Corporation ®,
automatically be improved with the passage of time as 2008.
the system is designed to adopt and store external
information. [11] Olero Training Biz Name space, ORM Sample Class
Library, (Object Relational Mapping), 2008.
5. Conclusion
Appendix A
Gender is an important characteristic of nouns in
Urdu language. Each noun in Urdu is either of gender Nouns
masculine or feminine. The aim was to develop an The following are the observations during literature
efficient computerized system for the recognition of study while converting a masculine noun into a
gender of animate and inanimate nouns in Urdu text feminine one:
and then conversion of animate nouns into their Observation No. 1
opposite gender. This system is primarily designed to Delete the last character “ ”اof the masculine and add
help people who are learning Urdu as a second “” at the end, to form feminine. Some examples are:
language. The system will help the learners by
deciding the correct gender of nouns. This work can
also contribute in the automatic part-of-speech tagging. ()ا
()ا ڑا+(, ڑ+(,
Further, this work will contribute to the morphological اه
اه &ﮨ+-
&ﮨ+-
components of Urdu natural language systems. دوه
دوه &
/
6. References
Observation No. 2
[1] W.O. Grady, M. Dobrovolsky and F. Katamba, Add “” at the end of masculine to form feminine.
“Contemporary Linguistics: An Introduction”, Addison Some examples are:
Wesley Logman, London, 1997.
71
Proceedings of the Conference on Language & Technology 2009
72
Proceedings of the Conference on Language & Technology 2009
\
4 H +
T]7
ّ- + ^+
ﭩ7
_() ,ﮩ ﮩ,
(ﮍ7 ر+6() @ ا+F
(6
: H
) ہZJ
7`
Observation 16 (Exception)
There are certain animate nouns which are always
used as masculine. Some examples are:
73