Академический Документы
Профессиональный Документы
Культура Документы
Richa: rsrishti@gmail.com
1
Abstract:
POS-tagging is the process of labeling words in the running
corpus with their grammatical categories and optionally with
their associated grammatical features. It is essentially a
classification problem but for languages with split-
orthography 1 , it is also a mapping-problem which involves
mapping of the arrays of tokens (words, chunks or sentences)
on the arrays of tags in proper agreement with the syntactic
structure of a language. While POS-tagging is an established
technology in European languages and even in some South
Asian Languages like Arabic and Chinese, it is an emerging
field in Indian languages where little work has been done so
far, particularly, in those languages which use Persio-Arabic
script (e.g. Urdu, Kashmiri, Shina, Balti, and Purki). It has been
argued that such languages are real challenge to the already
complex NLP-tasks like tokenization, POS-tagging and
chunking due to their split-orthography. The problem of script
needs to be addressed tactfully so that such languages would
not lag behind in the progressing scenario of Indian language-
technology. Since, Kashmiri is one of such languages with
severe split-orthography; this paper is an initiative to put the
problem in the right perspective and to develop a versatile,
fine-grained, hierarchical2 tag-set for Kashmiri that can handle
script related issues as well as other linguistic issues. It also
ensures maximum facilitation of POS-tagging at the level of
parsing. The tag-set will be strictly morpho-syntactic in nature
as per the guidelines of Expert Advisory Group for Language
Engineering Standards (henceforth EAGLES) for morpho-
syntactic annotation (Leech, and Wilson, 1999). Therefore,
morpho-syntactic availability of the grammatical features
would be the governing principle for the present tag-set.
Capturing of semantically or lexically available grammatical
features is out of the scope of the present tag-set and will be
handled in the future work.
1
The term split-orthography is used due to the unavailability of any technical term in the
existing literature to denote the splitting tendency in the Persio-Arabic script due to which
affixes and roots are written separately; even some lexical items are written in two tokens,
forming multi-token words. The term is, in a way, a new coinage to describe this tokenization
problem of Kashmiri, Urdu, and Shina etc.
2
The term “hierarchical”, when used of a tag set, means that the categories in that tag set are
structured relative to one another. A hierarchical tag set will contain a small number of
categories, each of which contains a number of sub-categories, each of which may contain
sub-sub-categories, and so on, in a tree-like structure (Hardie 2003: 48).
2
The paper is organized in various sections. In section 1, the motivation behind
the present work will be introduced along with the backdrop in which the
present work is rooted. Section 2 presents the nature of Kashmiri, its
idiosyncrasies and similarities shared with other Indian languages. Section 3
presents the categorization scheme that is followed in the present tag set along
with the rationale to adopt such a categorization scheme. Section 4 presents
some aspects of the EAGLES scheme and their implementation on Kashmiri.
This section also shows some loopholes where EAGLES can’t be
implemented but International Standards for Language Engineering
Standards (henceforth ISLES) may work. Section 5 highlights some key
issues that are crucial in the development of Tag set of any sort and also to the
annotation process itself. Section 6 presents the tokenization problem along
with implementable solution. Section 7 concludes the paper.
Key Words: POS-Tagging, Token, Split-orthography, Morpho-syntactic
annotation, Granularity, etc.
3
Linguistic data consortium for Indian languages is an 11th plan project of Govt. of India set
in Central Institute of Indian Languages (CIIL), Mysore, on the lines of LDC at University of
Pennsylvania, USA. Its primary goal is to create annotated quality data of 22 Scheduled
Languages of Indian Union.
3
However, many of the tag-sets have rendered POS-tagging an isolated task of
NLP, focusing on corpus annotation only rather than its need to support
Natural Language Parsing (NLP). As Andrew Hardie (2009: 269) writes in the
context of South Asian languages, "Most work on annotation has been focused
on POS-tagging and parsing in particular...the goal of the tagging is to support
the requirements of computational linguistic techniques rather than linguistic
analysis per se". Similarly, Leech & Smith (1990: 27) made it clear that
syntactic parsing is the central task of NLP, and POS tagging, being a
prerequisite to parsing, is "the most central area of corpus processing
technology". The present work is rooted in ILPOST and LDCIL tag-sets. It is
a finely crafted tag-set of Kashmiri, with orientations to handle split-
orthography, to lower redundancies, to reduce cognitive load on the annotator
& to lessen computational complexities (by reducing the number of categories
and attributes), and ultimately, to facilitate Dependency Parsing of Kashmiri.
Since the tag-set is tested on very little corpus of approximately 300 words, it
is expected to be amplified in future after testing on corpus of at least 10k
words.
2. Kashmiri and Other Technologically Deprived Languages
Kashmiri is one of the major Indian languages as per the 8th schedule of the
Indian constitution. It is mainly spoken in the Kashmir valley by more than 4
million speakers (O.N.Koul, 2006). It is a Dardic language closely related to
Shina and some other languages of the North-West frontier. It shares some
morphological features such as pronominal agreement with Sindhi, remote
pronouns and demonstratives with Assamese, etc. However, it has some
unique features in the Indo-Aryan language family. For example, the finite
verb always occurs in the second position, except in relative clauses. Thus, its
word-order resembles to that of German, Dutch, etc. These languages
constitute a unique group of Verb Second (V-2)4 languages (a bit different
from verb middle languages like English SVO). Kashmiri, though spoken by
the dominant majority of people in the valley, has never been used as an
official language. Persian was introduced as the official language in14th
Century, It was replaced by Urdu which still continues to be the official
language of Kashmir along with its associate official language English.
Previously, Kashmiri was hegemonised by Sanskrit & Persian and now by
Urdu and English. Consequently, it borrowed heavily from Persian either
directly or indirectly through Urdu. Like other languages, Kashmiri is
heterogeneous in nature with three regional dialects - KamrAz (North
Kashmiri), MarAz (South Kashmiri) & YamrAz (Standard Kashmiri, spoken
in Srinagar). Moreover, on the basis of ethno-political biases in language use,
it has two varieties (See Grierson, 1919:234 and Kachru, 1969), one is written
in Devanagari script (Sanskritised) and another is written in Persio-Arabic
4
In a V2 language, any constituent of a sentence can precede the verb in contrast to verb
middle languages where only restricted constituents can precede the verb.
4
Script (Persianised). Persio-Arabic Script is the standard one and writers
irrespective of their religious background use Persio-Arabic script in their
works. They, however, retain their religious identity by using Persian or
Sanskrit-borrowed lexical items. It shows their selective use of the existing
vocabulary of Kashmiri. The divide imposed by the script and the borrowed
loan-words is analogous to the Urdu-Hindi divide, except, that the intensity of
Urdu-Hindi divide is very high and consequently, Urdu & Hindi have attained
the status of separate languages.
Like Kashmiri and Urdu, many Indian languages follow the modified Persio-
Arabic Script. For example, Shina, Sindhi, Purki, etc. They are replete with
borrowed Persian constructions (IzAfe) and open-class items (infixations).
These languages are technologically deprived languages in the Indian
language technology scenario and need to be dealt tactfully, with due
emphasis on split-orthography which is a key issue in the text-processing
(tokenization, POS-tagging etc.). The present tag-set is designed to address
this problem effectively.
5
In a flat design, there is large number of independent categories without subcategories but
optionally with feature values.
5
is that the former captures the higher levels of granularity6. It includes twelve
categories, thirty three subcategories and fifteen attributes along with their
values as given in the appendix. The key in this tag-set is that it has merged all
the modifying parts of speech under a single category and has taken a clear cut
stance to classify on the basis of Function which is hardy evident in many POS
schemes. This stance is to achieve better results in annotation process.
The proposed tag-set is organized in three levels. Level-1: The top level in the
hierarchy is an inventory of categories in which each category leads to Level-
2: An intermediate level which is itself an inventory of subcategories. Again
each subcategory leads to the Level-3: The bottom level which is inventory of
attributes with their embedded values. For example: Dimension is an attribute
of Definite demonstratives with embedded values: Proximal, Distal and
Remote. The overall hierarchal structure of the tag-set to capture fine
granularity is shown in the following snapshot:
Fig.1
4. EAGLES but Not ISLES: Strictly Morpho-syntactic Annotation
An EAGLE provides a set of attributes for morpho-syntactic POS tagging.
Some of them are mandatory and others are recommended. The mandatory
attributes include the grammatical categories while as the recommended
attributes include grammatical descriptions (feature columns) of those
categories. Therefore, its recommended tags are decomposable 7 - a set of
morpho-syntactic attributes.
Generally, fine-grained tag-sets could have been designed without the
EAGLES recommendations but capturing high granularity would have been a
6
Granularity, basically, means an extent to which a tag can be enriched with linguistic
information.
7
A tag is decomposable if the string representing the tag contains one or more shorter sub-
strings that are meaningful out of the context of the original tag.
6
problem. EAGLE’s recommendations provide a means by which a balance
could be maintained between the granularity of a tag-set and the capturing of
that much of granularity. The focus on morpho-syntactic annotation vis-à-vis
morpho-syntactic availability has made the tag-set information rich and
tangible, hence easily computable. Some attributes which are not morpho-
syntactically available but only lexically available are non-tangible and their
capturing is out of the scope of EAGLES recommendations and hence out of
the scope of the present tag-set. For example, jAy (NC) has the value -
feminine (fem) for the attribute - gender but there is no marker for the same.
However, for the attribute - number (plurality value), it has got a marker (-eh)
as in jAyeh. Such efforts are covered by ISLE - a lexical standardization
initiative to broaden the scope of EAGLES. While the capturing of non-
tangible, morpho-syntactically unavailable features has become the main
concern of ISLE in the context of European languages, a Lex-tagging initiative
of LDCIL has focused on the same concern for Indian languages.
7
script and borrow heavily from Persian, Kashmiri has split-orthography due to
which there is a lack of one-to-one correspondence between tokens and words,
thereby, posing a mapping problem between the array of tokens and the array
of tags. Many derivational morphemes and their bases are written as separate
tokens instead of one in Kashmiri. For example “mazi-dAr” (delicious),
“miltry-voul” (military personnel), “kasUr-vAr” (guilty), etc. The above
words are bi-token words in which the first token is a morphological base and
the second one is a derivational bound morpheme. The second token of such
words does not fit in any of the POS schemes. Hence, to tag such tokens
without an impact on the syntactic structure (creating problems in parsing), a
syntactically neutral tag is devised called the Null-Tag (NT). This tag is
applied to the first token of the bi-token words and the second token is tagged
with the POS category to which the whole word belongs as given below:
Mazi\NT.0.0.0 + dAr\MAdj.0.0.0 = MAdj.0.0.0
KasUr\NT.0.0.0 + vAr\MAdj.0.0.0 = MAdj.0.0.0
(Null Tag) + (POS Tag) = (Resultant Tag)
This strategy resolves the mapping problem and it smoothens the ground for
the ultimate goal of tagging i.e. to facilitate parsing.
7. Conclusion
Tag-set is an important prerequisite that determines the nature of not only the
annotated (POS tagged) corpus and its utility in research and development but
also the entire course of NLP. However, guidelines for corpus annotation also
play a crucial role. Our experience in developing a tag-set for Urdu and
Kashmiri for POS tagging has revealed the fact that the split-orthography is
not really a hard problem (but a different one from other Indian languages that
do not follow Persio-Arabic script) that can hamper the progress of language
technology in such resource poor languages. Further extension of EAGLES
annotation scheme to Kashmiri revealed that modularity should be maintained
between the morpho-syntactic tagging and the lexical tagging (part of
semantic tagging) to achieve high scores in machine learning with encoding of
maximum linguistic information so that developing automatic tagger would
be, comparatively, an easy task and there would be quality control (accuracy)
in automatic as well as manual POS-tagging.
References:
8
Koul, Omkar Nath(2006) Modern Kashmiri Grammar. USA: McNeil
Technologies, Inc.
Habash, N. & Owen Rambow (2005) Arabic Tokenization, Morphological
Analysis, and Part-of-Speech Tagging in One Fell Swoop. In Proceedings of
the Conference of American Association for Computational Linguistics
(ACL05).
Baskaran S. et al (2007) Framework for a Common. Parts-of-Speech Tag-set
for Indic Languages. (Draft) http://research.microsoft.com/~baskaran/POSTagset/
Leech, G & Wilson, A (1999) Standards for Tag-sets. In Syntactic Wordclass
Tagging, ed. Hans van Halteren, Dordrecht: Kluwer Academic.
Santorini, B (1990) Part-of-speech tagging guidelines for the Penn Treebank
Project. Technical report MS-CIS- 90-47, Department of Computer and
Information Science, University of Pennsylvania
IIT-tagset. A Parts-of-Speech tag-set for Indian languages.
http://shiva.iiit.ac.in/SPSAL2007/iiit_tagset_guidelines.pdf
Brill 93. E. Brill (1993) A corpus-based Approach to Language Learning.
Hardie. A (2003) Developing a tag-set for automated part-of-speech tagging
in Urdu. Proceedings of the Corpus Linguistics 2003 conference, 16, 2003.
Leech. G and Wilson.A (1999) Recommendations for the Morpho-syntactic
Annotation of Corpora. EAGLES Report EAG-TCWG-MAC/R, 1999.
Kashmiri Tag-
Tag-set
Category Subcategory Attributes
(level-
(level-1) (level-
(level-2) (level-
(level-3)
Noun Number. Case. Case marker. Emphatic
1.Common
9
Verb 14.Main Person. Number. Gender. Aspect. Mood. Finiteness. Emphatic. Pronominal Agreement
16.Adverb Emphatic
Postposition Number. Gender. Case Marker. Emphatic. Animacy
Particles
21.Co-ordinating
22.Subordinating
23.(Dis)Agreeme
nt
24.Interjection
27.Dedative
28.Dubitative
29.Others
Null Tag 30.Part of word
31.X-Tag
Reduplication Distributive
33.Symbol
Unknown
Punctuation
Table 1
10