Вы находитесь на странице: 1из 10

Developing Hierarchical Tag-set for Kashmiri

Shahid Mushtaq Bhat: Shahid.bhat3@gmail.com

Richa: rsrishti@gmail.com

Farooq Ahmad Sheikh: farooq.linguist@gmail.com

LDC-IL, CIIL, Mysore


Department of Linguistics, University of Kashmir, Srinagar

1
Abstract:
POS-tagging is the process of labeling words in the running
corpus with their grammatical categories and optionally with
their associated grammatical features. It is essentially a
classification problem but for languages with split-
orthography 1 , it is also a mapping-problem which involves
mapping of the arrays of tokens (words, chunks or sentences)
on the arrays of tags in proper agreement with the syntactic
structure of a language. While POS-tagging is an established
technology in European languages and even in some South
Asian Languages like Arabic and Chinese, it is an emerging
field in Indian languages where little work has been done so
far, particularly, in those languages which use Persio-Arabic
script (e.g. Urdu, Kashmiri, Shina, Balti, and Purki). It has been
argued that such languages are real challenge to the already
complex NLP-tasks like tokenization, POS-tagging and
chunking due to their split-orthography. The problem of script
needs to be addressed tactfully so that such languages would
not lag behind in the progressing scenario of Indian language-
technology. Since, Kashmiri is one of such languages with
severe split-orthography; this paper is an initiative to put the
problem in the right perspective and to develop a versatile,
fine-grained, hierarchical2 tag-set for Kashmiri that can handle
script related issues as well as other linguistic issues. It also
ensures maximum facilitation of POS-tagging at the level of
parsing. The tag-set will be strictly morpho-syntactic in nature
as per the guidelines of Expert Advisory Group for Language
Engineering Standards (henceforth EAGLES) for morpho-
syntactic annotation (Leech, and Wilson, 1999). Therefore,
morpho-syntactic availability of the grammatical features
would be the governing principle for the present tag-set.
Capturing of semantically or lexically available grammatical
features is out of the scope of the present tag-set and will be
handled in the future work.

1
The term split-orthography is used due to the unavailability of any technical term in the
existing literature to denote the splitting tendency in the Persio-Arabic script due to which
affixes and roots are written separately; even some lexical items are written in two tokens,
forming multi-token words. The term is, in a way, a new coinage to describe this tokenization
problem of Kashmiri, Urdu, and Shina etc.
2
The term “hierarchical”, when used of a tag set, means that the categories in that tag set are
structured relative to one another. A hierarchical tag set will contain a small number of
categories, each of which contains a number of sub-categories, each of which may contain
sub-sub-categories, and so on, in a tree-like structure (Hardie 2003: 48).

2
The paper is organized in various sections. In section 1, the motivation behind
the present work will be introduced along with the backdrop in which the
present work is rooted. Section 2 presents the nature of Kashmiri, its
idiosyncrasies and similarities shared with other Indian languages. Section 3
presents the categorization scheme that is followed in the present tag set along
with the rationale to adopt such a categorization scheme. Section 4 presents
some aspects of the EAGLES scheme and their implementation on Kashmiri.
This section also shows some loopholes where EAGLES can’t be
implemented but International Standards for Language Engineering
Standards (henceforth ISLES) may work. Section 5 highlights some key
issues that are crucial in the development of Tag set of any sort and also to the
annotation process itself. Section 6 presents the tokenization problem along
with implementable solution. Section 7 concludes the paper.
Key Words: POS-Tagging, Token, Split-orthography, Morpho-syntactic
annotation, Granularity, etc.

1. Introducing the Backdrop of the Present Tag-set


Research in language technology is a quite recent tendency in India.
Consequently, the concept of tag-set designing in Indian languages
(henceforth ILS) is also very recent as compared to its European counterpart.
Despite being an emerging field and a difficult task (due to complex nature of
ILS), it has matured enough in a short period of time (due to external as well
as internal inputs) and has produced a number of tag-sets and common
frameworks for ILS. On the basis of these tag-sets, a large amount of corpora
has been annotated to feed data (annotated) hungry research and industry. The
initial efforts in POS tag-set designing resulted in tag-sets such as Brown, C5,
UPenn, (designed mainly for English).They were mostly simple inventories of
tags corresponding to the morpho-syntactic features, and varied greatly in
terms of their granularity (Hardie, 2004). It was CLAWS2 tag-set (Sartoni,
1987) which is a landmark in the history of tag-set designing. It marked an
important change in the structure of tag-sets, from a flat-structure to a
hierarchical-structure. Tag-sets have also been designed to provide the
resources for Indian Languages, which include: AU-KBC Tamil tag set (2001),
Hardie's tag-set for Urdu (Hardie (2005). IIIT-Hyderabad tag-set for Hindi
(Bharati, et al. 2006), Micro-Soft Research of India (MSRI) IL-POSTS based
on Hindi & Bangla (Baskaran et al. 2008), MSRI-JNU Sanskrit tag-set, CSI-
HCU for Telugu (Sree R.J et al. 2008), IIT-Kharagpur tag-set for Bangla,
Nelrlac tag-set for Nepali and LDCIL3 tag-sets for Indian languages (2009).

3
Linguistic data consortium for Indian languages is an 11th plan project of Govt. of India set
in Central Institute of Indian Languages (CIIL), Mysore, on the lines of LDC at University of
Pennsylvania, USA. Its primary goal is to create annotated quality data of 22 Scheduled
Languages of Indian Union.

3
However, many of the tag-sets have rendered POS-tagging an isolated task of
NLP, focusing on corpus annotation only rather than its need to support
Natural Language Parsing (NLP). As Andrew Hardie (2009: 269) writes in the
context of South Asian languages, "Most work on annotation has been focused
on POS-tagging and parsing in particular...the goal of the tagging is to support
the requirements of computational linguistic techniques rather than linguistic
analysis per se". Similarly, Leech & Smith (1990: 27) made it clear that
syntactic parsing is the central task of NLP, and POS tagging, being a
prerequisite to parsing, is "the most central area of corpus processing
technology". The present work is rooted in ILPOST and LDCIL tag-sets. It is
a finely crafted tag-set of Kashmiri, with orientations to handle split-
orthography, to lower redundancies, to reduce cognitive load on the annotator
& to lessen computational complexities (by reducing the number of categories
and attributes), and ultimately, to facilitate Dependency Parsing of Kashmiri.
Since the tag-set is tested on very little corpus of approximately 300 words, it
is expected to be amplified in future after testing on corpus of at least 10k
words.
2. Kashmiri and Other Technologically Deprived Languages
Kashmiri is one of the major Indian languages as per the 8th schedule of the
Indian constitution. It is mainly spoken in the Kashmir valley by more than 4
million speakers (O.N.Koul, 2006). It is a Dardic language closely related to
Shina and some other languages of the North-West frontier. It shares some
morphological features such as pronominal agreement with Sindhi, remote
pronouns and demonstratives with Assamese, etc. However, it has some
unique features in the Indo-Aryan language family. For example, the finite
verb always occurs in the second position, except in relative clauses. Thus, its
word-order resembles to that of German, Dutch, etc. These languages
constitute a unique group of Verb Second (V-2)4 languages (a bit different
from verb middle languages like English SVO). Kashmiri, though spoken by
the dominant majority of people in the valley, has never been used as an
official language. Persian was introduced as the official language in14th
Century, It was replaced by Urdu which still continues to be the official
language of Kashmir along with its associate official language English.
Previously, Kashmiri was hegemonised by Sanskrit & Persian and now by
Urdu and English. Consequently, it borrowed heavily from Persian either
directly or indirectly through Urdu. Like other languages, Kashmiri is
heterogeneous in nature with three regional dialects - KamrAz (North
Kashmiri), MarAz (South Kashmiri) & YamrAz (Standard Kashmiri, spoken
in Srinagar). Moreover, on the basis of ethno-political biases in language use,
it has two varieties (See Grierson, 1919:234 and Kachru, 1969), one is written
in Devanagari script (Sanskritised) and another is written in Persio-Arabic

4
In a V2 language, any constituent of a sentence can precede the verb in contrast to verb
middle languages where only restricted constituents can precede the verb.

4
Script (Persianised). Persio-Arabic Script is the standard one and writers
irrespective of their religious background use Persio-Arabic script in their
works. They, however, retain their religious identity by using Persian or
Sanskrit-borrowed lexical items. It shows their selective use of the existing
vocabulary of Kashmiri. The divide imposed by the script and the borrowed
loan-words is analogous to the Urdu-Hindi divide, except, that the intensity of
Urdu-Hindi divide is very high and consequently, Urdu & Hindi have attained
the status of separate languages.
Like Kashmiri and Urdu, many Indian languages follow the modified Persio-
Arabic Script. For example, Shina, Sindhi, Purki, etc. They are replete with
borrowed Persian constructions (IzAfe) and open-class items (infixations).
These languages are technologically deprived languages in the Indian
language technology scenario and need to be dealt tactfully, with due
emphasis on split-orthography which is a key issue in the text-processing
(tokenization, POS-tagging etc.). The present tag-set is designed to address
this problem effectively.

3. Categorization, Sub-categorization and the Attributes


An empirical approach in creating a POS-categorization scheme for a
language is to use its corpus. However, it can’t be done before creating a tag-
set. To design a POS tag-set for Kashmiri, experience in Urdu POS tagging
and native-speakers’ intuitions have been taken into account along with some
published descriptive works of Kashmiri grammar like “Modern Grammar of
Kashmiri” (O.N.Koul, 2006). Further, Dionysius Thrax’s Techne (C.100 B.C)
– a grammatical sketch of Greek – is not only a role model for contemporary
POS descriptions in European languages; it is also a model for POS
descriptions in Indian languages. Techne includes a scheme of eight POS-
categories (noun, verb, pronoun, preposition, adverb, conjunction, particle,
and article). The present POS categorization scheme belongs to the same
tradition; however, it is a variant of the scheme used in the ILPOST and the
LDCIL tag-sets. The variation is in terms of the reduction of number of
categories and attributes and inclusion of a non-POS category. An efficient
tag-set is one that includes the all the possible POS categories of a language
(Hardie, 2004). But the present tag-set is not merely an inventory of all
possible POS categories of Kashmiri. It is a minimal hierarchically organized
collection of the categories, the subcategories, and the corresponding morpho-
syntactic attributes. Being fine-grained it ensures the encoding of maximum
linguistic information with minimum categories and attributes. Since it is very
economical, it is expected to help in achieving better scores in machine
learning. The rationale for using a hierarchical-design instead of a flat-design5

5
In a flat design, there is large number of independent categories without subcategories but
optionally with feature values.

5
is that the former captures the higher levels of granularity6. It includes twelve
categories, thirty three subcategories and fifteen attributes along with their
values as given in the appendix. The key in this tag-set is that it has merged all
the modifying parts of speech under a single category and has taken a clear cut
stance to classify on the basis of Function which is hardy evident in many POS
schemes. This stance is to achieve better results in annotation process.
The proposed tag-set is organized in three levels. Level-1: The top level in the
hierarchy is an inventory of categories in which each category leads to Level-
2: An intermediate level which is itself an inventory of subcategories. Again
each subcategory leads to the Level-3: The bottom level which is inventory of
attributes with their embedded values. For example: Dimension is an attribute
of Definite demonstratives with embedded values: Proximal, Distal and
Remote. The overall hierarchal structure of the tag-set to capture fine
granularity is shown in the following snapshot:

Fig.1
4. EAGLES but Not ISLES: Strictly Morpho-syntactic Annotation
An EAGLE provides a set of attributes for morpho-syntactic POS tagging.
Some of them are mandatory and others are recommended. The mandatory
attributes include the grammatical categories while as the recommended
attributes include grammatical descriptions (feature columns) of those
categories. Therefore, its recommended tags are decomposable 7 - a set of
morpho-syntactic attributes.
Generally, fine-grained tag-sets could have been designed without the
EAGLES recommendations but capturing high granularity would have been a

6
Granularity, basically, means an extent to which a tag can be enriched with linguistic
information.
7
A tag is decomposable if the string representing the tag contains one or more shorter sub-
strings that are meaningful out of the context of the original tag.

6
problem. EAGLE’s recommendations provide a means by which a balance
could be maintained between the granularity of a tag-set and the capturing of
that much of granularity. The focus on morpho-syntactic annotation vis-à-vis
morpho-syntactic availability has made the tag-set information rich and
tangible, hence easily computable. Some attributes which are not morpho-
syntactically available but only lexically available are non-tangible and their
capturing is out of the scope of EAGLES recommendations and hence out of
the scope of the present tag-set. For example, jAy (NC) has the value -
feminine (fem) for the attribute - gender but there is no marker for the same.
However, for the attribute - number (plurality value), it has got a marker (-eh)
as in jAyeh. Such efforts are covered by ISLE - a lexical standardization
initiative to broaden the scope of EAGLES. While the capturing of non-
tangible, morpho-syntactically unavailable features has become the main
concern of ISLE in the context of European languages, a Lex-tagging initiative
of LDCIL has focused on the same concern for Indian languages.

5. Understanding the Key Issues: Understanding the Annotator


Tag-set designing is a purpose-oriented task. So, before developing a tag-set
for any language, a number of decisions are to be taken like whether the tag-
set to be designed would be hierarchical or flat structured, or whether it would
be fine-grained or coarse-grained. Such dualities of the tag-set designing are
generally resolved on the basis of purpose and utility of the tag-set. But the
dualities that a human annotator faces while implementing the tag-set are very
crucial in fulfilling the main purpose of the tag-set. The most important duality
is whether a tag should be based on the form or the function. For example, in a
Kashmiri phrase “shongith shahmAr”, the token “shongith” is a participle by
its form but it functions as an adjective. Similarly “miyon”, “chuon” are
possessive pronouns by their form but adjectives by their function. A token is
tagged on the basis of the form rather than the function in AnnCorra (Bharti et
al. 2006). But Hardie (2005) uses function based approach and tags Possessive
pronouns like “mera, tumhara” as Adjectives. The form-function duality
creates computational complexities, putting cognitive load on the annotator.
Since, the present tag-set is expected to lower the computational complexities
and facilitate at the level of parsing of Kashmiri sentences (by creating
common category of modifiers), a function based approach would be more
viable.

6. Null-Tag: A Solution to Tokenization and Mapping Problems


Like Urdu, Shina, and Purki, Kashmiri follows a modified Persio-Arabic script
with some additional diacritics to capture the phenomenon of secondary
articulations (like Palatalization) which is idiosyncratic to Kashmiri. The script
imposes some sort of splitting tendency in the words due to which there are
word internal (token demarcating or not) as well word external spaces (word
demarcating). So, like Urdu and other languages that follow Persio-Arabic

7
script and borrow heavily from Persian, Kashmiri has split-orthography due to
which there is a lack of one-to-one correspondence between tokens and words,
thereby, posing a mapping problem between the array of tokens and the array
of tags. Many derivational morphemes and their bases are written as separate
tokens instead of one in Kashmiri. For example “mazi-dAr” (delicious),
“miltry-voul” (military personnel), “kasUr-vAr” (guilty), etc. The above
words are bi-token words in which the first token is a morphological base and
the second one is a derivational bound morpheme. The second token of such
words does not fit in any of the POS schemes. Hence, to tag such tokens
without an impact on the syntactic structure (creating problems in parsing), a
syntactically neutral tag is devised called the Null-Tag (NT). This tag is
applied to the first token of the bi-token words and the second token is tagged
with the POS category to which the whole word belongs as given below:
Mazi\NT.0.0.0 + dAr\MAdj.0.0.0 = MAdj.0.0.0
KasUr\NT.0.0.0 + vAr\MAdj.0.0.0 = MAdj.0.0.0
(Null Tag) + (POS Tag) = (Resultant Tag)
This strategy resolves the mapping problem and it smoothens the ground for
the ultimate goal of tagging i.e. to facilitate parsing.

7. Conclusion
Tag-set is an important prerequisite that determines the nature of not only the
annotated (POS tagged) corpus and its utility in research and development but
also the entire course of NLP. However, guidelines for corpus annotation also
play a crucial role. Our experience in developing a tag-set for Urdu and
Kashmiri for POS tagging has revealed the fact that the split-orthography is
not really a hard problem (but a different one from other Indian languages that
do not follow Persio-Arabic script) that can hamper the progress of language
technology in such resource poor languages. Further extension of EAGLES
annotation scheme to Kashmiri revealed that modularity should be maintained
between the morpho-syntactic tagging and the lexical tagging (part of
semantic tagging) to achieve high scores in machine learning with encoding of
maximum linguistic information so that developing automatic tagger would
be, comparatively, an easy task and there would be quality control (accuracy)
in automatic as well as manual POS-tagging.

References:

Hardie, A (2004) The Computational Analysis of Morpho-syntactic Categories


in Urdu. PhD thesis submitted to Lancaster University.
Leech, G and Wilson, A (1996) Recommendations for the Morpho-syntactic
Annotation of Corpora. EAGLES Report EAG-TCWG-MAC/R.

8
Koul, Omkar Nath(2006) Modern Kashmiri Grammar. USA: McNeil
Technologies, Inc.
Habash, N. & Owen Rambow (2005) Arabic Tokenization, Morphological
Analysis, and Part-of-Speech Tagging in One Fell Swoop. In Proceedings of
the Conference of American Association for Computational Linguistics
(ACL05).
Baskaran S. et al (2007) Framework for a Common. Parts-of-Speech Tag-set
for Indic Languages. (Draft) http://research.microsoft.com/~baskaran/POSTagset/
Leech, G & Wilson, A (1999) Standards for Tag-sets. In Syntactic Wordclass
Tagging, ed. Hans van Halteren, Dordrecht: Kluwer Academic.
Santorini, B (1990) Part-of-speech tagging guidelines for the Penn Treebank
Project. Technical report MS-CIS- 90-47, Department of Computer and
Information Science, University of Pennsylvania
IIT-tagset. A Parts-of-Speech tag-set for Indian languages.
http://shiva.iiit.ac.in/SPSAL2007/iiit_tagset_guidelines.pdf
Brill 93. E. Brill (1993) A corpus-based Approach to Language Learning.
Hardie. A (2003) Developing a tag-set for automated part-of-speech tagging
in Urdu. Proceedings of the Corpus Linguistics 2003 conference, 16, 2003.
Leech. G and Wilson.A (1999) Recommendations for the Morpho-syntactic
Annotation of Corpora. EAGLES Report EAG-TCWG-MAC/R, 1999.

Kashmiri Tag-
Tag-set
Category Subcategory Attributes
(level-
(level-1) (level-
(level-2) (level-
(level-3)
Noun Number. Case. Case marker. Emphatic
1.Common

2.Proper Case. Case marker. Emphatic

3.Verbal Case. Case marker. Emphatic

4.Spatio-temporal Case. Case marker. Emphatic. Dimension


Pronoun Person. Number. Gender. Case. Case Marker. Emphatic. Honorific.
6.Definite
Dimension

7.Indefinite Person. Number. Gender. Case. Case Marker. Emphatic. Dimension

8.Reflexive Number. Gender. Case. Case Marker. Emphatic

9.Reciprocal Case. Emphatic

10.Relative Number. Gender. Case. Case Marker. Emphatic


Demonstrative 11.Definite Number. Gender. Case. Case marker. Emphatic. Dimension.

12.Indefinite Number. Gender. Case. Case marker. Emphatic. Dimension.

13.Relative Number. Gender. Case. Case marker. Emphatic.

9
Verb 14.Main Person. Number. Gender. Aspect. Mood. Finiteness. Emphatic. Pronominal Agreement

15.Auxiliary Person. Number. Gender. Tense. Finiteness. Emphatic. Pronominal Agreement.


Negative

Modifier 12.Adjective Number. Gender. Case. Case marker. Emphatic

13.Quantifier Number. Case. Case marker. Emphatic. Numeral

14.Intensifier Case. Case marker. Emphatic

15.Participle Aspect. Emphatic

16.Adverb Emphatic
Postposition Number. Gender. Case Marker. Emphatic. Animacy

Particles
21.Co-ordinating

22.Subordinating

23.(Dis)Agreeme
nt

24.Interjection

25.Similative Number. Gender

26.Delimitive Inclusive. Exclusive

27.Dedative

28.Dubitative

29.Others
Null Tag 30.Part of word

31.X-Tag
Reduplication Distributive

Residuals 32.Foreign word

33.Symbol
Unknown
Punctuation

Table 1

10

Вам также может понравиться