You are on page 1of 34

Finite State Morphology: The Turkish Nominal Paradigm

A Thesis by Philip Makedonski /nero_pdm@yahoo.com/ Submited to Seminar fr Sprachwissenschaft Eberhard Karls Universitt Tbingen, 72074 Tbingen, Germany In fulfillment of the requirements for the degree Bachelor of Arts in Computational Linguistics

July 2005

ABSTRACT
Finite State Morphology: The Turkish Nominal Paradigm
Makedonski, Philip Seminar fr Sprachwissenschaft Eberhard Karls Universitt Tbingen Supervisor: Dr. Dale Gerdemann July 2005 24 Pages

In this thesis my goal is to present a finite state approach to the inflectional morphology of Turkish nouns, the ultimate goal being building a morphological analyzer for Turkish nouns. Well be dealing primarily with the principles of vowel harmony across the different inflectional noun suffixes in Turkish as the most interesting phenomenon and my implementation of these principles in the Xerox Finite State Toolbox (xFST). We will also pay attention to the other morphophonological alternations occurring both in the stem and the suffixes attached to it as a result of the inflectional processes.

Keywords: Natural Language Processing, Finite State Networks, Morphology, Computational Linguistics

Turkish

To my family, to my love

ACKNOWLEDGEMENTS
First, Id like to thank my supervisor Dr. Dale Gerdemann for his support and advisory over this project. I appreciate the freedom and independence I had for the choice of topic and approach. I would also like to thank Dr. Sandra Kbler for her support and understanding throughout this course of studies, which in many cases turned out to be the crucial for my progress. Many, many thanks to my family for their support all the time, no matter what was happening. Thanks to my friends for their understanding. And most of all, special thanks to Nevin Recep for sparkling my interest in the Turkish language and supporting me all the time.

TABLE OF CONTENTS
ABSTRACT........................................................................................................................................................... 1 DEDICATION....................................................................................................................................................... 2 ACKNOWLEDGEMENTS.................................................................................................................................. 3 TABLE OF CONTENTS...................................................................................................................................... 4 1. INTRODUCTION ....................................................................................................................................... 5 1.1 1.2 1.3 1.4 2. MOTIVATION ....................................................................................................................................... 5 MORPHOLOGY..................................................................................................................................... 5 RELATED WORK ................................................................................................................................. 6 OVERVIEW ........................................................................................................................................... 7

BACKGROUND.......................................................................................................................................... 7 2.1 TURKISH .............................................................................................................................................. 7 2.2 FINITE STATE TECHNOLOGY ............................................................................................................ 10 2.2.1 Finite State Automata (FSA) ....................................................................................................... 10 2.2.2 Finite State Transducers (FSTs) ................................................................................................ 11 2.3 XFST ................................................................................................................................................. 12

3.

THE MODEL............................................................................................................................................. 13 3.1 THE NOMINAL PARADIGM OF TURKISH. MORPHOTACTICS............................................................ 13 3.1.1 Inflection for Number .................................................................................................................. 14 3.1.2 Case Inflection.............................................................................................................................. 14 3.1.3 Inflection for Possession .............................................................................................................. 15 3.1.4 Lexical Exceptions the su case............................................................................................... 17 3.2 PHONOLOGICAL ALTERNATION RULES ........................................................................................... 17 3.2.1 Resolving Vowel Harmony........................................................................................................... 17 3.2.2 Consonant Alternation Rules....................................................................................................... 19
3.2.2.1 Final Consonat (De)Voicing ....................................................................................................................... 19 3.2.2.2 (De)Gemination ........................................................................................................................................... 20

3.2.3

Other Alternations........................................................................................................................ 20

3.2.3.1 Vowel Insertion/Deletion............................................................................................................................. 21 3.2.3.1 The Glottal Stop ........................................................................................................................................... 21

3.3 IMPLEMENTATION ............................................................................................................................. 22 3.3.1 The Lexicon .................................................................................................................................. 22 3.3.2 The Rules Component .................................................................................................................. 24
3.3.2.1 Vowel Harmony Rules ................................................................................................................................. 25 3.3.2.2 Consonant Alternation Rules ...................................................................................................................... 26 3.3.2.3 Fixing the Morphotactics ............................................................................................................................ 27 3.3.2.4 Rule Order ................................................................................................................................................... 28

4. 5.

CONCLUSIONS........................................................................................................................................ 29 FUTURE WORK....................................................................................................................................... 29

APPENDIX A: LIST OF ABBREVIATIONS.................................................................................................. 31 APPENDIX B: LEXC CODE SAMPLES......................................................................................................... 32 APPENDIX C: ON REPLACEMENT RULES................................................................................................ 33

1. Introduction
1.1 Motivation
In morphologically rich languages like Bulgarian, Turkish, Russian, Spanish and many others, grammatical features and functions typically assigned to the syntactic structure in morphologically poor languages like English, are often represented in the morphological structure. As a consequence, any form of an adequate Natural Language Processing (NLP) application would require a good morphological component due to the increased role of morphology in these languages. This in turn would require a rich lexicon, and building up a lexicon, explicitly listing all the possible forms as separate entries, would quickly explode into an unmanageable size due to the rich inflectional and derivational possibilities for a single base (dictionary) form (stem). In Turkish for example, the nominal inflectional paradigm has three basic types of suffixes for number, possession and case (the number varies in the different sources), and the verbal inflectional paradigm is even more complicated with its eight affixes (again, the number might be different depending on the source). There are approximately 20.000 stems and 300-400 roots actively used in Turkish, which effectively amount to millions of inflected and derived forms. This further increases the demand for an automated morphological analysis. As it turns out, morphological structures are much more regular than syntactic ones. They can be handled very efficiently and accurately using sets of rules and compact lexicons of base forms (stems). Furthermore, important semantic and grammatical information could be encoded in such lexicons as well.

1.2 Morphology
The central concepts of morphology are morphotactics and (morpho)phonological alternations. Morphotactics (also morphosyntax or word formation) defines the constraints on possible morpheme combinations. Phonological (also orthographical) alternations define the changes in morphemes occurring in particular environments. To illustrate the issue an example from Karttunen (Karttunen, 2003) comes at hand:

(1)

pity piti-less piti-less-ness (Karttunen, 2003, pp. XVI)

Morphotactic definition accounts for the acceptability of a word like piti-less-ness and the unacceptability of a word like *piti-ness-less. Phonological alternations on the other side describe why pity is realized as piti in the context of a following less. These are simple examples that could be caught easily with a few basic rules. But for a full scale NLP, one needs a much more sophisticated system. This is especially valid for agglutinative languages like Turkish where the concept of a word is much wider. Different relations between the words in a sentence are mostly expressed by affixes. Furthermore, many affixes and roots in Turkish change their shape depending on the environment and have to obey various constraints like vowel harmony.

1.3 Related Work


A significant amount of work has been done in the computational modeling of Turkish morphology already: Kksals first approach to a computerized model for automatic morphological analysis of Turkish (Kksal, 1975); Hankamers description in terms of finite state morphology (Hankamer, 1986); numerous recent works by Kemal Oflazer based on his Two-Level model of Turkish (Oflazer, 1994); Schaaiks Studies in Turkish Grammar (Schaaik, 1996) is a comprehensive guide to building a computational model for full nominal phrases using the functional grammar formalism (Dik, 1981). For the earlier works are hard to find, I will briefly discuss only the more recent works by Oflazer and Schaaik as closely related to what I am doing in this project. Oflazers work is based primarily on his two-level model for Turkish morphology (Oflazer, 1994). The idea behind the two-level models originates from Koskenniemi (Koskenniemi, 1983). The most significant difference from the ordered linear approach in composed sequences of rule transducers1 is that all the rules operate in parallel. To illustrate the difference, a basic two-level model and a cascade-based model relating the languages defining the lexical and surface forms are presented in Figure 1.1 below: Lexical Form Lexical Form

FST 1

FST 2 Intermediate forms FST n FST 1 FST 2 FST n

Surface Form

Surface Form

Figure 1.1: Cascade-based and two-level (parallel) models in finite state morphology. In the cascade-model of composed rule transducers, each transducer operates on its own input and output, producing an intermediate output to feed the next transducer in the cascade. With the key concept here being feed, the major drawback of the two-level models has been that in the case of bleeding or feeding relations between rules (which is often the case in generative phonology), it is hardly possible to define such relations within this approach
More on transducers and automata follows in the technical background on finite state technology in Section 2.2. For now think of rule transducers simply as a way to implement rules.
1

(apart from having to design the rules very carefully in order to get the necessary result). But the convenience of the cascade-based model from this perspective comes at a price. In the process of composition, the network could easily explode into unmanageable size as many parts of it may need to be copied. Luckily there are some techniques to restrict such growth. My project combines both models in a way as we shall see later. The advantage being, whenever parallel operation of rules is needed, well use one, and whenever sequential (linear) operation of rules is needed, such will be used.

1.4 Overview
In the following sections I will present a finite state approach to a part of the Turkish morphology. I will focus on the nominal morphology only, in particular the different inflectional paradigms, as the complete nominal morphology of Turkish is a subject too broad to cover here (set aside the complete Turkish morphology). Once a solution for the nominal morphology is designed however, it could be easily extended to cover the other major word classes in a language. I will try to approach the task as modular as possible, so that if changes or extensions are required, all that is needed is to plug in the extension component and occasionally do a little tune up of the system. The key concept here is modularity. My work is based primarily on Geoffrey Lewis Turkish (Lewis, 1989) and Turkish Grammar (Lewis, 1967), referred to as the official language guides for Turkish in most papers. For the purpose of this project I will be using the Xerox Finite State Toolbox (XFST) and the manual to it by Lauri Karttunen (Karttunen, 2003). In Section 2 I will roughly present the background information needed to proceed through the paper as follows: Section 2.1 linguistic background on Turkish; Sections 2.2 and 2.3 provide some technical background on the technology employed and the particular toolbox I have chosen to use. The actual model and its implementation will be presented in their full beauty in Section 3. We conclude in Section 4 and in Section 5 I will present an outlook on possible future elaborations.

2. Background
In the following sections I will present the basic technical properties of the language and the technology used to model it.

2.1 Turkish
In this subsection I will present the most important features of Turkish that well be dealing with in the subsequent sections. Turkish is an agglutinative language from the family of Turkic languages. A Turkish word consists of a root (base form) and a number of suffixes attached to it, each extending its meaning or changing its word class:

(2)

bilgi knowledge biglisiz without knowledge bilgisizlik lack of knowledge bilgisizlikleri their lack of knowledge bilgisizliklerinden from their lack of knowledge bilgisizliklerindenmi I gather that it was from their lack of knowledge (Lewis, 1989. pp. 3)

As one might infer, many ideas typically expressed by prepositions or pronouns across languages are expressed by suffixes in Turkish. Another important feature of the Turkish language is vowel harmony. Vowel harmony is basically described as a progressive sound assimilation phenomenon. In simple words, the features of a vowel depend on the features of the preceding vowel. Well be dealing exclusively with the vowel harmony of suffixes in Turkish and as mentioned before, the scope of this project will be restricted to inflectional noun suffixes only. Geoffrey Lewis (Lewis, 1989) describes the vowel harmony in Turkish with a general law of vowel harmony in terms of the feature +/-back of vowels. The Turkish vowel system is shown in table 2.1 below: Unrounded Low High a e i Rounded Low o High u

Front Back

Table 2.1: The vowel system of Turkish.

As stated in (Lewis, 1989), all the vowels in a word agree with the backness value of the first vowel of that word:

(3)

+Back sekiz eight seksen eighty sinir nerve sinirler nerves sinirlerimiz our nerves

-Back dokuz nine doksan ninety snr frontier snrlar frontiers snrlarmz our frontiers (Lewis, 1989. pp. 11)

In cases of disharmony1 in the root or if an invariable suffix is attached, the harmonic suffixes harmonize with the vowel of the last preceding syllable. So attaching the plural suffix -ler/ -lar, which harmonizes for backness, to anne (mother) will result in anneler (mothers) and not in *annelar, harmonizing with the vowel of the first syllable.
Exceptions to this principle are: a small number of native Turkish words elma (apple), anne (mother), karde (brother or sister); eight invariable suffixes; compound words bilgisayar (computer), from bilgi (information) and sayar (counter, lister); loanwords. Clements and Sezer account for them in (Clements, 1982)
1

There is also, as Lewis (Lewis, 1989) refers to it, a special law of vowel harmony, that constrains the occurrence of vowels in terms of roundedness1. Unrounded vowels are typically followed by unrounded vowels and rounded vowels are typically followed by low unrounded or high rounded vowels. Combining the two principles we end up with the following:

(4)

a is followed by a or e is followed by e or i is followed by a or i is followed by e or i o is followed by a or u is followed by e or u is followed by a or u is followed by e or

Turkish suffixes, except the eight invariable ones, harmonize with, for the sake of simplicity, the vowel of the last syllable of the word they are attached to. They could be divided in two groups: The vowels of the first group alternate between the low unrounded vowels a and e (also called e-type2 suffixes (Pollard, 1996)) and the vowels of the second group alternate between the high vowels , i, u and (the so-called i-type1 suffixes (Pollard, 1996)). Except one the present tense verbal suffix iyor/yor/uyor/yor, no other suffixes contain o and . (4) above provides some basic notion about this classification. The plural suffix -ler/ -lar falls in the first class, whereas suffix like the definite objective case suffix is an i-type suffix.

(5)

ev (house) kol (arm) kitap (book) kpr (bridge)

evler (houses) kollar (arms) kitaplar (books) kprler (bridges)

evi (the house) kolu (the arm) kitab (the book) kpry (the bridge)

One might notice a few addtional things from (5). First of all no vowel sequences are possible in Turkish. Exceptions are some loan words like saat (hour). Typically a buffer y is inserted if a suffix begining with a vowel is attached to a word ending in a vowel. In some cases it is a n or an s. Second, words in Turkish typically end in voiceless consonants, but they do change to voiced ones intervocally. This topic, allong with the other alternations occuring in the process of suffixation will be further elaborated in Section 3.2.2. These are the general morphological and phonological features of Turkish that we will pay attention to. In Section 3.1 and 3.2 I will present the actual morphotactics of the Turkish nominal inflectional paradigm and the phonological alternation rules respectively.

Exceptions to this principle will be: tapu (title-deed), avu (hollow of the hand), abuk sabuk (nonsensical), amur (mud) in general a can be followed by u if a p, v, b or m intervenes. These exceptions occur apparently only root-internally and do not seem to affect suffixation: kitap (book) kitab (book, definite objective case the book). 2 The e-/i-type distinction is really a distinction between harmonizing vowels and not suffixes as Pollard (Pollard, 1996) proposes. Some suffixes like the 3pPl Poss. leri/-lar feature both types of harmonizing vowels.

2.2 Finite State Technology


Finite state technology was quickly condemned by the linguists at the earlier stages of its development due to its weak descriptive power. But later on it proved to be quite useful for modeling parts of languages that could be considered finite and regular. Various tasks are nowadays approached using finite state technology part-of-speech disambiguation, tokenization, shallow parsing.. But the most significant and core application of finite state technology in NLP remains morphological analysis. It is the basis for any further kind of natural language processing. The basic idea behind finite state technology is a set of states, with different properties and set of arcs that connect these states. Arcs have a direction and an input symbol. That is, for a particular state there is a set of outgoing arcs with their respective input symbols. The states and arcs together form networks1.

2.2.1 Finite State Automata (FSA)


Finite state networks typically have one start state and one or more final states. Transitions between the states are possible only if the required input is recognized. The sequence of transitions over arcs to a particular state is called a path. In the above example there are two paths possible to the final state 3. In order to accept a string, at the end of the input the network should be in a final state. Valid inputs for the network in Figure 2.1 are b and ab, but not a by itself. For the slightly more complicated network in Figure 2.2, valid input sequences will be: b, ab, bcb, abcb, bcab, abcab Because of the looping arc through c, we end up with an infinite set of acceptable input strings. All the possible input strings in this case seem to follow a particular (regular) pattern. Enumerating all the inputs seems unreasonable. Wed rather define some rule that selects valid inputs. A more compact representation could be defined using regular expressions. A regular expression (or a regex) is a pattern that matches a set of strings which obey particular syntax rules. It
1

2 3
b

Figure 2.1: A simple three-state network. The state marked with and arrow (1) is the start state, the state marked with a double circle (3) is the final state.

1
c

2 3
b

Figure 2.2: A bit more complicated three-state model. The arc with input c takes us back to the start state creating a loop.

We will be talking about networks here as a general term abstracting over transducers and automata. Automata are finite state machines that only accept a set of given strings (a language), whereas transducers provide a set of outputs for an accepted input, which might as well be identical to the input. Automata describe languages, whereas transducers express relations between languages.

10

is an essential concept in Finite State Technology. Regular expressions describe the languages accepted by Finite State Automata the regular languages. In the current state, regular expressions are only partially related to real regular expressions. There are newer operations defined in every particular toolbox, extending its capabilities and expressive power. The precise syntax varies among applications and toolboxes. I will describe the necessary syntax basics in further detail, in terms of the toolbox I am using in section 2.3. A model solution for the above networks using the lexc language is provided in the appendix.

2.2.2 Finite State Transducers (FSTs)


A Finite State Network (or a Finite State Machine), as noted above, is the general term for Finite State Automata (FSA) and Finite State Transducers (FSTs). Where FSA deal with acceptance/recognition only, FSTs also provide output(s) for the recognized input. This major difference is described using symbol pairs in the model in Figure 2.3 below:

a:A

1
c

b:B

2 3
b:B

Figure 2.3: A Finite State Transducer. It accepts the same strings as the FSA in Figure 2.2, but transforms the lowercase as and bs into upper-case As and Bs respectively. The cs remain unchanged.

For an input string like ab the output will be AB, for abcb ABcB, and so on. It seems like a simple replacement operation, but there is no such operation involved here. In this case we have strings from one language (later on referred to as the UPPER language1) related to strings from another language (which will be called the LOWER language1). The c which remains unchanged is applied the identity relation. These are the basics. Once we have designed a network describing a language or a relation, we can apply different operations to it intersection (&)2, union (|), concatenatenation ( ), negation (~), subtraction (-), composition (.o.), etc. The essential terms will be explained as needed as we proceed. Most important to note here is the composition operation (.o.). A general feature of Finite State Networks is that they can be composed together yielding a sequence of transducers/ automata a modular structure that is very essential to our purpose in this paper. Composition is an operation on two relations. Say we have the transducer above (Figure 2.3) that is turning lowercase as and bs into upper case As and Bs respectively. This could be further described as <a,A> and <b,B> in terms of relations. Say we have then another transducer that is turning capital As and Bs into numbers, <A,1> and <B,2>. Composing the two of them would provide us with a new transducer taking the upper side of the first and the lower side of the second transducer, where the inner symbols match:

(6)
1 2

[<a,A>, <b,B> ] .o. [<A,1>, <B,2>.] [<a,1>, <b,2>.]

The terms will be explained in more detail in section 2.3 The operators and their syntax vary among toolboxes. I will be using the ones described in (Karttunen, 2003)

11

All the operations can be applied multiple times to different networks. For some of them the order matters, for others not. Composition allows us to build a cascade of multiple transducers into a single transducer, in terms of the current task at hand, compose multiple rule transducers into a single lexical transducer that is relating strings from the language of surface forms to strings from the language of lexical (underlying) forms. It was C.D. Johnson (Johnson, 1972) who first realized that morphophonological knowledge could be modeled using FSNs. The most fascinating part is, once we have constructed a transducer for morphological generation, we can easily apply it in the other direction for the task of morphological analysis. This natural feature of finite state networks is what makes them so suitable for morphological processing. I will spare the mathematical model behind Finite State Networks, as it wont be necessary to understand the current paper. For further information on finite state technology and automata theory refer to (Hopcroft, 1979).

2.3 XFST
The Xerox Finite State Toolbox (XFST) was developed at the Xerox Research Centre Europe (XRCE) by Kenneth R. Beesley and Lauri Karttunen. It implements the standard finite state operations such as composition and union as well as several innovative operations like replacement rules1 and local sequentialization. XFST includes: lexc - a complier for lexicons in the lexc language, which is specifically designed for handling morphotactics in natural languages, and xfst the core tool providing interface to the finite state calculus for building, accessing and manipulating Finite State Networks and compiler for regular expressions and replacement rules which will be essential to my work. Additionally, there is a compiler for two-level morphology rules (twolc) as described by Koskenniemi (Koskenniemi, 1983), but its application is beyond the scope of my work, so I will leave it aside. XFST also provides two tools, lookup and tokenize, designed for testing and application of larger projects, but they wont be discussed any further in this paper. In the process of implementing a morphological analyzer, the morphotactics will be defined in lexc as supposed, whereas phonological/orthographical alternation rules will be defined as separate transducers (mostly using replacement rules), composed together into a single transducer, which itself will be composed with the network derived from the lexc definition of the lexicon to finally result in a lexical transducer which will be used for our final purpose. Additional transducers can be composed to the network at hand to impose restrictions, define alternations or add more content. XFST defines transducers as relations between two languages. What would be referred to as upper language, could be thought of as the input and the lower language would then be the output when we apply an input to a transducer downwards. If we apply input to the transducer upwards then the roles switch the input is applied on the lower side and the output comes from the upper side. Although it seems a bit confusing, the terms upper and lower remain constant. In the definition of a lexical transducer, the upper side language will describe the lexical (underlying) forms of the language to be analyzed and the lower side language will contain the actual surface forms in the standard orthography.

A brief overview of the formalism is available in the appendix.

12

3. The Model
In this section I will present the nominal paradigm of Turkish and my implementation of it. There are two modules in the model the lexicon defining the morphotactics of Turkish nouns and the morphophnonological rules component describing the alternations occurring on the surface. In Sections 3.1 and 3.2 I will present the theoretical background behind my model. An important notion in the following sections will be that of archiphonemic descriptions. As I was implementing the vowel harmony principles using variables for the alternating vowel segments, I realized that the idea of using variables could be further employed to describe other phenomena, such as the consonant alternations. My initial approach, using consonant alternation rules on the surface forms failed to describe the exceptional cases, so I had to redesign it using unspecified abstract definitions on the lexical side for entries that do undergo the alternations and underspecify the entries that do not. The general idea: I will be using both in theory and practice the so-called archiphonemes to describe classes of similar phonemes that alternate depending on the environment. For example, to describe vowel harmony I will be using I to generalize over the class of high vowels that alternate according to the principle of i-type vowel harmony and E to generalize over the class of low unrounded vowels that alternate in concordance with the principle of e-type vowel harmony. The symbols denoting the particular classes of alternating phonemes will be defined as needed as we proceed further.

3.1 The Nominal Paradigm of Turkish. Morphotactics


The nominal inflectional paradigm is defined in different ways in the various sources. The basic pattern on which everyone agrees though is: STEM NUM POSS CASE Turkish has no distinction of grammatical gender. Worth mentioning is that in some sources, the relativising suffix ki is classified as part of the nominal inflectional paradigm. At the current stage of development I wont be concerned with it however. On the other side, casetype suffixes are also differently defined in the various sources in some of the recent works, the suffix (y)la/(y)le is classified as an instrumental case suffix. Well get back to this issue in the subsequent sections. NUM POSS CASE STEM

2
0

3
0

4
0

Figure 3.1: A simplified FSA model for the nominal morphotactics in Turkish.

So lets have a closer look at the core of the Turkish noun paradigm. The definition will be further extended in the subsequent sections.

13

3.1.1 Inflection for Number


The basic uninflected dictionary form of Turkish nouns is singular (or as claimed in some sources numberless). The plural form is derived by attaching the ler/lar suffix. It comes generally before any other inflectional suffix. Its vowel is of e-type harmony, therefore the compact representation using an archiphonemic description will be lEr. Ketrez (Ketrez, 2003) provides an extensive study on the multiple readings of the Turkish plural morpheme, but it is mostly from syntactic and semantic points of view and I wont go any further discussing the issue.

3.1.2 Case Inflection


Lewis (Lewis, 1967, 1989) defines six cases in his grammar of Turkish. Table 3.1 below provides an overview of the case paradigm in Turkish: Case\Last preceding vowel Absolute (Nominative) Definite Objective (Accusative) Genitive (of) Dative (to, for) Locative (in, on, at) Ablative (from, out of)
Table 3.1: Summary of case suffixes in Turkish.

e or i -(y)i -(n)in -(y)e -de -den

or -(y) -(n)n

a or -(y) -(n)n -(y)a -da -dan

O or u -(y)u -(n)un

The bracketed y and n are realized on the surface only if the word the suffix is attached to ends in a vowel. The locative and ablative suffixes are generally realized as de/da and den/dan, but when attached to a word ending in a voiceless consonant (, f, h, k, p, s, and t), they are realized as te/ta and ten/tan respectively. So using archiphonemic descriptions and the principles of vowel harmony, the case inflection summary will look like: Case Absolute (Nominative) Definite Objective (Accusative) Genitive (of) Dative (to, for) Locative (in, on, at) Ablative (from, out of) Lexical Form of the Suffix -(y)I -(n)In -(y)E -DE -DEn

Table 3.2: Summary of case suffixes in Turkish using archiphonemic descriptions.

A few examples will be:

(7)

araba (car, Nom.)

araba-(y)I (car, Acc. / LF) ev-DE (house, Loc. / LF)

arabay (car, Acc. / SF the car) evde (house, Loc. / SF in the house)

ev (house, Nom.)

14

As mentioned above, some more recent works treat what used to be (and I believe still is) a postposition (ilE) following absolute or genitive forms as an additional instrumental/ comitative case suffix ((y)lE). It is however, still used, as far as my knowledge reaches out, both as a postposition and as a cliticized suffix. I will stick to the classic works for now and treat it as a separate (non-case) suffix1.

3.1.3 Inflection for Possession


Where in many languages possession is formed using pre-/post-posed pronouns (English: my/mine, your/yours, his, her/hers, etc.; German: mein (my), dein (your), sein (his), ihr (her), etc.; Bulgarian: pre-posed ([moy] - my), ([tvoy] - your), ([negov] his), ([nein] - her); post-posed:. ([mi] my, of mine), ([ti] you, of yours), ([mu] his, of his), ([i] her, of hers), etc.), in Turkish possession is expressed by suffixes. The complexity of the possessives varies across languages, depending on their overall morphological complexity. In Bulgarian for example, the pre-posed possessives act pretty much like adjectives and typically precede them, so they carry the inflection for gender, number and definiteness. In Turkish the possessive suffixes are partially derived from the present tense forms of the verb to be. A summary of the possessive suffixes is presented in Table 3.3 below: Person 1pSg 2pSg 3pSg 1pPl 2pPl 3pPl Suffix -(I)m -(I)n -(s)I -(I)mIz -(I)nIz -lErI Gloss my your his/her/its our your their

Table 3.3: Summary of possessive suffixes in Turkish using archiphonemic descriptions.

Again, the bracketed segments surface only in particular conditions. Opposite to the case suffixes, where the bracketed segments surfaced only if the word they are attached to ends in a vowel, here the optional segments surface both if the word the possessive suffix is attached to ends in a consonant (for the first and second person singular and plural) and if the word ends in a vowel (for the third person singular). So we have vowel deletion in one case and consonant insertion in the other, to avoid vowel sequences2.

(8)

ev (house) araba (car) araba (car)

ev-(I)m (house, 1pSg Poss. / LF) araba-(I)mIz (car, 1pPl Poss. / LF) araba-(s)I (car, 3pSg Poss / LF)

evim (house, 1pSg Poss. / SF my house) arabamz (car, 1pPl Poss. / SF our car) arabas (car, 3pSg Poss. / SF his/her car)

Lewis (Lewis, 1967, 1989) states that it is attached to nominative nouns and genitive pronouns, in this sense it could be considered an additional case suffix. I will leave it aside until I get a clearer view on the issue. 2 More on vowel sequences to come in the description of the rules in the following sections

15

Possessive suffixes precede case suffixes. By having another look at the two inflectional paradigms one might or might not notice that some of the suffixed forms could occasionally overlap on the surface. For example: the underlyingly different ev-(y)I (house Definite Objective (Accusative) case, the house) and ev-(s)I (house 3pSg possessive, his house) end up absolutely the same on the surface evi:

(9)

ev (house) ev (house)

ev-(y)I (house, Acc / LF)

evi (house, Acc. / SF the house) evi (house, 3pSg Poss. / SF his/her house)

ev-(s)I (house, 3pSg Poss. / LF)

Things get further complicated if there are multiple instances of the plural suffix lEr in the case of 3pPl possessive for example, if the possessed noun is already plural evler (houses) *evlerleri evleri (their houses) one lEr gets deleted. So we end up having the single form evleri for both their house and their houses. Paying a closer look however, reveals even further complications: evleri could also denote the accusative case of the plural of houses (the houses) and the 3pSg possessive of the plural of houses his/her houses. Even though Turkish is morphologically highly specified, we often have 2-,3- or as in this case 4-fold ambiguities. The derivations from the underlying lexical representations of the four interpretations of evleri are given in (10) below:

(10) Pl.Acc .
(the houses)

Pl.3pPl.Poss.
(their houses)

Sg.3pPl.Poss.
(their house)

Pl.3pSg.Poss.
(his/her houses)

ev-lEr-(y)I ev-ler-I evleri

ev-lEr-lErI ev-ler-leri evleri

ev-lErI ev-leri evleri

ev-lEr-(s)I ev-ler-i evleri

Worth to note, just to make things even more confusing, is that after the third person possessive suffixes, a so-called pronominal n is added when there is a case suffix following.

(11) evi (his/her house, also the house)


but:

(12) evinde (in his/her house in our case, but also identical with in your house)
Confusing? Typically ambiguities are resolved by looking at the context where the ambiguous word occurs ambiguous forms are usually used with the genitive of the personal pronouns to avoid confusion. In this case the noun itself reverts to accusative case.

(13) evleri (their house)


(house, 3pPl.Poss) evleri (their houses) (houses, 3pPl.Poss) evleri (his houses) (houses, 3pSg.Poss)

onlarn evi (their house, the house of theirs) (they, Gen.; house, Acc.) onlarn evleri (their houses, the houses of theirs) (they, Gen.; houses, Acc.) onun evleri (his houses, the houses of his) (he, Gen.; houses, Acc.) 16

For the purpose of this project, however, I wont be concerned with morphological disambiguation, as this task should be performed at a later stage, after examining the already analyzed context. There are further distinctions in the uses of the possessives in Turkish, but again, this topic is beyond the scope of my work. As one might imagine, for a single entry in the lexicon, that is for a single noun stem, there are plenty of possible inflections - 2x for number times 7x (the six possessive suffixes + the possession free form) for possession times 6x (or even 7x if the instrumental case is included) for case inflection, results in 84 basic forms from inflection only (even though some of them might be identincal), and things get further complicated.

3.1.4 Lexical Exceptions the su case


There is only one pure lexical exception to the paradigm the noun su (water). There is however a large number of derived noun roots that end in su, for example: akarsu (river running water). For this reason, it deserves a special treatment. The exception manifests itself as su taking the -yun suffix for the genitive (instead of the standard nun suffix) and also, in the possessive forms, there is always a y preceding the possessive suffix suyum (my water) instead of *sum, suyu (his/her water) instead of *susu. In general, the y is inserted whenever a suffix starting in a vowel or dropping consonant is attached to the word.

3.2 Phonological Alternation Rules


In the following subsections I will outline theoretically the basics of the phonological alternation rules in Turkish with respect to the task at hand.

3.2.1 Resolving Vowel Harmony


The vowel harmony principles as described in Section 2.1 are rather simple to implement. I will present how the basics work and then address some of the exceptional cases. I split the two harmony classes in two rules for e-type harmony and for i-type harmony. The e-type harmony rule checks the value backness feature of the last preceding vowel if it is a back vowel the underlying E is realized as a, if it is a front vowel, it is realized as e. Since the system does not provide us with feature specification of phonemes, I had to define the classes of vowels as sets:

(14) define BackV [a | | o | u];


define FrontV [e | i | | ]; define LowV [a | e | o | ]; define HighV [ | i | u | ]; define UnroundedV [a | | e | i ]; define RoundedV [o | | u | ]; The intersection (&) of those sets provides us with the sub-classes of vowels having combined features. So the set of back unrounded vowels will be derived as: 17

(15) [ BackV & UnroundedV ]


which results in:

(16) [ a | ]
This is essential for defining the i-type harmony, as it is based on two features rather than one, namely backness and roundedness. So, if the last preceding vowel is back and unrounded, the underlying I is realized as (or the hgh back and unrounded vowel so to say intersecting the set of high vowels with the sets describing the features of the last preceding vowel). The same holds for the other realizations of the undelying I:

(17) I [HighV & BackV & RoundedV] || [BackV & RoundedV] Consonant1 _
Which should be read as: I is realized as the high-back-rounded vowel (u) in the context of a back rounded last preceding vowel (o or u). The other rules are identical:

(18) I [HighV & BackV & UnroundedV] || [BackV & UnroundedV] Consonant _
I [HighV & FrontV & RoundedV] || [FrontV & RoundedV] Consonant _ I [HighV & FrontV & UnroundedV] || [FrontV & UnroundedV] Consonant _ This is only necessary to state clearly the principles operating vowel harmony. One migh as well simply write the rules as: I -> i || [ i | e] Consonant _ , but that wont have much of a descrptve liguistic value. In my solution the rules operate in parallel locally, that is for the e-type and the i-type they operate together among themselves, but the e-type harmnoy still has precedence over the itype. The reason behind it apart from the backness harmony being the more general principle and having broader coverage, the abstract symbols have to be resolved in a left-toright fashion and e-type suffixes at the current stage precede i-type suffixes. We need the exact properties of the last preceding vowel in order to resolve the next variable vowel in the following (or even in the same suffix). In this sense, I might need to combine the e-type and itype rules into one single rule operating in parallel as the system gets more sophisticated2. A few words about the exceptions to vowel harmony: We will be concerned with roots whose last vowel does not have predictive power over the harmonic features of the suffixes attached to it. Schaaik (Schaaik, 1996) refers to words which induce such exceptions as disharmonic roots. The same term however is used in some sources for roots that do not conform the principles of vowel harmony internally the already mentioned in section 2.1 exceptional cases like anne (mother), amur (mud), etc. Although they often do overlap, it cant be stated that this is always the case. The exceptions we will be dealing with are mostly of foreign origin: alkol (alcohol), rol (role), saat (clock), etc. are realized as alkol (alcohol, Acc), rol
Consonant is also a defined class featuring all the consonants A small issue that occured when I accidently switched the order of the rules was that for example in words having a round vowel in their last syllable (like katalog (catalogue)) were resolved in an unusual way *kataloglarunuz, whereas the correct form would be kataloglarnz (our catalogues). This was due to resolving the InIz (1pPl Possessive suffix) as unuz in concordance with the last (resolved) preceding vowel o (the E in the plural suffix lEr was still pending resolution). This is important, because if a e-type suffix is added, all the following suffixes feature unrounded vowels (unless a suffix with an invariable rounded vowel is added).
2 1

18

(role, Acc.), saati (clock, Acc.) and alkoller (alcohol, Pl.), roller (role, Pl.), saatler (clock, Pl.) instead of *alkolu, *rolu, *saat and *alkollar, *rollar, *saatlar respectively in their accusative and plural forms.

3.2.2 Consonant Alternation Rules


As mentioned in Section 2.1, word final consonants undergo particular alternations depending on the environment. For the purpose of this project, I used archiphonemic descriptions for the alternating segments. In most of the related works they pick the capital letter for the voiced phoneme (B for b and p, D for d and t, etc.) or a capital for the geminating phoneme (S for ss and s, etc.). I will stick to the standard notation to avoid unnecessary confusion. The above abstraction is necessary to model the exceptions to these alternation rules. We will pay some attention to the exceptions in the end of the section. My approach to this issue is partially based on the paper by Sharon Inkelas and Orhan Orgun (Inkelas, 1997) in which lexical exceptions are treated in terms of Optimality Theory. In brief: the alternating word final consonant in regular roots that undergo the alternations will be unspecified in the lexicon using a special symbol and the exceptional cases will be underspecified with their nonalternating surface realizations so that they wont trigger the alternation rules.

3.2.2.1 Final Consonat (De)Voicing


The final consonant voicing occurs when a suffix starting in a vowel or a dropping consonant is attached to the stem. It covers the voiceless plosives p, , and t, which transform into their voiced counterparts. Additionally, what is often classified separately as a K/01 alternation (namely because of the subclass of velar consonants k, g and that exhibit similar behavior), falls into this category as well. (19) Below provides basic notion about the alternations that occur, where do they occur, and what do the archiphonemic symbols stand for:

(19) B b || _ Vowel, otherwise B p


D d || _ Vowel, otherwise D t C c || _ Vowel, otherwise C K k || _ Vowel, otherwise K G g || _ Vowel, otherwise G Q k || _ Vowel, otherwise Q k So far it seems fine as far as alternations in the stems are concerned. But similar alternations occur in suffixes as well. They are dependent on the preceding phoneme and assimilate the value of its voicing feature. So we have2:

(20) B p || VoiceLessCons _ , otherwise B b


D t || VoiceLessCons _ , otherwise D d C || VoiceLessCons _ , otherwise C c
K/O because the counterpart of k intervocally is the so called yumuak ge (soft g), which is phonologically realized as lengthening of the preceding vowel. 2 For the purpose of this project only the the d/t alternation will be actually used as it is the only one occurring in the inflectional suffixes of nouns
1

19

An example for both phenomena where several rules apply, will be the inflection of kitap (book) in Table 3.4: Surface Form kitap kitaplar kitabm kitapta kitabimda Lexical Form kitaB kitaB-lar kitaB-(I)m kitaB-DE kitaB-(I)m-DE Alternation Rules Bp Bp Bb, I Bp, Dt, Ea Bp, I, Dd, Ea Gloss book, Sg, Nom. book, Pl, Nom book, Sg, 1pSg Poss, Nom. book, Sg, Loc. book, Sg, 1pSg Poss, Loc

Table 3.4: Summary of the application of the phonological alternation rules.

The rules in (19) and (20) are oversimplified of course. In the actual implementation they feature a wider context including morpheme boundaries to make the distinctions clearer. In linguistic terms we have regressive assimilation in stems and progressive assimilation in suffixes. The exceptions to these rules include primarily monosyllabic words that perserve the quality of their final consonant. There are however monosyllabic words that do undergo the alternation rules, as there are polysyllabic words that do not. Such exceptions will be underspecified in the lexicon with their unchanging consonant.

3.2.2.2 (De)Gemination
Apart from the final stop voicing/devoicing, which is the most productive type of consonant alternation a few other types of alternations are worth mentioning. The final consonant (de)gemination occurs only in a small number of Arabic loan words. The nature of this phenomenon is similar to the one of the final consonant (de)voicing a word final segment gets doubled if a suffix starting in a vowel (or dropping consonant) is attached to the word:

(21) his (feeling)


hat (line)

hissi (feeling, Acc., the feeling) hatt (line, Acc., the line)

hisler (feelings) hatler (lines)

Again, we will have to employ special symbols that will be realized differently on the surface depending on the context as proposed by Schaaik (Schaaik, 1996)1. He proceeds even further, investigating the dependence of these alternations on the re-syllabification processes occurring with the different suffixes. I will not go into detail however, as my project is not intended to feature a syllabification module in its current stage of development.

3.2.3 Other Alternations


Two other alternations are worth mentioning for the sake of completeness. One of them involves vowel insertion/deletion and the other describes the status of the glottal stop in Turkish. The first one is rather common, whereas the second operates on a limited domain of Arabic loan words. Both of them show some ambiguities.

This issue could be approached differently, by underspecifying the geminating stems with their double consonants in the lexicon and then removing the additional segment if necessary.

20

3.2.3.1 Vowel Insertion/Deletion


Some stems in Turkish exhibit an interesting property of forming stem final consonant clusters via vowel epenthesis:

(22) burun (nose)


fikir (idea) ehir (city) mr (life) aln (forehead)

burnu (nose, Acc. the nose) fikri (idea, Acc. the idea) ehri (city, Acc. the city)1 mr (life, Acc. the life) aln (forehead, Acc. the forehead)

This phenomenon occurs again whenever a suffix starting in a vowel is attached to the stem (seems like all the stem-internal alternations in Turkish are conditioned on the same context). The epenthesized vowel is always a high vowel, but its other features cannot be automatically determined, so it has to be hard-coded. Such stems will be indicated in the lexicon with a meta character preceding the vowel which is to be deleted. As for the quality of the consonant clusters that are formed after the epenthesis occurs, there have been several attempts to define the possible consonant sequences in such cases, but this is far beyond the scope of this paper.

3.2.3.1 The Glottal Stop


This, along with the gemmination rule, is probably the most improductive rule in Turkish. They both concern only a limited number of arabic loan words. The nature of the glottal stop is not quite clear to me, however I attempted an approach based on Schaaiks (Schaaik, 1996) description and the Turkish Lexical Database Project (TLDP). Schaaik (Schaaik, 1996) describes two types of glottal stop:

(23) Type 1: ^ -> 0 / ^ (0 if a consonant follows and ^ if a vowel follows)


cami^ (mosque) -> -> camiler (mosques) cami^i (the mosque / his/her mosque)2

Type 2: -> i / (i if a consonant follows and if a vowel follows) nev (sort) -> -> neviler (sorts) nevi (the sort / his/her sort) (Schaaik, 1996, pp. 114) Both are supposed to act as consonants if a vowel follows. In modern Turkish however, the glottal stop is mostly omitted both in speech and writing. It is preserved only when ambiguities occur telin (of the wire / your wire) and telin (denunciation). Apparently, in TLDP the glottal stop is not featured either. Both cases are
1 2

In modern Turkish, the tendency is to retain the i in ehir (city) ehiri (the city) The Type 1 glottal marker ^ is not manifesting itself orthographically.

21

accepted there camii and camisi both denote the 3pSg Possessive form (his/her mosque), identically camii and camiyi both denote the accusative case (the mosque). For the second type though, only yeisi (the despair / his/her despair) and neviyi (the sort) / nevisi (his/her sort) are recognized. So the first type allows for both realizations, whereas the second type behaves more or less as if it wasnt there at all. In my solution, I tried to approach the issue as in the TLDP. There are some mismatches though, and even though it is more likely that the mistake is overgeneration from my side, it is also possible that the TLDP analyzer has some flaws. The examples I am concerned with are:

(24) camim (mosque, 1pSg Poss. my mosque)


vs.

(25) camiim (mosque, 1pSg Poss. my mosque)


Analogous to camii and camisi (his/her mosque), they should both denote the same thing, but the TLDP analyzer provides different solutions, where only the first one (camim) seems to be proper. I have to investigate the issue further. For now, in my project they will both stand for my mosque.

3.3 Implementation
The model comprises of two components the lexicon, defined in lexc, describing the morphotactics of Turkish (technically it is implemented as an FSA, but it does include some transductions for the tags), and a set of rules, that describe the morphophonological alternations that occur on the surface (implemented naturally by a set of FSTs in xfst, using the formalism of replacement rules).

3.3.1 The Lexicon


The lexicon network implemented in lexc describes the morphotactics of the Turkish nominal inflection. First of all, there is a multicharacter symbols definition (26) where a set of sequences of symbols that should be treated as atomic symbols is defined:

(26) Multichar_Symbols +Noun +Poss +Case +1p +2p +3p +Sg +Pl +DefObj +Gen
+Dat +Loc +Abl +Abs These are primarily used to define the tags to be used (case marking, possession, number, etc.). Further on, it contains a sub-lexicon of the noun stems it is the simplest, but most important part it contains the noun stems in their lexical (underlying) form, which could be automatically extracted from a dictionary. This form includes all the special symbols that denote alternating segments and trigger the alternation rules. Then on the next stage (the standard continuation class for all nouns) a tag +Noun is attached on the upper side, that is, it visible only if morphological analysis (or lookup) is performed (same for all the other tags). On the lower (surface) it is realized as an epsilon. The continuation class from there is the number lexicon number suffixes are attached on the lower (surface) side and tags +Sg and +Pl are attached on the upper (lexical) side, (the dash stands for morpheme boundary):

22

(27) LEXICON Number


+Sg:0 +Pl:-lEr Possessive; Possessive;

A possessive sub-lexicon follows which defines the inflection for possession as described in Section 3.1.3 with the appropriate tags. There is an intermediate lexicon however, that specifies the optionality of the possessive suffixes:

(28) LEXICON Possessive


+Poss:0 +Case:0 PSuff; CSuff;

That is, either take a possessive tag +Poss and go to the lexicon of possessive suffixes, or take a +Case tag and go to the lexicon of case suffixes. So the actual sub-lexicon for the possessive suffixes is called PSuff:

(29) LEXICON PSuff


+1p+Sg:-*Im +2p+Sg:-*In +3p+Sg:-*sIN +1p+Pl:-*ImIz +2p+Pl:-*InIz +3p+Pl:-lErIN Case; Case; Case; Case; Case; Case; ! "my" ! "your" ! "his/her/its" ! "our" ! "your" ! "their"

After taking a possessive suffix there is again an intermediate stage that should be passed the possessive forms still have to take a +Case tag. In the morphological analysis module of the Turkish WordNet the possessive markup is obligatory. It is referred to as possessive agreement there, and if there is none, then the tag is +Pnon. I dont find it necessary for now, but of course it wont be any problem to tune my system up so that it features the same type of mark-up. Two more points to make clear: the optional segments which were marked with brackets in the theoretical part are prefixed with an optionality marker (*); the pronominal n is denoted by the capital N. Oflazer (Oflazer, 1995) defines it as a part of the case suffixes. In my case it is an optional segment that surfaces only if there is a suffix following the third person singular and plural possessive forms. In his case, there are two copies of each case suffix one that follows the third person possessive form and one for all the other possessive and nonpossessive forms. To me it seems more intuitive to have it as a part of the possessive, as it is indeed a pronominal n, and I dont find much sense in having two instances of every case inflection.

(30) LEXICON CSuff


+DefObj:-*yI +Gen:-*nIn +Dat:-*yE +Loc:-DE +Abl:-DEn +Abs:0 #; #; #; #; #; #; ! Definite Objective Case (Accusative) ! Genitive Case - posessive, "of" ! Dative Case - (indirect object) "to", "for" ! Locative Case - "in", "on", "at" ! Ablative Case - "from", "out of", "througn" ! Absolute (dictionary) form (Nominative)

23

The last component of our lexicon is the case inflection sub-lexicon. It is obligatory, as all uninflected nouns are in their absolute form (Nominative case). The hash symbol (#) is an anchor symbol denoting word boundary (in replacement rules it is circumfixed by dots (.#.)). To summarize, a visual map of the lexicon network is presented in Figure 3.2 below: 0

0.Root

1.Noun

/Noun Stems/

2.NN

+Noun:0

3.Number +1p+Sg:-*Im , +2p+Sg:-*In , +3p+Sg:-*sIN , +1p+Pl:-*ImIz , +2p+Pl:-*InIz , +3p+Pl:-lErIN 5.PSuff 6.Case

+Sg:0 , +Pl:lEr

4.Possessive

+Poss:0

+Case:0 +Case:0 7.CSuff

+DefObj:-*yI , +Gen:-*nIn , 8. # +Dat:-*yE , +Loc:-DE , +Abl:-Den , +Abs:0

Figure 3.2: Schematic visualization model of the lexicon network

3.3.2 The Rules Component


The rules component of the system is implemented as a sequence of composed transducers in xfst using the formalism of replacement rules. It currently features 17 rules, of which 12 are significant and 5 are just for cleaning up the markers1. The rules are composed in a particular
I prefer keep them apart in the development stage, as it often happens that I need to preserve some markers in order to see what exactly has gone wrong in case of an error.
1

24

sequence, as some of them do depend on each other. Full independence is hardly achievable. In the case of a dropping vowel in the stem for instance, the vowel harmony rules have to apply before the vowel is deleted, since the suffixes have to harmonize with this vowel. This is especially true for monosyllabic roots that lose their one and only vowel. The rules are split (for now) in several groups addressing the different phenomena types that they describe. A few classes needed to be defined in order to make the rules operational. I defined a class for the vowels and consonants initially, where the consonant class had to be extended to feature all the archiphonemic descriptions used. As already mentioned, the vowels are further divided into subclasses according to their features for the vowel harmony resolution. Further on, for the rule of progressive assimilation in suffixes, I had to define a class of voiceless consonants.

3.3.2.1 Vowel Harmony Rules


So far, the rules for e-type and i-type vowel harmony are split into two separate rules (which operate in parallel among themselves), where the e-type precedes the i-type harmony resolution, but they might need to be merged into a single rule operating in parallel on all the harmonizing segments. As mentioned above in Section 3.2.1, an underlying E is realized as e on the surface when the last preceding vowel is a front vowel and as a when the last preceding vowel is a back vowel. This defines the e-type harmony rule:

(31) E -> e || FrontV ~$[Vowel] _ ,,


E -> a || BackV ~$[Vowel] _ The dollar sign ($) has a special meaning in xfst contains. The tilde (~) on the other side stands for a complementation operator negation (in this case: negation of the language that contains vowels). In simple words the left context should be read as: there is a front vowel on the left and between it and the symbol to be resolved (E), there are no other vowels. Same for the second line, only that it concerns back vowels in the left context. A thing to mention, the double commas (,,) in xfst replacement rules stand for parallel operation (as opposed to the composition operator (.o.) which stands for sequential operation). In other words, this is a two-level rule. Same for the i-type harmony rule, only that it considers two features (backness and roundedness) of the last preceding vowel.

(32) I -> i || [FrontV & UnroundedV] ~$[Vowel] _ ,,


I -> || [FrontV & RoundedV] ~$[Vowel] _ ,, I -> || [BackV & UnroundedV] ~$[Vowel] _ ,, I -> u || [BackV & RoundedV] ~$[Vowel] _ As far as vowel disharmony in suffixes is concerned, the stems that induce such disharmony will be marked as such (again this could be implemented as an automated procedure) by inserting a (dis-)harmony marker after the last vowel of the stem. The disharmony marker itself will be nothing more than the vowel that induces the new vowel harmony, prefixed by a harmony marker (H). For example: alkol (alcohol) which transforms into alkol (instead of *alkolu) (the alcohol) and alkoller (instead of *alkollar) (alcohol, Pl) will be lexically represented as alkoHl.

25

3.3.2.2 Consonant Alternation Rules


The most productive consonant alternation rule as described in Section 3.2.2.1 is the final stop devoicing rule. Similar rules (both in operation and conditions) are the K/0 alternation rule and the consonant germination rules. The suffix onset (de)voicing rule will also fall in this category. For these rules, as already mentioned, I had to use abstract symbols denoting the alternating phonemes (just as in the vowel harmony rules). As weve already had an extensive overview of the principles behind these rules I will not discuss them any further.

(33) Final Consonant Devoicing Rule:


[ B -> b , C -> c , D -> d || _ %- (%*) Vowel ] .o. [ B -> p , C -> , D -> t || _ [[%- (%*) Cons] | .#.]] Velar Alternation Rule: [ G -> , K -> , Q -> g || _ %- (%*) Vowel ] .o. [ G -> g, K -> k, Q -> k || _ [[%- (%*) Cons] | .#.]] Suffix Onset Devoicing Rule: [ C -> , D -> t , G -> k || VLCons %- (%*) _ ] .o. [ C -> c , D -> d , G -> || ~VLCons %- (%*) _ ] Gemination Rule: [ S -> s, T -> t || _ [%- Cons | .#. ] ] .o. [ S -> [ s s ] , T -> [ t t ] ] The percent sign (%) is used as an escape character in xfst to literalize characters that have a special meanings otherwise. The anchor marker (.#.) is used to denote word boundaries (the beginning of string if used on the left and the end of string if used on the right). The brackets denote optionality in the regular expressions sense (%*) in a replacement rule means there is a possible literal * there. A few notes on the germination rule: it is a rather radical approach as far as context is concerned, it could be improved though in case of failure; so far it covers only the cases of geminating s and t. They were chosen at random out of the set of eight geminating consonants in Turkish, just to implement the principle. For the remaining six consonants, special symbols have to be chosen and their transformations need to be inserted in the rule (pure mechanical operation). There are however some special cases, where germination and devoicing occur simultaneously:

(34) muhip (friend)

muhibbi (friend, Acc., the friend)

and the even further complicated case of serhat (border), which is an exception to vowel harmony, besides undergoing germination and voicing serhaddi (border, Acc., the border). This issue could be fixed using a few minor tricks and the current system is ready to handle it, but I will leave it for a later stage of development.

26

3.3.2.3 Fixing the Morphotactics


The few next rules are used to fix the morphotactics they deal with general phenomena such as vowel/consonant deletion, the pronominal n, the lexical exception su and the elimination of the multiple plural morpheme. The rule for multiple plurals simply takes two adjacent plural morphemes and rewrites them as a single morpheme, nothing unusual. The rule for the pronominal n simply drops the N word finally (a tricky solution). The rule for the glottal stop is again a tricky solution. As weve seen, the ^ marker is either realized as underlying consonant or as nothing at all. My approach was to optionally delete it if a vowel follows:

(35) [ %^ (->) [.0.] || _ %- %* ]


This way, both camii and camisi/camiyi will be recognized as described in Section 3.2.3.1. Next, the rule for dropping stem vowels has nothing particularly interesting to it. Dropping segments are prefixed by a literal dollar sign ($), so an underlying koy$un (bosom) will be realized as: koyun (bosom) and koynu (the bosom) (as opposed to koyun (sheep) which is realized as: koyunlar (sheep, Pl) and koyunu (the sheep)). The dropping segments remain if the suffix attached starts in a consonant on the surface. The rule for the exceptional class of words ending in su (water) is again pretty simple, the words ending in su are specified in the lexicon as suY (this is partially from the origin of the word historically it derives from suw). This special symbol is then realized as y in the proper context or as epsilon by default. It took me quite a while to come to this idea. I was happy to see that others have approached the issue in a similar way1. The most complicated rule, and the one that took me the most time to design optimally (and which is still under consideration whether it is the best solution or not) is the rule that manages all the dropping consonants and vowels in suffixes (except the pronominal n). I called it fixing the vowel sequences, as this is more or less what it is supposed to do. In the case suffixes we have y and n insertion if the stem ends in a vowel. On the other side, we have high vowel (I) deletion if the stem ends in a vowel and s insertion if the stem ends in a vowel in the possessive inflectional paradigm. In simple terms, all these phenomena occur to avoid vowel sequences. After quite a bit of thinking I dealt with all these phenomena in a single blow:

(36) [? - HighV] -> 0 || Cons %- %* _ ] .o.


[ HighV -> 0 || Vowel %- %* _ ] The above composition of two rules does two things, namely: 1. It deletes every segment that is not a high vowel (I), marked up as optional, in the context of a preceding consonant across a morpheme boundary and 2. It deletes every high vowel (I), which is marked up as optional in the context of a preceding vowel across a morpheme boundary.
1

See (Schaaik, 1996).

27

The remaining rules clean up the marker leftovers. The clean up procedures can be incorporated in the rules themselves, but during the development stage, I prefer to keep them separated for debugging purposes.

3.3.2.4 Rule Order


A few notes on the current rule ordering. FixMultiPlural .o. eTypeHR .o. iTypeHR .o. SUexception .o. PronominalN .o. FixGlottal .o. FixVowelSeq .o. FinalStopDevoicing .o. VelarAlternations .o. Gemination .o. SuffOnsetDevoicing .o. StemVDeletion .o. ClearMBMarker .o. ClearOptMarker .o. ClearSVDMarker .o. ClearGlottal .o. ClearExHarmony
Figure 3.3: The current rule ordering

First things first, getting rid of the multiple plural morpheme is a good thing to start with. There are some local dependencies among the rules, like already mentioned, the e-type harmony rule has to precede the i-type harmony rule (or probably they will have to be merged in a single rule and apply simultaneously as two-level rules). Also the vowel harmony resolution shall precede the stem vowel deletion. If we proceed from left to right (with parallel rules), the stem vowels will be deleted before the vowels in the suffixes which shall 28

harmonize with the deleted vowel are resolved. There should also be some tendency to go from simpler and more general to more sophisticated and specific rules (either in upward or downward direction). Such however is not present in the current stage of development. The final stop devoicing, the velar alternations, the germination, stem vowel deletion and the rule for the su exception could all operate at a single stage as they occur in identical contexts and their purpose is more or less the same. The suffix onset devoicing rule is partially dependent on the outcome of the final stop devoicing rule, but if the input is processed left to right, this will be determined before the application of the suffix onset devoicing rule. The pronominal n rule is also on its own, so getting the bigger picture, in the end it seems that the rules are mostly independent. All that maters is to process the input sequentially, from left to right. And therefore if we have the wrong rule ordering, rules that apply on segments that occur after unresolved segments might cause major troubles. This is the reason why most finite state approaches to Turkish morphology are based on two-level morphological descriptions.

4. Conclusions
In order to analyze the complex and often symbiotic relations between words, one needs first to determine the exact properties of each and every individual token. Some of the properties however, could only be determined after examining the environment. The common approach to this issue is inside-out (or bottom-up) starting from the basic entities and building up increasingly complex structures out of them. In this paper I presented an approach to part of the basic entities in the Turkish language.

5. Future Work
Where do we go from here on? One could come up with various ideas. I myself am not so sure which way this project will take. First, before everything else, the model has to be completed to cover the other major word category in Turkish, as well as the minor word categories, to result in a full-featured morphological processor. Then perhaps, to extend functionality a lexicon extraction routine has to be implemented, that automatically extracts entities from a dictionary into the morphological processor. This could be combined with a morphological guesser, and the two could form a symbiotic relation, in which the former will be used to train the latter, and the guessing algorithm will occasionally provide substance for the extension of the lexicon. Further on I am thinking of implementing a syllabification module as it seems quite necessary, as well as perhaps stress markup. Having a fully functional morphological processor at hand, there are various ways one could take: Integrate it into a larger NLP system (speech synthesis/recognition applications, automatic machine translation applications, language tutoring applications, artificial intelligence components, OCR applications, supplemental linguistic applications); extend its functionality for different tasks (a major advantage of the modular approach simply add a new module for the task at hand and occasionally tune up the existing modules); add a context component for disambiguation (this falls in the previous category perhaps); try approaching a different language, and numerous other options in the field. As a first step however, a complete coverage of the language of choice has to be accomplished.

29

Bibliography:
Dik, Simon C. 1981. Functional Grammar 3rd Ed,. Foris. Dordrecht. The Netherlands. Clements, George N. and Engin Sezer. 1982. Vowel and Consonant Disharmony in Turkish. Linguistic Models: The Structure of Phonological Representations (Part II), ed. by H. van der Halst and N. Smith. Foris Publishing, Dordrecht, Holland. Hankamer, Jorge. 1986. Finite State Morphology and Left to Right Parsing. Paper, 3rd International Conference on Turkish Linguistics, August 1986, Tilburg, The Netherlands. Hopcroft, J.E. 1979, Ullman, J.D., Introduction to Automata Theory, Languages and Computation. Addison Wesley. Inkelas, Sharon. C. Orhan Orgun. 1997. The Implications of Lexical Exceptions for the Nature of Grammar. Derivations and Constraints in Phonology. Roca, Iggy; Clarendon Press, Oxford. 1997. Johnson, C. Douglas. 1972. Formal Aspects of Phonological Description. Mouton. The Hague. Paris. Karttunen, Lauri, with Kenneth R. Beesley. 2003. Finite State Morphology. CSLI Publications. Stanford. Ketrez, F. Nihan. 2003. Multiple Readings of the Plural Morpheme in Turkish. USC. USA. (online at: http://www-scf.usc.edu/~ketrez/papers/ADL2003ketrez.pdf - 25.06.2005) Koskenniemi, Kimmo. 1983. Two-level Morphology. A General Computational Model for Word-Form Recognition and Production. Department of General Linguistics. University of Helsinki. Kksal, A. 1975. A First Approach to a Computerized Model for the Automatic Morphological Analysis of Turkish. Doctoral Dissertation, Hacettepe Universitesi, Ankara. Lewis, Geoffrey. 1967. Turkish Grammar. Oxford University Press. Oxford. Lewis, Geoffrey. 1989. Turkish 2nd ed. (Teach Yourself Books). Hodder and Stoughton. London. Oflazer, Kemal. 1994. Two-level Description of Turkish Morphology. Linguistic and Literary Computing. (online at: http://acl.ldc.upenn.edu/E/E93/E93-1066.pdf - 25.06.2005). Oflazer, Kemal. Elvan Gmen and Cem Bozahin. 1995. An Outline of Turkish Morphology. Technical Report. Middle East Technical University (online at: http://www.lcsl.metu.edu.tr/ftp/papers/morphspecs.ps.gz 18.07.2005). Pollard, Asuman elen; Pollard, David. 1996. Turkish: A complete course for beginners. (Teach Yourself Books). Hodder and Stoughton. London. Schaaik, Gerjan van. 1996. Studies in Turkish Grammar. Harrassowitz Verlag, Wiesbaden, Germany. Sebktekin, Hikmet I. 1971. Turkish-English Contrastive Analysis. Turkish Morphology and Corresponding English Structures. Mouton. The Hague. Paris.

Useful links:
http://www.hlst.sabanciuniv.edu/TL/ - The Turkish Lexical Database Project - provides morphological analysis to verify the results http://www.turkishdictionary.net/ - Turkish online dictionary additional glossary http://www.google.com/ - Everything is there! using the web as a corpus

30

Appendix A: List of Abbreviations


CASES:
Nom./+Abs Acc./+DefObj Dat./+Dat Gen./+Gen Loc./+Loc Abl./+Abl Nominative/Absolute Accusative/Definite Objective Dative Genitive Locative Ablative

NUMBER/POSSESSIVE:
Sg./+Sg Pl./+Pl (+)1p/2p/3p Poss./+Poss Singular Plural 1/2/3 Person Possessive

GENERAL:
FST FSA FSN LF SF Finite State Transducer Finite State Automaton (-ta) Finite State Network Lexical Form (lexicon entry form) Surface Form (standard orthographical representation)

31

Appendix B: lexc Code Samples


!##############A lexc solution to the network in Figure 2.2######################## LEXICON Root !#The start state so to say. Every lexicon needs it. One; !#A line in lexc has two components: !#1. An expression (which could be as complex as needed) !#2. A continuation class !#Think of the expression as the symbol over the arc and !#the continuation class as the destination state Lexicon One a Two; b Three; Lexicon Two b Three; Lexicon Three #; c One; !#Figuratively speaking State 1 !#The two arcs with the respective input symbols and destinations

!#State 2

!#State 3 !#The hash symbol denotes end of input, or a final state !#The loop back to State 1

!#################A model lexc solution for Figure 2.3########################### !#Same as above for the most part LEXICON Root One; Lexicon One a:A Two; b:B Three;

!#The semicolon operator denotes a transduction here !#Basically the expressions could be regular expressions !#with varying complexity, combining various operations, !#but as my key concept is modularity, I will try to keep !#them as simple as possible.

Lexicon Two b:B Three; Lexicon Three #; c One;

32

Appendix C: On Replacement Rules


Replacement rules are simply intuitive and convenient shorthands for more complex regular expressions. The most general shape of a context-free replacement rule is: A->B where A and B are regular languages (which could be arbitrarily complex regular expressions themselves). In this case, every string from the upper language (the universal language1) is mapped to itself, except that whenever a substring from A is encountered, it is related to a substring from B (opposed to normal transducers where if the input string doesnt match a string from the upper language, nothing happens and there is no output). This formalism is further extended to include context: A->B || L _ R where A, B, L and R all denote languages and not relations (both L and R are optional). What happens here is essentially the same as above, only that the languages A and B are further contextually restricted. A substring from A is related to a substring from B, only if it is preceded by a substring from L and followed by a substring from R. The double vertical bars separate the rule(s) from the context. Different rules operating in the same context are separated by a comma: A->B , C->D || L _ R The same is valid for contexts: A->B || L1 _ R1 , L2 _ R2 Replacement rules could be constructed to operate in parallel (as in two-level models) using double comma (,,) separator: A->B || L1 _ R1 ,, C->D || L2 _ R2 Or composed as standard networks: [A->B || L1 _ R1] .o. [C->D || L2 _ R2] The difference is crucial if the rules are dependent on each other. These are the basics. For more information on XFST and its replacement rules refer to (Karttunen, 2003).

The language of all possible strings.

33