Вы находитесь на странице: 1из 44

Latin Noun Inflection and Latin Prosody -

A Finite State Implementation

BA-Thesis

Author: Bettina Demmer


Nauklerstraße 63
72074 Tübingen
bettina.demmer@gmx.de
-------------------------------
Seminar: Finite State Methods in Computational Linguistics (SS 2006)
Instructor: Dr. Dale Gerdemann
International Studies in Computational Linguistics
Seminar für Sprachwissenschaft
Eberhard Karls Universität Tübingen
Hiermit versichere ich, dass ich die vorgelegte Arbeit selbstständig und nur mit den
angegebenen Quellen und Hilfsmitteln einschließlich des www und anderer elektronischer
Quellen angefertigt habe. Alle Stellen der Arbeit, die ich anderen Werken dem Wortlaut oder
dem Sinne nach entnommen habe, sind kenntlich gemacht.

Tübingen, den 21. August 2006

Bettina Demmer

2
In principio erat verbum (Joh 1,1)

3
Table of Content

Abstract ...................................................................................................................................... 5
1 Introduction ........................................................................................................................ 5
2 Morphology in Computational Linguistics ........................................................................ 6
2.1 Definition of Morphology .......................................................................................... 6
2.2 Computational Applications of Morphology ............................................................. 7
2.3 What is Finite State Morphology? ............................................................................. 8
2.4 Existing Approaches to the Morphology of Latin.................................................... 11
3 The Latin Noun ................................................................................................................ 13
3.1 Latin Alphabet.......................................................................................................... 13
3.2 Latin Noun Inflection ............................................................................................... 14
3.2.1 Stem + Ending.................................................................................................. 14
3.2.2 Case, Number, Gender ..................................................................................... 16
3.2.3 Declension ........................................................................................................ 16
3.3 Latin Prosody and Stress Assignment ...................................................................... 17
3.3.1 Latin Syllabification ......................................................................................... 17
3.3.2 Penultimate Law............................................................................................... 18
4 Latin Finite State Implementation in xfst......................................................................... 19
4.1 The Overall Structure of the Script .......................................................................... 19
4.2 Introduction to the xfst Syntax ................................................................................. 20
4.3 The xfst Script in More Detail.................................................................................. 22
5 Bibliography..................................................................................................................... 31
6 Appendix: xfst Script File ................................................................................................ 33

4
Abstract
This paper – submitted for the degree 'Bachelor of Arts' – describes a finite state approach to
the inflectional morphology and the prosody of classical Latin nouns. Using Xerox finite state
tools we developed an xfst-script which describes step by step – in terms of several small
transducers which are composed together – the construction of a classical Latin noun on the
one hand and stress assignment on a classical Latin noun on the other. The idea of using
finite state tools for that is that the approach is two-way, which means that the xfst-script can
be used to form a declined Latin noun surface form with assigned stress from a given lexicon
entry (generation) or to analyze a given Latin noun in its surface form towards its lexicon
entry (analysis). The paper also covers a general introduction to morphology, a definition of
finite state morphology, which is used to describe the natural language morphology of Latin,
a survey of existing computational approaches to Latin morphology and a linguistic
description of Latin inflectional morphology and prosody.

1 Introduction
Finite state morphology is a computational description of natural language morphology.
Morphology – as a classical branch of linguistics – deals with the formation of words out of
smaller pieces, called morphemes (→ section 2.1). In finite state morphology (→ section 2.3)
one is concerned with morphologies of natural languages but in a technical way. One tries to
extract rules in describing the structural patterns of word formation, that means, rules that can
be spelled out in order to form a two-way program which is able to analyze surface word
forms of a specific language and to generate word forms out of the lexicon according to
specific features. In this paper we will discuss a finite state implementation of the inflectional
morphology and prosody of Latin nouns. It is a program that is able to generate an inflected
Latin noun from the lexicon according to features specified by the user (case and number) and
to assign stress to it. It is also able to analyze an inflected noun – specified by the user –
towards its lemma (dictionary entry) and the morphological features it contains.

In section two of this paper we will give a general introduction to morphology. What is
morphology and how can it be described in terms of a finite state machine? As we deal with
the high inflecting language Latin in this paper, we will focus especially on the definition of

5
inflectional morphology (→ section 2.1). In section 2.2 we will give a survey of possible
applications of morphology in computational linguistics. Further, the general ideas of finite
state morphology will be discussed (→ section 2.3). This section introduces the basics of
finite state theory. If you are familiar with this theory, you can skip section 2.3. In the end of
section two (→ section 2.4), we will give an overview of the research done on Latin
morphology and existing approaches dealing with it.

In section three of this paper we will give an overview of Latin noun morphology (→ section
3.2) and Latin prosody (→ section 3.3). We will explain our motivation for choosing Latin as
an example language to experiment on with finite state tools. We will argue that it is the most
general and computationally efficient way to split Latin nouns into stem and ending,
compared to the traditional way of splitting Latin nouns into root, theme vowel and suffix (→
section 3.2.1). Also the prosody of Latin will be discussed in this chapter. The Latin Stress
Rule – the 'Penultimate Law' – will be explained (→ section 3.3.2) and we will argue that it is
possible to assign stress to Latin nouns without knowing the exact syllable boundaries of the
word.

In section four of this paper we will come to the finite state implementation, the realization of
the beforehand discussed theories on finite state morphology on the natural - though 'dead'
language – Latin. We will first argue for the 'Item-and-Process' theory according to which we
chose the basic structure of our xfst implementation (→ section 4.1). In section 4.2 we will
give an introduction to the syntax of xfst. We will explain further step by step the rules of the
finite state transducers which form in the end a complete 'construction plan' of a Latin noun
(→ section 4.3).

2 Morphology in Computational Linguistics

2.1 Definition of Morphology

Morphology, from Greek morphe 'form', is the branch of linguistics that studies the 'forms of
words'. It deals with the internal structure of words. The basic components of a word are one
or more morphemes. There are three types of morphemes (Müller, 2002): root morpheme,
which carries lexical meaning and is the base of all morphologically related words of the

6
same family, stem morpheme, which is a realization of a root morpheme and can be identical
to the root morpheme, and affix, a dependent component of a complex word, which cannot
stand alone. There are three fields into which morphology divides its studies about word
formation (Matthews, 1991): Derivation describes the formation of a new word out of an
existing word with the help of derivational morphemes. This process usually involves
changing of meaning or changing of the part of speech of a word. An example of this process
is the English prefix 'un-' which turns an adjective into its negative counterpart. Composition
describes the formation of new words out of two (or more) single words. The third field is
called inflection, which is concerned with the grammatical motivated forms of words
depending on the syntactic context they appear in. In inflecting languages words are usually
constructed of a basic morpheme, the root of a word (which carries the lexical meaning of the
word), and inflection morphemes, affixes (which carry some other information, e.g. plural
marking, case marking etc.). Traditionally, inflection is presented in paradigms, a two-
dimensional representation form, which covers one morphosyntactic category on one axis,
and another morphosyntactic category (a category that is "directly referred to by specific rules
in both morphology and syntax" (Matthews, 1991)) on the other axis. These categories can
consist of sets of variables. A word (or actually a lemma, the morpheme carrying the meaning
of the word) is then inflected according to the categories on the two axes. Latin, which is
described morphologically in section 3.2, is a high inflecting language and the function of a
noun in its context is expressed by a suffix according to its declension on the one axis and
gender, case and number as a set of categories on the other axis. An advantage of the
representation of the inflection of a language in paradigms is that it is quite easy to find word
forms that share the same spelling and phonetics but express different functions. This
phenomenon is called 'syncretism' (Matthews, 1991). In Latin, for example, the ablative plural
form is always identical in spelling and phonetics to the dative plural form of the same noun.
But the function of the noun, either as dative plural or as ablative plural, in the context is
different. The branch of morphology that is concerned with inflection and paradigms is called
'Inflectional Morphology'.

2.2 Computational Applications of Morphology

In this chapter an overview should be given over the role of morphology in computational
linguistics in general and of some possibilities of its application in natural language
processing.

7
Morphology, as the study of the internal structure of words, builds the basis for almost all
natural language applications in computational linguistics. In classical applications of
computational linguistics, as for example machine translation, information retrieval, parsing,
part of speech disambiguation, data mining etc., it is necessary and contributes to the
efficiency of a system to correctly determine the internal structure of a word: To know the
grammatical function of a word in its context which can be mainly determined by its
morphological analysis is necessary in order to determine its correct translation (e.g. machine
translation). To trace a word back to its lemma (the basic form as which the word appears in a
dictionary) rather than to analyze its grammatical realization in the context, is helpful in order
to summarize the information of a text (e.g. information retrieval).
Morphology is actually the most important branch of linguistics and computational
linguistics, as it builds the basis for all the other branches: syntax, semantics and phonology.

2.3 What is Finite State Morphology?

Finite state morphology is a branch of computational linguistics which deals with morphology
in a technical sense. In finite state morphology, a morphological description of a natural
language is displayed as a finite state automaton or as a finite state transducer (general term:
finite state machine). A finite state automaton describes a language compared to a finite state
transducer which describes a relation of two languages. The language or the relation of two
languages is described in terms of regular expressions (Roark and Sproat, 2006).
Kaplan and Kay's (1994) is the most influential work in the field of finite state morphology. It
was their idea to represent phonological rules as a cascade of transducers. Inspired by the idea
of Kaplan and Kay, Koskenniemi (1983) implemented a finite state system which he calls 'a
general computational model for word-form recognition and production'. Since his work,
finite state methods have been used to describe the morphology and phonology of a wide
range of natural languages.

Every finite state machine consists of one or more states, exactly one start state and any
number of final states, which are connected by arcs. Every arc has a label and a destination
(one state of the network). Small networks can be viewed graphically as transition diagrams.
Every finite state network includes a 'sigma', the symbol alphabet of the machine. These
symbols represent the range of the language or relation that the network describes (Beesley
and Karttunen, 2003).

8
Finite State Automaton
Dealing with natural language, a finite state automaton is a network that accepts a regular
language. Figure one shows a finite state automaton which describes the language ab*cdd*e.

Figure 1: A simple finite-state automaton accepting the language ab*cdd*e (Roark and Sproat, 2006).

A language in finite state terms is a set of words from an alphabet which contains a set of
characters. A finite state language is called regular if it is constructed from a finite alphabet in
combination with on of the following operations: set union, concatenation or transitive
closure (Roark and Sproat, 2006). A finite state automaton maps an input string against the
labels of its arcs. If after this matching a final state is reached the string is accepted and it is in
the language of the automaton. Roark and Sproat (2006) give a technical summary of the
definition of a finite state automaton:

A finite-state automaton is a quintuple M = (K, s, F, Σ, d) where:


1. K is a finite set of states
2. s is a designated initial state
3. F is a designated set of final states
4. Σ is an alphabet of symbols, and
5. d is a transition relation from K × (Σ c є) to K

There are some special languages which should be mentioned:


1) The empty language contains exactly one final state and accepts only the empty string.
2) The null language does not accept any string, not even the empty string, and consists of
exactly one non-final state.
3) The universal language which is denoted by Σ* contains all strings that can be
constructed out of the alphabet Σ, including the empty string є.

Finite State Transducer


A finite state transducer is a network that describes a regular relation. Figure two shows a
finite state transducer that describes the regular relation (a :a)(b :b)*(c :g)(d: f)(d: f)*(e : e).

9
Figure 1.2: A simple finite-state automaton that computes the relation (a : a)(b :b)*(c : g)(d : f)(d : f)*(e : e)
(Roark and Sproat, 2006).

Dealing with natural languages, a regular relation is almost always a mapping between pairs
of strings. A finite state transducer matches a string against the upper symbols of the labels of
its arcs and maps these to the lower symbols of its arcs. If a string is matched, i.e. a final state
is in the network is reached, the changed string is given as output. Roark and Sproat (2006)
give a technical summary of the definition of a finite state transducer:

A (2-way) finite-state transducer is a quintuple M = (K, s, F, Σ × Σ, d) where:


1. K is a finite set of states
2. s is a designated initial state
3. F is a designated set of final states
4. Σ is an alphabet of symbols, and
5. d is a transition relation from K × (Σ c є × Σ c є) to K

Composition plays an important role in finite state transducers (Roark and Sproat, 2006).
Transducers can be composed together. A composition of two transducers means first
applying the first transducer and then applying the second transducer to the output of the first
transducer. We used this operation very often for our finite state description of Latin noun
morphology where we factored our system into a set of operations that are executed one after
each other using composition (→ section 4.3). Another central feature of finite state
transducers is inversion (Roark and Sproat, 2006). Inversion means that the system that is
implemented as a finite state transducer or a set of transducers composed together can be used
in two directions. It can be used in morphological analysis to map a string from a lexical level
to the surface level (generation) following several rules or it can be used the other way
around, from the surface level to the lexical level (analysis). This feature constitutes the
innovation of our morphological analysis of Latin nouns, as our program can be used to
generate Latin nouns from the lexicon as well as to analyze Latin nouns from a given surface
form (→ section 4.3).
Finite state methods can be used for speech and language processing including morphology
and phonology, computational analysis of syntax, language modelling for speech recognition,
pronunciation modelling etc (Roark and Sproat, 2006).

10
For our implementation we used Xerox finite state tools, the platform xfst in specific, in order
to describe transducers which – composed together – form the construction plan of a Latin
noun. xfst includes a compiler which builds a finite state network out of the description of the
transducers in the xfst script file. For more information on the syntax of xfst see section 4.2.

2.4 Existing Approaches to the Morphology of Latin

Latin is a very popular language for morphological analysis. Much research has been done on
Latin inflectional morphology. In the following section we will present an overview of
literature or systems concerned with Latin morphology.
Matthews (1972) describes the inflection of Latin verbs in his book in order to explain
inflectional morphology in general. In his book, he describes the 'Item-and-Arrangement'
theory opposed to the 'Item-and-Process' theory. He argues that for high inflecting languages
as Latin, the 'Item-and-Arrangement' theory is not sufficient, where morphemes are the basic
units of meaning which are arranged linearly. Instead he argues for the 'Item-and-process'
theory for inflecting languages where morphology is viewed as the construction of words out
of base forms (stems or roots) modified by rules. See section 4.1 for further discussion.
In Matthews (1991), Latin is used as representative for other high inflecting Indo-European
languages in order to show and explain paradigm structures. Paradigms (Greek parádeigma
'pattern') are the traditional way of presenting a word, in our paper nouns, and its inflectional
changes according to certain features and contexts. Paradigms are two-dimensional
constructions where one category is opposed to other categories. In Latin noun inflection,
which is described in this paper, we oppose declension of a noun to its case and number.
Lindsay (1894), Sommer (1914) and Sommer (1977) give classical analyses of Latin
morphology without a reference to computational applications of these. Our summary of
Latin noun morphology in section 3.1 and 3.2 is mostly taken from these books.
Bender describes Latin noun inflection (found in his collection 'Essays on Morphology') a bit
differently from traditional descriptions of Latin noun morphology. In his analysis, he splits
Latin nouns into stems and endings – opposed to the traditional analysis of Latin nouns into
root, theme vowel and suffix (with the fusion of the latter two) – in order to minimize the
morphological condition contexts. He argues that this way of splitting is a generalization of
the traditional theme-vowel-plus-suffix analysis which in most cases differs just in the change
of the theme vowel. By counting the theme vowel towards the root of the word, which forms
the stem, it is possible to predict the declension membership of this noun. The final character

11
of the stem is decisive for the membership of the noun to one of the six declensions. In our
implementation we took Bender's analysis. For further discussion see section 3.2.1.
Convington (1999) adopts the same generalization theory about Latin noun inflection in his
paper as Bender. In his paper he argues that by leaving the theme vowel together with the root
of the word, which forms the stem, and generalizing the rest ending over the other
declensions, contributes to the "economy of representation when the inflectional system is
stored as a transition network […], a representation that is computationally efficient and may
be psychologically realistic" In his paper he refers to Bubenheimer (1995) who has
implemented a morphological analyzer based on transition networks.
McLean presents a Latin translator on his homepage. The program takes Latin inflected words
as input and gives the English translation and a short analysis (including the case and number
but not declension) of the word. A disadvantage of the program is that it does not trace back
declined nouns to its lemma.
Logos offers 'language solutions' on its homepage. There we found a 'universal conjugator', a
system, which also handles Latin verb inflection. It is possible to enter a Latin verb and the
output of the system is the complete conjugation of that verb.
Bozzi and Cappelli (1991) present 'A project for Latin Lexicography: 2. A Latin
morphological analyzer'. In their article they describe a morphological analyzer, which
comprises a base dictionary, a table of suffixes, a table of endings and a table of postfixes.
The Perseus Project, 'a digital library for the humanities', offers morphological analysis for
inflecting languages as Latin and Greek. Using the tool 'Latin Morphological Analysis' the
user can enter an inflected word – the system covers adjective, nouns and verbs – and gets a
table with all possible morphological analyzes of the entered word, including the lemma of
the word, its English translation, its frequency in either the Latin Prose or Latin Poetry or
Latin Texts corpus
During the research of literature about Latin morphology, we encountered many
morphological implementations or ideas about Latin morphology. But most systems we found
deal with Latin verb morphology rather than with Latin noun morphology. All systems only
provide analysis of Latin morphology. What is new in our approach described in this paper, is
the construction of a bidirectional system, which on one hand analyzes Latin noun
morphology but on the other hand also generates Latin noun forms according to given
features.

12
3 The Latin Noun
Latin belongs to the Indo-European language family. It constitutes the mother language of the
Romance languages in the Indo-European languages tree (Stowasser, 2004).
Classical Latin, also called 'aurea Latinitas', is the name of a 100 year period in the first
century BC in the development of Latin. It is the period in which Latin is most developed
compared to other periods in the development of Latin. In this time Latin developed towards a
cultivated language of literature and education (Stowasser, 2004). Latin became the official
language of the Roman Empire. As classical Latin is the phase in the history of the language
in which the most grammatical restrictions existed, we will concentrate in the following
analysis on the grammar of Latin of this time. We always refer to classical Latin when we
mention the language Latin.
A Latin noun is determined by case, number and gender. Further, Latin nouns are grouped
into five different declensions which are distinguished by different final character of the stem
of a noun.

3.1 Latin Alphabet

The alphabet of Latin consists of 24 letters (Stock, 1970).

A B C D E F G H I K L M N O P Q R S T U V X Y Z
a b c d e f g h i k l m n o p q r s t u v x y z

Classical Latin has


• 6 vowels a e i o u (y); a= e= i= o= u= (y=) which can be either long or short (long
vowels marked with a = -sign in the text)
• 4 diphthongs ae oe au eu which are always long
• 17 consonants b p d t g c q k l r m n f v s z h x (Stock, 1970).
In written Latin there is no graphical distinction between long vowels or short vowels. Thus,
in Latin many homographs can be found, words which share the same writing but have
different meanings caused by differences in vowel quantity, e.g. iace=re (Engl. to lie) vs.
iacere (Engl. to throw), pare=re (Engl. to obey) vs. parere (Engl. to give birth), cupi=do (Engl.
desire) vs. cupido (Engl. someone who is eager (in the dative case)) (Stowasser, 2004). The
information of vowel quantity has to be given in the lexicon.

13
3.2 Latin Noun Inflection

3.2.1 Stem + Ending

Latin is an inflecting language, i.e. the grammatical function of a word form is expressed
mainly by changes of the word final ending (Stowasser, 2004), e.g. the word form 'amica-m'
(engl. '(female) friend') can be analyzed in the following way:

amica - m
'stem' 'ending'
The stem carries the lexical meaning of the The ending carries the grammatical function
word, in this case '(female) friend' and the of the word, in this case 'accusative' +
information which declension the noun 'singular' + 'feminine'.
belongs to; in this case the final character of
the stem is an 'a', so the noun belongs to the
first declension (or a-declension).

Stem
The stem of a noun can be found by cutting off the ending '–um' or '–rum' in the genitive
plural form of a noun, e.g. flamma-rum, lupo-rum, passu-um, die-rum, turri-um, reg-um
(Stock, 1970). This stem appears in front of all case endings except for the nominative
singular ending (and except for accusative singular ending in neuter nouns). In front of
nominative singular endings (and accusative singular endings in neuter nouns), a stem change
takes place (Sommer, 1914). The changed stem that is used in front of the nominative singular
ending (and accusative singular ending) combined with its nominative singular ending
constitutes the base form, the lemma, that is found in the Latin dictionary. The final character
of the stem is decisive for the declension of the noun (→ section 3.2.3).

Ending
The ending is added to this stem according to the declension, the case, the number and the
gender of the noun. Most grammars split up Latin nouns differently; they count the final
character of the stem towards the ending, so that every declension has its own ending for each
case, number and gender. This way of splitting is easier concerning the studying of the
language. The learner first studies the different declensions that exist in Latin. Then she learns

14
all the endings according to the declension, i.e. the endings that contain the theme vowel. But
in this paper we discuss a computational implementation of the Latin noun morphology, so
we prefer the highest generalization possible that can be done examining Latin nouns.
Splitting up Latin nouns into stem and ending means higher possible generalization of the
endings. This can be seen in more detail in the description of the xfst-implementation (→
section 4.3). All the endings can be summarized in only two tables (→ see table1 and
table2). Further, the endings can be summarized even more: E.g. the ablative singular ending
for masculine/feminine nouns is the same for all declensions except for the consonantal
declension. This phenomenon is called 'syncretism'. This way of scaling down the endings
and combining features works against studying a language, where it is easier to have an
ending for every feature.
In the following tables all Latin noun endings can be seen (taken from Bender). The proper
ending for a noun is determined by the final character of the noun (→ declension) and its
gender. Table1 covers all endings for masculine and feminine nouns; table2 covers all
endings for neuter nouns. As neuter nouns come up only in the second, fourth and third
declension, the other columns are left empty. Some special characters have to be explained: A
carat (^) preceding a vowel-initial suffix indicates that that vowel replaces the stem vowel, if
any. A tilde (~) preceding a consonantal suffix indicates that the preceding vowel is always
short. A hash (#) stands for a zero suffix. A colon (:) following a vowel indicates that the
vowel turns into a long vowel.

1 masc/fem 1st: a 2nd: o 5th: e 4th: u 3rd: i 3rd: C


nom sg # s s s s s
acc sg ~m ~m ~m ~m ~m em
dat sg e : i: i: i: i:
abl sg : : : : : e
gen sg e ^i: i: :s ^is ^is
nom pl e ^i: :s :s ^e:s ^e:s
acc pl :s :s :s :s :s e:s
dat pl/abl pl ^i:s ^i:s bus ^ibus ^ibus ^ibus
gen pl :rum :rum :rum um um um

15
2 neut 1st: a 2nd: o 5th: e 4th: u 3rd: i 3rd: C
nom sg ~m : # #
acc sg ~m : # #
dat sg : : : i:
abl sg : : : e
gen sg ^i: :s s is
nom pl ^a a a a
acc pl ^a a a a
dat pl/abl pl ^i:s ^ibus ^ibus ^ibus
gen pl :rum um um um

As an example we take the word 'stella' (engl. star) and go through its declension. The stem of
that noun which is given in the lexicon is 'stella-'. From the final character of the stem we can
see that the noun belongs to a-declension. The noun is feminine (which is given in the
lexicon), that means that the endings are taken from table1, 2nd column (with the title '1st: a').
Now all the endings from this column can be added to the stem according to case and number.

3.2.2 Case, Number, Gender

In Latin, six cases are distinguished: nominative (subject of sentence), genitive (possession,
attachment), dative (indirect object), accusative (direct object), ablative (means, object of
prepositions of position) and vocative (personal address). The vocative case endings
correspond to the nominative case endings in all declensions except for the vocative singular
ending in o-declension nouns, where the ending is '–e' instead of the nominative singular
ending '–us' (Stock, 1970).
The number of a noun is either singular or plural.
The gender of a Latin noun can be masculine, feminine or neuter. The gender information
cannot be seen by looking at the noun. Thus, the information must be given in the dictionary.

3.2.3 Declension

Latin nouns are grouped into five different declensions (Stock, 1970). A noun's membership
to a declension is decided by the last character of its stem, i.e. the noun rex (engl. king) has
the stem reg-, which ends in a consonant. Therefore 'rex' belongs to the third declension
(consonantal stems). The first declension covers nouns with a-stems, the second declension

16
nouns with o-stems, the third – as already mentioned – covers nouns with consonantal stems,
i-stems or mixed nouns (which are nouns that belong to the consonantal stem nouns in the
singular and to the i-stem nouns in the plural). The fourth declension covers nouns with u-
stems, the fifth declension nouns with e-stems. The declensions have traditionally different
names in some grammars (e.g. Stock, 1970); they are called a-declension, o-declension, and
consonantal declension, i-declension, mixed declension, u-declension and e-declension,
respectively.

3.3 Latin Prosody and Stress Assignment

As Latin is not phonetically realized anymore, Latin prosody as a science is based on the
syllabification theories of Roman grammarians, on actual syllabification in inscriptions and
on a theory of syllable boundary with which linguistic phenomena can be explained (Sommer,
1977).
In written Latin, stress is not visible. But from observations of how stress affects meaning or
how stress occurs in verse and from antique notes it is possible to reconstruct stress
assignment in Latin words.
Until the 5th century BC, stress appeared in Latin as expiratory stress which is produced by a
stronger air pressure during pronunciation of the stressed syllable. In this phase, stress is
always assigned to the first syllable of the word form, called initial stress (Allen, 1978). Later
in the 5th century BC, this type of stress changes to a musical stress in which the stressed
syllable is pronounced on a higher pitch. In this second phase, stress is assigned according to
quantity of the penultimate syllable of the word form. After the two mentioned phases in the
history of Latin stress a third phase follows in which the stress changes again into an
expiratory stress (Sommer, 1914). It distinguishes its accented syllables by giving them
greater energy of articulation than the unaccented. The stress remains on its old place
(according to the Penultimate Law). This stress type lives on in the Romance languages
(Stowasser, 2004).

3.3.1 Latin Syllabification

The basic principles of Latin syllabification make the syllable end with a vowel and begin
with a consonant or a combination of consonants (Lindsay, 1894). The syllabification rule of
Roman grammarians confirms that a set of consonants in a word is added to the following
syllable unless it is not pronounceable. The syllable boundary in the latter case falls into the

17
consonant group (Sommer, 1977; Lindsay 1894). Compounds are split etymological. In
inscriptions, however, syllable boundaries are found at different places: The syllables are
always split between consonants (if more then two consonants occur, the syllable boundary
lies before the last consonant) except for the 'muta cum liquida' sequence, which always
counts towards the next syllable.
As Latin syllabification appears vague in the literature, we use a different analysis for stress
assignment on Latin nouns. We have the information that every syllable contains exactly one
vowel or diphthong. Every syllable (or vowel) is long (= heavy) when it is followed by at
least two consonants (position length) (Allen, 1978). With this information it is possible to
apply the penultimate law to Latin nouns without knowing the exact syllable boundaries in a
word. The penultimate law thus applies to the several vowels as deputies of the syllables.

3.3.2 Penultimate Law

The penultimate law is a rule which describes the grammatical stress assignment on Latin
words. According to that law, the last syllable of a word is extrametrical. If a word consists of
one or two syllables stress is assigned to the first syllable. If a word consists of at least three
syllables the quantity of the penultimate syllable is decisive for the stress: Stress is on the
penultimate syllable if it is heavy (i.e. ending a long vowel or diphthong) and on the
antepenultimate syllable if the penultimate syllable is light (i.e. ending in a short vowel)
(Stowasser, 2004; Sommer, 1914; Lindsay, 1894; Allen, 1978; Kenstowicz, 1994; Zirin,
1967).
A syllable is heavy (i.e. consisting of long vowel) in Latin either by nature ('natural length') or
by position ('position length'). A syllable is naturally heavy if it ends in a long vowel or a
diphthong (which are always long). If a syllable ends in a short vowel followed by at least two
consonants it is called 'closed' which turns it into a heavy syllable by position (Stock, 1970).
If a 'muta' (i.e. b, p, g, c, d, t) is followed by a 'liquida' (l, r) – muta cum liquida – the two
consonants count to the initial sound of the following syllable, e.g. inte-grum which triggers
stress on the antepenultimate syllable, as the syllable is not closed anymore (Allen, 1978).
Words which have stress on the third syllable have a second stress on the first syllable
(Sommer, 1914; Lindsay, 1894).
If an enclitic occurs at the end of a word, stress is pushed on the syllable before the enclitic,
no matter which quantity the syllable has (Allen, 1978).

18
4 Latin Finite State Implementation in xfst

4.1 The Overall Structure of the Script

The overall structure of the Latin xfst-script that we implemented for this paper follows the
traditional 'Item-and-Process' theory. There has been long discussion about two different
morphological theories: On the one hand we have the theory which is called 'Item-and-
Arrangement' (Hockett, 1954) where morphology is viewed as the construction of words out
of morphemes, small lexical pieces. On the other hand we have the theory which is called
'Item-and-process' (Hockett, 1954) – the theory that we use for our analysis of Latin nouns in
this paper – a theory where morphology is viewed as the construction of words out of base
forms (stems or roots) modified by rules. These different approaches are motivated by the
properties of different languages. Roark and Sproat (2006) argue in their paper, that the
differences between these two approaches are not as significant from a more formal or
computational point of view. As Latin noun inflection is presented with paradigms listing the
various inflected word forms according to their functions and with rules for deriving these
forms, it is clear that an analysis of Latin – also analyses of other high inflected Indo-
European languages such as Classical Greek and Sanskrit – is best done in the framework of
the 'Item-and-process' theory. In more detail, the first reason to apply the 'Item-and-Process'
and not 'Item-and-Arrangement' theory to Latin noun morphology is that in Latin many
morphosyntactic features are expressed by only one single suffix (Roark and Sproat, 2006).
There are not many morphs out of which a noun is constructed which can be associated with
corresponding morphemes. A second reason why 'Item-and-Arrangement' is inappropriate for
Latin noun inflection is that there is not only one stem to which suffixes are attached, but
sometimes two, depending on the context of the suffix. In Latin, if a second stems occurs, it
appears in all cases but in the nominative (and in neuter nouns also in the accusative case).
Thirdly, the suffixes may change, depending on the class of the noun they are attached to. In
Latin, for example, the dative/ablative plural ending is '–is' in the first and second declension,
but '-(i)bus' in the other declensions even though they represent the same morphosyntactic
feature. As a consequence of the three mentioned reasons, the 'Item-and-Process' theory fits
better to the structure of Latin noun inflection which can be better described in terms of rules
that introduce suffixes according to some features than assuming separate morphemes that
encode the features (Roark and Sproat, 2006).

19
The 'Item-and-Process' and 'Item-and-Arrangement' theories are reformulated by Stump
(2001), who describes four terms in his theory: lexical and inferential, incremental and
realizational. Without explaining these four terms in this paper in too much detail, we just
mention that his inferential-realizational theory, which he also calls Paradigm Function
Morphology, corresponds best to the 'Item-and-Process' theory. With the term 'inferential' he
describes theories in which "associations between the morphosyntactic properties of a word
and its morphology are expressed by morphological rules which relate that word to its root"
(Stump, 2001). The term 'realizational' refers to theory which says that "the association of a
word with a particular set of morphosyntactic properties licenses the introduction of the
inflectional exponents of those properties" (Stump, 2001).

4.2 Introduction to the xfst Syntax

In this section we will give an introduction to the syntax of xfst. Xfst is part of Xerox finite-
state tools which "provides a regular-expression compiler and direct access to the XEROX
FINITE-STATE CALCULUS, the algorithm for building and manipulating finite-state
networks" (Beesley and Karttunen, 2003). Xfst is maintained and expanded by Lauri
Karttunen. In the table, all signs that are used in the xfst script below are listed with their
functions respectively (taken from Beesley and Karttunen, 2003).

0 EPSILON symbol: empty-string language or


corresponding identity relation
? ANY symbol: language of all single-symbol
strings or corresponding identity relation
a single symbol: language that consists of the
corresponding string or identity relation on
that language
a:b pair of symbols: relation that consists of the
corresponding ordered pair of strings; a =
UPPER symbol, b = LOWER symbol
.#. boundary symbol; designates the beginning
of a string in the left or the end of a string in
the right context of a restriction
[] empty string language

20
[..] -> A epenthesis rule: mapping the empty string
into non-empty string A
() optionality
A+ Kleene-plus: concatenation of A with itself
one or more times
A* Kleene-star: union of A+ with the empty
string language
~A complement of A
[A B] concatenation of the two languages or
relations
{word} = [w o r d] concatenation of the corresponding single-
character symbols
$A 'contains A'
[A | B] union of the two languages or relations;
DISJUNCTION
[A & B] intersection of the two languages;
CONJUNCTION
[A – B] all the strings in A that are not members of B
[A .o. B] composition of the relation A with the
relation B
A.u upper language of the relation A
A.l lower language of the relation A
A.i inverse of the relation A
[A -> B || L _ R] replacement of an original upper-side string
A by a string from B if the indicated
condition (L = left context; R = right context)
is fulfilled
clear clear the stack (which saves networks)
define VAR text command to define a variable VAR
# start of a comment
source <filename> e.g. source Latin.xfst: reads in the xfst script
and builds a network out of it
read regex a single regular expression can be read in

21
with this command
apply up in Latin.xfst: the user can enter a declined
noun after this command and gets back the
lemma of the noun and some features
(analysis)
apply down in Latin.xfst: the user can enter a Latin noun
in nominative singular and some features
after this command and gets back the
declined noun (generation)
print lower-words print all surface strings (i.e. declined nouns)
that the network covers
print upper-words print all lexical strings (i.e. lemma and
features) that the network covers
print net print information about the network

4.3 The xfst Script in More Detail

Stem + Ending
In order to understand the particular rules we begin with the definition of the lexicon which is
later handled by replace rules. The lexicon entries look like |1|:

|1| [{stella} [noun & $fem]]


[{genus} %# {gener} [noun & $neut]]

Each entry of the lexicon consists of the stem of the Latin noun and information on its part of
speech (just in case the script is extended to handle other parts of speech) and on its gender. A
variant stem is also mentioned in the lexicon, as it cannot be reconstructed automatically from
the other stem. In the example above, 'stella' does not have a variant stem, but 'gener', which
has a variant stem in the nominative singular and accusative singular form. That is why its
nominative singular form is given in front of the 'traditional' stem (separated by a hash sign
for further processing reasons). How a stem of a Latin noun can be found is described in
section 3.2.1. The lexicon with its features is the part of the program that is used as input to
the generation part and as output from the analysis part of the program. To make the program
more user-friendly, i.e. the user does not have to know the stem of a word by heart but can use

22
the program with the better known nominative singular form, we implemented a transducer
that turns the stem of a noun into its nominative singular form automatically, which is the
lexical entry of a noun in a Latin dictionary (→ |2|).

|2| define StemToNomSg [?* []:[noun & $nom & $sg]] .o. Lexicon .o. Suffixes;

If a noun has a variant stem (which is already given as nominative singular form) this form is
used. That means, either the nominative singular form has to be newly constructed ('Case2')
or the already given nominative singular form is used ('Case1') (→ |3|). In the end, the
inversion of the StemToDict transducer is composed with the Morphology transducer, which
is described later in this section.

|3| define Case1 [


$%#
.o.
[..] -> %# || \LexFeatures _ LexFeatures
.o.
%# ?* %# -> 0
.o.
~$%#
];

define Case2 [
~$%# .o. [StemToNomSg LexFeatures*]
];

define StemToDict [
Case1
|
Case2
];

define Dictionary [
[StemToDict].i .o. Morphology
];

Now, we come to the actual program, the part which works only on the lexicon itself. Firstly,
the 'noun' feature in the lexicon is replaced by the four features that it describes: the part of
speech tag 'Noun', gender 'Gend' (which is already replaced by the gender that is given in the
entry because of the 'contains'-sign $), case 'Case' and number 'Num' (→ |4|).

|4| define noun Noun Gend Case Num;

In the next step, these features are extended by two more features, namely 'Ending', which is a
placeholder for the possible endings that can be attached to the stem of the word, and

23
'DeclTag' which is a placeholder for the class (declension) the noun belongs to (→ |5|). These
are 'helper' tags as they are not important nor harmful for the user. The declension tag can be
rewritten by automatic recognition of the condition context. The user does not have to specify
the class of the noun himself.

|5| define Features [


[..] -> [Ending DeclTag] || _ noun
];

After this step, the lexicon entry looks like the following:
{stella} Ending DeclTag Noun Gend Case Num

This is now the starting position for the other transducers of the program. All these tags are
necessary to determine the surface word form of a Latin noun (in the generation phase; all
transducers can also be used the other way around (for analysis)). In the following part of the
program, each of these 'tags' (Ending, DeclTag, Noun, Gend, Case, Num) is now replaced by
its possible realizations (conditioned by the realization of the other tags).

|6| define Declension [


DeclTag -> adecl || [a|a=] Ending _
.o.
DeclTag -> odecl || [o|o=] Ending _
.o.
DeclTag -> edecl || [e|e=] Ending _
.o.
DeclTag -> udecl || [u|u=] Ending _
.o.
DeclTag -> idecl || [i|i=] Ending _
.o.
DeclTag -> cdecl || C Ending _
];

|6| describes the rewriting of the declension tag. There are six different noun declensions in
Latin and the final character of the stem is decisive for the membership of a word to a specific
declension. This can be seen in the condition part of the rewrite rule (after the ||). In our
example from the lexicon (the entry 'stella'), DeclTag would be rewritten to 'adecl', as the
word stem ends in an 'a'. The gender tag would be already rewritten (from the information in
the lexicon) to 'feminine'.

|7| define StemChange [


[C|V]+ -> 0 || .#. _ %# [C|V]+ Ending cdecl Noun masc nom pl
.
.

24
.
.o.
[C|V]+ -> 0 || %# _ Ending cdecl Noun masc nom sg
.
.
.
.o.
o -> u || _ Ending odecl Noun masc acc sg
];

In |7| the definition of the possible stem change can be seen. In the first part of the rewrite rule
all cases are listed in which the 'standard' stem (→ section 3.2.1) is used. These are all cases
except for the nominative singular case for masculine or feminine nouns and all cases except
for the nominative and accusative singular case for neuter nouns. In these cases the variant
stem in front of the hash sign is deleted. In the second part of the rule all nominative (and for
neuter nouns accusative) singular forms are listed in the condition part of the rewrite rule. For
these cases the variant stem is used and the 'standard' stem (following the hash sign) is
deleted. At the end of the StemChange definition, one exceptional case is listed: In masculine
o-declension nouns the o of the stem changes into a u in the accusative singular. |7| doesn't
affect our example from the lexicon. 'stella' does not have a variant stem.

|8| define Endings [


Ending -> 0 || _ adecl Noun fem nom sg
.o.
.
.
.
Ending -> {~m} || _ adecl Noun fem acc sg
.o.
.
.
.
Ending -> {:rum} || _ adecl Noun fem gen pl
.o.
.
.
.
Ending -> {^i:s} || _ adecl Noun fem dat pl
]

In the Endings section (→ |8| is just an extract) the Ending tag is rewritten to the respective
'real' ending with the gender, case and number as conditions for that. We first listed all
possible endings for all possible conditions and realized later that there is much conformity
between the abstract paradigms (which is called syncretism in morphology) that we describe
with the different conditions. It is possible to summarize this conformity with another rule,
which states that two conditions trigger the same rewriting of the Ending tag. In the Ending

25
section three special characters can be observed which have to be removed later from the
surface string. The explanation of these special characters can be repeated in section 3.2.1. It
can be seen again that the Latin noun is determined in its ending by its declension, its gender,
its case and number. Our example from the lexicon gets its endings in the Ending section
according to the remaining information case and number.
Throughout rules |9| to |11| no stem or ending changing replacements are undertaken. These
rules describe phonological replacements: In |9| vowels preceding the carat sign (^) are
deleted, in |10| vowels preceding the colon (:) turn into long vowels and in |11| vowels in front
of the tilde (~) turn into short vowels (→ recall the special characters in section 3.2.1):
|9| define Voweldeletion

|10| define Long

|11| define Short

Rule |12| and |13| describe stylistic replacements: In |12| all the special characters which were
used to trigger phonological changes in rule |9| to |11| are now deleted and in |13| all features
are deleted in order to leave only the word in its surface form.

|12| define RemoveSpecialCharacters

|13| define RemoveFeatures

Finally, in |14|, all the rules (which are small transducers respectively) that we just described
are composed and combined to a bigger transducer which we call 'Suffixes'. Thus, in the
section 'Suffixes' every just described smaller transducer is executed one after the other.
Composition means operation on two relations with a new relation as a result. In this case we
compose several relations, which means, that the lower language of one transducer is the
upper language for the next transducer. This is done throughout all the transducers until we
get in the end the result of the composition of all the relations.
|14| define Suffixes [
Features
.o.
Declension
.o.
StemChange
.o.
RemoveHashSign
.o.
Endings
.o.

26
Voweldeletion
.o.
Long
.o.
Short
.o.
RemoveSpecialCharacters
.o.
RemoveFeatures
];

To read in the Lexicon and to actually draw the network we have to give the command 'read
regex' (→ |15|). First we define the morphology of Latin nouns, which is the composition of
the Lexicon – with all its stem entries and some additional information – with the Suffixes (→
|14|). Then we apply the inversion of the StemToDict function to the 'Morphology' transducer
in order to get the nominative singular forms instead of the stems. This 'Dictionary' transducer
is then composed with 'Spaces' which is a stylistic transducer to add spaces between all the
lexical features in the lexical part (upper language) just for readability reasons. This is the
final definition that is read in, in order to draw the final state network which can be used by
the user.

|15| define Morphology [


Lexicon .o. Suffixes
];

define Dictionary [
[StemToDict].i .o. Morphology
];

read regex Spaces .o. Dictionary;

Stress Assignment
After completion of the 'stem-and-ending' transducer we go further to the stress assignment on
Latin nouns. As already mentioned in section 3.3.1 we argue that it is possible to assign stress
to a Latin noun without knowing the exact syllable boundaries simply with the facts that
every syllable contains exactly one vowel or diphthong and the information that every vowel
is long by position when it is followed by at least two consonants (except for the situation that
it is followed by a 'muta cum liquida' sequence, which counts towards the beginning of the
following syllable rather then to be split, which does not trigger a long vowel by position).
Thus, it is possible with these two facts to formulate rules for the stress assignment:

27
Initially, in |16| all vowels are replaced by long vowels where applicable, namely only in the
context where the vowel is followed by at least two consonants.

|16| define LongVowel [


a -> a= || _ [C C+] - MutaCumLiquida
.o.
e -> e= || _ [C C+] - MutaCumLiquida
.
.
.
];

All the other vowels which are naturally long are marked in the lexicon.
In the next steps (→ |17| to |20|) the nouns are divided into three classes according to the
number of syllables (or vowels/diphthongs) they consist of: one syllable (→ |17|), two
syllables (→ |18|) or three syllables (→ |19| and |20|). If a word consists of only one syllable
the stress lies on the single vowel or diphthong:

|17| define OneSyllable [


a -> 'a || .#. C* _ C* .#.
.o.
.
.
.
a= -> 'a= || .#. C* _ C* .#.
.o.
.
.
.
{ae} -> '{ae} || .#. C* _ C* .#.
.o.
.
.
.
];

If a word consists of two syllables (i.e. two vowels/diphthongs), stress is assigned to the first
vowel or diphthong independent of the quality of the vowel, which can be either long or short:

|18| define TwoSyllables [


a -> 'a || .#. C* _ C* [V|D] C* .#.
.o.
.
.
.
a= -> 'a= || .#. C* _ C* [V|D] C* .#.
.o.
.
.
.

28
{ae} -> '{ae} || .#. C* _ C* [V|D] C* .#.
.o.
.
.
.
];

If the word consists of three or more syllables, the penultimate law (→ section 3.3.2) comes
into play. That means if the second last vowel is a long vowel or a diphthong, stress is
assigned to it:

|19| define ThreeOrMoreSyllablesPenult [


a= -> 'a= || _ C* [V|D] C* .#.
.o.
.
.
.
{ae} -> '{ae} || _ C* [V|D] C* .#.
.o.
.
.
.
];

If on the other hand the second last syllable contains a short vowel stress is assigned to the
vowel or diphthong preceding that short vowel, independent of the quality of the vowel:

|20| define ThreeOrMoreSyllablesAntepenult [


a -> 'a || _ C* VShort C* [V|D] C* .#.
.o.
.
.
.
a= -> 'a= || _ C* VShort C* [V|D] C* .#.
.o.
.
.
.
{ae} -> '{ae} || _ C* VShort C* [V|D] C* .#.
.o.
.
.
.
];

As in |14| all the five transducers are composed together to build the 'StressAssignment'
transducer (→ |21|):

|21| define StressAssignment [


LongVowel
.o.

29
OneSyllable
.o.
TwoSyllables
.o.
ThreeOrMoreSyllablesPenult
.o.
ThreeOrMoreSyllablesAntepenult
];

'Prosody' then is the composition of 'Spaces', the dictionary and the just mentioned
'StressAssignment' transducer:

|22| define Prosody [


Spaces .o. Dictionary .o. StressAssignment
];

|23| read regex Prosody;

|23| is the command to read in the transducer to build the actual finite state network.

One Final Remark


The implementation of the Latin noun inflection is a program that can be used in two ways:
for generation and analysis of declined Latin nouns. The implementation of Latin stress
assignment, on the other hand, is a program that is just interesting for use in one direction,
namely in the generation phase. If we composed 'Prosody' with 'Dictionary' generally, the user
would have to specify main stress in the word he enters for analysis, which is unpractical,
because he maybe does not know the main stress rule. That would mean if the user does not
know the stress he cannot use the program for analysis. Therefore, it is useful to keep the two
finite state networks separate. If the user wishes to have information on stress in Latin nouns
in the generation phase he activates the second network 'Prosody'. Otherwise just the network
'Dictionary' is used.

30
5 Bibliography
Allen, W. Sidney. Vox Latina – A Guide to the Pronunciation of Classical Latin, Second
edition. Cambridge University Press, Cambridge, 1978.

Beesley, R. Kenneth and Karttunen, Lauri. Finite State Morphology. CSLI Publications,
Leland Stanford Junior University, 2003.

Bender, Byron. W. Latin Noun Inflection (A Solution to Latin 10). University of Hawaii at
Manoa.

Bozzi, A. and Cappelli, G. A Project for Latin Lexicography: 2. A Latin Morphological


Analyzer. In Computers and the Humanities, 24 (5-6). 1991.

Bubenheimer, Uli. Eine Morphologische Analysekomponente für das Lateinische zum Einsatz
in einem Lehrunterstützendem System. Studienarbeit, Universität Koblenz-Landau, 1995.

Convington, Michael A. Converging Transition Networks and Sub-Morphemic Regularities in


Latin Noun Inflection. Draft. Artificial Intelligence Center, University of Georgia, Athens,
1999.

Hockett, Charles F. Two models of grammatical description. 1954.

Kaplan, Ronald M. and Kay, Martin. Regular Models of Phonological Rule Systems.
Computational Linguistics 20:331-378, 1994.

Kenstowicz, Michael. Phonology in Generative Grammar. Blackwell Publishing, Oxford,


1994.

Koskenniemi, Kimmo. Two-Level Morphology: A General Computational Model for Word-


Form Recognition and Production. Dissertation, University of Helsinki, 1983.

Lindsay, W.M. The Latin Language – An Historical Account of Latin Sounds, Stems, and
Flexions. Clarendon Press, Oxford, 1894.

Logos Group. URL: http://www.logosconjugator.org. 2006. Universal Conjugator.

Matthews, P. H. Inflectional Morphology. University Press, Cambridge, 1972.

Matthews, P. H.. Morphology, Second Edition. University Press, Cambridge, 1991.

31
McLean, Adam. URL: http://www.levity.com/alchemy/latin/latintrans.html. Latin parser and
translator 0.96.

Müller, Horst M. (Hrsg.). Arbeitsbuch Linguistik. Schöningh, Paderborn, 2002.

Perseus Digital Library Project. URL: http://www.perseus.tufts.edu. Ed. Gregory R. Crane.


Tufts University.

Roark, Brian and Sproat, Richard. Compuational Approaches to Morphology and Syntax.
2006. (unpublished draft).

Sommer, Ferdinand. Handbuch der Lateinischen Laut- und Formenlehre – Eine Einführung
in das sprachwissenschaftliche Studium des Lateins, 2. und 3. Auflage. Carl Winters
Universitätsbuchhandlung, Heidelberg, 1914.

Sommer, Ferdinand. Handbuch der lateinischen Laut- und Formenlehre – Eine Einführung in
das sprachwissenschaftliche Studium des Lateins, 4. Auflage. Carl Winter
Universitätsverlag, Heidelberg, 1977.

Stock, Leo. Langenscheidts Kurzgrammatik Latein, 24. Auflage. Langenscheidt, Berlin, 1970.

Stowasser, J.M. et al. Stowasser, Auflage 2004. HPT Medien AG, Zug, 1979.

Stump, Gregory T. Inflectional Morphology - A Theory of Paradigm Structure. Cambridge


University Press, Cambridge, 2001.

Zirin, R. The Phonological Basis of Latin Prosody. University Microfilms, Inc., Ann Arbor,
Michigan, 1967.

32
6 Appendix: xfst Script File
clear
undefine all

define VShort a | e | i | o | u; #short vowel


define VLong a= | e= | i= | o= | u=; #long vowel
define V VShort | VLong; #vowel
define D {ae} | {au} | {oe} | {eu}; #diphthong
define C b | c | d | f | g | h | l | m | n | p | q | r | s | t | v | x | z;
#consonant
define Seg C | V; #segment
define Gend masc | fem | neut; #gender
define Case nom | gen | dat | acc | abl; #case
define Num sg | pl; #number
define Decl adecl | odecl | edecl | udecl | idecl | cdecl; #declension
define MutaCumLiquida {bl} | {br} | {pl} | {pr} | {gl} | {gr} | {cl} | {cr}
| {dl} | {dr} | {tl} | {tr}; #muta cum liquida
define noun Noun Gend Case Num; #noun
define POS Noun; #part of speech
define LexFeatures [Gend | Case | Num | POS]; #lexical features

####################STEM+ENDING############################################
###########################################################################
###########################################################################

# The lexical features given in the lexicon of the noun are extended by two
# more: 'Ending' and 'DeclTag' (declension)
define Features [
[..] -> [Ending DeclTag] || _ noun
];

# The 'DeclTag' feature is rewritten by the actual declension depending on


# the context
define Declension [
DeclTag -> adecl || [a|a=] Ending _
.o.
DeclTag -> odecl || [o|o=] Ending _
.o.
DeclTag -> edecl || [e|e=] Ending _
.o.
DeclTag -> udecl || [u|u=] Ending _
.o.
DeclTag -> idecl || [i|i=] Ending _
.o.
DeclTag -> cdecl || C Ending _
];

# The following definition about stem change contains three different


# condition contexts: 1) all segments preceding the hash sign are deleted
# in all cases 2) except for the nominative singular (and for neuter nouns
# in the accusative singular) where instead the segments following the
# hash sign are deleted and 3) final stem character o changes into u with
# o-declension masculine accusative singular nouns
define StemChange [
Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun masc nom pl
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun masc gen
.o.

33
Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun masc dat
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun masc acc
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun masc abl
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun fem nom pl
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun fem gen
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun fem dat
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun fem acc
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun fem abl
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun fem nom pl
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun fem gen
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun fem dat
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun fem acc
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun fem abl
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun masc nom pl
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun masc gen
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun masc dat
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun masc acc
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun masc abl
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun neut nom pl
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun neut gen
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun neut dat
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun neut acc pl
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun neut abl
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun neut nom pl
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun neut gen
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun neut dat
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun neut acc pl
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun neut abl
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun neut nom pl
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun neut gen
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun neut dat

34
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun neut acc pl
.o.
Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun neut abl
.o.
Seg+ -> 0 || %# _ Ending cdecl Noun masc nom sg
.o.
Seg+ -> 0 || %# _ Ending cdecl Noun fem nom sg
.o.
Seg+ -> 0 || %# _ Ending idecl Noun fem nom sg
.o.
Seg+ -> 0 || %# _ Ending odecl Noun masc nom sg
.o.
Seg+ -> 0 || %# _ Ending cdecl Noun neut nom sg
.o.
Seg+ -> 0 || %# _ Ending idecl Noun neut nom sg
.o.
Seg+ -> 0 || %# _ Ending odecl Noun neut nom sg
.o.
Seg+ -> 0 || %# _ Ending cdecl Noun neut acc sg
.o.
Seg+ -> 0 || %# _ Ending idecl Noun neut acc sg
.o.
Seg+ -> 0 || %# _ Ending odecl Noun neut acc sg
.o.
o -> u || _ Ending odecl Noun masc acc sg
];

# The auxiliary hash sign is deleted after 'StemChange'


define RemoveHashSign [
%# -> 0
];

# The Ending tag is rewritten to the actual ending of the noun according to
# its declension, its gender, its case and its number
define Endings [
Ending -> 0 || _ adecl Noun fem nom sg
.o.
Ending -> e || _ adecl Noun fem gen sg
.o.
Ending -> e || _ adecl Noun fem dat sg
.o.
Ending -> {~m} || _ adecl Noun fem acc sg
.o.
Ending -> {:} || _ adecl Noun fem abl sg
.o.
Ending -> e || _ adecl Noun fem nom pl
.o.
Ending -> {:rum} || _ adecl Noun fem gen pl
.o.
Ending -> {^i:s} || _ adecl Noun fem dat pl
.o.
Ending -> {:s} || _ adecl Noun fem acc pl
.o.
Ending -> {^i:s} || _ adecl Noun fem abl pl
.o.
Ending -> 0 || _ odecl Noun masc nom sg
.o.
Ending -> {^i:} || _ odecl Noun masc gen sg
.o.
Ending -> {:} || _ odecl Noun masc dat sg

35
.o.
Ending -> {~m} || _ odecl Noun masc acc sg
.o.
Ending -> {:} || _ odecl Noun masc abl sg
.o.
Ending -> {^i:} || _ odecl Noun masc nom pl
.o.
Ending -> {:rum} || _ odecl Noun masc gen pl
.o.
Ending -> {^i:s} || _ odecl Noun masc dat pl
.o.
Ending -> {:s} || _ odecl Noun masc acc pl
.o.
Ending -> {^i:s} || _ odecl Noun masc abl pl
.o.
Ending -> s || _ edecl Noun fem nom sg
.o.
Ending -> {i:} || _ edecl Noun fem gen sg
.o.
Ending -> {i:} || _ edecl Noun fem dat sg
.o.
Ending -> {~m} || _ edecl Noun fem acc sg
.o.
Ending -> {:} || _ edecl Noun fem abl sg
.o.
Ending -> {:s} || _ edecl Noun fem nom pl
.o.
Ending -> {:rum} || _ edecl Noun fem gen pl
.o.
Ending -> {bus} || _ edecl Noun fem dat pl
.o.
Ending -> {:s} || _ edecl Noun fem acc pl
.o.
Ending -> {bus} || _ edecl Noun fem abl pl
.o.
Ending -> s || _ udecl Noun masc nom sg
.o.
Ending -> {:s} || _ udecl Noun masc gen sg
.o.
Ending -> {i:} || _ udecl Noun masc dat sg
.o.
Ending -> {~m} || _ udecl Noun masc acc sg
.o.
Ending -> {:} || _ udecl Noun masc abl sg
.o.
Ending -> {:s} || _ udecl Noun masc nom pl
.o.
Ending -> {um} || _ udecl Noun masc gen pl
.o.
Ending -> {^ibus} || _ udecl Noun masc dat pl
.o.
Ending -> {:s} || _ udecl Noun masc acc pl
.o.
Ending -> {^ibus} || _ udecl Noun masc abl pl
.o.
Ending -> s || _ udecl Noun fem nom sg
.o.
Ending -> {:s} || _ udecl Noun fem gen sg
.o.
Ending -> {i:} || _ udecl Noun fem dat sg
.o.

36
Ending -> {~m} || _ udecl Noun fem acc sg
.o.
Ending -> {:} || _ udecl Noun fem abl sg
.o.
Ending -> {:s} || _ udecl Noun fem nom pl
.o.
Ending -> {um} || _ udecl Noun fem gen pl
.o.
Ending -> {^ibus} || _ udecl Noun fem dat pl
.o.
Ending -> {:s} || _ udecl Noun fem acc pl
.o.
Ending -> {^ibus} || _ udecl Noun fem abl pl
.o.
Ending -> s || _ idecl Noun fem nom sg
.o.
Ending -> {^is} || _ idecl Noun fem gen sg
.o.
Ending -> {^i:} || _ idecl Noun fem dat sg
.o.
Ending -> {~m} || _ idecl Noun fem acc sg
.o.
Ending -> {:} || _ idecl Noun fem abl sg
.o.
Ending -> {^e:s} || _ idecl Noun fem nom pl
.o.
Ending -> {um} || _ idecl Noun fem gen pl
.o.
Ending -> {^ibus} || _ idecl Noun fem dat pl
.o.
Ending -> {:s} || _ idecl Noun fem acc pl
.o.
Ending -> {^ibus} || _ idecl Noun fem abl pl
.o.
Ending -> 0 || _ cdecl Noun masc nom sg
.o.
Ending -> {^is} || _ cdecl Noun masc gen sg
.o.
Ending -> {i:} || _ cdecl Noun masc dat sg
.o.
Ending -> {em} || _ cdecl Noun masc acc sg
.o.
Ending -> e || _ cdecl Noun masc abl sg
.o.
Ending -> {^e:s} || _ cdecl Noun masc nom pl
.o.
Ending -> {um} || _ cdecl Noun masc gen pl
.o.
Ending -> {^ibus} || _ cdecl Noun masc dat pl
.o.
Ending -> {e:s} || _ cdecl Noun masc acc pl
.o.
Ending -> {^ibus} || _ cdecl Noun masc abl pl
.o.
Ending -> 0 || _ cdecl Noun fem nom sg
.o.
Ending -> {^is} || _ cdecl Noun fem gen sg
.o.
Ending -> {i:} || _ cdecl Noun fem dat sg
.o.
Ending -> {em} || _ cdecl Noun fem acc sg

37
.o.
Ending -> e || _ cdecl Noun fem abl sg
.o.
Ending -> {^e:s} || _ cdecl Noun fem nom pl
.o.
Ending -> {um} || _ cdecl Noun fem gen pl
.o.
Ending -> {^ibus} || _ cdecl Noun fem dat pl
.o.
Ending -> {e:s} || _ cdecl Noun fem acc pl
.o.
Ending -> {^ibus} || _ cdecl Noun fem abl pl
.o.
Ending -> 0 || _ odecl Noun neut nom sg
.o.
Ending -> {^i:} || _ odecl Noun neut gen sg
.o.
Ending -> {:} || _ odecl Noun neut dat sg
.o.
Ending -> 0 || _ odecl Noun neut acc sg
.o.
Ending -> {:} || _ odecl Noun neut abl sg
.o.
Ending -> {^a} || _ odecl Noun neut nom pl
.o.
Ending -> {:rum} || _ odecl Noun neut gen pl
.o.
Ending -> {^i:s} || _ odecl Noun neut dat pl
.o.
Ending -> {^a} || _ odecl Noun neut acc pl
.o.
Ending -> {^i:s} || _ odecl Noun neut abl pl
.o.
Ending -> {:} || _ udecl Noun neut nom sg
.o.
Ending -> {:s} || _ udecl Noun neut gen sg
.o.
Ending -> {:} || _ udecl Noun neut dat sg
.o.
Ending -> {:} || _ udecl Noun neut acc sg
.o.
Ending -> {:} || _ udecl Noun neut abl sg
.o.
Ending -> a || _ udecl Noun neut nom pl
.o.
Ending -> {um} || _ udecl Noun neut gen pl
.o.
Ending -> {^ibus} || _ udecl Noun neut dat pl
.o.
Ending -> a || _ udecl Noun neut acc pl
.o.
Ending -> {^ibus} || _ udecl Noun neut abl pl
.o.
Ending -> 0 || _ idecl Noun neut nom sg
.o.
Ending -> s || _ idecl Noun neut gen sg
.o.
Ending -> {:} || _ idecl Noun neut dat sg
.o.
Ending -> 0 || _ idecl Noun neut acc sg
.o.

38
Ending -> {:} || _ idecl Noun neut abl sg
.o.
Ending -> a || _ idecl Noun neut nom pl
.o.
Ending -> {um} || _ idecl Noun neut gen pl
.o.
Ending -> {^ibus} || _ idecl Noun neut dat pl
.o.
Ending -> a || _ idecl Noun neut acc pl
.o.
Ending -> {^ibus} || _ idecl Noun neut abl pl
.o.
Ending -> 0 || _ cdecl Noun neut nom sg
.o.
Ending -> {is} || _ cdecl Noun neut gen sg
.o.
Ending -> {i:} || _ cdecl Noun neut dat sg
.o.
Ending -> 0 || _ cdecl Noun neut acc sg
.o.
Ending -> e || _ cdecl Noun neut abl sg
.o.
Ending -> a || _ cdecl Noun neut nom pl
.o.
Ending -> {um} || _ cdecl Noun neut gen pl
.o.
Ending -> {^ibus} || _ cdecl Noun neut dat pl
.o.
Ending -> a || _ cdecl Noun neut acc pl
.o.
Ending -> {^ibus} || _ cdecl Noun neut abl pl
];

#define Referral

# Vowels preceding a caret (^) are deleted


define Voweldeletion [
V -> 0 || _ %^
];

# Short vowels preceding a colon (:) turn into long vowels respectively
define Long [
a -> a= || _ %:
.o.
e -> e= || _ %:
.o.
i -> i= || _ %:
.o.
o -> o= || _ %:
.o.
u -> u= || _ %:
];

# A vowel preceding a tilde (~) is always short


define Short [
[a|a=] -> a || _ %~
.o.
[e|e=] -> e || _ %~
.o.
[i|i=] -> i || _ %~
.o.

39
[o|o=] -> o || _ %~
.o.
[u|u=] -> u || _ %~
];

# After 'Voweldeletion', 'Long' and 'Short' all special characters are


# deleted
define RemoveSpecialCharacters [
%^ -> 0 || Seg _
.o.
%: -> 0 || VLong _
.o.
%~ -> 0 || VShort _
];

# Finally, all tags are deleted to leave just the surface form of the noun
# as a result
define RemoveFeatures [
Decl -> 0
.o.
Gend -> 0
.o.
nom | gen | dat | acc | abl -> 0
.o.
sg | pl -> 0
.o.
POS -> 0
];

define Suffixes [
Features
.o.
Declension
.o.
StemChange
.o.
RemoveHashSign
.o.
Endings
.o.
Voweldeletion
.o.
Long
.o.
Short
.o.
RemoveSpecialCharacters
.o.
RemoveFeatures
];

define Lexicon [{stella} [noun & $fem]] |


[{fenestra} [noun & $fem]] |
[{servus} %# {servo} [noun & $masc]] |
[{bellum} %# {bello} [noun & $neut]] |
[{integrum} %# {integro} [noun & $neut]] |
[{puer} %# {puero} [noun & $masc]] |
[{ager} %# {agro} [noun & $masc]] |
[{vir} %# {viro} [noun & $masc]] |
[{deus} %# {deo} [noun & $masc]] |
[{rex} %# r e= g [noun & $masc]] |

40
[{cor} %# {cord} [noun & $neut]] |
[{iter} %# {itiner} [noun & $neut]] |
[{caput} %# {capit} [noun & $neut]] |
[c o= {nsul} [noun & $masc]] |
[{pater} %# {patr} [noun & $masc]] |
[n o= {men} %# n o= {min} [noun & $neut]] |
[{genus} %# {gener} [noun & $neut]] |
[{corpus} %# {corpor} [noun & $neut]] |
[{turri} [noun & $fem]] |
[i= {gni} [noun & $fem]] |
[{animal} %# {anim} a= {li} [noun & $neut]] |
[{manu} [noun & $fem]] |
[{lacu} [noun & $masc]] |
[{genu} [noun & $neut]] |
[r e= [noun & $fem]] |
[{di} e= [noun & $fem]] |
[{fid} e= [noun & $fem]];

define Spaces [
~[{ } ?*]
.o.
~[?* { }]
.o.
~$[{ } { }]
.o.
~$[Seg { } Seg]
.o.
~[?* [Seg|Gend|Case|Num|POS] [Gend|Case|Num|POS] ?*]
.o.
{ } -> 0
];

# The morphology of Latin nouns is defined as the composition of the


# lexicon with the suffixes
define Morphology [
Lexicon .o. Suffixes
];

# Stylistic transducer to change every stem into its nominative singular


# (standard dictionary) form
define StemToNomSg [?* []:[noun & $nom & $sg]] .o. Lexicon .o. Suffixes;

# If a noun has a variant stem, the nominative singular form of the noun is
# given in the lexicon preceding a hash sign. In these cases the nominative
# singular form of the noun does not have to be formed but can be taken
# from the lexicon.
define Case1 [
$%#
.o.
[..] -> %# || \LexFeatures _ LexFeatures
.o.
%# ?* %# -> 0
.o.
~$%#
];

# If there is a hash sign in the lexicon entry of a noun, the nominative


# singular form is taken from the lexicon, otherwise the nominative
# singular form of the noun is newly constructed
define StemToDict [
Case1

41
|
[~$%# .o. [StemToNomSg LexFeatures*]]
];

# The dictionary is defined to be the composition of the inversion of the


# 'StemToDict' function with the Latin morphology
define Dictionary [
[StemToDict].i .o. Morphology
];

read regex Spaces .o. Dictionary;

####################PROSODY################################################
###########################################################################
###########################################################################

# Every short vowel turns into a long vowel (-> heavy syllable ('position
# lenght')) when it is followed by at least two consonants
define LongVowel [
a -> a= || _ [C C+] - MutaCumLiquida
.o.
e -> e= || _ [C C+] - MutaCumLiquida
.o.
i -> i= || _ [C C+] - MutaCumLiquida
.o.
o -> o= || _ [C C+] - MutaCumLiquida
.o.
u -> u= || _ [C C+] - MutaCumLiquida
];

# If a word consists of only one syllable, stress is assigned to that


# syllable
define OneSyllable [
a -> 'a || .#. C* _ C* .#.
.o.
e -> 'e || .#. C* _ C* .#.
.o.
i -> 'i || .#. C* _ C* .#.
.o.
o -> 'o || .#. C* _ C* .#.
.o.
u -> 'u || .#. C* _ C* .#.
.o.
a= -> 'a= || .#. C* _ C* .#.
.o.
e= -> 'e= || .#. C* _ C* .#.
.o.
i= -> 'i= || .#. C* _ C* .#.
.o.
o= -> 'o= || .#. C* _ C* .#.
.o.
u= -> 'u= || .#. C* _ C* .#.
.o.
{ae} -> '{ae} || .#. C* _ C* .#.
.o.
{au} -> '{au} || .#. C* _ C* .#.
.o.
{oe} -> '{oe} || .#. C* _ C* .#.
];

42
# If a word consists of two syllables (two vowels or diphthongs) stress is
# assigned to the first syllable
define TwoSyllables [
a -> 'a || .#. C* _ C* [V|D] C* .#.
.o.
e -> 'e || .#. C* _ C* [V|D] C* .#.
.o.
i -> 'i || .#. C* _ C* [V|D] C* .#.
.o.
o -> 'o || .#. C* _ C* [V|D] C* .#.
.o.
u -> 'u || .#. C* _ C* [V|D] C* .#.
.o.
a= -> 'a= || .#. C* _ C* [V|D] C* .#.
.o.
e= -> 'e= || .#. C* _ C* [V|D] C* .#.
.o.
i= -> 'i= || .#. C* _ C* [V|D] C* .#.
.o.
o= -> 'o= || .#. C* _ C* [V|D] C* .#.
.o.
u= -> 'u= || .#. C* _ C* [V|D] C* .#.
.o.
{ae} -> '{ae} || .#. C* _ C* [V|D] C* .#.
.o.
{au} -> '{au} || .#. C* _ C* [V|D] C* .#.
.o.
{oe} -> '{oe} || .#. C* _ C* [V|D] C* .#.
];

# If a word consists of three or more syllables stress is assigned to the


# second last syllble if it is a heavy syllable (ending in a long vowel or
# diphthong)
define ThreeOrMoreSyllablesPenult [
a= -> 'a= || _ C* [V|D] C* .#.
.o.
e= -> 'e= || _ C* [V|D] C* .#.
.o.
i= -> 'i= || _ C* [V|D] C* .#.
.o.
o= -> 'o= || _ C* [V|D] C* .#.
.o.
u= -> 'u= || _ C* [V|D] C* .#.
.o.
{ae} -> '{ae} || _ C* [V|D] C* .#.
.o.
{au} -> '{au} || _ C* [V|D] C* .#.
.o.
{oe} -> '{oe} || _ C* [V|D] C* .#.
];

# If a word consists of three or more syllables and the second last


# syllable is a light syllable (ending in a short vowel) stress is assigned
# to the third last syllable (vowel or diphthong)
define ThreeOrMoreSyllablesAntepenult [
a -> 'a || _ C* VShort C* [V|D] C* .#.
.o.
e -> 'e || _ C* VShort C* [V|D] C* .#.
.o.
i -> 'i || _ C* VShort C* [V|D] C* .#.
.o.

43
o -> 'o || _ C* VShort C* [V|D] C* .#.
.o.
u -> 'u || _ C* VShort C* [V|D] C* .#.
.o.
a= -> 'a= || _ C* VShort C* [V|D] C* .#.
.o.
e= -> 'e= || _ C* VShort C* [V|D] C* .#.
.o.
i= -> 'i= || _ C* VShort C* [V|D] C* .#.
.o.
o= -> 'o= || _ C* VShort C* [V|D] C* .#.
.o.
u= -> 'u= || _ C* VShort C* [V|D] C* .#.
.o.
{ae} -> '{ae} || _ C* VShort C* [V|D] C* .#.
.o.
{au} -> '{au} || _ C* VShort C* [V|D] C* .#.
.o.
{oe} -> '{oe} || _ C* VShort C* [V|D] C* .#.
];

define StressAssignment [
LongVowel
.o.
OneSyllable
.o.
TwoSyllables
.o.
ThreeOrMoreSyllablesPenult
.o.
ThreeOrMoreSyllablesAntepenult
];

define Prosody [
Spaces .o. Dictionary .o. StressAssignment
];

#read regex Prosody;

44

Оценить