Академический Документы
Профессиональный Документы
Культура Документы
April 23 - 27 2012
Avignon France
c 2012 The Association for Computational Linguistics
ISBN 978-1-937284-19-0
ii
Preface: General Chair
Welcome to EACL 2012, the 13th Conference of the European Chapter of the Association for
Computational Linguistics. We are happy that despite strong competition from other Computational
Linguistics events and economic turmoil in many European countries, this EACL is comparable to the
successful previous ones, both in terms of the number of papers submitted and in terms of attendance. We
have a strong scientific program, including ten workshops, four tutorials, a demos session and a student
research workshop. I am convinced that you will appreciate our program.
What does a General Chair at EACL have to do? Not much, it turns out. My job was to act as a liaison
between the local organizing team, the scientific committees, and the EACL board, and to give advice
when needed. Looking back at the thousands of e-mails I was copied on reminded me of the Jerome K.
Jerome quote: I like work. I can sit and look at it for hours. It has been an enjoyable experience to
cooperate with the many people who made this conference happen, and to see them work. I have learned
a lot from them.
The Program Committee at an ACL conference is a trained army of Area Chairs, Program Committee
members, and additional reviewers. Mirella Lapata and Llus Marquez commanded this particular one.
It is thanks to the voluntary peer reviewing work, year after year, of this large group of people, formed by
the top researchers in our field, that you will find a high-quality program. It is thanks to Mirella and Llus
that you will not only find the quality we expect from EACL, but also innovation, coherence, breadth,
and depth. I cant thank them enough for their work on all aspects of the scientific program and for their
advice on virtually any other aspect of the organization. Many thanks also to Regina Barzilay, Raymond
Mooney, and Martin Cooke for accepting to present an invited lecture and thereby increase the appeal of
this event even more.
As in previous years, the selection of the workshops of all ACL conferences in the same year is
coordinated in a single committee. For EACL, Kristiina Jokinen and Alessandro Moschitti collaborated
with the NAACL and ACL chairs in reviewing and selecting the workshops. As EACL is the first
conference of the three, they had to initiate the call for proposals and activate their colleagues long
before they were planning to. Thanks to their professionalism and efficiency, the process went very
smoothly, and the resulting workshops program reflects the diversity and maturity of the field. For
even more variation during the first two days of the conference, we also have a strong tutorial program.
Tutorial Chairs Lieve Macken and Eneko Agirre managed to attract an impressive list of high-quality
submissions and performed a thorough and thoughtful review and selection. It is truly a pity only four
could be accommodated in the program, but their quality and timeliness is inspiring. Many thanks to
Kristiina, Alessandro, Lieve, and Eneko for making this important part of the scientific program such a
success.
As is previous editions of EACL, the Student Research Workshop was organized by the student members
of the EACL board: Pierre Lison, Mattias Nilsson, and Marta Recasens, with help from faculty advisor
Laurence Danlos. Their task was a huge one: to organize a mini-conference within the conference.
This included finding reviewers, selecting papers, setting up a program for the student session, finding
mentors for the accepted papers, selecting a best paper award, . . . The amount of work they did cannot
be overestimated, and the result is brilliant. Thank you! To round of the scientific program, we
have stimulating demonstration sessions, selected and coordinated by Demonstrations Chair Frederique
Segond. Thank you for showing so clearly the rapid progress application-oriented computational
linguistics is making.
Thanks also to Gertjan van Noord and Caroline Sporleder for accepting the role of coordinators of the
mentoring service. In the end, they didnt have to assign mentors, but it is important that such a service
is available when needed.
iii
For EACL 2012 we decided to switch to digital proceedings only. They were available before the
conference from the website, during the conference on the memory stick you received with your
registration material, and afterwards from the website and the ACL Anthology. An exception was made
for the tutorial notes, which are available to participants on paper as well. I warned the Publications
Chairs, Adria de Gispert and Fabrice Lefevre, beforehand that theirs was probably the most demanding
and stressful task of the conference: making sure that huge volumes of material from so many sources are
available in time and in the right format, incorporating last minute corrections, and handling unavoidable
glitches in the publications software. It is a formidable task, but they completed it without flinching. We
all owe them our gratitude.
EACL seems to follow economical crises, let us hope it does not become a habit. Both the previous
conference in 2009 and the current one happened in grim economical times. Being a Sponsorship Chair
is not a happy occasion in such times. Nevertheless, both the international ACL Sponsorship Committee
(with Massimiliano Ciaramita as EACL member) and the local Sponsorship Chairs (Eric SanJuan and
Stephane Huet) left no stone unturned looking for sponsors. We would have ended up in a much worse
financial situation if it hadnt been for their efforts. Thank you! And of course also many thanks to our
sponsors who, despite the economic situation, decided to help us financially with the conference. I am
convinced their investment will be rewarded.
Organizing large conferences like this is a complex undertaking, even with the help of extensive material
(the ACL conference handbook). Whenever in doubt, I have had the opportunity to interact with the
EACL Board, and occasionally with the ACL Board and with Priscilla Rasmussen. This has always been
a pleasure. I have learned that the people running our associations are dedicated, know everything, and
never sleep.
Last but not least, the local organizing team has had to carry the largest burden in the organization. The
sheer number of tasks and actions the local organizers of a conference like EACL have to assume is
astonishing. Marc El-Beze has been a wonderful chair and his team (Frederic Bechet, Yann Fernandez,
Stephane Huet, Tania Jimenez, Fabrice Lefevre, Georges Linares, Alexis Nasr, Eric SanJuan, and Iria
Da Cunha) has done outstanding work. There is no beginning in mentioning the many tasks they had to
fulfill for making this a top conference. I am very grateful for all the work they put in the event and for
the stress-free and friendly cooperation. I am also grateful for the support of the University of Avignon.
I hope you will have many fond memories of EACL 2012, organized in these stunning surroundings
in Avignon, both about the exciting scientific program and about the superb social program and local
arrangements.
Walter Daelemans
General Chair
March 2012
iv
Preface: Program Chairs
We are delighted to present you with this volume containing the papers accepted for presentation at
the 13th Conference of the European Chapter of the Association for Computational Linguistics, held in
Avignon, France, from April 23 till April 27 2012.
EACL 2012 received 326 submissions. We were able to accept 85 papers in total (an acceptance rate
of 26%). 48 of the papers (14.7%) were accepted for oral presentation, and 34 (10.4%) for poster
presentation. One oral paper was subsequently withdrawn after acceptance. The papers were selected
by a program committee of 28 area chairs, from Asia, Europe, and North America, assisted by a panel
of 471 reviewers. Each submission was reviewed by three reviewers, who were furthermore encouraged
to discuss any divergences they might have, and the papers in each area were ranked by the area chairs.
The final selection was made by the program co-chairs after an independent check of all reviews and
discussions with the area chairs.
This year EACL introduced an author response period. Authors were able to read and respond to the
reviews of their paper before the program committee made a final decision. They were asked to correct
factual errors in the reviews and answer questions raised in the reviewers comments. The intention was
to help produce more accurate reviews. In some cases, reviewers changed their scores in view of the
authors response and the area chairs read all responses carefully prior to making recommendations for
acceptance. Another new feature was to allow authors to include optional supplementary material in
addition to the paper itself (e.g., code, data sets, and resources). Finally, in an attempt to eliminate any
bias from the reviewing process we put in place a double-blind reviewing system where the identity of
the authors was not revealed to the area chairs.
After the program was selected, each of the area chairs was asked to nominate the best paper from his
or her area, or to explicitly decline to nominate any. This resulted in several nominations out of which
three stood out and were further considered in more detail by an dedicated committee chaired by Stephen
Clark. This independent committee selected the best paper of the conference, which will be also awarded
with a prize sponsored by Google. The best paper and the other two finalists will be presented in plenary
sessions at the conference.
In addition to the main conference program, EACL 2012 will feature the now traditional Student
Research Workshop, 10 workshops, 4 tutorials and a demo session with 21 presentations. We are also
fortunate to have three invited speakers, Martin Cooke, Ikerbasque (Basque Foundation for Science),
Regina Barzilay, Massachusetts Institute of Technology, and Raymond Mooney, University of Texas at
Austin. Martin Cooke will speak about Speech Communication in the Wild, Regina Barzilay will
discuss the topic of Learning to Behave by Reading, and Raymond Mooney will present on Learning
Language from Perceptual Context.
First and foremost, we would like to thank the authors who submitted their work to EACL. The sheer
number of submissions reflects how broad and active our field is. We are deeply indebted to the area
chairs and the reviewers for their hard work. They enabled us to select an exciting program and to
provide valuable feedback to the authors. We are grateful to our invited speakers who graciously agreed
to give talks at EACL. Additional thanks to the Publications Chairs, Adria de Gispert and Fabrice
Lefevre who put this volume together. We are grateful to Rich Gerber and the START team who
always responded to our questions quickly, and helped us manage the large number of submissions
smoothly. Thanks are due to the local organizing committee chair, Marc El-Beze for his cooperation
with us over many organisational issues. We are also grateful to the Student Research Workshop chairs,
Pierre Lison, Mattias Nilsson, and Marta Recasens, and the NAACL-HLT (Srinivas Bangalore, Eric
Fosler-Lussier and Ellen Riloff) and ACL (Chin-Yew Lin and Miles Osborne) program chairs for their
smooth collaboration in the handling of double submissions. Last but not least, we are indebted to the
v
General Chair, Walter Daelemans, for his guidance and support throughout the whole process.
vi
Organizing Committee
General Chair:
Walter Daelemans, University of Antwerp, Belgium
Area Chairs:
Katja Filippova, Google
Min-Yen Kan, National University of Singapore
Charles Sutton, University of Edinburgh
Ivan Titov, Saarland University
Xavier Carreras, Universitat Politecnica de Catalunya (UPC)
Kenji Sagae, University of Southern California
Kallirroi Georgila, Institute for Creative Technologies, University of Southern California
Michael Strube, HITS gGmbH
Pascale Fung, The Hong Kong University of Science and Technology
Bing Liu, University of Illinois at Chicago
Theresa Wilson, Johns Hopkins University
David McClosky, Stanford University
Sebastian Riedel, University of Massachusetts
Phil Blunsom, University of Oxford
Mikel L. Forcada, Universitat dAlacant
Christof Monz, University of Amsterdam
Sharon Goldwater, University of Edinburgh
Richard Wicentowski, Swarthmore College
Patrick Pantel, Microsoft Research
Hiroya Takamura, Tokyo Institute of Technology
Alexander Koller, University of Potsdam
Sebastian Pado, Universitat Heidelberg
Maarten de Rijke, University of Amsterdam
Julio Gonzalo, UNED
Lori Levin, Carnegie Mellon University
Piek Vossen, VU University Amsterdam
Afra Alishahi, Tilburg University, The Netherlands
John Hale, Cornell University
vii
Student research workshop chairs:
Pierre Lison, University of Oslo, Norway
Mattias Nilsson, Uppsala University, Sweden
Marta Recasens, University of Barcelona, Spain
Publications Committee:
Adria de Gispert, University of Cambridge, UK
Fabrice Lefevre, University of Avignon, France
Sponsorship Committee:
Massimiliano Ciaramita
Mentoring service:
Caroline Sporleder, Saarland University, Germany
Gertjan van Noord, University of Groningen, The Netherlands
viii
Table of Contents
Answer Sentence Retrieval by Matching Dependency Paths acquired from Question/Answer Sentence
Pairs
Michael Kaisser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Can Click Patterns across Users Query Logs Predict Answers to Definition Questions?
Alejandro Figueroa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Adaptation of Statistical Machine Translation Model for Cross-Lingual Information Retrieval in a Ser-
vice Context
Vassilina Nikoulina, Bogomil Kovachev, Nikolaos Lagos and Christof Monz . . . . . . . . . . . . . . . . 109
ix
Tree Representations in Probabilistic Models for Extended Named Entities Detection
Marco Dinarelli and Sophie Rosset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
A Probabilistic Model of Syntactic and Semantic Acquisition from Child-Directed Utterances and their
Meanings
Tom Kwiatkowski, Sharon Goldwater, Luke Zettlemoyer and Mark Steedman . . . . . . . . . . . . . . . 234
CLex: A Lexicon for Exploring Color, Concept and Emotion Associations in Language
Svitlana Volkova, William B. Dolan and Theresa Wilson. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .306
x
User Edits Classification Using Document Revision Histories
Amit Bronner and Christof Monz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
Combining Tree Structures, Flat Features and Patterns for Biomedical Relation Extraction
Md. Faisal Mahbub Chowdhury and Alberto Lavelli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
Cutting the Long Tail: Hybrid Language Models for Translation Style Adaptation
Arianna Bisazza and Marcello Federico . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
Detecting Highly Confident Word Translations from Comparable Corpora without Any Prior Knowledge
Ivan Vulic and Marie-Francine Moens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
Evaluating language understanding accuracy with respect to objective outcomes in a dialogue system
Myroslava O. Dzikovska, Peter Bell, Amy Isard and Johanna D. Moore . . . . . . . . . . . . . . . . . . . . . 471
Joint Satisfaction of Syntactic and Pragmatic Constraints Improves Incremental Spoken Language Un-
derstanding
Andreas Peldszus, Okko Bu, Timo Baumann and David Schlangen . . . . . . . . . . . . . . . . . . . . . . . . 514
Learning How to Conjugate the Romanian Verb. Rules for Regular and Partially Irregular Verbs
Liviu P. Dinu, Vlad Niculae and Octavia-Maria Sulea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
xi
Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History
Torsten Zesch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
Perplexity Minimization for Translation Model Domain Adaptation in Statistical Machine Translation
Rico Sennrich . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
Learning for Microblogs with Distant Supervision: Political Forecasting with Twitter
Micol Marchetti-Bowick and Nathanael Chambers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
Framework of Semantic Role Assignment based on Extended Lexical Conceptual Structure: Comparison
with VerbNet and FrameNet
Yuichiroh Matsubayashi, Yusuke Miyao and Akiko Aizawa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686
xii
Elliphant: Improved Automatic Detection of Zero Subjects and Impersonal Constructions in Spanish
Luz Rello, Ricardo Baeza-Yates and Ruslan Mitkov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706
To what extent does sentence-internal realisation reflect discourse context? A study on word order
Sina Zarrie, Aoife Cahill and Jonas Kuhn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767
Not as Awful as it Seems: Explaining German Case through Computational Experiments in Fluid Con-
struction Grammar
Remi van Trijp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829
xiii
Conference Program
xv
Wednesday April 25, 2012 (continued)
10:30 Answer Sentence Retrieval by Matching Dependency Paths acquired from Ques-
tion/Answer Sentence Pairs
Michael Kaisser
10:55 Can Click Patterns across Users Query Logs Predict Answers to Definition Questions?
Alejandro Figueroa
11:20 Adaptation of Statistical Machine Translation Model for Cross-Lingual Information Re-
trieval in a Service Context
Vassilina Nikoulina, Bogomil Kovachev, Nikolaos Lagos and Christof Monz
14:25 Tree Representations in Probabilistic Models for Extended Named Entities Detection
Marco Dinarelli and Sophie Rosset
14:50 When Did that Happen? Linking Events and Relations to Timestamps
Dirk Hovy, James Fan, Alfio Gliozzo, Siddharth Patwardhan and Christopher Welty
xvi
Wednesday April 25, 2012 (continued)
16:10 A Probabilistic Model of Syntactic and Semantic Acquisition from Child-Directed Utter-
ances and their Meanings
Tom Kwiatkowski, Sharon Goldwater, Luke Zettlemoyer and Mark Steedman
xvii
Wednesday April 25, 2012 (continued)
16:10 CLex: A Lexicon for Exploring Color, Concept and Emotion Associations in Language
Svitlana Volkova, William B. Dolan and Theresa Wilson
16:10 Generalization Methods for In-Domain and Cross-Domain Opinion Holder Extraction
Michael Wiegand and Dietrich Klakow
16:10 Skip N-grams and Ranking Functions for Predicting Script Events
Bram Jans, Steven Bethard, Ivan Vulic and Marie-Francine Moens
xviii
Thursday April 26, 2012
16:10 Combining Tree Structures, Flat Features and Patterns for Biomedical Relation Extraction
Md. Faisal Mahbub Chowdhury and Alberto Lavelli
16:10 Cutting the Long Tail: Hybrid Language Models for Translation Style Adaptation
Arianna Bisazza and Marcello Federico
16:10 Detecting Highly Confident Word Translations from Comparable Corpora without Any
Prior Knowledge
Ivan Vulic and Marie-Francine Moens
xix
Thursday April 26, 2012 (continued)
16:10 Evaluating language understanding accuracy with respect to objective outcomes in a dia-
logue system
Myroslava O. Dzikovska, Peter Bell, Amy Isard and Johanna D. Moore
16:10 Joint Satisfaction of Syntactic and Pragmatic Constraints Improves Incremental Spoken
Language Understanding
Andreas Peldszus, Okko Bu, Timo Baumann and David Schlangen
16:10 Learning How to Conjugate the Romanian Verb. Rules for Regular and Partially Irregular
Verbs
Liviu P. Dinu, Vlad Niculae and Octavia-Maria Sulea
16:10 Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revi-
sion History
Torsten Zesch
16:10 Perplexity Minimization for Translation Model Domain Adaptation in Statistical Machine
Translation
Rico Sennrich
16:10 Subcat-LMF: Fleshing out a standardized format for subcategorization frame interoper-
ability
Judith Eckle-Kohler and Iryna Gurevych
16:10 The effect of domain and text type on text prediction quality
Suzan Verberne, Antal van den Bosch, Helmer Strik and Lou Boves
xx
Thursday April 26, 2012 (continued)
10:30 Learning for Microblogs with Distant Supervision: Political Forecasting with Twitter
Micol Marchetti-Bowick and Nathanael Chambers
10:55 Learning from evolving data streams: online triage of bug reports
Grzegorz Chrupala
xxi
Friday April 27, 2012 (continued)
10:30 Smart Paradigms and the Predictability and Complexity of Inflectional Morphology
Gregoire Detrez and Aarne Ranta
11:45 Identifying Broken Plurals, Irregular Gender, and Rationality in Arabic Text
Sarah Alkuhlani and Nizar Habash
10:30 Framework of Semantic Role Assignment based on Extended Lexical Conceptual Struc-
ture: Comparison with VerbNet and FrameNet
Yuichiroh Matsubayashi, Yusuke Miyao and Akiko Aizawa
11:20 Elliphant: Improved Automatic Detection of Zero Subjects and Impersonal Constructions
in Spanish
Luz Rello, Ricardo Baeza-Yates and Ruslan Mitkov
xxii
Friday April 27, 2012 (continued)
14:00 To what extent does sentence-internal realisation reflect discourse context? A study on
word order
Sina Zarrie, Aoife Cahill and Jonas Kuhn
14:25 Behind the Article: Recognizing Dialog Acts in Wikipedia Talk Pages
Oliver Ferschke, Iryna Gurevych and Yevgen Chebotar
14:50 An Unsupervised Dynamic Bayesian Network Approach to Measuring Speech Style Ac-
commodation
Mahaveer Jain, John McDonough, Gahgene Gweon, Bhiksha Raj and Carolyn Penstein
Rose
xxiii
Friday April 27, 2012 (continued)
14:50 Not as Awful as it Seems: Explaining German Case through Computational Experiments
in Fluid Construction Grammar
Remi van Trijp
xxiv
Speech Communication in the Wild
Martin Cooke
Language and Speech Laboratory
University of the Basque Country
Ikerbasque (Basque Science Foundation)
m.cooke@ikerbasque.org
Abstract
Much of what we know about speech perception comes from laboratory studies with clean, canonical
speech, ideal listeners and artificial tasks. But how do interlocutors manage to communicate effec-
tively in the seemingly less-than-ideal conditions of everyday listening, which frequently involve try-
ing to make sense of speech while listening in a non-native language, or in the presence of competing
sound sources, or while multitasking? In this talk Ill examine the effect of real-world conditions on
speech perception and quantify the contributions made by factors such as binaural hearing, visual in-
formation and prior knowledge to speech communication in noise. Ill present a computational model
which trades stimulus-related cues with information from learnt speech models, and examine how
well it handles both energetic and informational masking in a two-sentence separation task. Speech
communication also involves listening-while-talking. In the final part of the talk Ill describe some
ways in which speakers might be making communication easier for their interlocutors, and demon-
strate the application of these principles to improving the intelligibility of natural and synthetic speech
in adverse conditions.
1
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, page 1,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
Power-Law Distributions for Paraphrases Extracted from Bilingual
Corpora
2
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 211,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
each sub-phrase-table (Section 2.2). done by identifying all vertices such that, upon
The underlying connectivity of the source removal, the component becomes disconnected.
and target clusters gives rise to a natural graph Such vertices are called articulation points or cut-
representation for each cluster (Section 3.1). vertices. Cut-vertices of high connectivity degree
The vertices of the graphs consist of phrases are removed from the giant component (see Sec-
and features with a dual smoothing/syntactic- tion 4.1). For the remaining vertices of the giant
information-carrier role. The latter allow (a) re- component, new components are identified and
distribution of the mass for phrases with no appro- we proceed iteratively, while keeping track of the
priate paraphrases and (b) the extraction of syn- cut-vertices that are removed at each iteration, un-
tactic paraphrases. The proximity among vertices til the size of the largest component is less than a
of a graph is measured by means of a random walk certain threshold (see Section 4.1).
distance measure, the commute time (Aldous and Note that at each iteration, when removing cut-
Fill, 2001). This measure is known to perform vertices from a giant component, the resulting col-
well in identifying similar words on the graph of lection of components may include graphs con-
WordNet (Rao et al., 2008) and a related measure, sisting of a single vertex. We refer to such ver-
the hitting time is known to perform well in har- tices as residues. They are excluded from the re-
vesting paraphrases on a graph constructed from sulting collection and are considered for separate
multiple phrase-tables (KB). treatment, as explained later in this section.
Generally in NLP, power-law distributions are The cut-vertices need to be inserted appropri-
typically encountered in the collection of counts ately back to the components: Starting from the
during the training stage. The distances of Sec- last iteration step, the respective cut-vertices are
tion 3.1 are converted into artificial co-occurrence added to all the components of P0 which they
counts with a novel technique (Section 3.2). Al- used to glue together; this process is performed
though they need not be integers, the main chal- iteratively, until there are no more cut-vertices to
lenge is the type of the underlying distributions; add. By addition of a cut-vertex to a component,
it should ideally emulate the resulting count dis- we mean the re-establishment of edges between
tributions from the phrase extraction stage of a the former and other vertices of the latter. The
monolingual parallel corpus (Dolan et al., 2004). result is a collection of components whose total
These counts give rise to the desired probability number of unique vertices is less than the number
distributions by means of relative frequencies. of vertices of the initial giant component P0 .
These remaining vertices are the residues. We
2 Sub-phrase-tables & Clustering then construct the graph R which consists of
the residues together with all their translations
2.1 Extracting Connected Components
(even those that are included in components of
For the decomposition of the phrase-table into the above collection) and then identify its compo-
sub-phrase-tables it is convenient to view the nents {R0 , ..., Rm }. It turns out, that the largest
phrase-table as an undirected, unweighted graph component, say R0 , is giant and we repeat the de-
P with the vertex set being the source and target composition process that was performed on P0 .
phrases and the edge set being the phrase-table en- This results in a new collection of components
tries. For the rest of this section, we do not distin- as well as new residues: The components need
guish between source and target phrases, i.e. both to be pruned (see Section 4.1) and the residues
types are treated equally as vertices of P . When give rise to a new graph R0 which is constructed
referring to the size of a graph, we mean the num- in the same way as R. We proceed iteratively until
ber of vertices it contains. the number of residues stops changing. For each
A trivial initial decomposition of P is achieved remaining residue u, we identify its translations,
by identifying all its connected components (com- and for each translation v we identify the largest
ponents for brevity), i.e. the mutually disjoint component of which v is a member and add u to
connected subgraphs, {P0 , P1 , ..., Pn }. It turns that component.
out (see Section 4.1) that the largest component, The final result is a collection C = D F,
say P0 , is of significant size. We call P0 giant where D is the collection of components emerg-
and it needs to be further decomposed. This is ing from the entire iterative decomposition of P0
3
and R, and F = {P1 , ..., Pn }. Figure 1 shows which the phrases x and x0 co-occur, and equiv-
the decomposition of a connected graph G0 ; for alently for c(). The purpose of this measure is
simplicity we assume that only one cut-vertex is for pruning paraphrase candidates and its use is
removed at each iteration and ties are resolved ar- explained in Section 3.1. Note that idf (x, x0 )
bitrarily. In Figure 2 the residue graph is con- [0, 1].
structed and its two components are identified. The merging process and the idf measure are
The iterative insertion of the cut vertices is also irrelevant for phrases belonging to the compo-
depicted. The resulting two components together nents of F, since the vertex set of each compo-
with those from R form the collection D for G0 . nent of F is mutually disjoint with the vertex set
The addition of cut-vertices into multiple com- of any other component in C.
ponents, as well as the construction method of the
residue-based graph R, can yield the occurrences s1 t1 s1 t1 G
11
of a vertex in multiple components in D. We ex- G0 s2 t2 s3 t3
ploit this property in two ways: G 12
s3 t 3 c 0={s 2 } s 4 t4
(a) In order to mitigate the risk of excessive de-
s4 t4
composition (which implies greater risk of good r={t 2 }
paraphrases being in different components), as
well as to reduce the size of D, a conserva- s3 t3 s3 t 4 G 21
G 12
tive merging algorithm of components is em- s4 t 4 c 1={t 3 }
ployed. Suppose that the elements of D are r r{s 4 }
ranked according to size in ascending order as
D = {D1 , ..., Dk , Dk+1 , ..., D|D| }, where |Di | Figure 1: The decomposition of G0 with vertices
, for i = 1, ..., k, and some threshold (see Sec- si and tj : The cut-vertex of the ith iteration is de-
tion 4.1). Each component Di with i {1, ..., k} noted by ci , and r collects the residues after each
is examined as follows: For each vertex of Di the iteration. The task is completed in Figure 2.
number of its occurrences in D is inspected; this is
done in order to identify an appropriate vertex b to
act as a bridge between Di and other components
s2 t2 s2 t2
of which b is a member. Note that translations of R
a vertex b with smaller number of occurrences in s4 t3 s4 t3
D are less likely to capture their full spectrum of c 0
paraphrases. We thus choose a vertex b from Di s1 t1
c 1 t3
with the smallest number of occurrences in D , s3 t 4 s3 t4 t3 c 0
resolving ties arbitrarily, and proceed with merg- s3 t4
ing Di with the largest component, say Dj with s2 t3
s1 t1
j {1, ..., |D| 1}, of which b is also a member.
s2 s3 t4
The resulting merged component Dj 0 contains all
vertices and edges of Di and Dj and new edges,
which are formed according to the rule: if u is a Figure 2: Top: Residue graph with its components
vertex of Di and v is a vertex of Dj and (u, v) is (no further decomposition is required). Bottom:
a phrase-table entry, then (u, v) is an edge in Dj 0 . Adding cut-vertices back to their components.
As long as no connected component has identi-
fied Di as the component with which it should be
merged, then Di is deleted from the collection D. 2.2 Clustering Connected Components
(b) We define an idf -inspired measure for each The aim of this subsection is to generate sep-
phrase pair (x, x0 ) of the same type (source or tar- arate clusters for the source and target phrases
get) as of each sub-phrase-table (component) C C.
1
2c(x, x0 )|D|
For this purpose the Information-Theoretic Co-
0
idf (x, x ) = log , (1) Clustering (ITC) algorithm (Dhillon et al., 2003)
log |D| c(x) + c(x0 )
is employed, which is a general principled cluster-
where c(x, x0 ) is the number of components in ing algorithm that generates hard clusters (i.e. ev-
4
ery element belongs to exactly one cluster) of two than some threshold (see Section 4.1). If two
interdependent quantities and is known to per- phrases that satisfy condition (b) and have trans-
form well on high-dimensional and sparse data. lations in more than one common target cluster,
In our case, the interdependent quantities are the a distinct such edge is established. All edges are
source and target phrases and the sparse data is bi-directional with distinct weights for both direc-
the phrase-table. tions.
ITC is a search algorithm similar to K-means, Figure 3 depicts an example of such a construc-
in the sense that a cost function, is minimized at tion; a link between a phrase si and a target cluster
each iteration step and the number of clusters for implies the existence of at least one translation for
both quantities are meta-parameters. The number si in that cluster. We are not interested in the tar-
of clusters is set to the most conservative initial- get phrases and they are thus not shown. For sim-
ization for both source and target phrases, namely plicity we assume that condition (b) is always sat-
to as many clusters as there are phrases. At each isfied and the extracted graph contains the maxi-
iteration, new clusters are constructed based on mum possible edges. Observe that phrases s3 and
the identification of the argmin of the cost func- s4 have two edges connecting them, (due to tar-
tion for each phrase, which gradually reduces the get clusters Tc and Td ) and that the target cluster
number of clusters. Ta is irrelevant to the construction of the graph,
We observe that conservative choices for the since s1 is the only phrase with translations in it.
meta-parameters often result in good paraphrases This conversion of a source cluster into a graph G
being in different clusters. To overcome this prob-
lem, the hard clusters are converted into soft (i.e. s1 s2 s3 s4 s5 s6 s7 s8
an element may belong to several clusters): One
step before the stopping criterion is met, we mod-
ify the algorithm so that instead of assigning a Ta Tb Tc Td Te Tf
phrase to the cluster with the smallest cost we se-
lect the bottom-X clusters ranked by cost. Addi- s1 s4 s7
tionally, only a certain number of phrases is cho- s3
sen for soft clustering. Both selections are done s6
conservatively with criteria based on the proper- s5 s8
s2
ties of the cost functions.
The formation of clusters leads to a natural re-
finement of the idf measure defined in eqn. (1): Figure 3: Top: A source cluster containing
The quantity c(x, x0 ) is redefined as the number phrases s1 ,..., s8 and the associated target clusters
of components in which the phrases x and x0 co- Ta ,..., Tf . Bottom: The extracted graph from the
occur in at least one cluster. source cluster. All edges are bi-directional.
5
Phrase* )feature weights: As mentioned
HAVE
OWN OWN I HAVE
above, feature vertices have the dual role of car-
rying syntactic information and smoothing. From
eqn. (3) it can be deduced that, if for a phrase
has owns i have i had
s, the amount of its outgoing weights is close to
the amount of its incoming weights, then this is
VBZ PRP VBP PRP VBD
an indication that at least a significant part of its
neighborhood is reliable; the larger the strengths,
the more certain the indication. Otherwise, either
Figure 4: Adding feature vertices to the extracted s or a significant part of its neighborhood is
graph (has) *) (owns) * ) (i have) * ) (i had). unreliable. The amount of weight from s to its
Phrase, POS tag feature and stem feature ver- feature vertices should depend on this observation
tices are drawn in circles, dotted rectangles and and we thus let
solid rectangles respectively. All edges are bi-
directional.
X 0 0
net(s) = (w(s s ) w(s s)) + ,
s0 (s)
ture vertices. The purpose of the feature vertices, (4)
unlike KB, is primarily for smoothing and secon- where prevents net(s) from becoming 0 (see
darily for identifying paraphrases with the same Section 4.1). The net weight of a phrase vertex
syntactic information and this will become clear s is distributed over its feature vertices as
in the description of the computation of weights.
The set of all phrase vertices that are adja- w(s fX ) =< w(s s0 ) > +net(s), (5)
cent to s is written as (s), and referred to
as the neighborhood of s. Let n(s, t) denote where the first summand is the average weight
the co-occurrence count of a phrase-table entry from s to its neighboring phrase vertices and
(s, t) (Koehn, 2009). We define the strength of X = POS, STEM. If s has multiple POS tag
s in the subgraph generated by cluster T as sequences, we distribute the weight of eqn. (5)
X relatively to the co-occurrences of s with the re-
n(s; T ) = n(s, t), (2) spective POS tag feature vertices. The quantity
tT
< w(s s0 ) > accounts for the basic smoothing
which is simply a partial occurrence count for s. and is augmented by a value net(s) that measures
We proceed with computing weights for all edges the reliability of ss neighborhood; the more unre-
of G: liable the neighborhood, the larger the net weight
and thus larger the overall weights to the feature
Phrase* )phrase weights: Inspired by the
vertices.
notion of preferential attachment (Yule, 1925),
which is known to produce power-law weight dis- The choice for the opposite direction is trivial:
tributions for evolving weighted networks (Barrat 1
et al., 2004), we set the weight of a directed w(fX s) = , (6)
|{s0 : (fX , s0 ) is an edge }|
edge from s to s0 to be proportional to the
strengths of s0 in all subgraphs in which both where X = POS, STEM. Note the effect of
s and s0 are members. Thus, in the random eqns. (4)(6) in the case where the neighborhood
walk framework, s is more likely to visit of s has unreliable strengths: In a random walk
a stronger (more reliable) neighbor. If Ts,s0 = the feature vertices of s will be preferred and the
{T |s and s0 coexist in subgraph generated by T }, resulting similarities between s and other phrase
then the weight w(s s0 ) of the directed edge vertices will be small, as desired. Nonetheless,
from s to s0 is given by if the syntactic information is the same with any
X other phrase vertex in G, then the paraphrases will
w(s s0 ) = n(s0 ; T ), (3)
be captured.
T Ts,s0
The transition probability from any vertex u to
if s0 (s) and 0 otherwise. any other vertex v in G, i.e., the probability of
6
hopping from u to v in one step, is given by for all pairs of vertices u, v in G until conver-
gence. Experimentally, we find that convergence
w(u v) is always achieved. After the execution of this it-
p(u v) = P 0
, (7)
v 0 w(u v ) erative process we divide each count by the small-
where we sum over all vertices adjacent to u in G. est count in order to achieve a lower bound of 1.
We can thus compute the similarity between any A pair u, v may appear in multiple graphs in the
two vertices u and v in G by their commute time, same sub-phrase-table C. The total co-occurrence
i.e., the expected number of steps in a round trip, count of u and v in C and the associated condi-
in a random walk from u to v and then back to u, tional probabilities are thus given by
which is denoted by (u, v) (see Section 4.1 for X
the method of computation of ). Since (u, v) is nC (u, v) = nG (u, v) (13)
a distance measure, the smaller its value, the more GC
similar u and v are. nC (u, v)
pC (v|u) = P . (14)
xC nC (u, x)
3.2 Counts
We convert the distance (u, v) of a vertex pair A pair u, v may appear in multiple sub-phrase-
u, v in a graph G into a co-occurrence count tables and for the calculation of the final count
nG (u, v) with a novel technique: In order to as- n(u, v) we need to average over the associated
sess the quality of the pair u, v with respect to G counts from all sub-phrase-tables. Moreover, we
we compare (u, v) with (u, x) and (v, x) for have to take into account the type of the vertices:
all other vertices x in G. We thus consider the av- For the simplest case where both u and v repre-
erage distance of u with the other vertices of G sent phrase vertices, their expected count is, by
other than v, and similarly for v. This quantity is definition, given by
denoted by (u; v) and (v; u) respectively, and X
by definition it is given by n(s, s0 ) = nC (s, s0 )p(C|s, s0 ). (15)
X C
(i; j) = (i, x)pG (x|i) (8)
xG On the other hand, if at least one of u or v is
x6=j
a feature vertex, then we have to consider the
where pG (x|i) p(x|G, i) is a yet unknown phrase vertex that generates this feature: Suppose
probability distribution with respect to G. The that u is the phrase vertex s=acquire and v the
quantity ((u; v)+(v; u))/2 can then be viewed POS tag vertex f =NN and they co-occur in two
as the average distance of the pair u, v to the rest sub-phrase-tables C and C 0 with positive counts
of the graph G. The co-occurrence count of u and nC (s, f ) and nC 0 (s, f ) respectively; the feature
v in G is thus defined by vertex f is generated by the phrase vertices own-
ership in C and by possession in C 0 . In that
(u; v) + (v; u) case, an interpolation of the counts nC (s, f ) and
nG (u, v) = . (9)
2(u, v) nC 0 (s, f ) as in eqn. (15) would be incorrect and
a direct sum nC (s, f ) + nC 0 (s, f ) would provide
In order to calculate the probabilities pG (|) we
the true count. As a result we have
employ the following heuristic: Starting with a
(0)
uniform distribution pG (|) at timestep t = 0, n(s, f ) =
XX
nC (s, f (s0 ))p(C|s, f (s0 )),
we iterate s0 C
X (t) (16)
(t) (i; j) = (i, x)pG (x|i) (10) where the first summation is over all phrase ver-
xG
x6=j tices s0 such that f (s0 ) = f . With a similar argu-
ment we can write
(t) (t) (u; v) + (t) (v; u)
nG (u, v) = (11)
2(u, v) XX
n(f, f 0 ) = nC (f (s), f (s0 ))
(t)
(t+1) nG (u, v) s,s0 C
pG (v|u) = (t)
(12)
p(C|f (s), f (s0 )).
P
xG nG (u, v) (17)
7
For the interpolants, from standard probability we 7
10
find 6
5
10
10
pC (v|u)p(C|u) P0
p(C|u, v) = P 0
, (18) 5
10
C 0 pC 0 (v|u)p(C |u) 4
10
size
0
where the probabilities p(C|u) can be computed 3
10
10 0
10 10
2 4
10
6
10
by considering the likelihood function
2
10
N
Y N X
Y 1
10
`(u) = p(xi |u) = pC (xi |u)p(C|u)
0
i=1 i=1 C 10 0 2 4 6
10 10 10 10
rank
and by maximizing the average log-likelihood
1
N log `(u), where N is the total number of ver- Figure 5: Log-log plot of ranked components ac-
tices with which u co-occurs with positive counts cording to their size (number of source and target
in all sub-phrase-tables. phrases) for: (a) Components extracted from P .
Finally, the desired probability distributions are 1-1 components are not shown. (b) Components
given by the relative frequencies extracted from the decomposition of P0 .
n(u, v)
p(v|u) = P , (19)
x n(u, x)
In the components emerging from the decompo-
for all pairs of vertices u, v. sition of R0 , we observe an excessive number
of cut-vertices. Note that vertices that consist
4 Experiments
these components can be of two types: i) for-
4.1 Setup mer residues, i.e., residues that emerged from the
The data for building the phrase-table P decomposition of P0 , and ii) other vertices of
is drawn from DE-EN bitexts crawled from P0 . Cut-vertices can be of either type. For each
www.project-syndicate.org, which is component, we remove cut-vertices that are not
a standard resource provider for the WMT translations of the former residues of that com-
campaigns (News Commentary bitexts, see, ponent. Following this pruning strategy, the de-
e.g. (Callison-Burch et al., 2007) ). The filtered generacy of excessive cut-vertices does not reap-
bitext consists of 125K sentences; word align- pear in the subsequent iterations of decompos-
ment was performed running GIZA++ in both di- ing components generated by new residues, but
rections and generating the symmetric alignments the emergence of two giant components was ob-
using the grow-diag-final-and heuristics. The served: One consisting mostly of source type ver-
resulting P has 7.7M entries, 30% of which are tices and one of target type vertices. Without go-
1-1, i.e. entries (s, t) that satisfy p(s|t) = ing into further details, the algorithm can extend
p(t|s) = 1. These entries are irrelevant for para- to multiple giant components straightforwardly.
phrase harvesting for both the baseline and our For the merging process of the collection D we
method, and are thus excluded from the process. set = 5000, to avoid the emergence of a giant
The initial giant component P0 contains 1.7M component. The sizes of the resulting sub-phrase-
vertices (Figure 5), of which 30% become tables are shown in Figure 6. For the ITC algo-
residues and are used to construct R. At each it- rithm we use the smoothing technique discussed
eration of the decomposition of a giant compo- in (Dhillon and Guan, 2003) with = 106 .
nent, we remove the top 0.5% size cut-vertices For the monolingual graphs, we set = 0.65
ranked by degree of connectivity, where size is and discard graphs with more than 20 phrase ver-
the number of vertices of the giant component and tices, as they contain mostly noise. Thus, the sizes
set = 2500 as the stopping criterion. The latter of the graphs allow us to use analytical methods
choice is appropriate for the subsequent step of to compute the commute times: For a graph G,
co-clustering the components, for both time com- we form the transition matrix P , whose entries
plexity and performance of the ITC algorithm. P (u, v) are given by eqn. (7), and the fundamen-
8
10
6 the graph. Figure 8 depicts the new graph, where
the lengths of the edges represent the magnitude
5
10
before merging of commute times. Observe that the quality of
after merging
10
4 the probabilities is preserved but the counts are
inflated, as required.
size
3
10
In general, if a source phrase vertex s has at
10
2
least one translation t such that n(s, t) 3, then a
1 triplet (is , f (is ), g(is )) is added to the graph as in
10
Figure 8. The inflation vertex is establishes edges
0
10 0
10 10
2
10
4
10
6 with all other phrase and inflation vertices in the
rank
graph and weights are computed as in Section 3.1.
The pipeline remains the same up to eqn. (13),
Figure 6: Log-log plot of ranked sub-phrase-
where all counts that include inflation vertices are
tables according to their size (number of source
ignored.
and target phrases).
f a f b
tal matrix (Grinstead and Snell, 2006; Boley et al.,
a b
2011) Z = (I P + 1 T )1 , where I is the iden-
tity matrix, 1 denotes the vector of all ones and g a g b
is the vector of stationary probabilities (Aldous
and Fill, 2001) which is such that T P = T
and T 1 = 1 and can be computed as in (Hunter, na , b = 2.0 p ba = .20
2000). The commute time between any vertices u na , f a = 2.6 p f aa = .27
na , g a = 2.6 p g aa = .27
and v in G is then given by (Grinstead and Snell,
na , f b = 1.3 p f ba = .13
2006)
na , g b = 1.3 p g ba = .13
(u, v) = (Z(v, v) Z(u, v))/(v) +
Figure 7: Top: A graph with source phrase ver-
+ (Z(u, u) Z(v, u))/(u). (20)
tices a and b, both of strength 40, with accom-
panying distinct POS sequence vertices f () and
For the parameter of eqn. (4), an appropriate
stem sequence vertices g(). Bottom: The result-
choice is = |(s)| + 1; for reliable neighbor-
ing co-occurrence counts and conditional proba-
hoods, this quantity is insignificant. POS tags and
bilities for a.
lemmata are generated with TreeTagger1 .
Figure 7 depicts the most basic type of graph
that can be extracted from a cluster; it includes
two source phrase vertices a, b, of different syn-
tactic information. Suppose that both a and g a g i a
b are highly reliable with strengths n(a; T ) = f a ia f i a
a
n(b; T ) = 40, for some target cluster T . The re- b ib
sulting conditional probabilities adequately repre- f b f i b
sent the proximity of the involved vertices. On g b g i b
the other hand, the range of the co-occurrence
na , b = 11.3 p ba = .22
counts is not compatible with that of the strengths.
na , f a = 13.5 p f aa = .26
This is because i) there are no phrase vertices with p g aa = .26
na , g a = 13.5
small strength in the graph, and ii) eqn. (9) is es- na , f b = 6.7 p f ba = .13
sentially a comparison between a pair of vertices na , g b = 6.7 p g ba = .13
and the rest of the graph. To overcome this prob-
lem inflation vertices ia and ib of strength 1 with
Figure 8: The inflated version of Figure 7.
accompanying feature vertices are introduced to
1
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
9
4.2 Results Lenient MEP Strict MEP
Method
Our method generates conditional probabilities @1 @5 @10 @1 @5 @10
for any pair chosen from {phrase, POS sequence, Baseline .58 .47 .41 .43 .33 .28
stem sequence}, but for this evaluation we restrict Graphs .72 .61 .52 .53 .40 .33
ourselves to phrase pairs. For a phrase s, the qual-
Table 1: Mean Expected Precision (MEP) at k un-
ity of a paraphrase s0 is assessed by
der lenient and strict evaluation criteria.
P (s0 |s) p(s0 |s) + p(f1 (s0 )|s) + p(f2 (s0 )|s),
(21)
by eqns. (15)(17)) for all vertices u and v, be-
where f1 (s0 ) and f2 (s0 ) denote the POS tag se-
longs to the power-law family (Figure 9). This is
quence and stem sequence of s0 respectively. All
evidence that the monolingual graphs can simu-
three summands of eqn. (21) are computed from
late the phrase extraction process of a monolin-
eqn. (19). The baseline is given by pivoting (Ban-
gual parallel corpus. Intuitively, we may think of
nard and Callison-Burch, 2005),
X the German side of the DEEN parallel corpus as
P (s0 |s) = p(t|s)p(s0 |t), (22) the English approximation to a ENEN par-
t allel corpus, and the monolingual graphs as the
where p(t|s) and p(s0 |t) are the phrase-based rel- word alignment process.
ative frequencies of the translation model.
5
We select 150 phrases (an equal number for 10
10
References Stanley Kok and Chris Brockett. 2010. Hitting the
Right Paraphrases in Good Time. Proc. NAACL,
David Aldous and James A. Fill. 2001. Reversible pp.145153.
Markov Chains and Random Walks on Graphs.
Roland Kuhn, Boxing Chen, George Foster, and Evan
http://www.stat.berkeley.edu/aldous/RWG/
Stratford. 2010. Phrase Clustering for Smoothing
book.html
TM Probabilities: or, how to Extract Paraphrases
Ion Androutsopoulos and Prodromos Malakasiotis. from Phrase Tables. Proc. COLING, pp.608616.
2010. A Survey of Paraphrasing and Textual En-
Nitin Madnani and Bonnie Dorr. 2010. Generating
tailment Methods. Journal of Artificial Intelligence
Phrasal and Sentential Paraphrases: A Survey of
Research, 38:135187.
Data-Driven Methods. Computational Linguistics,
Colin Bannard and Chris Callison-Burch. 2005. Para- 36(3):341388.
phrasing with Bilingual Parallel Corpora. Proc. Donald Metzler, Eduard Hovy, and Chunliang
ACL, pp. 597604. Zhang. 2011. An Empirical Evaluation of Data-
Alain Barrat, Marc Barthlemy, and Alessandro Vespig- Driven Paraphrase Generation Techniques. Proc.
nani. 2004. Modeling the Evolution of Weighted ACL:Short Papers, pp. 546551.
Networks. Phys. Rev. Lett., 92. Takashi Onishi, Masao Utiyama, and Eiichiro Sumita.
Daniel Boley, Gyan Ranjan, and Zhi-Li Zhang. 2011. 2010. Paraphrase Lattice for Statistical Machine
Commute Times for a Directed Graph using an Translation. Proc. ACL:Short Papers, pp. 15.
Asymmetric Laplacian. Linear Algebra and its Ap- Delip Rao, David Yarowsky, and Chris Callison-
plications, Issue 2, pp. 224242. Burch. 2008. Affinity Measures based on the Graph
Chris Callison-Burch. 2008. Syntactic Constraints Laplacian. Proc. Textgraphs Workshop on Graph-
on Paraphrases Extracted from Parallel Corpora. based Algorithms for NLP at COLING, pp. 4148.
Proc. EMNLP, pp. 196205. George U. Yule. 1925. A Mathematical Theory of
Chris Callison-Burch, Cameron Fordyce, Philipp Evolution, based on the Conclusions of Dr. J. C.
Koehn, Christof Monz, and Josh Schroeder. 2007 Willis, F.R.S. Philos. Trans. R. Soc. London, B 213,
(Meta-) Evaluation of Machine Translation. Proc. pp. 2187.
Workshop on Statistical Machine Translation, pp. Shiqi Zhao, Haifeng Wang, Ting Liu, and Sheng Li.
136158. 2008. Pivot Approach for Extracting Paraphrase
Chris Callison-Burch, Philipp Koehn, and Miles Os- Patterns from Bilingual Corpora. Proc. ACL, pp.
borne. 2006 Improved statistical machine trans- 780788.
lation using paraphrases. Proc. HLT/NAACL, pp.
1724.
Inderjit S. Dhillon and Yuqiang Guan. 2003. Informa-
tion Theoretic Clustering of Sparse Co-Occurrence
Data. Proc. IEEE Intl Conf. Data Mining, pp. 517
520.
Inderjit S. Dhillon, Subramanyam Mallela, and Dhar-
mendra S. Modha. 2003. Information-Theoretic
Coclustering. Proc. ACM SIGKDD Intl Conf.
Knowledge Discovery and Data Mining, pp. 8998.
William Dolan, Chris Quirk, and Chris Brockett.
2004. Unsupervised construction of large para-
phrase corpora: Exploiting massively parallel news
sources. Proc. COLING, pp. 350-356.
Juri Ganitkevitch, Chris Callison-Burch, Courtney
Napoles, and Benjamin Van Durme 2011. Learn-
ing Sentential Paraphrases from Bilingual Paral-
lel Corpora for Text-to-Text Generation. Proc.
EMNLP, pp. 11681179.
Charles Grinstead and Laurie Snell. 2006. Introduc-
tion to Probability. Second ed., American Mathe-
matical Society.
Jeffrey J. Hunter. 2000. A Survey of Generalized In-
verses and their Use in Stochastic Modelling. Res.
Lett. Inf. Math. Sci., Vol. 1, pp. 2536.
Philipp Koehn. 2009. Statistical Machine Translation.
Cambridge University Press, Cambridge, UK.
11
A Bayesian Approach to Unsupervised Semantic Role Induction
Abstract Mary always takes an agent role (A0) for the pred-
icate open, and door is always a patient (A1).
We introduce two Bayesian models for un- SRL representations have many potential appli-
supervised semantic role labeling (SRL) cations in natural language processing and have
task. The models treat SRL as clustering
recently been shown to be beneficial in question
of syntactic signatures of arguments with
clusters corresponding to semantic roles. answering (Shen and Lapata, 2007; Kaisser and
The first model induces these clusterings Webber, 2007), textual entailment (Sammons et
independently for each predicate, exploit- al., 2009), machine translation (Wu and Fung,
ing the Chinese Restaurant Process (CRP) 2009; Liu and Gildea, 2010; Wu et al., 2011; Gao
as a prior. In a more refined hierarchical and Vogel, 2011), and dialogue systems (Basili et
model, we inject the intuition that the clus- al., 2009; van der Plas et al., 2011), among others.
terings are similar across different predi-
Though syntactic representations are often predic-
cates, even though they are not necessar-
ily identical. This intuition is encoded as tive of semantic roles (Levin, 1993), the interface
a distance-dependent CRP with a distance between syntactic and semantic representations is
between two syntactic signatures indicating far from trivial. The lack of simple determinis-
how likely they are to correspond to a single tic rules for mapping syntax to shallow semantics
semantic role. These distances are automat- motivates the use of statistical methods.
ically induced within the model and shared
across predicates. Both models achieve Although current statistical approaches have
state-of-the-art results when evaluated on been successful in predicting shallow seman-
PropBank, with the coupled model consis- tic representations, they typically require large
tently outperforming the factored counter- amounts of annotated data to estimate model pa-
part in all experimental set-ups. rameters. These resources are scarce and ex-
pensive to create, and even the largest of them
1 Introduction have low coverage (Palmer and Sporleder, 2010).
Moreover, these models are domain-specific, and
Semantic role labeling (SRL) (Gildea and Juraf- their performance drops substantially when they
sky, 2002), a shallow semantic parsing task, has are used in a new domain (Pradhan et al., 2008).
recently attracted a lot of attention in the com- Such domain specificity is arguably unavoidable
putational linguistic community (Carreras and for a semantic analyzer, as even the definitions
Marquez, 2005; Surdeanu et al., 2008; Hajic et of semantic roles are typically predicate specific,
al., 2009). The task involves prediction of predi- and different domains can have radically different
cate argument structure, i.e. both identification of distributions of predicates (and their senses). The
arguments as well as assignment of labels accord- necessity for a large amounts of human-annotated
ing to their underlying semantic role. For exam- data for every language and domain is one of the
ple, in the following sentences: major obstacles to the wide-spread adoption of se-
mantic role representations.
(a) [A0 Mary] opened [A1 the door].
These challenges motivate the need for unsu-
(b) [A0 Mary] is expected to open [A1 the door].
pervised methods which, instead of relying on la-
(c) [A1 The door] opened. beled data, can exploit large amounts of unlabeled
(d) [A1 The door] was opened [A0 by Mary]. texts. In this paper, we propose simple and effi-
12
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 1222,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
cient hierarchical Bayesian models for this task. parses, and with gold and automatically identified
It is natural to split the SRL task into two arguments).
stages: the identification of arguments (the iden- Both models admit efficient inference: the es-
tification stage) and the assignment of semantic timation time on the Penn Treebank WSJ corpus
roles (the labeling stage). In this and in much does not exceed 30 minutes on a single proces-
of the previous work on unsupervised techniques, sor and the inference algorithm is highly paral-
the focus is on the labeling stage. Identification, lelizable, reducing inference time down to sev-
though an important problem, can be tackled with eral minutes on multiple processors. This sug-
heuristics (Lang and Lapata, 2011a; Grenager and gests that the models scale to much larger corpora,
Manning, 2006) or, potentially, by using a super- which is an important property for a successful
vised classifier trained on a small amount of data. unsupervised learning method, as unlabeled data
We follow (Lang and Lapata, 2011a), and regard is abundant.
the labeling stage as clustering of syntactic sig- The rest of the paper is structured as follows.
natures of argument realizations for every predi- Section 2 begins with a definition of the seman-
cate. In our first model, as in most of the previous tic role labeling task and discuss some specifics
work on unsupervised SRL, we define an indepen- of the unsupervised setting. In Section 3, we de-
dent model for each predicate. We use the Chi- scribe CRPs and dd-CRPs, the key components
nese Restaurant Process (CRP) (Ferguson, 1973) of our models. In Sections 4 6, we describe
as a prior for the clustering of syntactic signatures. our factored and coupled models and the infer-
The resulting model achieves state-of-the-art re- ence method. Section 7 provides both evaluation
sults, substantially outperforming previous meth- and analysis. Finally, additional related work is
ods evaluated in the same setting. presented in Section 8.
In the first model, for each predicate we inde-
pendently induce a linking between syntax and se-
2 Task Definition
mantics, encoded as a clustering of syntactic sig- In this work, instead of assuming the availabil-
natures. The clustering implicitly defines the set ity of role annotated data, we rely only on auto-
of permissible alternations, or changes in the syn- matically generated syntactic dependency graphs.
tactic realization of the argument structure of the While we cannot expect that syntactic structure
verb. Though different verbs admit different alter- can trivially map to a semantic representation
nations, some alternations are shared across mul- (Palmer et al., 2005)1 , we can use syntactic cues
tiple verbs and are very frequent (e.g., passiviza- to help us in both stages of unsupervised SRL.
tion, example sentences (a) vs. (d), or dativiza- Before defining our task, let us consider the two
tion: John gave a book to Mary vs. John gave stages separately.
Mary a book) (Levin, 1993). Therefore, it is nat- In the argument identification stage, we imple-
ural to assume that the clusterings should be sim- ment a heuristic proposed in (Lang and Lapata,
ilar, though not identical, across verbs. 2011a) comprised of a list of 8 rules, which use
Our second model encodes this intuition by re- nonlexicalized properties of syntactic paths be-
placing the CRP prior for each predicate with tween a predicate and a candidate argument to it-
a distance-dependent CRP (dd-CRP) prior (Blei eratively discard non-arguments from the list of
and Frazier, 2011) shared across predicates. The all words in a sentence. Note that inducing these
distance between two syntactic signatures en- rules for a new language would require some lin-
codes how likely they are to correspond to a sin- guistic expertise. One alternative may be to an-
gle semantic role. Unlike most of the previous notate a small number of arguments and train a
work exploiting distance-dependent CRPs (Blei classifier with nonlexicalized features instead.
and Frazier, 2011; Socher et al., 2011; Duan et al., In the argument labeling stage, semantic roles
2007), we do not encode prior or external knowl- are represented by clusters of arguments, and la-
edge in the distance function but rather induce it beling a particular argument corresponds to decid-
automatically within our Bayesian model. The ing on its role cluster. However, instead of deal-
coupled dd-CRP model consistently outperforms 1
Although it provides a strong baseline which is diffi-
the factored CRP counterpart across all the experi- cult to beat (Grenager and Manning, 2006; Lang and Lapata,
mental settings (with gold and predicted syntactic 2010; Lang and Lapata, 2011a).
13
ing with argument occurrences directly, we rep- for describing CRPs is assignment of tables to
resent them as predicate specific syntactic signa- restaurant customers. Assume a restaurant with a
tures, and refer to them as argument keys. This sequence of tables, and customers who walk into
representation aids our models in inducing high the restaurant one at a time and choose a table to
purity clusters (of argument keys) while reducing join. The first customer to enter is assigned the
their granularity. We follow (Lang and Lapata, first table. Suppose that when a client number i
2011a) and use the following syntactic features to enters the restaurant, i 1 customers are sitting
form the argument key representation: at each of the k (1, . . . , K) tables occupied so
Active or passive verb voice (ACT/PASS). far. The new customer is then either seated at one
Nk
Argument position relative to predicate of the K tables with probability i1+ , where Nk
(LEFT/RIGHT). is the number customers already sitting at table
k, or assigned to a new table with the probability
Syntactic relation to its governor.
i1+ . The concentration parameter encodes
Preposition used for argument realization. the granularity of the drawn partitions: the larger
In the example sentences in Section 1, the argu- , the larger the expected number of occupied ta-
ment keys for candidate arguments Mary for sen- bles. Though it is convenient to describe CRP in a
tences (a) and (d) would be ACT:LEFT:SBJ and sequential manner, the probability of a seating ar-
PASS:RIGHT:LGS->by,2 respectively. While rangement is invariant of the order of customers
aiming to increase the purity of argument key arrival, i.e. the process is exchangeable. In our
clusters, this particular representation will not al- factored model, we use CRPs as a prior for clus-
ways produce a good match: e.g. the door in tering argument keys, as we explain in Section 4.
sentence (c) will have the same key as Mary in Often CRP is used as a part of the Dirich-
sentence (a). Increasing the expressiveness of the let Process mixture model where each subset in
argument key representation by flagging intransi- the partition (each table) selects a parameter (a
tive constructions would distinguish that pair of meal) from some base distribution over parame-
arguments. However, we keep this particular rep- ters. This parameter is then used to generate all
resentation, in part to compare with the previous data points corresponding to customers assigned
work. to the table. The Dirichlet processes (DP) are
In this work, we treat the unsupervised seman- closely connected to CRPs: instead of choosing
tic role labeling task as clustering of argument meals for customers through the described gener-
keys. Thus, argument occurrences in the corpus ative story, one can equivalently draw a distribu-
whose keys are clustered together are assigned the tion G over meals from DP and then draw a meal
same semantic role. Note that some adjunct-like for every customer from G. We refer the reader
modifier arguments are already explicitly repre- to Teh (2010) for details on CRPs and DPs. In
sented in syntax and thus do not need to be clus- our method, we use DPs to model distributions of
tered (modifiers AM-TMP, AM-MNR, AM-LOC, and arguments for every role.
AM-DIR are encoded as syntactic relations TMP, In order to clarify how similarities between
MNR, LOC, and DIR, respectively (Surdeanu et al., customers can be integrated in the generative pro-
2008)); instead we directly use the syntactic labels cess, we start by reformulating the traditional
as semantic roles. CRP in an equivalent form so that distance-
dependent CRP (dd-CRP) can be seen as its gen-
3 Traditional and Distance-dependent eralization. Instead of selecting a table for each
CRPs customer as described above, one can equiva-
The central components of our non-parametric lently assume that a customer i chooses one of
Bayesian models are the Chinese Restaurant Pro- the previous customers ci as a partner with prob-
cesses (CRPs) and the closely related Dirichlet 1
ability i1+ and sits at the same table, or occu-
Processes (DPs) (Ferguson, 1973). pies a new table with the probability i1+
. The
CRPs define probability distributions over par- transitive closure of this seating-with relation de-
titions of a set of objects. An intuitive metaphor termines the partition.
2
LGS denotes a logical subject in a passive construction A generalization of this view leads to the defini-
(Surdeanu et al., 2008). tion of the distance-dependent CRP. In dd-CRPs,
14
a customer i chooses a partner ci = j with Our model associates two distributions with
the probability proportional to some non-negative each predicate: one governs the selection of argu-
score di,j (di,j = dj,i ) which encodes a similarity ment fillers for each semantic role, and the other
between the two customers.3 More formally, models (and penalizes) duplicate occurrence of
roles. Each predicate occurrence is generated in-
di,j , i 6= j
p(ci = j|D, ) (1) dependently given these distributions. Let us de-
, i = j
scribe the model by first defining how the set of
where D is the entire similarity graph. This pro- model parameters and an argument key clustering
cess lacks the exchangeability property of the tra-
are drawn, and then explaining the generation of
ditional CRP but efficient approximate inference
individual predicate and argument instances. The
with dd-CRP is possible with Gibbs sampling.
generative story is formally presented in Figure 1.
For more details on inference with dd-CRPs, we
We start by generating a partition of argument
refer the reader to Blei and Frazier (2011).
keys Bp with each subset r Bp representing
Though in previous work dd-CRP was used ei-
a single semantic role. The partitions are drawn
ther to encode prior knowledge (Blei and Fra-
from CRP() (see the Factored model section of
zier, 2011) or other external information (Socher
Figure 1) independently for each predicate. The
et al., 2011), we treat D as a latent variable
crucial part of the model is the set of selectional
drawn from some prior distribution over weighted
preference parameters p,r , the distributions of ar-
graphs. This view provides a powerful approach
guments x for each role r of predicate p. We
for coupling a family of distinct but similar clus-
represent arguments by their syntactic heads,4 or
terings: the family of clusterings can be drawn by
more specifically, by either their lemmas or word
first choosing a similarity graph D for the entire
clusters assigned to the head by an external clus-
family and then re-using D to generate each of the
tering algorithm, as we will discuss in more detail
clusterings independently of each other as defined
in Section 7.5 For the agent role A0 of the pred-
by equation (1). In Section 5, we explain how we
icate open, for example, this distribution would
use this formalism to encode relatedness between
assign most of the probability mass to arguments
argument key clusterings for different predicates.
denoting sentient beings, whereas the distribution
4 Factored Model for the patient role A1 would concentrate on ar-
guments representing openable things (doors,
In this section we describe the factored method boxes, books, etc).
which models each predicate independently. In In order to encode the assumption about sparse-
Section 2 we defined our task as clustering of ar- ness of the distributions p,r , we draw them from
gument keys, where each cluster corresponds to a the DP prior DP (, H (A) ) with a small concen-
semantic role. If an argument key k is assigned tration parameter , the base probability distribu-
to a role r (k r), all of its occurrences are la- tion H (A) is just the normalized frequencies of ar-
beled r. guments in the corpus. The geometric distribution
Our Bayesian model encodes two common as- p,r is used to model the number of times a role
sumptions about semantic roles. First, we enforce r appears with a given predicate occurrence. The
the selectional restriction assumption: we assume decision whether to generate at least one role r is
that the distribution over potential argument fillers drawn from the uniform Bernoulli distribution. If
is sparse for every role, implying that peaky dis- 0 is drawn then the semantic role is not realized
tributions of arguments for each role r are pre- for the given occurrence, otherwise the number
ferred to flat distributions. Second, each role nor- of additional roles r is drawn from the geometric
mally appears at most once per predicate occur- distribution Geom(p,r ). The Beta priors over
rence. Our inference will search for a clustering
4
which meets the above requirements to the maxi- For prepositional phrases, we take as head the head noun
of the object noun phrase as it encodes crucial lexical infor-
mal extent. mation. However, the preposition is not ignored but rather
3
It may be more standard to use a decay function f : encoded in the corresponding argument key, as explained
R R and choose a partner with the probability propor- in Section 2.
5
tional to f (di,j ). However, the two forms are equivalent Alternatively, the clustering of arguments could be in-
and using scores di,j directly is more convenient for our in- duced within the model, as done in (Titov and Klementiev,
duction purposes. 2011).
15
Clustering of argument keys: nations for a predicate. E.g., passivization can be
Factored model: roughly represented with the clustering of the key
for each predicate p = 1, 2, . . . : ACT:LEFT:SBJ with PASS:RIGHT:LGS->by
Bp CRP () [partition of arg keys] and ACT:RIGHT:OBJ with PASS:LEFT:SBJ.
Coupled model: The set of permissible alternations is predicate-
D N onInf orm [similarity graph] specific,6 but nevertheless they arguably repre-
for each predicate p = 1, 2, . . . : sent a small subset of all clusterings of argu-
Bp dd-CRP (, D) [partition of arg keys] ment keys. Also, some alternations are more
likely to be applicable to a verb than others: for
Parameters:
example, passivization and dativization alterna-
for each predicate p = 1, 2, . . . : tions are both fairly frequent, whereas, locative-
for each role r Bp :
p,r DP (, H (A) ) [distrib of arg fillers]
preposition-drop alternation (Mary climbed up the
p,r Beta(0 , 1 ) [geom distr for dup roles] mountain vs. Mary climbed the mountain) is less
common and applicable only to several classes
Data Generation: of predicates representing motion (Levin, 1993).
for each predicate p = 1, 2, . . . : We represent this observation by quantifying how
for each occurrence l of p: likely a pair of keys is to be clustered. These
for every role r Bp : scores (di,j for every pair of argument keys i and
if [n U nif (0, 1)] = 1: [role appears at least once]
j) are induced automatically within the model,
GenArgument(p, r) [draw one arg]
while [n p,r ] = 1: [continue generation] and treated as latent variables shared across pred-
GenArgument(p, r) [draw more args] icates. Intuitively, if data for several predicates
strongly suggests that two argument keys should
GenArgument(p, r):
kp,r U nif (1, . . . , |r|) [draw arg key]
be clustered (e.g., there is a large overlap be-
xp,r p,r [draw arg filler] tween argument fillers for the two keys) then the
posterior will indicate that di,j is expected to be
greater for the pair {i, j} than for some other pair
Figure 1: Generative stories for the factored and cou-
{i0 , j 0 } for which the evidence is less clear. Con-
pled models.
sequently, argument keys i and j will be clustered
even for predicates without strong evidence for
can indicate the preference towards generating at such a clustering, whereas i0 and j 0 will not.
most one argument for each role. For example, One argument against coupling predicates may
it would express the preference that a predicate stem from the fact that we are using unlabeled
open typically appears with a single agent and a data and may be able to obtain sufficient amount
single patient arguments. of learning material even for less frequent pred-
Now, when parameters and argument key clus- icates. This may be a valid observation, but an-
terings are chosen, we can summarize the re- other rationale for sharing this similarity structure
mainder of the generative story as follows. We is the hypothesis that alternations may be easier
begin by independently drawing occurrences for to detect for some predicates than for others. For
each predicate. For each predicate role we in- example, argument key clustering of predicates
dependently decide on the number of role occur- with very restrictive selectional restrictions on ar-
rences. Then we generate each of the arguments gument fillers is presumably easier than clustering
(see GenArgument) by generating an argument for predicates with less restrictive and overlap-
key kp,r uniformly from the set of argument keys ping selectional restriction, as compactness of se-
assigned to the cluster r, and finally choosing its lectional preferences is a central assumption driv-
filler xp,r , where the filler is either a lemma or a ing unsupervised learning of semantic roles. E.g.,
word cluster corresponding to the syntactic head predicates change and defrost belong to the same
of the argument. Levin class (change-of-state verbs) and therefore
admit similar alternations. However, the set of po-
5 Coupled Model tential patients of defrost is sufficiently restricted,
As we argued in Section 1, clusterings of argu- 6
Or, at least specific to a class of predicates (Levin,
ment keys implicitly encode the pattern of alter- 1993).
16
whereas the selectional restrictions for the patient key implies some computations for all its occur-
of change are far less specific and they overlap rences in the corpus. Instead of more complex
with selectional restrictions for the agent role, fur- MAP search algorithms (see, e.g., (Daume III,
ther complicating the clustering induction task. 2007)), we use a greedy procedure where we start
This observation suggests that sharing clustering with each argument key assigned to an individual
preferences across verbs is likely to help even if cluster, and then iteratively try to merge clusters.
the unlabeled data is plentiful for every predicate. Each move involves (1) choosing an argument key
More formally, we generate scores di,j , or and (2) deciding on a cluster to reassign it to. This
equivalently, the full labeled graph D with ver- is done by considering all clusters (including cre-
tices corresponding to argument keys and edges ating a new one) and choosing the most probable
weighted with the similarity scores, from a prior. one.
In our experiments we use a non-informative prior Instead of choosing argument keys randomly at
which factorizes over pairs (i.e. edges of the the first stage, we order them by corpus frequency.
graph D), though more powerful alternatives can This ordering is beneficial as getting clustering
be considered. Then we use it, in a dd-CRP(, right for frequent argument keys is more impor-
D), to generate clusterings of argument keys for tant and the corresponding decisions should be
every predicate. The rest of the generative story is made earlier.7 We used a single iteration in our
the same as for the factored model. The part rele- experiments, as we have not noticed any benefit
vant to this model is shown in the Coupled model from using multiple iterations.
section of Figure 1.
6.2 Similarity Graph Induction
Note that this approach does not assume that
the frequencies of syntactic patterns correspond- In the coupled model, clusterings for different
ing to alternations are similar, and a large value predicates are statistically dependent, as the simi-
for di,j does not necessarily mean that the corre- larity structure D is latent and shared across pred-
sponding syntactic frames i and j are very fre- icates. Consequently, a more complex inference
quent in a corpus. What it indicates is that a large procedure is needed. For simplicity here and in
number of different predicates undergo the corre- our experiments, we use the non-informative prior
sponding alternation; the frequency of the alterna- distribution over D which assigns the same prior
tion is a different matter. We believe that this is an probability to every possible weight di,j for every
important point, as we do not make a restricting pair {i, j}.
assumption that an alternation has the same dis- Recall that the dd-CRP prior is defined in terms
tributional properties for all verbs which undergo of customers choosing other customers to sit with.
this alternation. For the moment, let us assume that this relation
among argument keys is known, that is, every ar-
6 Inference gument key k for predicate p has chosen an argu-
ment key cp,k to sit with. We can compute the
An inference algorithm for an unsupervised MAP estimate for all di,j by maximizing the ob-
model should be efficient enough to handle vast jective:
amounts of unlabeled data, as it can easily be ob- X X dk,cp,k
tained and is likely to improve results. We use arg max log P ,
di,j , i6=j p k0 K p dk,k0
a simple approximate inference algorithm based kK p
on greedy MAP search. We start by discussing where K p is the set of all argument keys for the
MAP search for argument key clustering with the predicate p. We slightly abuse the notation by us-
factored model and then discuss its extension ap- ing di,i to denote the concentration parameter
plicable to the coupled model. in the previous expression. Note that we also as-
sume that similarities are symmetric, di,j = dj,i .
6.1 Role Induction If the set of argument keys K p would be the same
For the factored model, semantic roles for every for every predicate, then the optimal di,j would
predicate are induced independently. Neverthe- 7
This idea has been explored before for shallow semantic
less, search for a MAP clustering can be expen- representations (Lang and Lapata, 2011a; Titov and Klemen-
sive, as even a move involving a single argument tiev, 2011).
17
be proportional to the number of times either i se- rior with Naive Bayes tends to be overconfident
lects j as a partner, or j chooses i as a partner.8 due to violated conditional independence assump-
This no longer holds if the sets are different, but tions (Rennie, 2001). The same behavior is ob-
the solution can be found efficiently using a nu- served here: the shared prior does not have suf-
meric optimization strategy; we use the gradient ficient effect on frequent predicates.10 Though
descent algorithm. different techniques have been developed to dis-
We do not learn the concentration parameter count the over-confidence (Kolcz and Chowdhury,
, as it is used in our model to indicate the de- 2005), we use the most basic one: we raise the
sired granularity of semantic roles, but instead likelihood term in power T1 , where the parameter
only learn di,j (i 6= j). However, just learning T is chosen empirically.
the concentration parameter would not be suffi-
cient as the effective concentration can be reduced 7 Empirical Evaluation
or increased arbitrarily by scaling all the similar- 7.1 Data and Evaluation
ities di,j (i 6= j) at once, as follows from expres-
We keep the general setup of (Lang and Lapata,
sion (1). Instead, we enforce the normalization
2011a), to evaluate our models and compare them
constraint on the similarities di,j . We ensure that
to the current state of the art. We run all of our
the prior probability of choosing itself as a part-
experiments on the standard CoNLL 2008 shared
ner, averaged over predicates, is the same as it
task (Surdeanu et al., 2008) version of Penn Tree-
would be with uniform di,j (di,j = 1 for every
bank WSJ and PropBank. In addition to gold
key pair {i, j}, i 6= j). This roughly says that
dependency analyses and gold PropBank annota-
we want to preserve the same granularity of clus-
tions, it has dependency structures generated au-
tering as it was with the uniform similarities. We
tomatically by the MaltParser (Nivre et al., 2007).
accomplish this normalization in a post-hoc fash-
We vary our experimental setup as follows:
ion
P by P dividing the weightsPafter optimization by
We evaluate our models on gold and auto-
p k,k0 K p , k0 6=k dk,k0 / p |K p |(|K p | 1).
matically generated parses, and use either
If D is fixed, partners for every predicate p and
gold PropBank annotations or the heuristic
every k can be found using virtually the same al-
from Section 2 to identify arguments, result-
gorithm as in Section 6.1: the only difference is
ing in four experimental regimes.
that, instead of a cluster, each argument key itera-
tively chooses a partner. In order to reduce the sparsity of predicate
Though, in practice, both the choice of partners argument fillers we consider replacing lem-
and the similarity graphs are latent, we can use an mas of their syntactic heads with word clus-
iterative approach to obtain a joint MAP estimate ters induced by a clustering algorithm as a
of ck (for every k) and the similarity graph D by preprocessing step. In particular, we use
alternating the two steps.9 Brown (Br) clustering (Brown et al., 1992)
Notice that the resulting algorithm is again induced over RCV1 corpus (Turian et al.,
highly parallelizable: the graph induction stage 2010). Although the clustering is hierarchi-
is fast, and induction of the seat-with relation cal, we only use a cluster at the lowest level
(i.e. clustering argument keys) is factorizable over of the hierarchy for each word.
predicates. We use the purity (PU) and collocation (CO) met-
One shortcoming of this approach is typical rics as well as their harmonic mean (F1) to mea-
for generative models with multiple features: sure the quality of the resulting clusters. Purity
when such a model predicts a latent variable, it measures the degree to which each cluster con-
tends to ignore the prior class distribution and re- tains arguments sharing the same gold role:
lies solely on features. This behavior is due to 1 X
PU = max |Gj Ci |
the over-simplifying independence assumptions. N j
i
It is well known, for instance, that the poste-
where if Ci is the set of arguments in the i-th in-
8
Note that weights di,j are invariant under rescaling duced cluster, Gj is the set of arguments in the jth
when the rescaling is also applied to the concentration pa-
10
rameter . The coupled model without discounting still outper-
9
In practice, two iterations were sufficient. forms the factored counterpart in our experiments.
18
gold cluster, and N is the total number of argu- gold parses auto parses
ments. Collocation evaluates the degree to which PU CO F1 PU CO F1
LLogistic 79.5 76.5 78.0 77.9 74.4 76.2
arguments with the same gold roles are assigned
SplitMerge 88.7 73.0 80.1 86.5 69.8 77.3
to a single cluster. It is computed as follows: GraphPart 88.6 70.7 78.6 87.4 65.9 75.2
1 X Factored 88.1 77.1 82.2 85.1 71.8 77.9
CO = max |Gj Ci |
N i Coupled 89.3 76.6 82.5 86.7 71.2 78.2
j
Factored+Br 86.8 78.8 82.6 83.8 74.1 78.6
We compute the aggregate PU, CO, and F1 Coupled+Br 88.7 78.1 83.0 86.2 72.7 78.8
scores over all predicates in the same way as SyntF 81.6 77.5 79.5 77.1 70.9 73.9
(Lang and Lapata, 2011a) by weighting the scores Table 1: Argument clustering performance with gold
of each predicate by the number of its argument argument identification. Bold-face is used to highlight
occurrences. Note that since our goal is to evalu- the best F1 scores.
ate the clustering algorithms, we do not include
incorrectly identified arguments (i.e. mistakes tion stage, and minimize the noise due to auto-
made by the heuristic defined in Section 2) when matic syntactic annotations. All four variants of
computing these metrics. the models we propose substantially outperform
We evaluate both factored and coupled models other models: the coupled model with Brown
proposed in this work with and without Brown clustering of argument fillers (Coupled+Br) beats
word clustering of argument fillers (Factored, the previous best model SplitMerge by 2.9% F1
Coupled, Factored+Br, Coupled+Br). Our mod- score. As mentioned in Section 2, our approach
els are robust to parameter settings, they were specifically does not cluster some of the modifier
tuned (to an order of magnitude) on the develop- arguments. In order to verify that this and argu-
ment set and were the same for all model variants: ment filler clustering were not the only aspects
= 1.e-3, = 1.e-3, 0 = 1.e-3, 1 = 1.e-10, of our approach contributing to performance im-
T = 5. Although they can be induced within the provements, we also evaluated our coupled model
model, we set them by hand to indicate granular- without Brown clustering and treating modifiers
ity preferences. We compare our results with the as regular arguments. The model achieves 89.2%
following alternative approaches. The syntactic purity, 74.0% collocation, and 80.9% F1 scores,
function baseline (SyntF) simply clusters predi- still substantially outperforming all of the alter-
cate arguments according to the dependency re- native approaches. Replacing gold parses with
lation to their head. Following (Lang and Lapata, MaltParser analyses we see a similar trend, where
2010), we allocate a cluster for each of 20 most Coupled+Br outperforms the best alternative ap-
frequent relations in the CoNLL dataset and one proach SplitMerge by 1.5%.
cluster for all other relations. We also compare 7.2.2 Automatic Arguments
our performance with the Latent Logistic classifi- Results are summarized in Table 2.11 The
cation (Lang and Lapata, 2010), Split-Merge clus- precision and recall of our re-implementation of
tering (Lang and Lapata, 2011a), and Graph Parti- the argument identification heuristic described in
tioning (Lang and Lapata, 2011b) approaches (la- Section 2 on gold parses were 87.7% and 88.0%,
beled LLogistic, SplitMerge, and GraphPart, re- respectively, and do not quite match 88.1% and
spectively) which achieve the current best unsu- 87.9% reported in (Lang and Lapata, 2011a).
pervised SRL results in this setting. Since we could not reproduce their argument
7.2 Results identification stage exactly, we are omitting their
results for the two regimes, instead including the
7.2.1 Gold Arguments results for our two best models Factored+Br and
Experimental results are summarized in Ta- Coupled+Br. We see a similar trend, where the
ble 1. We begin by comparing our models to the coupled system consistently outperforms its fac-
three existing clustering approaches on gold syn- tored counterpart, achieving 85.8% and 83.9% F1
tactic parses, and using gold PropBank annota- 11
Note, that the scores are computed on correctly iden-
tions to identify predicate arguments. In this set of tified arguments only, and tend to be higher in these ex-
experiments we measure the relative performance periments probably because the complex arguments get dis-
of argument clustering, removing the identifica- carded by the heuristic.
19
gold parses auto parses balizations of relations (Lin and Pantel, 2001;
PU CO F1 PU CO F1 Banko et al., 2007). Early unsupervised ap-
Factored+Br 87.8 82.9 85.3 85.8 81.1 83.4 proaches to the SRL problem include the work
Coupled+Br 89.2 82.6 85.8 87.4 80.7 83.9
by Swier and Stevenson (2004), where the Verb-
SyntF 83.5 81.4 82.4 81.4 79.1 80.2
Net verb lexicon was used to guide unsupervised
Table 2: Argument clustering performance with auto- learning, and a generative model of Grenager and
matic argument identification. Manning (2006) which exploits linguistic priors
on syntactic-semantic interface.
for gold and MaltParser analyses, respectively.
More recently, the role induction problem has
We observe that consistently through the four been studied in Lang and Lapata (2010) where
regimes, sharing of alternations between predi- it has been reformulated as a problem of detect-
cates captured by the coupled model outperforms ing alterations and mapping non-standard link-
the factored version, and that reducing the argu- ings to the canonical ones. Later, Lang and La-
ment filler sparsity with clustering also has a sub- pata (2011a) proposed an algorithmic approach
stantial positive effect. Due to the space con- to clustering argument signatures which achieves
straints we are not able to present detailed anal- higher accuracy and outperforms the syntactic
ysis of the induced similarity graph D, however, baseline. In Lang and Lapata (2011b), the role
argument-key pairs with the highest induced sim- induction problem is formulated as a graph parti-
ilarity encode, among other things, passivization, tioning problem: each vertex in the graph corre-
benefactive alternations, near-interchangeability sponds to a predicate occurrence and edges repre-
of some subordinating conjunctions and preposi- sent lexical and syntactic similarities between the
tions (e.g., if and whether), as well as, restoring occurrences. Unsupervised induction of seman-
some of the unnecessary splits introduced by the tics has also been studied in Poon and Domin-
argument key definition (e.g., semantic roles for gos (2009) and Titov and Klementiev (2010) but
adverbials do not normally depend on whether the the induced representations are not entirely com-
construction is passive or active). patible with the PropBank-style annotations and
they have been evaluated only on a question an-
8 Related Work
swering task for the biomedical domain. Also, the
Most of SRL research has focused on the super- related task of unsupervised argument identifica-
vised setting (Carreras and Marquez, 2005; Sur- tion was considered in Abend et al. (2009).
deanu et al., 2008), however, lack of annotated re-
sources for most languages and insufficient cover- 9 Conclusions
age provided by the existing resources motivates
In this work we introduced two Bayesian models
the need for using unlabeled data or other forms
for unsupervised role induction. They treat the
of weak supervision. This work includes methods
task as a family of related clustering problems,
based on graph alignment between labeled and
one for each predicate. The first factored model
unlabeled data (Furstenau and Lapata, 2009), us-
induces each clustering independently, whereas
ing unlabeled data to improve lexical generaliza-
the second model couples them by exploiting a
tion (Deschacht and Moens, 2009), and projection
novel technique for sharing clustering preferences
of annotation across languages (Pado and Lapata,
across a family of clusterings. Both methods
2009; van der Plas et al., 2011). Semi-supervised
achieve state-of-the-art results with the coupled
and weakly-supervised techniques have also been
model outperforming the factored counterpart in
explored for other types of semantic representa-
all regimes.
tions but these studies have mostly focused on re-
stricted domains (Kate and Mooney, 2007; Liang Acknowledgements
et al., 2009; Titov and Kozhevnikov, 2010; Gold-
The authors acknowledge the support of the MMCI
wasser et al., 2011; Liang et al., 2011).
Cluster of Excellence, and thank Hagen Furstenau,
Unsupervised learning has been one of the cen- Mikhail Kozhevnikov, Alexis Palmer, Manfred Pinkal,
tral paradigms for the closely-related area of re- Caroline Sporleder and the anonymous reviewers for
lation extraction, where several techniques have their suggestions, and Joel Lang for answering ques-
been proposed to cluster semantically similar ver- tions about their methods and data.
20
References Michael Kaisser and Bonnie Webber. 2007. Question
answering based on semantic roles. In ACL Work-
Omri Abend, Roi Reichart, and Ari Rappoport. 2009.
shop on Deep Linguistic Processing.
Unsupervised argument identification for semantic
Rohit J. Kate and Raymond J. Mooney. 2007. Learn-
role labeling. In ACL-IJCNLP.
ing language semantics from ambigous supervision.
Michele Banko, Michael J Cafarella, Stephen Soder-
In AAAI.
land, Matt Broadhead, and Oren Etzioni. 2007.
Aleksander Kolcz and Abdur Chowdhury. 2005. Dis-
Open information extraction from the web. In IJ-
counting over-confidence of naive bayes in high-
CAI.
recall text classification. In ECML.
Roberto Basili, Diego De Cao, Danilo Croce,
Joel Lang and Mirella Lapata. 2010. Unsupervised
Bonaventura Coppola, and Alessandro Moschitti.
induction of semantic roles. In ACL.
2009. Cross-language frame semantics transfer in
bilingual corpora. In CICLING. Joel Lang and Mirella Lapata. 2011a. Unsupervised
David M. Blei and Peter Frazier. 2011. Distance de- semantic role induction via split-merge clustering.
pendent chinese restaurant processes. Journal of In ACL.
Machine Learning Research, 12:24612488. Joel Lang and Mirella Lapata. 2011b. Unsupervised
Peter F. Brown, Vincent Della Pietra, Peter V. deSouza, semantic role induction with graph partitioning. In
Jenifer C. Lai, and Robert L. Mercer. 1992. Class- EMNLP.
based n-gram models for natural language. Compu- Beth Levin. 1993. English Verb Classes and Alter-
tational Linguistics, 18(4):467479. nations: A Preliminary Investigation. University of
Xavier Carreras and Llus Marquez. 2005. Intro- Chicago Press.
duction to the CoNLL-2005 Shared Task: Semantic Percy Liang, Michael I. Jordan, and Dan Klein. 2009.
Role Labeling. In CoNLL. Learning semantic correspondences with less super-
Hal Daume III. 2007. Fast search for dirichlet process vision. In ACL-IJCNLP.
mixture models. In AISTATS. Percy Liang, Michael Jordan, and Dan Klein. 2011.
Koen Deschacht and Marie-Francine Moens. 2009. Learning dependency-based compositional seman-
Semi-supervised semantic role labeling using the tics. In ACL: HLT.
Latent Words Language Model. In EMNLP. Dekang Lin and Patrick Pantel. 2001. DIRT discov-
Jason Duan, Michele Guindani, and Alan Gelfand. ery of inference rules from text. In KDD.
2007. Generalized spatial dirichlet process models. Ding Liu and Daniel Gildea. 2010. Semantic role fea-
Biometrika, 94:809825. tures for machine translation. In Coling.
Thomas S. Ferguson. 1973. A Bayesian analysis J. Nivre, J. Hall, S. Kubler, R. McDonald, J. Nilsson,
of some nonparametric problems. The Annals of S. Riedel, and D. Yuret. 2007. The CoNLL 2007
Statistics, 1(2):209230. shared task on dependency parsing. In EMNLP-
Hagen Furstenau and Mirella Lapata. 2009. Graph CoNLL.
alignment for semi-supervised semantic role label- Sebastian Pado and Mirella Lapata. 2009. Cross-
ing. In EMNLP. lingual annotation projection for semantic roles.
Qin Gao and Stephan Vogel. 2011. Corpus expansion Journal of Artificial Intelligence Research, 36:307
for statistical machine translation with semantic role 340.
label substitution rules. In ACL:HLT. Alexis Palmer and Caroline Sporleder. 2010. Evalu-
Daniel Gildea and Daniel Jurafsky. 2002. Automatic ating FrameNet-style semantic parsing: the role of
labelling of semantic roles. Computational Linguis- coverage gaps in FrameNet. In COLING.
tics, 28(3):245288. M. Palmer, D. Gildea, and P. Kingsbury. 2005. The
Dan Goldwasser, Roi Reichart, James Clarke, and Dan proposition bank: An annotated corpus of semantic
Roth. 2011. Confidence driven unsupervised se- roles. Computational Linguistics, 31(1):71106.
mantic parsing. In ACL. Hoifung Poon and Pedro Domingos. 2009. Unsuper-
Trond Grenager and Christoph Manning. 2006. Unsu- vised semantic parsing. In EMNLP.
pervised discovery of a statistical verb lexicon. In Sameer Pradhan, Wayne Ward, and James H. Martin.
EMNLP. 2008. Towards robust semantic role labeling. Com-
Jan Hajic, Massimiliano Ciaramita, Richard Johans- putational Linguistics, 34:289310.
son, Daisuke Kawahara, Maria Antonia Mart, Llus Jason Rennie. 2001. Improving multi-class text
Marquez, Adam Meyers, Joakim Nivre, Sebastian classification with Naive bayes. Technical Report
Pado, Jan Stepanek, Pavel Stranak, Mihai Surdeanu, AITR-2001-004, MIT.
Nianwen Xue, and Yi Zhang. 2009. The CoNLL- M. Sammons, V. Vydiswaran, T. Vieira, N. Johri,
2009 shared task: Syntactic and semantic depen- M. Chang, D. Goldwasser, V. Srikumar, G. Kundu,
dencies in multiple languages. In Proceedings Y. Tu, K. Small, J. Rule, Q. Do, and D. Roth. 2009.
of the 13th Conference on Computational Natural Relation alignment for textual entailment recogni-
Language Learning (CoNLL-2009), June 4-5. tion. In Text Analysis Conference (TAC).
21
Dan Shen and Mirella Lapata. 2007. Using semantic
roles to improve question answering. In EMNLP.
Richard Socher, Andrew Maas, and Christopher Man-
ning. 2011. Spectral chinese restaurant processes:
Nonparametric clustering based on similarities. In
AISTATS.
Mihai Surdeanu, Adam Meyers Richard Johansson,
Llus Marquez, and Joakim Nivre. 2008. The
CoNLL-2008 shared task on joint parsing of syn-
tactic and semantic dependencies. In CoNLL 2008:
Shared Task.
Richard Swier and Suzanne Stevenson. 2004. Unsu-
pervised semantic role labelling. In EMNLP.
Yee Whye Teh. 2010. Dirichlet processes. In Ency-
clopedia of Machine Learning. Springer.
Ivan Titov and Alexandre Klementiev. 2011. A
Bayesian model for unsupervised semantic parsing.
In ACL.
Ivan Titov and Mikhail Kozhevnikov. 2010.
Bootstrapping semantic analyzers from non-
contradictory texts. In ACL.
Joseph Turian, Lev Ratinov, and Yoshua Bengio.
2010. Word representations: A simple and general
method for semi-supervised learning. In ACL.
Lonneke van der Plas, Paola Merlo, and James Hen-
derson. 2011. Scaling up automatic cross-lingual
semantic role annotation. In ACL.
Dekai Wu and Pascale Fung. 2009. Semantic roles for
SMT: A hybrid two-pass model. In NAACL.
Dekai Wu, Marianna Apidianaki, Marine Carpuat, and
Lucia Specia, editors. 2011. Proc. of Fifth Work-
shop on Syntax, Semantics and Structure in Statisti-
cal Translation. ACL.
22
Entailment above the word level in distributional semantics
23
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 2332,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
entails animal). With almost no manual effort, 2 Background
we achieve performance nearly identical with the
state-of-the-art balAPinc measure that Kotlerman 2.1 Distributional semantics above the word
et al. (2010) crafted, which detects feature inclu- level
sion between the two nouns occurrence contexts. DS models such as LSA (Landauer and Dumais,
1997) and HAL (Lund and Burgess, 1996) ap-
Our second experiment goes beyond lexical in-
proximate the meaning of a word by a vector that
ference. We look at phrases built from a quanti-
summarizes its distribution in a corpus, for exam-
fying determiner1 and a noun (QNs) and use their
ple by counting co-occurrences of the word with
distributional vectors to recognize entailment re-
other words. Since semantically similar words
lations of the form many dogs |= some dogs, be-
tend to share similar contexts, DS has been very
tween two QNs sharing the same noun. It turns
successful in tasks that require quantifying se-
out that a classifier trained on a set of Q1 N |= Q2 N
mantic similarity among words, such as synonym
pairs can recognize entailment in pairs with a new
detection and concept clustering (Turney and Pan-
quantifier configuration. For example, we can
tel, 2010).
train on many dogs |= some dogs then correctly
predict all cats|=several cats. Interestingly, on the Recently, there has been a flurry of interest
QN entailment task, neither our classifier trained in DS to model meaning composition: How can
on AN-N pairs nor the balAPinc method beat we derive the DS representation of a composite
baseline methods. This suggests that our success- phrase from that of its constituents? Although the
ful QN classifiers tap into vector properties be- general focus in the area is to perform algebraic
yond such relations as feature inclusion that those operations on word semantic vectors (Mitchell
methods for nominal entailment rely upon. and Lapata, 2010), some researchers have also di-
rectly examined the corpus contexts of phrases.
Together, our experiments show that corpus- For example, Baldwin et al. (2003) studied vec-
harvested DS representations of composite ex- tor extraction for phrases because they were inter-
pressions such as ANs and QNs contain suffi- ested in the decomposability of multiword expres-
cient information to capture and generalize their sions. Baroni and Zamparelli (2010) and Gue-
inference patterns. This result brings DS closer vara (2010) look at corpus-harvested phrase vec-
to the central concerns of FS. In particular, the tors to learn composition functions that should de-
QN study is the first to our knowledge to show rive such composite vectors automatically. Ba-
that DS vectors capture semantic properties not roni and Zamparelli, in particular, showed qual-
only of content words, but of an important class of itatively that directly corpus-harvested vectors for
function words (quantifying determiners) deeply AN constructions are meaningful; for example,
studied in FS but of little interest until now in DS. the vector of young husband has nearest neigh-
Besides these theoretical implications, our re- bors small son, small daughter and mistress. Fol-
sults are of practical import. First, our AN study lowing up on this approach, we show here quanti-
presents a novel, practical method for detect- tatively that corpus-harvested AN vectors are also
ing lexical entailment that reaches state-of-the- useful for detecting entailment. We find moreover
art performance with little or no manual interven- distributional vectors informative and useful not
tion. Lexical entailment is in turn fundamental only for phrases made of content words (such as
for constructing ontologies and other lexical re- ANs) but also for phrases containing functional
sources (Buitelaar and Cimiano, 2008). Second, elements, namely quantifying determiners.
our QN study demonstrates that phrasal entail-
2.2 Entailment from formal to distributional
ment can be automatically detected and thus paves
semantics
the way to apply DS to advanced NLP tasks such
as recognizing textual entailment (Dagan et al., Entailment in FS To characterize the condi-
2009). tions under which a sentence is true, FS begins
with the lexical meanings of the words in the sen-
tence and builds up the meanings of larger and
1
In the sequel we will simply refer to a quantifying de- larger phrases until it arrives at the meaning of the
terminer as a quantifier. whole sentence. The meanings throughout this
24
compositional process inhabit a variety of seman- for phrasal entailment in a way that can be cap-
tic domains, depending on the syntactic category tured and generalized to unseen phrase pairs.
of the expressions: typically, a sentence denotes a Rather recently, the study of sentential entail-
truth value (true or false) or truth conditions, ment has taken an empirical turn, thanks to the de-
a noun such as cat denotes a set of entities, and a velopment of benchmarks for entailment systems.
quantifier phrase (QP) such as all cats denotes a The FS definition of entailment has been modified
set of sets of entities. by taking common sense into account. Instead of
The entailment relation (|=) is a core notion of a relation from the truth of the consequent to the
logic: it holds between one or more sentences and truth of the antecedent in any circumstance, the
a sentence such that it cannot be that the former applied view looks at entailment in terms of plau-
(antecedent) are true and the latter (consequent) sibility: |= if a human who reads (and trusts)
is false. FS extends this notion from formal-logic would most likely infer that is also true. En-
sentences to natural-language expressions. By as- tailment systems have been compared under this
signing meanings to parts of a sentence, FS allows new perspective in various evaluation campaigns,
defining entailment not only among sentences but the best known being the Recognizing Textual En-
also among words and phrases. Each semantic tailment (RTE) initiative (Dagan et al., 2009).
domain A has its own entailment relation |=A . Most RTE systems are based on advanced NLP
The entailment relation |=S among sentences is components, machine learning techniques, and/or
the logical notion just described, whereas the en- syntactic transformations (Zanzotto et al., 2007;
tailment relations |=N and |=QP among nouns Kouleykov and Magnini, 2005). A few systems
and quantifier phrases are the inclusion relations exploit deep FS analysis (Bos and Markert, 2006;
among sets of entities and sets of sets of entities Chambers et al., 2007). In particular, the FS re-
respectively. Our results in Section 5 show that sults about QP properties that affect entailment
DS needs to treat |=N and |=QP differently as well. have been exploited by Chambers et al, who com-
plement a core broad-coverage system with a Nat-
Empirical, corpus-based perspectives on en- ural Logic module to trade lower recall for higher
tailment Until recently, the corpus-based re- precision. For instance, they exploit the mono-
search tradition has studied entailment mostly at tonicity properties of no that cause the follow-
the word level, with applied goals such as clas- ing reversal in entailment direction: some bee-
sifying lexical relations and building taxonomic tles |= some insects but no insects |= no beetles.
WordNet-like resources automatically. The most To investigate entailment step by step, we ad-
popular approach, first adopted by Hearst (1992), dress here a much simpler and clearer type of
extracts lexical relations from patterns in large entailment than the more complex notion taken
corpora. For instance, from the pattern N1 such up by the RTE community. While RTE is out-
as N2 one learns that N2 |= N1 (from insects such side our present scope, we do focus on QP entail-
as beetles, derive beetles |= insects). Several stud- ment as Natural Logic does. However, our eval-
ies have refined and extended this approach (Pan- uation differs from Chambers et al.s, since we
tel and Ravichandran, 2004; Snow et al., 2005; rely on general-purpose DS vectors as our only
Snow et al., 2006; Turney, 2008). resource, and we look at phrase pairs with differ-
While empirically very successful, the pattern- ent quantifiers but the same noun. For instance,
based method is mostly limited to single content we aim to predict that all beetles |= many beetles
words (or frequent content-word phrases). We are but few beetles 6|= all beetles. QPs, of course, have
interested in entailment between phrases, where it many well-known semantic properties besides en-
is not obvious how to use lexico-syntactic patterns tailment; we leave their analysis to future study.
and cope with data sparsity. For instance, it seems
hard to find a pattern that frequently connects one Entailment in DS Erk (2009) suggests that it
QP to another it entails, as in all beetles PATTERN may not be possible to induce lexical entailment
many beetles. Hence, we aim to find a more gen- directly from a vector space representation, but it
eral method and investigate whether DS vectors is possible to encode the relation in this space af-
(whether corpus-harvested or compositionally de- ter it has been derived through other means. On
rived) encode the information needed to account the other hand, recent studies (Geffet and Dagan,
25
2005; Kotlerman et al., 2010; Weeds et al., 2004) into pointwise mutual information (PMI) scores
have pursued the intuition that entailment is the (Church and Hanks, 1990). The result of this step
asymmetric ability of one term to substitute for is a sparse matrix (with both positive and negative
another. For example, baseball contexts are also entries) with 48K rows (one per phrase of interest)
sport contexts but not vice versa, hence baseball and 27K columns (one per content word).
is narrower than sport and baseball |= sport. On
this view, entailment between vectors corresponds 3.2 The AN |= N data set
to inclusion of contexts or features, and can be To characterize entailment between nouns using
captured by asymmetric measures of distribution their semantic vectors, we need data exemplifying
similarity. In particular, Kotlerman et al. (2010) which noun entails which. This section introduces
carefully crafted the balAPinc measure (see Sec- one cheap way to collect such a training data set
tion 3.5 below). We adopt this measure because exploiting semantic vectors for composed expres-
it has been shown to outperform others in several sions, namely AN sequences. We rely on the lin-
tasks that require lexical entailment information. guistic fact that ANs share a syntactic category
Like Kotlerman et al., we want to capture the and semantic type with plain common nouns (big
entailment relation between vectors of features. cat shares syntactic category and semantic type
However, we are interested in entailment not only with cat). Furthermore, most adjectives are re-
between words but also between phrases, and we strictive in the sense that, for every noun N, the
ask whether the DS view of entailment as fea- AN sequence entails the N alone (every big cat
ture inclusion, which captures entailment between is a cat). From a distributional point of view, the
nouns, also captures entailment between QPs. To vector for an N should by construction include the
this end, we complement balAPinc with a more information in the vector for an AN, given that the
flexible supervised classifier. contexts where the AN occurs are a subset of the
contexts where the N occurs (cat occurs in all the
3 Data and methods contexts where big cat occurs). This ideal inclu-
sion suggests that the DS notion of lexical entail-
3.1 Semantic space
ment as feature inclusion (see Section 2.2 above)
We construct distributional semantic vectors from should be reflected in the AN |= N pattern.
the 2.83-billion-token concatenation of the British Because most ANs entail their head Ns, we can
National Corpus (http://www.natcorp. create positive examples of AN |= N without any
ox.ac.uk/), WackyPedia and ukWaC (http: manual inspection of the corpus: simply pair up
//wacky.sslmit.unibo.it/). We tok- the semantic vectors of ANs and Ns. Furthermore,
enize and POS-tag this corpus, then lemmatize because an AN usually does not entail another N,
it with TreeTagger (Schmid, 1995) to merge sin- we can create negative examples (AN1 6|= N2 ) just
gular and plural instances of words and phrases by randomly permuting the Ns. Of course, such
(some dogs is mapped to some dog). unsupervised data would be slightly noisy, espe-
We process the corpus in two steps to compute cially because some of the most frequent adjec-
semantic vectors representing our phrases of in- tives are not restrictive.
terest. We use phrases of interest as a general To collect cleaner data and to be sure that we
term to refer to both multiword phrases and sin- are really examining the phenomenon of entail-
gle words, and more precisely to: those AN and ment, we took a mere few moments of man-
QN sequences that are in the data sets (see next ual effort to select the 256 restrictive adjectives
subsections), the adjectives, quantifiers and nouns from the most frequent 300 adjectives in the cor-
contained in those sequences, and the most fre- pus. We then took the Cartesian product of these
quent (9.8K) nouns and (8.1K) adjectives in the 256 adjectives with the 200 concrete nouns in the
corpus. The first step is to count the content BLESS data set (Baroni and Lenci, 2011). Those
words (more precisely, the most frequent 9.8K nouns were chosen to avoid highly polysemous
nouns, 8.1K adjectives, and 9.6K verbs in the cor- words. From the Cartesian product, we obtain a
pus) that occur in the same sentence as phrases total of 1246 AN sequences, such as big cat, that
of interest. In the second step, following standard occur more than 100 times in the corpus. These
practice, the co-occurrence counts are converted AN sequences encompass 190 of the 256 adjec-
26
tives and 128 of the 200 nouns. Quantifier pair Instances Correct
The process results in 1246 positive instances all |= some 1054 1044 (99%)
of AN |= N entailment, which we use as training all |= several 557 550 (99%)
data. To create a comparable amount of negative each |= some 656 647 (99%)
data, we randomly permuted the nouns in the pos- all |= many 873 772 (88%)
itive instances to obtain pairs of AN1 6|= N2 (e.g., much |= some 248 217 (88%)
big cat 6|= dog). We manually double-checked that every |= many 460 400 (87%)
all positive and negative examples are correctly many |= some 951 822 (86%)
all |= most 465 393 (85%)
classified (2 of 1246 negative instances were re-
several |= some 580 439 (76%)
moved, leaving 1244 negative training examples). both |= some 573 322 (56%)
many |= several 594 113 (19%)
3.3 The lexical entailment N1 |= N2 data set most |= many 463 84 (18%)
For testing data, we first listed all WordNet nouns both |= either 63 1 (2%)
in our corpus, then extracted hyponym-hypernym Subtotal 7537 5804 (77%)
chains linking the first synsets of these nouns. For some 6|= every 484 481 (99%)
example, pope is found to entail leader because several 6|= all 557 553 (99%)
WordNet contains the chain pope spiritual several 6|= every 378 375 (99%)
leader leader. Eliminating the 20 hypernyms some 6|= all 1054 1043 (99%)
many 6|= every 460 452 (98%)
with more than 180 hyponyms (mostly very ab-
some 6|= each 656 640 (98%)
stract nouns such as entity, object, and quality) few 6|= all 157 153 (97%)
yields 9734 hyponym-hypernym pairs, encom- many 6|= all 873 843 (97%)
passing 6402 nouns. Manually double-checking both 6|= most 369 347 (94%)
these pairs leaves us with 1385 positive instances several 6|= few 143 134 (94%)
of N1 |= N2 entailment. both 6|= many 541 397 (73%)
We created the negative instances of again 1385 many 6|= most 463 300 (65%)
either 6|= both 63 39 (62%)
pairs by inverting 33% of the positive instances
many 6|= no 714 369 (52%)
(from pope|=leader to leader6|=pope), and by ran- some 6|= many 951 468 (49%)
domly shuffling the words across the positive in- few 6|= many 161 33 (20%)
stances. We also manually double-checked these both 6|= several 431 63 (15%)
pairs to make sure that they are not hyponym- Subtotal 8455 6690 (79%)
hypernym pairs. Total 15992 12494 (78%)
3.4 The Q1 N |= Q2 N data set Table 1: Entailing and non-entailing quantifier pairs
We study 12 quantifiers: all, both, each, either, with number of instances per pair (Section 3.4) and
every, few, many, most, much, no, several, some. SVMpair-out performance breakdown (Section 5).
We took the Cartesian product of these quantifiers
with the 6402 WordNet nouns described in Sec-
rise to an instance of entailment (Q1 N |= Q2 N if
tion 3.3. From this Cartesian product, we obtain
Q1 |= Q2 ; example: many dogs |= several dogs) or
a total of 28926 QN sequences, such as every cat,
non-entailment (Q1 N6|=Q2 N if Q1 6|=Q2 ; example:
that occur at least 100 times in the corpus. These
many dogs6|=most dogs). The number of QN pairs
are our QN phrases of interest to which the proce-
that each quantifier pair gives rise to in this way is
dure in Section 3.1 assigns a semantic vector.
listed in the second column of Table 1. As shown
Also, from the set of quantifier pairs (Q1 , Q2 )
there, we have a total of 7537 positive instances
where Q1 6= Q2 , we identified 13 clear cases
and 8455 negative instances of QN entailment.
where Q1 |=Q2 and 17 clear cases where Q1 6|=Q2 .
These 30 cases are listed in the first column of
3.5 Classification methods
Table 1. For each of these 30 quantifier pairs
(Q1 , Q2 ), we enumerate those WordNet nouns N We consider two methods to classify candidate
such that semantic vectors are available for both pairs as entailing or non-entailing, the balAPinc
Q1 N and Q2 N (that is, both sequences occur in measure of Kotlerman et al. (2010) and a standard
at least 100 times). Each such noun then gives Support Vector Machine (SVM) classifier.
27
balAPinc As discussed in Section 2.2, balAP- To adapt balAPinc to recognize entailment, we
inc is optimized to capture a relation of feature must select a threshold t above which we classify
inclusion between the narrower (entailing) and a pair as entailing. In the experiments below, we
broader (entailed) terms, while capturing other in- explore two approaches. In balAPincupper , we op-
tuitions about the relative relevance of features. timize the threshold directly on the test data, by
balAPinc averages two terms, APinc and LIN. setting t to maximize the F-measure on the test
APinc is given by: set. This gives us an upper bound on how well bal-
P|Fu | 0
APinc could perform on the test set (but note that
r=1 P (r) rel (fr ) optimizing F does not necessarily translate into a
APinc(u |= v) =
|Fu | good accuracy performance, as clearly illustrated
APinc is a version of the Average Precision by Table 3 below). In balAPincAN |= N , we use the
measure from Information Retrieval tailored to AN |= N data set as training data and pick the t
lexical inclusion. Given vectors Fu and Fv rep- that maximizes F on this training set.
resenting the dimensions with positive PMI val- We use the balAPinc measure as a refer-
ues in the semantic vectors of the candidate pair ence point because, on the evidence provided by
u |= v, the idea is that we want the features (that Kotlerman et al., it is the state of the art in various
is, vector dimensions) that have larger values in tasks related to lexical entailment. We recognize
Fu to also have large values in Fv (the opposite however that it is somewhat complex and specifi-
does not matter because it is u that should be in- cally tuned to capturing the relation of feature in-
cluded in v, not vice versa). The Fu features are clusion. Consequently, we also experiment with
ranked according to their PMI value so that fr a more flexible classifier, which can detect other
is the feature in Fu with rank r, i.e., r-th high- systematic properties of vectors in an entailment
est PMI. Then the sum of the product of the two relation. We present this classifier next.
terms P (r) and rel0 (fr ) across the features in Fu
is computed. The first term is the precision at r, SVM Support vector machines are widely used
which is higher when highly ranked u features are high-performance discriminative classifiers that
present in Fv as well. The relevance term rel0 (fr ) find the hyperplane providing the best separation
is higher when the feature fr in Fu also appears between negative and positive instances (Cristian-
in Fv with a high rank. (See Kotlerman et al. for ini and Shawe-Taylor, 2000). Our SVM classifiers
how P (r) and rel0 (fr ) are computed.) The result- are trained and tested using Weka 3 and LIBSVM
ing score is normalized by dividing by the entail- 2.8 (Chang and Lin, 2011). We use the default
ing vector size |Fu | (in accordance with the idea polynomial kernel ((u v/600)3 ) with (tolerance
that having more v features should not hurt be- of termination criterion) set to 1.6. This value was
cause the u features should be included in the v tuned on the AN|=N data set, which we never use
features, not vice versa). for testing. In the same initial tuning experiments
To balance the potentially excessive asymmetry on the AN |= N data set, SVM outperformed deci-
of APinc towards the features of the antecedent, sion trees, naive Bayes, and k-nearest neighbors.
Kotlerman et al. average it with LIN, the widely We feed each potential entailment pair to SVM
used symmetric measure of distributional similar- by concatenating the two vectors representing the
ity proposed by Lin (1998): antecedent and consequent expressions.2 How-
P ever, for efficiency and to mitigate data sparse-
f Fu Fv [wu (f ) + wv (f )]
LIN(u, v) = P P ness, we reduce the dimensionality of the seman-
f Fu wu (f ) + f Fv wv (f ) tic vectors to 300 columns using Singular Value
LIN essentially measures feature vector overlap. Decomposition (SVD) before feeding them to the
The positive PMI values wu (f ) and wv (f ) of a classifier.3 Because the SVD-reduced semantic
feature f in Fu and Fv are summed across those 2
We have tried also to represent a pair by subtracting and
features that are positive in both vectors, normal- by dividing the two vectors. The concatenation operation
izing by the cumulative positive PMI mass in both gave more successful results.
3
vectors. Finally, balAPinc is the geometric aver- To keep a manageable parameter space, we picked 300
age of APinc and LIN: columns without tuning. This is the best value reported in
p many earlier studies, including classic LSA. Since SVD
balAPinc(u|=v) = APinc(u |= v) LIN(u, v) sometimes improves the semantic space (Landauer and Du-
28
vectors occupy a 300-dimensional space, the en- P R F Accuracy
tailment pairs occupy a 600-dimensional space. (95% C.I.)
An SVM with a polynomial kernel takes into SVMupper 88.6 88.6 88.5 88.6 (87.389.7)
account not only individual input features but also balAPincAN |= N 65.2 87.5 74.7 70.4 (68.772.1)
their interactions (Manning et al., 2008, chapter
balAPincupper 64.4 90.0 75.1 70.1 (68.471.8)
15). Thus, our classifier can capture not just prop-
SVMAN |= N 69.3 69.3 69.3 69.3 (67.671.0)
erties of individual dimensions of the antecedent
and consequent pairs, but also properties of their cos(N1 , N2 ) 57.7 57.6 57.5 57.6 (55.859.5)
combinations (e.g., the product of the first dimen- fq(N1 ) < fq(N2 ) 52.1 52.1 51.8 53.3 (51.455.2)
sions of the antecedent and the consequent). We
conjecture that this property of SVMs is funda- Table 2: Detecting lexical entailment. Results ranked
mental to their success at detecting entailment, by accuracy and expressed as percentages. 95% con-
where relations between the antecedent and the fidence intervals around accuracy calculated by bino-
mial exact tests.
consequent should matter more than their inde-
pendent characteristics.
accuracy on the test set, which is balanced be-
4 Predicting lexical entailment from tween positive and negative instances. Interest-
AN |= N evidence ingly, the balAPinc decision thresholds tuned on
Since the contexts of AN must be a subset of the the AN |= N set and on the test data are very
contexts of N, semantic vectors harvested from close (0.26 vs. 0.24), resulting in very similar per-
AN phrases and their head Ns are by construc- formance for balAPincAN |= N and balAPincupper .
tion in an inclusion relation. The first experiment This suggests that the relation captured by bal-
shows that these vectors constitute excellent train- APinc on the phrasal entailment training data is
ing data to discover entailment between nouns. indeed the same that the measure captures when
This suggests that the vector pairs representing applied to lexical entailment data.
entailment between nouns are also in an inclusion The success of this first experiment shows that
relation, supporting the conjectures of Kotlerman the entailment relation present in the distribu-
et al. (2010) and others. tional representation of AN phrases and their
Table 2 reports the results we obtained with head Ns transfers to lexical entailment (entailment
balAPincupper , balAPincAN |= N (Section 3.5) and among Ns). Most importantly, this result demon-
SVMAN |= N (the SVM classifier trained on the strates that the semantic vectors of composite ex-
AN |= N data). As an upper bound for meth- pressions (such as ANs) are useful for lexical en-
ods that generalize from AN |= N, we also re- tailment. Moreover, the result is in accordance
port the performance of SVM trained with 10-fold with the view of FS, that ANs and Ns have the
cross-validation on the N1 |= N2 data themselves same semantic type, and thus they enter entail-
(SVMupper ). Finally, we tried two baseline classi- ment relations of the same kind. Finally, the hy-
fiers. The first baseline (fq(N1 ) < fq(N2 )) guesses pothesis that entailment among nouns is reflected
entailment if the first word is less frequent than by distributional inclusion among their semantic
the second. The second (cos(N1 , N2 )) applies a vectors (Kotlerman et al., 2010) is supported both
threshold (determined on the test set) to the co- by the successful generalization of the SVM clas-
sine similarity of the pair. The results of these sifier trained on AN |= N pairs and by the good
baselines shown in Table 2 use SVD; those with- performance of the balAPinc measure.
out SVD are similar. Both baselines outperformed 5 Generalizing QN entailment
more trivial methods such as random guessing or
fixed response, but they performed significantly The second study is somewhat more ambitious,
worse than SVM and balAPinc. as it aims to capture and generalize the entailment
Both methods that generalize entailment from relation between QPs (of shape QN) using only
AN |= N to N1 |= N2 perform well, with 70% the corpus-harvested semantic vectors represent-
mais, 1997; Rapp, 2003; Schutze, 1997), we tried balAPinc
ing these phrases as evidence. We are thus first
on the SVD-reduced vectors as well, but results were consis- and foremost interested in testing whether these
tently worse than with PMI vectors. vectors encode information that can help a power-
29
P R F Accuracy baselines are only slightly better overall than more
(95% C.I.) trivial baselines.) We consider moreover an alter-
SVMpair-out 76.7 77.0 76.8 78.1 (77.578.8) native approach that ignores the noun altogether
SVMquantifier-out 70.1 65.3 68.0 71.0 (70.371.7)
and uses vectors for the quantifiers only (e.g., the
decision about all dogs|=some dogs considers the
SVMQ
pair-out 67.9 69.8 68.9 70.2 (69.570.9)
corpus-derived all and some vectors only). The
SVMQ
quantifier-out 53.3 52.9 53.1 56.0 (55.256.8) models resulting from this Q-only strategy are
cos(QN1 , QN2 ) 52.9 52.3 52.3 53.1 (52.353.9) marked with the superscript Q in the table.
balAPincAN |= N 46.7 5.6 10.0 52.5 (51.753.3) The results confirm clearly that semantic vec-
SVMAN |= N 2.8 42.9 5.2 52.4 (51.753.2) tors for QNs contain enough information to allow
fq(QN1 )<fq(QN2 ) 51.0 47.4 49.1 50.2 (49.451.0) a classifier to detect entailment: SVMquantifier-out
balAPincupper 47.1 100 64.1 47.2 (46.447.9) performs as well as the lexical entailment classi-
fiers of our first study, and SVMpair-out does even
Table 3: Detecting quantifier entailment. Results better. This success is especially impressive given
ranked by accuracy and expressed as percentages. our challenging training and testing regimes.
95% confidence intervals around accuracy calculated In contrast to the first study, now SVMAN |= N ,
by binomial exact tests. the classifier trained on the AN |= N data set,
and balAPinc perform no better than the base-
lines. (Here balAPincupper and balAPincAN |= N
ful classifier, such as SVM, to detect entailment. pick very different thresholds: the first settling
To abstract away from lexical or other effects on a very low t = 0.01, whereas for the sec-
linked to a specific quantifier, we consider two ond t = 0.26.) As predicted by FS (see Section
challenging training and testing regimes. In the 2.2 above), noun-level entailment does not gen-
first (SVMpair-out ), we hold out one quantifier pair eralize to quantifier phrase entailment, since the
as testing data and use the other 29 pairs in Table 1 two structures have different semantic types, cor-
as training data. Thus, for example, the classifier responding to different kinds of entailment rela-
must discover all dogs |= some dogs without see- tions. Moreover, the failure of balAPinc suggests
ing any all N |= some N instance in the training that, whatever evidence the SVMs rely upon, it is
data. In the second (SVMquantifier-out ), we hold out not simple feature inclusion.
one of the 12 quantifiers as testing data (that is, Interestingly, even the Q vectors alone encode
hold out every pair involving a certain quantifier) enough information to capture entailment above
and use the rest as training data. For example, chance. Still, the huge drop in performance from
the quantifier must guess all dogs |= some dogs SVMQ Q
pair-out to SVMquantifier-out suggests that the Q-
without ever seeing all in the training data. We only method learned ad-hoc properties that do not
expect the second training regime to be more dif- generalize (e.g., all entails every Q2 ).
ficult, not just because there is less training data, Tables 1 and 4 break down the SVM results by
but also because the trained classifier is tested on (pairs of) quantifiers. We highlight the remark-
a quantifier that it has never encountered within able dichotomy in Table 4 between the good per-
any training QN sequence.4 formance on the universal-like quantifiers (each,
Table 3 reports the results for SVMpair-out and every, all, much) and the poor performance on the
SVMquantifier-out , as well as for the methods we existential-like ones (some, no, both, either).
tried in the lexical entailment experiments. (As In sum, the QN experiments show that seman-
in the first study, the frequency- and cosine-based tic vectors contain enough information to detect
4
a logical relation such as entailment not only be-
In our initial experiments, we added negative entail-
ment instances by blindly permuting the nouns, under the
tween words, but also between phrases contain-
assumption that Q1 N1 typically does not entail Q2 N2 when ing quantifiers that determine their entailment re-
Q1 6= Q2 and N1 6= N2 . These additional instances turned lation. While a flexible classifier such as SVM
out to be much easier to classify: adding an equal proportion performs this task well, neither measuring fea-
of them to the training data and testing data, such that the
number of instances where N1 = N2 and where N1 6= N2
ture inclusion nor generalizing nominal entail-
is equal, reduced every error rate roughly by half. The re- ment works. SVMs are evidently tapping into
ported results do not involve these additional instances. other properties of the vectors.
30
Quantifier Instances Correct Very importantly, instead of extracting vectors
|= 6|= |= 6|= representing phrases directly from the corpus, we
each 656 656 649 637 (98%) intend to derive them by compositional operations
every 460 1322 402 1293 (95%) proposed in the literature (see Section 2.1 above).
much 248 0 216 0 (87%) We will look for composition methods producing
all 2949 2641 2011 2494 (81%) vector representations of composite expressions
several 1731 1509 1302 1267 (79%) that are as good as (or better than) vectors directly
many 3341 4163 2349 3443 (77%) extracted from the corpus at encoding entailment.
few 0 461 0 311 (67%)
Finally, we would like to evaluate our entail-
most 928 832 549 511 (60%)
some 4062 3145 1780 2190 (55%) ment detection strategies for larger phrases and
no 0 714 0 380 (53%) sentences, possibly containing multiple quanti-
both 636 1404 589 303 (44%) fiers, and eventually embed them as core compo-
either 63 63 2 41 (34%) nents of an RTE system.
Total 15074 16910 9849 12870 (71%)
Acknowledgments
Table 4: Breakdown of results with leaving-one-
We thank the Erasmus Mundus EMLCT Program
quantifier-out (SVMquantifier-out ) training regime.
for the student and visiting scholar grants to the
third and fourth author, respectively. The first
6 Conclusion two authors are partially funded by the ERC 2011
Starting Independent Research Grant supporting
Our main results are as follows. the COMPOSES project (nr. 283554). We are
grateful to Gemma Boleda, Louise McNally, and
1. Corpus-harvested semantic vectors repre-
the anonymous reviewers for valuable comments,
senting adjective-noun constructions and
and to Ido Dagan for important insights into en-
their heads encode a relation of entailment
tailment from an empirical point of view.
that can be exploited to train a classifier
to detect lexical entailment. In particular,
a relation of feature inclusion between the References
narrower antecedent and broader consequent
terms captures both AN |= N and N1 |= N2 Timothy Baldwin, Colin Bannard, Takaaki Tanaka,
and Dominic Widdows. 2003. An empirical model
entailment.
of multiword expression decomposability. In Pro-
ceedings of the ACL 2003 Workshop on Multiword
2. The semantic vectors of quantifier-noun con-
Expressions, pages 8996.
structions also encode information sufficient Marco Baroni and Alessandro Lenci. 2011. How
to learn an entailment relation that general- we BLESSed distributional semantic evaluation. In
izes to QNs containing quantifiers that were Proceedings of the Workshop on Geometrical Mod-
not seen during training. els of Natural Language Semantics.
Marco Baroni and Roberto Zamparelli. 2010. Nouns
3. Neither the entailment information encoded are vectors, adjectives are matrices: Representing
in AN |= N vectors nor the balAPinc mea- adjective-noun constructions in semantic space. In
sure generalizes well to entailment detection Proceedings of EMNLP, pages 11831193, Boston,
in QNs. This result suggests that QN vectors MA.
encode a different kind of entailment, as also Johan Bos and Katja Markert. 2006. When logical
suggested by type distinctions in Formal Se- inference helps determining textual entailment (and
when it doesnt. In Proceedings of the Second PAS-
mantics. CAL Challenges Workshop on Recognising Textual
Entailment.
In future work, we want first of all to conduct
Paul Buitelaar and Philipp Cimiano. 2008. Bridging
an analysis of the features in the Q1 N |= Q2 N vec- the Gap between Text and Knowledge. IOS, Ams-
tors that are crucially exploited by our success- terdam.
ful entailment recognizers, in order to understand Nathanael Chambers, Daniel Cer, Trond Grenager,
which characteristics of entailment are encoded in David Hall, Chloe Kiddon, Bill MacCartney, Marie-
these vectors. Catherine de Marneffe, Daniel Ramage, Eric Yeh,
31
and Christopher D. Manning. 2007. Learning Kevin Lund and Curt Burgess. 1996. Producing
alignments and leveraging natural logic. In ACL- high-dimensional semantic spaces from lexical co-
PASCAL Workshop on Textual Entailment and Para- occurrence. Behavior Research Methods, 28:203
phrasing. 208.
Chih-Chung Chang and Chih-Jen Lin. 2011. LIB- Chris Manning, Prabhakar Raghavan, and Hinrich
SVM: A library for support vector machines. ACM Schutze. 2008. Introduction to Information Re-
Transactions on Intelligent Systems and Technol- trieval. Cambridge University Press, Cambridge.
ogy, 2(3):27:127:27. Jeff Mitchell and Mirella Lapata. 2010. Composi-
Kenneth Church and Peter Hanks. 1990. Word associ- tion in distributional models of semantics. Cogni-
ation norms, mutual information, and lexicography. tive Science, 34(8):13881429.
Computational Linguistics, 16(1):2229. Richard Montague. 1970. Universal Grammar. Theo-
Nello Cristianini and John Shawe-Taylor. 2000. An ria, 36:373398.
introduction to Support Vector Machines and other Patrick Pantel and Deepak Ravichandran. 2004. Au-
kernel-based learning methods. Cambridge Univer- tomatically labeliing semantic classes. In Proceed-
sity Press, Cambridge. ings of HLT-NAACL 2004, pages 321328.
Ido Dagan, Bill Dolan, Bernardo Magnini, and Dan Reinhard Rapp. 2003. Word sense discovery based on
Roth. 2009. Recognizing textual entailment: ratio- sense descriptor dissimilarity. In Proceedings of the
nal, evaluation and approaches. Natural Language 9th MT Summit, pages 315322, New Orleans, LA.
Engineering, 15:459476. Magnus Sahlgren. 2006. The Word-Space Model.
Katrin Erk. 2009. Supporting inferences in semantic Dissertation, Stockholm University.
space: representing words as regions. In Proceed- Helmut Schmid. 1995. Improvements in part-of-
ings of IWCS, pages 104115, Tilburg, Netherlands. speech tagging with an application to German.
Maayan Geffet and Ido Dagan. 2005. The distribu- In Proceedings of the EACL-SIGDAT Workshop,
tional inclusion hypotheses and lexical entailment. Dublin, Ireland.
In Proceedings of ACL, pages 107114, Ann Arbor, Hinrich Schutze. 1997. Ambiguity Resolution in Nat-
MI. ural Language Learning. CSLI, Stanford, CA.
Edward Grefenstette and Mehrnoosh Sadrzadeh. Rion Snow, Daniel Juravsky, and Andrew Y. Ng.
2011. Experimental support for a categorical com- 2005. Learning syntactic patterns for automatic hy-
positional distributional model of meaning. In Pro- pernym discovery. In Proceedings of NIPS 17.
ceedings of EMNLP, pages 13951404, Edinburgh. Rion Snow, Daniel Juravsky, and Andrew Y. Ng.
2006. Semantic taxonomy induction from het-
Emiliano Guevara. 2010. A regression model
erogenous evidence. In Proceedings of ACL 2006,
of adjective-noun compositionality in distributional
pages 801808.
semantics. In Proceedings of the ACL GEMS Work-
shop, pages 3337, Uppsala, Sweden. Richmond H. Thomason, editor. 1974. Formal Phi-
losophy: Selected Papers of Richard Montague.
Marti Hearst. 1992. Automatic acquisition of hy-
Yale University Press, New York.
ponyms from large text corpora. In Proceedings of
Peter Turney and Patrick Pantel. 2010. From fre-
COLING, pages 539545, Nantes, France.
quency to meaning: Vector space models of se-
Irene Heim and Angelika Kratzer. 1998. Semantics in mantics. Journal of Artificial Intelligence Research,
Generative Grammar. Blackwell, Oxford. 37:141188.
Lili Kotlerman, Ido Dagan, Idan Szpektor, and Peter Turney. 2008. A uniform approach to analogies,
Maayan Zhitomirsky-Geffet. 2010. Directional synonyms, antonyms and associations. In Proceed-
distributional similarity for lexical inference. Natu- ings of COLING, pages 905912, Manchester, UK.
ral Language Engineering, 16(4):359389. Julie Weeds, David Weir, and Diana McCarthy. 2004.
Milen Kouleykov and Bernardo Magnini. 2005. Tree Characterising measures of lexical distributional
edit sistance for textual entailment. In Proceed- similarity. In Proceedings of the 20th Interna-
ings of RALNP-2005, International Conference on tional Conference of Computational Linguistics,
Recent Advances in Natural Language Processing, COLING-2004, pages 10151021.
pages 271278. Fabio M. Zanzotto, Marco Pennacchiotti, and Alessan-
Thomas Landauer and Susan Dumais. 1997. A dro Moschitti. 2007. Shallow semantics in fast tex-
solution to Platos problem: The latent semantic tual entailment rule learners. In Proceedings of the
analysis theory of acquisition, induction, and rep- ACL-PASCAL Workshop on Textual Entailment and
resentation of knowledge. Psychological Review, Paraphrasing.
104(2):211240. Maayan Zhitomirsky-Geffet and Ido Dagan. 2010.
Dekang Lin. 1998. An information-theoretic defini- Bootstrapping distributional feature vector quality.
tion of similarity. In Proceedings of ICML, pages Computational Linguistics, 35(3):435461.
296304, Madison, WI, USA.
32
Evaluating Distributional Models of Semantics for Syntactically
Invariant Inference
33
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 3343,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
Existing evaluations of distributional semantic 2 Compositionality and Distributional
models fall short of measuring this. One evalua- Semantics
tion approach consists of lexical-level word sub-
The idea of compositionality has been central to
stitution tasks which primarily evaluate a sys-
understanding contemporary natural language se-
tems ability to disambiguate word senses within a
mantics from an historiographic perspective. The
controlled syntactic environment (McCarthy and
idea is often credited to Frege, although in fact
Navigli, 2009, for example). Another approach is
Frege had very little to say about compositional-
to evaluate parsing accuracy (Socher et al., 2010,
ity that had not already been repeated since the
for example), which is really a formalism-specific
time of Aristotle (Hodges, 2005). Our modern
approximation to argument structure analysis.
notion of compositionality took shape primarily
These evaluations may certainly be relevant to
with the work of Tarski (1956), who was actu-
specific components of, for example, machine
ally arguing that a central difference between for-
translation or natural language generation sys-
mal languages and natural languages is that nat-
tems, but they tell us little about a semantic
ural language is not compositional. This in turn
models ability to support inference.
was the the contention that an important theo-
In this paper, we propose a general framework
retical difference exists between formal and nat-
for evaluating distributional semantic models that
ural languages, that Richard Montague so fa-
build sentence representations, and suggest two
mously rejected (Montague, 1974). Composi-
evaluation methods that test the notion of struc-
tionality also features prominently in Fodor and
turally invariant inference directly. Both rely on
Pylyshyns (1988) rejection of early connection-
determining whether sentences express the same
ist representations of natural language semantics,
semantic relation between entities, a crucial step
which seems to have influenced Mitchell and La-
in solving a wide variety of inference tasks like
pata (2008) as well.
recognizing textual entailment, information re-
Logic-based forms of compositional semantics
trieval, question answering, and summarization.
have long strived for syntactic invariance in mean-
The first evaluation is a relation classification
ing representations, which is known as the doc-
task, where a semantic model is tested on its abil-
trine of the canonical form. The traditional justifi-
ity to recognize whether a pair of sentences both
cation for canonical forms is that they allow easy
contain a particular semantic relation, such as
access to a knowledge base to retrieve some de-
Company X acquires Company Y. The second task
sired information, which amounts to a form of in-
is a question answering task, the goal of which is
ference. Our work can be seen as an extension of
to locate the sentence in a document that contains
this notion to distributional semantic models with
the answer. Here, the semantic model must match
a more general notion of representational similar-
the question, which expresses a proposition with a
ity and inference.
missing argument, to the answer-bearing sentence
There are many regular alternations that seman-
which contains the full proposition.
tics models have tried to account for such as pas-
We apply these new evaluation protocols to
sive or dative alternations. There are also many
several recent distributional models, extending
lexical paraphrases which can take drastically dif-
several of them to build sentence representa-
ferent syntactic forms. Take the following exam-
tions. We find that the models outperform a sim-
ple from Poon and Domingos (2009), in which the
ple lemma overlap model only slightly, but that
same semantic relation can be expressed by a tran-
combining these models with the lemma overlap
sitive verb or an attributive prepositional phrase:
model can improve performance. This result is
likely due to weaknesses in current models abil- (1) Utah borders Idaho.
ity to deal with issues such as named entities, Utah is next to Idaho.
coreference, and negation, which are not empha-
sized by existing evaluation methods, but it does In distributional semantics, the original sen-
suggest that distributional models of semantics tence similarity test proposed by Kintsch (2001)
can play a more central role in systems that re- served as the inspiration for the evaluation per-
quire deep, precise inference. formed by Mitchell and Lapata (2008) and most
later work in the area. Intransitive verbs are given
34
in the context of their syntactic subject, and can- which words are given in the context of the sur-
didate synonyms are ranked for their appropri- rounding sentence, and the task is to rank a given
ateness. This method targets the fact that a syn- list of proposed substitutions for that word. The
onym is appropriate for only some of the verbs list of substitutions as well as the correct rankings
senses, and the intended verb sense depends on are elicited from annotators. This task was origi-
the surrounding context. For example, burn and nally conceived as an applied evaluation of WSD
beam are both synonyms of glow, but given a par- systems, not an evaluation of phrase representa-
ticular subject, one of the synonyms (called the tions.
High similarity landmark) may be a more appro- Parsing accuracy has been used as a prelimi-
priate substitution than the other (the Low similar- nary evaluation of semantic models that produce
ity landmark). So, if the fire is the subject, glowed syntactic structure (Socher et al., 2010; Wu and
is the High similarity landmark, and beamed the Schuler, 2011). However, syntax does not always
Low similarity landmark. reflect semantic content, and we are specifically
Fundamentally, this method was designed as interested in supporting syntactic invariance when
a demonstration that compositionality in com- doing semantic inference. Also, this type of eval-
puting phrasal semantic representations does not uation is tied to a particular grammar formalism.
interfere with the ability of a representation to The existing evaluations that are most similar in
synthesize non-compositional collocation effects spirit to what we propose are paraphrase detection
that contribute to the disambiguation of homo- tasks that do not assume a restricted syntactic con-
graphs. Here, word-sense disambiguation is im- text. Washtell (2011) collected human judgments
plicitly viewed as a very restricted, highly lexi- on the general meaning similarity of candidate
calized case of inference for selecting the appro- phrase pairs. Unfortunately, no additional guid-
priate disjunct in the representation of a words ance on the definition of most similar in mean-
meaning. ing was provided, and it appears likely that sub-
Kintsch (2001) was interested in sentence sim- jects conflated lexical, syntactic, and semantic re-
ilarity, but he only conducted his evaluation on latedness. Dolan and Brockett (2005) define para-
a few hand-selected examples. Mitchell and La- phrase detection as identifying sentences that are
pata (2008) conducted theirs on a much larger in a bidirectional entailment relation. While such
scale, but chose to focus only on this single case sentences do support exactly the same inferences,
of syntactic combination, intransitive verbs and we are also interested in the inferences that can
their subjects, in order to factor out inessential be made from similar sentences that are not para-
degrees of freedom to compare their various al- phrases according to this strict definition a sit-
ternative models more equitably. This was not uation that is more often encountered in end ap-
necessaryusing the same, sufficiently large, un- plications. Thus, we adopt a less restricted notion
biased but syntactically heterogeneous sample of of paraphrasis.
evaluation sentences would have served as an ade-
quate controland this decision furthermore pre- 3 An Evaluation Framework
vents the evaluation from testing the desired in- We now describe a simple, general framework
variance of the semantic representation. for evaluating semantic models. Our framework
Other lexical evaluations suffer from the same consists of the following components: a seman-
problem. One uses the WordSim-353 dataset tic model to be evaluated, pairs of sentences that
(Finkelstein et al., 2002), which contains hu- are considered to have high similarity, and pairs
man word pair similarity judgments that seman- of sentences that are considered to have low simi-
tic models should reproduce. However, the word larity.
pairs are given without context, and homography In particular, the semantic model is a binary
is unaddressed. Also, it is unclear how reliable function, s = M(x, x ), which returns a real-
the similarity scores are, as different annotators valued similarity score, s, given a pair of arbitrary
may interpret the integer scale of similarity scores linguistic units (that is, words, phrases, sentences,
differently. Recent work uses this dataset mostly etc.), x and x . Note that this formulation of the
for parameter tuning. Another is the lexical para- semantic model is agnostic to whether the models
phrase task of McCarthy and Navigli (2009), in use compositionality to build a phrase represen-
35
tation from constituent representations, and even ontology construction, recognizing textual entail-
to the actual representation used. The model is ment and question answering.
tested by applying it to each element in the fol- In this task, the high and the low similarity sen-
lowing two sets: tence pairs are constructed in the following man-
ner. First, a target semantic relation, such as Com-
H = {(h, h )|h and h are linguistic units (2)
pany X acquires Company Y is chosen, and enti-
with high similarity} ties are chosen for each slot in the relation, such as
L = {(l, l )|l and l are linguistic units (3) Company X=Pfizer and Company Y=Rinat Neu-
with low similarity} roscience. Then, sentences containing these enti-
ties are extracted and divided into two subsets. In
The resulting sets of similarity scores are: one of them, E, the entities are in the target se-
S H = M(h, h )|(h, h ) H
(4) mantic relation, while in the other, N E, they are
not. The evaluation sets H and L are then con-
S L = M(l, l )|(l, l ) L
(5) structed as follows:
The semantic model is evaluated according to
H = E E \ {(e, e)|e E} (6)
its ability to separate S H and S L . We will de-
fine specific measures of separation for the tasks L = E NE (7)
that we propose shortly. While the particular def-
In other words, the high similarity sentence
initions of high similarity and low similarity
pairs are all the pairs where both express the tar-
depend on the task, at the crux of both our evalu-
get semantic relation, except the pairs between a
ations is that two sentences are similar if they ex-
sentence and itself, while the low similarity pairs
press the same semantic relation between a given
are all the pairs where exactly one of the two sen-
entity pair, and dissimilar otherwise. This thresh-
tences expresses the target relation.
old for similarity is closely tied to the argument
Several sentences expressing the relation Pfizer
structure of the sentence, and allows considerable
acquires Rinat Neuroscience are shown in Exam-
flexibility in the other semantic content that may
ples 8 to 10. These sentences illustrate the amount
be contained in the sentence, unlike the bidirec-
of syntactic and lexical variation that the semantic
tional paraphrase detection task. Yet it ensures
model must recognize as expressing the same se-
that a consistent and useful distinction for infer-
mantic relation. In particular, besides recognizing
ence is being detected, unlike unconstrained sim-
synonymy or near-synonymy at the lexical level,
ilarity judgments.
models must also account for subcategorization
Also, compared to word similarity assessments
differences, extra arguments or adjuncts, and part-
or paraphrase elicitation, determining whether a
of-speech differences due to nominalization.
sentence expresses a semantic relation is a much
easier task cognitively for human judges. This bi- (8) Pfizer buys Rinat Neuroscience to extend
nary judgment does not involve interpreting a nu- neuroscience research and in doing so
merical scale or coming up with an open-ended acquires a product candidate for OA.
set of alternative paraphrases. It is thus easier to (lexical difference)
get reliable annotated data.
Below, we present two tasks that instantiate (9) A month earlier, Pfizer paid an estimated
this evaluation framework and choice of similar- several hundred million dollars for biotech
ity threshold. They differ in that the first is tar- firm Rinat Neuroscience. (extra argument,
geted towards recognizing declarative sentences subcategorization)
or phrases, while the second is targeted towards a
(10) Pfizer to Expand Neuroscience Research
question answering scenario, where one argument
With Acquisition of Biotech Company Rinat
in the semantic relation is queried.
Neuroscience (nominalization)
3.1 Task 1: Relation Classification Since our interest is to measure the models
The first task is a relation classification task. Rela- ability to separate S H and S L in an unsuper-
tion extraction and recognition are central to a va- vised setting, standard supervised classification
riety of other tasks, such as information retrieval, accuracy is not applicable. Instead, we employ
36
the area under a ROC curve (AUC), which does manually checked. We use only those cases that
not depend on choosing an arbitrary classification have thus been determined to be correct question-
threshold. A ROC curve is a plot of the true pos- answer pairs. As a result of this restriction, this
itive versus false positive rate of a binary classi- task is rather more like Task 1 in how it tests a
fier as the classification threshold is varied. The models ability to recognize lexical and syntac-
area under a ROC curve can thus be seen as the tic paraphrases. This task also involves recog-
performance of linear classifiers over the scores nizing voicing alternations, which were automati-
produced by the semantic model. The AUC can cally extracted by the semantic parser.
also be interpreted as the probability that a ran- An example of a question-answer pair involv-
domly chosen positive instance will have a higher ing a voicing alternation that is used in this task is
similarity score than a randomly chosen negative presented in Example 13.
instance. A random classifier is expected to have
an AUC of 0.5. (13) Q: What does il-2 activate?
A: PI3K
3.2 Task 2: Restricted QA Sentence: Phosphatidyl inositol 3-kinase
The second task that we propose is a restricted (PI3K) is activated by IL-2.
form of question answering. In this task, the sys- Since there is only one element in H and hence
tem is given a question q and a document D con- H
S for each question and document, we measure
sisting of a list of sentences, in which one of the the separation between S H and S L using the rank
sentences contains the answer to the question. We of the score of answer-bearing sentence among
define: the scores of all the sentences in the document.
We normalize the rank so that it is between 0
H = {(q, d)|d D and d answers q} (11)
(ranked least similar) and 1 (ranked most simi-
L = {(q, d)|d D and d does not answer q} lar). Where ties occur, the sentence is ranked as
(12) if it were in the median position among the tied
sentences. If the question-answer pairs are zero-
In other words, the sentences are divided into two
indexed by i, answer(i) is the index of the sen-
subsets; those that contain the answer to q should
tence containing the answer for the ith pair, and
be similar to q, while those that do not should be
length(i) is the number of sentences in the doc-
dissimilar. We also assume that only one sentence
ument, then the mean normalized rank score of a
in each document contains the answer, so H con-
system is:
tains only one sentence.
Unrestricted question answering is a difficult
answer(i)
problem that forces a semantic representation to norm rank = E 1 (14)
i length(i) 1
deal sensibly with a number of other semantic is-
sues such as coreference and information aggre- 4 Experiments
gation which still seem to be out of reach for
We drew a number of recent distributional seman-
contemporary distributional models of meaning.
tic models to compare in this paper. We first de-
Since our focus in this work is on argument struc-
scribe the models and our reimplementation of
ture semantics, we restrict the question-answer
them, before describing the tasks and the datasets
pairs to those that only require dealing with para-
used in detail and the results.
phrases of this type.
To do so, we semi-automatically restrict the 4.1 Distributional Semantic Models
question-answer pairs by using the output of an
We tested four recent distributional models and a
unsupervised clustering semantic parser (Poon
lemma overlap baseline, which we now describe.
and Domingos, 2009). The semantic parser clus-
We extended several of the models to compo-
ters semantic sub-expressions derived from a de-
sitionally construct phrase representations using
pendency parse of the sentence, so that those sub-
component-wise vector addition and multiplica-
expressions that express the same semantic re-
tion, as we note below. Since the focus of this pa-
lations are clustered. The parser is used to an-
per is on evaluation methods for such models, we
swer questions, and the output of the parser is
did not experiment with other compositionality
37
operators. We do note, however, that component- a distributional representation of a, va , the repre-
wise operators have been popular in recent liter- sentation of a in context, a , is given by
ature, and have been applied across unrestricted
syntactic contexts (Mitchell and Lapata, 2009), a = va Rb (r 1 ) (17)
X
so there is value in evaluating the performance of Rb (r) = f (c, r, b) vc , (18)
these operators in itself. The models were trained c:f (c,r,b)>
on the Gigaword corpus (2nd ed., ~2.3B words).
All models use cosine similarity to measure the where Rb (r) is the vector describing the selec-
similarity between representations, except for the tional preference of word b in relation r, f (c, r, b)
baseline model. is the frequency of this dependency triple, is a
frequency threshold to weed out uncommon de-
Lemma Overlap This baseline simply repre- pendency triples (10 in our experiments), and
sents a sentence as the counts of each lemma is a vector combination operator, here component-
present in the sentence after removing stop wise multiplication. We extend the model to com-
words. Let a sentence x consist of lemma-tokens pute sentence representations from the contextu-
m1 , . . . , m|x| . The similarity between two sen- alized word vectors using component-wise addi-
tences is then defined as tion and multiplication.
M(x, x ) = #In(x, x ) + #In(x , x) (15) TFP Thater et al. (2010)s model is also sensi-
|x| tive to selectional preferences, but to two degrees.
For example, the vector for catch might contain
X
#In(x, x ) = 1x (mi ) (16)
i=1 a dimension labelled (OBJ,OBJ-1,throw),
which indicates the strength of connection be-
where 1x (mi ) is an indicator function that returns tween the two verbs through all of the co-
1 if mi x , and 0 otherwise. This definition occurring direct objects which they share. Unlike
accounts for multiple occurrences of a lemma. E&P, TFPs model encodes the selectional prefer-
M&L Mitchell and Lapata (2008) propose a ences in a single vector using frequency counts.
framework for compositional distributional se- We extend the model to the sentence level with
mantics using a standard term-context vector component-wise addition and multiplication, and
space word representation. A phrase is repre- word vectors are contextualized by the depen-
sented as a vector of context-word counts (actu- dency neighbours. We use a frequency threshold
ally, pmi-scaled values), which is derived compo- of 10 and a pmi threshold of 2 to prune infrequent
sitionally by a function over constituent vectors, word and dependencies.
such as component-wise addition or multiplica- D&L Dinu and Lapata (2010) (D&L) assume
tion. This model ignores syntactic relations and a global set of latent senses for all words, and
is insensitive to word-order. models each word as a mixture over these latent
E&P Erk and Pado (2008) introduce a struc- senses. The vector for a word ti in the context of
tured vector space model which uses syntactic de- a word cj is modelled by
pendencies to model the selectional preferences v(ti , cj ) = P (z1 |ti , cj ), ...P (zK |ti , cj ) (19)
of words. The vector representation of a word in
context depends on the inverse selectional prefer- where z1...K are the latent senses. By mak-
ences of its dependents, and the selectional pref- ing independence assumptions and decomposing
erences of its head. For example, suppose catch probabilities, training becomes a matter of esti-
occurs with a dependent ball in a direct object mating the probability distributions P (zk |ti ) and
relation. The vector for catch would then be in- P (cj |zk ) from data. While Dinu and Lapata
fluenced by the inverse direct object preferences (2010) describe two methods to do so, based
of ball (e.g. throw, organize), and the vector for on non-negative matrix factorization and latent
ball would be influenced by the selectional pref- Dirichlet allocation, the performances are similar,
erences of catch (e.g. cold, drift). More formally, so we tested only the latent Dirichlet allocation
given words a and b in a dependency relation r, method. Like the two previous models, we ex-
tend the model to build sentence representations
38
Pfizer/Rinat N. Yahoo/Inktomi Besson/Paris Antoinette/Vienna Average
Overlap 0.7393 0.6007 0.7395 0.8914 0.7427
Models trained on the entire GigaWord
M&L add 0.6196 0.5387 0.5259 0.7275 0.6029
M&L mult 0.9036 0.6099 0.6443 0.8467 0.7511
D&L add 0.9214 0.8168 0.6989 0.8932 0.8326
D&L mult 0.7732 0.6734 0.6527 0.7659 0.7163
Models trained on the AFP section
E&P add 0.7536 0.4933 0.2780 0.6408 0.5414
E&P mult 0.5268 0.5328 0.5252 0.8421 0.6067
TFP add 0.4357 0.5325 0.8725 0.7183 0.6398
TFP mult 0.5554 0.5524 0.7283 0.6917 0.6320
M&L add 0.5643 0.5504 0.4594 0.7640 0.5845
M&L mult 0.8679 0.6324 0.4356 0.8258 0.6904
D&L add 0.8143 0.9062 0.6373 0.8664 0.8061
D&L mult 0.8429 0.7461 0.645 0.5948 0.7072
Table 1: Task 1 results in AUC scores. The values in bold indicate the best performing model for a particular
training corpus. The expected random baseline performance is 0.5.
Entities: {X, Y} + N tion for comparison. Note that the AFP portion
Relation: acquires of Gigaword is three times larger than the BNC
{Pfizer, Rinat Neuroscience} 41 50 corpus (~100M words), on which several previ-
{Yahoo, Inktomi} 115 433 ous syntactic models were trained. Because our
Relation: was born in main goal is to test the general performance of the
{Luc Besson, Paris} 6 126 models and to demonstrate the feasibility of our
{Marie Antoinette, Vienna} 39 105 evaluation methods, we did not further tune the
parameter settings to each of the tasks, as doing
Table 2: Task 1 dataset characteristics. N is the total
number of sentences. + is the number of sentences
so would likely only yield minor improvements.
that express the relation.
4.3 Task 1
We used the dataset by Bunescu and Mooney
from the contextualized representations. We set (2007), which we selected because it contains
the number of latent senses to 1200, and train for multiple realizations of an entity pair in a target
600 Gibbs sampling iterations. semantic relation, unlike similar datasets such as
the one by Roth and Yih (2002). Controlling for
4.2 Training and Parameter Settings
the target entity pair in this manner makes the task
We reimplemented these four models, following more difficult, because the semantic model cannot
the parameter settings described by previous work make use of distributional information about the
where possible, though we also aimed for consis- entity pair in inference. The dataset is separated
tency in parameter settings between models (for into subsets depending on the target binary rela-
example, in the number of context words). For the tion (Company X acquires Company Y or Person
non-baseline models, we followed previous work X was born in Place Y) and the entity pair (e.g.,
and model only the 30000 most frequent lemmata. Yahoo and Inktomi) (Table 2).
Context vectors are constructed using a symmet- The dataset was constructed semi-
ric window of 5 words, and their dimensions rep- automatically using a Google search for the
resent the 3000 most frequent lemmatized context two entities in order with up to seven content
words excluding stop words. Due to resource lim- words in between. Then, the extracted sentences
itations, we trained the syntactic models over the were hand-labelled with whether they express the
AFP subset of Gigaword (~338M words). We also target relation. Because the order of the entities
trained the other two models on just the AFP por- has been fixed, passive alternations do not appear
39
Pure models Mixed models ing off to word vectors from the GENIA corpus
All Subset All Subset when a word vector could not be found in the
Overlap 0.8770 0.7291 0.8770 0.7291 Gigaword-trained model. We could not do this
Models trained on the entire GigaWord for the D&L model, since the global latent senses
M&L add 0.7467 0.6106 0.8782 0.7523 that are found by latent Dirichlet allocation train-
M&L mult 0.5331 0.5690 0.8841 0.7678 ing do not have any absolute meaning that holds
D&L add 0.6552 0.5716 0.8791 0.7539 across multiple runs. Instead, we found the 5
D&L mult 0.5488 0.5255 0.8841 0.7466 words in the Gigaword-trained D&L model that
Models trained on the AFP section were closest to each novel word in the GENIA
E&P add 0.4589 0.4516 0.8748 0.7375 corpus according to cosine similarity over the co-
E&P mult 0.5201 0.5584 0.8882 0.7719 occurrence vectors of the words in the GENIA
TFP add 0.6887 0.6443 0.8940 0.7871 corpus, and took their average latent sense distri-
TFP mult 0.5210 0.5199 0.8785 0.7432 butions as the vector for that word.
M&L add 0.7588 0.6206 0.8710 0.7371 Unlike in Task 1, there is no control for the
M&L mult 0.5710 0.5540 0.8801 0.7540 named entities in a sentence, because one of the
D&L add 0.6358 0.5402 0.8713 0.7305 entities in the semantic relation is missing. Also,
D&L mult 0.5647 0.5461 0.8856 0.7683 distributional models have problems in dealing
with named entities which are common in this
Table 3: Task 2 results, in normalized rank scores.
Subset is the cases where lemma overlap does not
corpus, such as the names of genes and proteins.
achieve a perfect score. The two columns on the right To address these issues, we tested hybrid models
indicate performance using the sum of the scores from where the similarity score from a semantic model
the lemma overlap and the semantic model. The ex- is added to the similarity score from the lemma
pected random baseline performance is 0.5. overlap model.
The results are presented in Table 3. Lemma
overlap again presents a strong baseline, but the
in this dataset.
hybridized models are able to outperform simple
The results for Task 1 indicate that the D&L ad-
lemma overlap. Unlike in Task 1, the E&P and
dition model performs the best (Table 1), though
TFP models are comparable to the D&L model,
the lemma overlap model presents a surprisingly
and the mixed TFP addition model achieves the
strong baseline. The syntax-modulated E&P and
best result, likely due to the need to more pre-
TFP models perform poorly on this task, even
cisely distinguish syntactic roles in this task. The
when compared to the other models trained on the
D&L addition model, which achieved the best
AFP subset. The M&L multiplication model out-
performance in Task 1, does not perform as well
performs the addition model, a result which cor-
in this task. This could be due to the domain adap-
roborates previous findings on the lexical substi-
tation procedure for the D&L model, which could
tution task. The same does not hold in the D&L
not be reasonably trained on such a small, special-
latent sense space. Overall, some of the datasets
ized corpus.
(Yahoo and Antoinette) appear to be easier for the
models than others (Pfizer and Besson), but more 5 Related Work
entity pairs and relations would be needed to in-
vestigate the models variance across datasets. Turney and Pantel (2010) survey various types of
vector space models and applications thereof in
4.4 Task 2 computational linguistics. We summarize below
We used the question-answer pairs extracted by a number of other word- or phrase-level distribu-
the Poon and Domingos (2009) semantic parser tional models.
from the GENIA biomedical corpus that have Several approaches are specialized to deal with
been manually checked to be correct (295 pairs). homography. The top-down multi-prototype ap-
Because our models were trained on newspaper proach determines a number of senses for each
text, they required adaptation to this specialized word, and then clusters the occurrences of the
domain. Thus, we also trained the M&L, E&P word (Reisinger and Mooney, 2010) into these
and TFP models on the GENIA corpus, back- senses. A prototype vector is created for each
of these sense clusters. When a new occurrence
40
of a word is encountered, it is represented as a results demonstrate that compositional distribu-
combination of the prototype vectors, with the de- tional models of semantics already have some
gree of influence from each prototype determined utility in the context of more empirically complex
by the similarity of the new context to the exist- semantic tasks than WSD-like lexical substitution
ing sense contexts. In contrast, the bottom-up ex- tasks, in which compositional invariance is a req-
emplar-based approach assumes that each occur- uisite property. Simply computing lemma over-
rence of a word expresses a different sense of the lap, however, is a very competitive baseline, due
word. The most similar senses of the word are ac- to issues in these protocols with named entities
tivated when a new occurrence of it is encountered and domain adaptivity. The better performance
and combined, for example with a kNN algorithm of the mixture models in Task 2 shows that such
(Erk and Pado, 2010). weaknesses can be addressed by hybrid seman-
The models we compared and the above work tic models. Future work should investigate more
assume each dimension in the feature vector cor- refined versions of such hybridization, as well as
responds to a context word. In contrast, Washtell extend this idea to other semantic phenomena like
(2011) uses potential paraphrases directly as di- coreference, negation and modality.
mensions in his expectation vectors. Unfortu- We also observe that no single model or com-
nately, this approach does not outperform vari- position operator performs best for all tasks and
ous context word-based approaches in two phrase datasets. The latent sense mixture model of Dinu
similarity tasks. and Lapata (2010) performs well in recognizing
In terms of the vector composition function, semantic relations in general web text. Because
component-wise addition and multiplication are of the difficulty of adapting it to a specialized
the most popular in recent work, but there ex- domain, however, it does less well in biomedi-
ist a number of other operators such as tensor cal question answering, where the syntax-based
product and convolution product, which are re- model of Thater et al. (2010) performs the best.
viewed by Widdows (2008). Instead of vector A more thorough investigation of the factors that
space representations, one could also use a matrix can predict the performance and/or invariance of
space representation with its much more expres- a given composition operator is warranted.
sive matrix operators (Rudolph and Giesbrecht, In the future, we would like to evaluate other
2010). So far, however, this has only been ap- models of compositional semantics that have been
plied to specific syntactic contexts (Baroni and recently proposed. We would also like to collect
Zamparelli, 2010; Guevara, 2010; Grefenstette more comprehensive test data, to increase the ex-
and Sadrzadeh, 2011), or tasks (Yessenalina and ternal validity of our evaluations.
Cardie, 2011).
Neural networks have been used to learn both Acknowledgments
phrase structure and representations. In Socher et We would like to thank Georgiana Dinu and Ste-
al. (2010), word representations learned by neu- fan Thater for help with reimplementing their
ral network models such as (Bengio et al., 2006; models. Saif Mohammad, Peter Turney, and
Collobert and Weston, 2008) are fed as input into the anonymous reviewers provided valuable com-
a recursive neural network whose nodes represent ments on drafts of this paper. This project was
syntactic constituents. Each node models both the supported by the Natural Sciences and Engineer-
probability of the input forming a constituent and ing Research Council of Canada.
the phrase representation resulting from composi-
tion.
References
6 Conclusions Marco Baroni and Roberto Zamparelli. 2010. Nouns
We have proposed an evaluation framework for are vectors, adjectives are matrices: Representing
adjective-noun constructions in semantic space. In
distributional models of semantics which build
Proceedings of the 2010 Conference on Empirical
phrase- and sentence-level representations, and Methods in Natural Language Processing, pages
instantiated two evaluation tasks which test for 11831193.
the crucial ability to recognize whether sen- Yoshua Bengio, Holger Schwenk, Jean-Sebastien
tences express the same semantic relation. Our Senecal, Frederic Morin, and Jean-Luc Gauvain.
41
2006. Neural probabilistic language models. In- Diana McCarthy and Roberto Navigli. 2009. The en-
novations in Machine Learning, pages 137186. glish lexical substitution task. Language Resources
Razvan C. Bunescu and Raymond J. Mooney. 2007. and Evaluation, 43(2):139159.
Learning to extract relations from the web using Jeff Mitchell and Mirella Lapata. 2008. Vector-based
minimal supervision. In Proceedings of the 45th models of semantic composition. In Proceedings of
Annual Meeting of the Association for Computa- ACL-08: HLT, pages 236244.
tional Linguistics, pages 576583. Jeff Mitchell and Mirella Lapata. 2009. Language
Ronan Collobert and Jason Weston. 2008. A unified models based on semantic composition. In Pro-
architecture for natural language processing: Deep ceedings of the 2009 Conference on Empirical
neural networks with multitask learning. In Pro- Methods in Natural Language Processing, pages
ceedings of the 25th International Conference on 430439.
Machine Learning, page 160167. Richard Montague. 1974. English as a formal lan-
Georgiana Dinu and Mirella Lapata. 2010. Measuring guage. Formal Philosophy, pages 188221.
distributional similarity in context. In Proceedings Hoifung Poon and Pedro Domingos. 2009. Unsuper-
of the 2010 Conference on Empirical Methods in vised semantic parsing. In Proceedings of the 2009
Natural Language Processing, pages 11621172. Conference on Empirical Methods in Natural Lan-
William B. Dolan and Chris Brockett. 2005. Auto- guage Processing, pages 110.
matically constructing a corpus of sentential para- Joseph Reisinger and Raymond J. Mooney. 2010.
phrases. In Proceedings of the Third International Multi-prototype vector-space models of word
Workshop on Paraphrasing, pages 916. meaning. In Human Language Technologies: The
Katrin Erk and Sebastian Pado. 2008. A structured 2010 Annual Conference of the North American
vector space model for word meaning in context. In Chapter of the Association for Computational Lin-
Proceedings of the Conference on Empirical Meth- guistics.
ods in Natural Language Processing, pages 897 Dan Roth and Wen-tau Yih. 2002. Probabilistic rea-
906. soning for entity & relation recognition. In Pro-
ceedings of the 19th International Conference on
Katrin Erk and Sebastian Pado. 2010. Exemplar-
Computational Linguistics, pages 835841.
based models for word meaning in context. In Pro-
Sebastian Rudolph and Eugenie Giesbrecht. 2010.
ceedings of the ACL 2010 Conference Short Papers,
Compositional matrix-space models of language.
pages 9297.
In Proceedings of the 48th Annual Meeting of the
Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias,
Association for Computational Linguistics, pages
Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey-
907916.
tan Ruppin. 2002. Placing search in context: The
Richard Socher, Christopher D. Manning, and An-
concept revisited. ACM Transactions on Informa-
drew Y. Ng. 2010. Learning continuous phrase
tion Systems, 20(1):116131.
representations and syntactic parsing with recursive
Jerry A. Fodor and Zenon W. Pylyshyn. 1988. Con-
neural networks. Proceedings of the Deep Learn-
nectionism and cognitive architecture: A critical
ing and Unsupervised Feature Learning Workshop
analysis. Cognition, 28:371.
of NIPS 2010, pages 19.
Edward Grefenstette and Mehrnoosh Sadrzadeh. Alfred Tarski. 1956. The concept of truth in formal-
2011. Experimental support for a categorical com- ized languages. Logic, Semantics, Metamathemat-
positional distributional model of meaning. In ics, pages 152278.
Proceedings of the 2011 Conference on Empirical Stefan Thater, Hagen Furstenau, and Manfred Pinkal.
Methods in Natural Language Processing, pages 2010. Contextualizing semantic representations us-
13941404. ing syntactically enriched vector models. In Pro-
Emiliano Guevara. 2010. A regression model ceedings of the 48th Annual Meeting of the Associa-
of adjective-noun compositionality in distributional tion for Computational Linguistics, pages 948957.
semantics. In Proceedings of the 2010 Workshop on Peter D. Turney and Patrick Pantel. 2010. From
GEometrical Models of Natural Language Seman- frequency to meaning: Vector space models of se-
tics, pages 3337. mantics. Journal of Artificial Intelligence Research,
Zeller S. Harris. 1954. Distributional structure. Word, 37:141188.
10(23):146162. Justin Washtell. 2011. Compositional expectation:
Wilfred Hodges. 2005. The interplay of fact and the- A purely distributional model of compositional se-
ory in separating syntax from meaning. In Work- mantics. In Proceedings of the Ninth International
shop on Empirical Challenges and Analytical Al- Conference on Computational Semantics (IWCS
ternatives to Strict Compositionality. 2011), pages 285294.
Walter Kintsch. 2001. Predication. Cognitive Sci- Dominic Widdows. 2008. Semantic vector products:
ence, 25(2):173202. Some initial investigations. In Second AAAI Sym-
posium on Quantum Interaction.
42
Stephen Wu and William Schuler. 2011. Structured
composition of semantic vectors. In Proceedings
of the Ninth International Conference on Computa-
tional Semantics (IWCS 2011), pages 295304.
Ainur Yessenalina and Claire Cardie. 2011. Com-
positional matrix-space models for sentiment analy-
sis. In Proceedings of the 2011 Conference on Em-
pirical Methods in Natural Language Processing,
pages 172182.
43
Cross-Framework Evaluation for Statistical Parsing
44
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 4454,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
and ellipsis) do not comply with the single head 2 Preliminaries: Relational Schemes for
assumption of dependency treebanks. Secondly, Cross-Framework Parse Evaluation
these scripts may be labor intensive to create, and
are available mostly for English. So the evalua- Traditionally, different statistical parsers have
tion protocol becomes language-dependent. been evaluated using specially designated evalu-
In Tsarfaty et al. (2011) we proposed a gen- ation measures that are designed to fit their repre-
eral protocol for handling annotation discrepan- sentation types. Dependency trees are evaluated
cies when comparing parses across different de- using attachment scores (Buchholz and Marsi,
pendency theories. The protocol consists of three 2006), phrase-structure trees are evaluated using
phases: converting all structures into function ParsEval (Black et al., 1991), LFG-based parsers
trees, for each sentence, generalizing the different postulate an evaluation procedure based on f-
gold standard function trees to get their common structures (Cahill et al., 2008), and so on. From a
denominator, and employing an evaluation mea- downstream application point of view, there is no
sure based on tree edit distance (TED) which dis- significance as to which formalism was used for
cards edit operations that recover theory-specific generating the representation and which learning
structures. Although the protocol is potentially methods have been utilized. The bottom line is
applicable to a wide class of syntactic represen- simply which parsing framework most accurately
tation types, formal restrictions in the procedures recovers a useful representation that helps to un-
effectively limit its applicability only to represen- ravel the human-perceived interpretation.
tations that are isomorphic to dependency trees. Relational schemes, that is, schemes that en-
The present paper breaks new ground in the code the set of grammatical relations that con-
ability to soundly compare the accuracy of differ- stitute the predicate-argument structures of sen-
ent parsers relative to one another given that they tences, provide an interface to semantic interpre-
employ different formal representation types and tation. They are more intuitively understood than,
obey different theoretical assumptions. Our solu- say, phrase-structure trees, and thus they are also
tion generally confines with the protocol proposed more useful for practical applications. For these
in Tsarfaty et al. (2011) but is re-formalized to reasons, relational schemes have been repeatedly
allow for arbitrary linearly ordered labeled trees, singled out as an appropriate level of representa-
thus encompassing constituency-based as well as tion for the evaluation of statistical parsers (Lin,
dependency-based representations. The frame- 1995; Carroll et al., 1998; Cer et al., 2010).
work in Tsarfaty et al. (2011) assumes structures The annotated data which statistical parsers are
that are isomorphic to dependency trees, bypass- trained on encode these grammatical relationships
ing the problem of arbitrary branching. Here we in different ways. Dependency treebanks provide
lift this restriction, and define a protocol which a ready-made representation of grammatical rela-
is based on generalization and TED measures to tions on top of arcs connecting the words in the
soundly compare the output of different parsers. sentence (Kubler et al., 2009). The Penn Tree-
We demonstrate the utility of this protocol by bank and phrase-structure annotated resources en-
comparing the performance of different parsers code partial information about grammatical rela-
for English and Swedish. For English, our tions as dash-features decorating phrase structure
parser evaluation across representation types al- nodes (Marcus et al., 1993). Treebanks like Tiger
lows us to analyze and precisely quantify previ- for German (Brants et al., 2002) and Talbanken
ously encountered performance tendencies. For for Swedish (Nivre and Megyesi, 2007) explic-
Swedish we show the first ever evaluation be- itly map phrase structures onto grammatical rela-
tween dependency-based and constituency-based tions using dedicated edge labels. The Relational-
parsing models, all trained on the Swedish tree- Realizational structures of Tsarfaty and Simaan
bank data. All in all we show that our ex- (2008) encode relational networks (sets of rela-
tended protocol, which can handle linearly- tions) projected and realized by syntactic cate-
ordered labeled trees with arbitrary branch- gories on top of ordinary phrase-structure nodes.
ing, can soundly compare parsing results across Function trees, as defined in Tsarfaty et al.
frameworks in a representation-independent and (2011), are linearly ordered labeled trees in which
language-independent fashion. every node is labeled with the grammatical func-
45
root (t1) root (t2) root (t3) root
sbj obj
f2 f1 w
sbj hd obj
w w
John loves Mary
(b) S-root root Figure 2: Unary chains in function trees
46
Figure 3: The Evaluation Protocol. Different formal frameworks yield different parse and gold formal types.
All types are transformed into multi-function trees. All gold trees enter generalization to yield a new gold for
each sentence. The different arcs represent the different edit scripts used for calculating the TED-based scores.
47
the intersection of the label sets dominating the We would now like to use distance-based met-
same span in both trees. The unification tree con- rics in order to measure the gap between the gold
tains nodes that exist in one tree or another, and and predicted theories. The idea behind distance-
for each span it is labeled by the union of all label based evaluation in Tsarfaty et al. (2011) is that
sets for this span in either tree. If we generalize recording the edit operations between the native
two trees and one tree has no specification for la- gold and the generalized gold allows one to dis-
bels over a span, it does not share anything with card their cost when computing the cost of a parse
the label set dominating the same span in the other hypothesis turned into the generalized gold. This
tree, and the label set dominating this span in the makes sure that different parsers do not get penal-
generalized tree is empty. If the trees do not agree ized, or favored, due to annotation specific deci-
on any label for a particular span, the respective sions that are not shared by other frameworks.
node is similarly labeled with an empty set. When The problem is now that TED is undefined with
we wish to unify theories, then an empty set over respect to multi-function trees because it cannot
a span is unified with any other set dominating the handle complex labels. To overcome this, we
same span in the other tree, without altering it. convert multi-function trees into sorted function
trees, which are simply function trees in which
Digression: Using Unification to Merge Infor-
any label set is represented as a unary chain of
mation From Different Treebanks In Tsarfaty
single-labeled nodes, and the nodes are sorted ac-
et al. (2011), only the generalization operation
cording to the canonical order of their labels.2 In
was used, providing the common denominator of
case of an empty set, a 0-length chain is created,
all the gold structures and serving as a common
that is, no node is created over this span. Sorted
ground for evaluation. The unification operation
function trees prevent reordering nodes in a chain
is useful for other NLP tasks, for instance, com-
in one tree to fit the order in another tree, since it
bining information from two different annotation
would violate the idea that the set of constraints
schemes or enriching one annotation scheme with
over a span in a multi-function tree is unordered.
information from a different one. In particular,
The edit operations we assume are add-
we can take advantage of the new framework to
node(l, i, j) and delete-node(l, i, j) where l L
enrich the node structure reflected in one theory
is a grammatical function label and i < j define
with grammatical functions reflected in an anno-
the span of a node in the tree. Insertion into a
tation scheme that follows a different theory. To
unary chain must confine with the canonical order
do so, we define the Tree-Labeling-Unification
of the labels. Every operation is assigned a cost.
operation on multi-function trees.
An edit script is a sequence of edit operations that
TL-Unification, denoted ttl , is an opera- turns a function tree 1 into 2 , that is:
tion that returns a tree that retains the struc- ES(1 , 2 ) = he1 , . . . , ek i
ture of the first tree and adds labels that ex-
ist over its spans in the second tree. For- Since all operations are anchored in spans, the se-
mally: 1 ttl 2 = 3 iff for every node quence can be determined to have a unique order
n 1 there exists a node m 3 such of traversing the tree (say, DFS). Different edit
that span(m) = span(n) and labels(m) = scripts then only differ in their set of operations
labels(n) labels(2 , span(n)). on spans. The edit distance problem is finding the
minimal cost script, that is, one needs to solve:
Where labels(2 , span(n)) is the set of labels of X
the node with yield span(n) in 2 if such a node ES (1 , 2 ) = min cost(e)
ES(1 ,2 )
exists and otherwise. We further discuss the TL- eES(1 ,2 )
Unification and its use for data preparation in 4.
In the current setting, when using only add and
3.3 TED Measures for Multi-Function Trees delete operations on spans, there is only one edit
The result of the generalization operation pro- script that corresponds to the minimal edit cost.
vides us with multi-function trees for each of the So, finding the minimal edit script entails finding
sentences in the test set representing sets of con- a single set of operations turning 1 into 2 .
2
straints on which the different gold theories agree. The ordering can be alphabetic, thematic, etc.
48
We can now define for the ith framework, as parser (Petrov et al., 2006) and the Brown parser
the error of parsei relative to its native gold stan- (Charniak and Johnson, 2005). All experiments
dard goldi and to the generalized gold gen. This use Penn Treebank (PTB) data. For Swedish,
is the edit cost minus the cost of the script turning we compare MaltParser and MSTParser with two
parsei into gen intersected with the script turning variants of the Berkeley parser, one trained on
goldi into gen. The underlying intuition is that phrase structure trees, and one trained on a vari-
if an operation that was used to turn parsei into ant of the Relational-Realizational representation
gen is used to discard theory-specific information of Tsarfaty and Simaan (2008). All experiments
from goldi , its cost should not be counted as error. use the Talbanken Swedish Treebank (STB) data.
(parsei , goldi , gen) = cost(ES (parsei , gen)) 4.1 English Cross-Framework Evaluation
We use sections 0221 of the WSJ Penn Tree-
cost(ES (parsei , gen) ES (goldi , gen)) bank for training and section 00 for evaluation and
In order to turn distance measures into parse- analysis. We use two different native gold stan-
scores we now normalize the error relative to the dards subscribing to different theories of encoding
size of the trees and subtract it from a unity. So grammatical relations in tree structures:
the Sentence Score for parsing with framework i T HE DEPENDENCY- BASED THEORY is the
is: theory encoded in the basic Stanford Depen-
score(parsei , goldi , gen) = dencies (SD) scheme. We obtain the set of
(parsei , goldi ,gen) basic stanford dependency trees using the
1 software of de Marneffe et al. (2006) and
|parsei | + |gen|
train the dependency parsers directly on it.
Finally, Test-Set Average is defined by macro-
avaraging over all sentences in the test-set: T HE CONSTITUENCY- BASED THEORY is
the theory reflected in the phrase-structure
P|testset| representation of the PTB (Marcus et al.,
j=1 (parseij , goldij , genj )
1 P|testset| 1993) enriched with function labels compat-
j=1 |parseij | + |genj | ible with the Stanford Dependencies (SD)
scheme. We obtain trees that reflect this
This last formula represents the T ED E VAL metric
theory by TL-Unification of the PTB multi-
that we use in our experiments.
function trees with the SD multi-function
A Note on System Complexity Conversion of trees (PTBttl SD) as illustrated in Figure 4.
a dependency or a constituency tree into a func- The theory encoded in the multi-function trees
tion tree is linear in the size of the tree. Our corresponding to SD is different from the one
implementation of the generalization and unifica- obtained by our TL-Unification, as may be seen
tion operation is an exact, greedy, chart-based al- from the difference between the flat SD multi-
gorithm that runs in polynomial time (O(n2 ) in function tree and the result of the PTBttl SD in
n the number of terminals). The TED software Figure 4. Another difference concerns coordina-
that we utilize builds on the TED efficient algo- tion structures, encoded as binary branching trees
rithm of Zhang and Shasha (1989) which runs in in SD and as flat productions in the PTBttl SD.
O(|T1 ||T2 | min(d1 , n1 ) min(d2 , n2 )) time where Such differences are not only observable but also
di is the tree degree (depth) and ni is the number quantifiable, and using our redefined TED metric
of terminals in the respective tree (Bille, 2005). the cross-theory overlap is 0.8571.
The two dependency parsers were trained using
4 Experiments
the same settings as in Tsarfaty et al. (2011), using
We validate our cross-framework evaluation pro- SVMTool (Gimenez and Marquez, 2004) to pre-
cedure on two languages, English and Swedish. dict part-of-speech tags at parsing time. The two
For English, we compare the performance of constituency parsers were used with default set-
two dependency parsers, MaltParser (Nivre et al., tings and were allowed to predict their own part-
2006) and MSTParser (McDonald et al., 2005), of-speech tags. We report three different evalua-
and two constituency-based parsers, the Berkeley tion metrics for the different experiments:
49
(PTB) S
a constituency tree, it is converted to and evalu-
NP VP ated on SD. Here we see that MST outperforms
NN V NP John Malt, though the differences for labeled depen-
John loves NN John loves loves Mary dencies are insignificant. We also observe here a
Mary Mary
familiar pattern from Cer et al. (2010) and others,
root
sbj obj
where the constituency parsers significantly out-
(SD) -ROOT- John loves Mary root {root}
perform the dependency parsers after conversion
of their output into dependencies.
sbj hd obj {sbj} {hd} {obj} The conversion to SD allows one to compare
John loves Mary John loves Mary results across formal frameworks, but not with-
(PTB) ttl (SD) = {root} out a cost. The conversion introduces a set of an-
notation specific decisions which may introduce
{sbj}
a bias into the evaluation. In the middle column
John {hd} {obj}
of Table 1 we report the T ED E VAL metrics mea-
loves Mary sured against the generalized gold standard for all
Figure 4: Conversion of PTB and SD tree to multi- parsing frameworks. We can now confirm that
function trees, followed by TL-Unification of the trees. the constituency-based parsers significantly out-
Note that some PTB nodes remain without an SD label. perform the dependency parsers, and that this is
not due to specific theoretical decisions which are
LAS/UAS (Buchholz and Marsi, 2006) seen to affect LAS/UAS metrics (Schwartz et al.,
PARS E VAL (Black et al., 1991) 2011). For the dependency parsers we now see
T ED E VAL as defined in Section 3 that Malt outperforms MST on labeled dependen-
cies slightly, but the difference is insignificant.
We use LAS/UAS for dependency parsers that The fact that the discrepancy in theoretical as-
were trained on the same dependency theory. We sumptions between different frameworks indeed
use ParseEval to evaluate phrase-structure parsers affects the conversion-based evaluation procedure
that were trained on PTB trees in which dash- is reflected in the results we report in Table 2.
features and empty traces are removed. We Here the leftmost and rightmost columns report
use our implementation of T ED E VAL to evaluate T ED E VAL scores against the own native gold
parsing results across all frameworks under two (S INGLE) and the middle column against the gen-
different scenarios:3 T ED E VAL S INGLE evalu- eralized gold (M ULTIPLE). Had the theories
ates against the native gold multi-function trees. for SD and PTBttl SD been identical, T ED E VAL
T ED E VAL M ULTIPLE evaluates against the gen- S INGLE and T ED E VAL M ULTIPLE would have
eralized (cross-theory) multi-function trees. Un- been equal in each line. Because of theoretical
labeled T ED E VAL scores are obtained by sim- discrepancies, we see small gaps in parser perfor-
ply removing all labels from the multi-function mance between these cases. Our protocol ensures
nodes, and using unlabeled edit operations. We that such discrepancies do not bias the results.
calculate pairwise statistical significance using a
4.2 Cross-Framework Swedish Parsing
shuffling test with 10K iterations (Cohen, 1995).
Tables 1 and 2 present the results of our cross- We use the standard training and test sets of the
framework evaluation for English Parsing. In the Swedish Treebank (Nivre and Megyesi, 2007)
left column of Table 1 we report ParsEval scores with two gold standards presupposing different
for constituency-based parsers. As expected, F- theories:
Scores for the Brown parser are higher than the
T HE D EPENDENCY-BASED T HEORY is the
F-Scores of the Berkeley parser. F-Scores are
dependency version of the Swedish Tree-
however not applicable across frameworks. In
bank. All trees are projectivized (STB-Dep).
the rightmost column of Table 1 we report the
LAS/UAS results for all parsers. If a parser yields T HE C ONSTITUENCY-BASED T HEORY is
3
Our TedEval software can be downloaded at
the standard Swedish Treebank with gram-
http://stp.lingfil.uu.se/tsarfaty/ matical function labels on the edges of con-
unipar/download.html. stituency structures (STB).
50
Formalism PS Trees MF Trees Dep Trees Formalism PS Trees MF Trees Dep Trees
Theory PTB tlt SD (PTB tlt SD) SD Theory STB STB ut Dep Dep
ut SD Metrics PARS E VAL T ED E VAL ATT S CORE
Metrics PARS E VAL T ED E VAL ATT S CORES U: 0.9266 U: 0.8298
M ALT N/A
U: 0.9525 U: 0.8962 L: 0.8225 L: 0.7782
M ALT N/A
L: 0.9088 L: 0.8772 U: 0.9275 U: 0.8438
MST N/A
U: 0.9549 U: 0.9059 L: 0.8121 L: 0.7824
MST N/A
L: 0.9049 L: 0.8795 F-Score U: 0.9281
B KLY /STB-RR N/A
F-Scores U: 0.9677 U: 0.9254 0.7914 L: 0.7861
B ERKELEY
0.9096 L: 0.9227 L: 0.9031 F-Score
B KLY /STB-PS N/A N/A
F-Scores U: 0.9702 U: 0.9289 0.7855
B ROWN
0.9129 L: 0.9264 L: 0.9057
Table 3: Swedish cross-framework evaluation: Three
Table 1: English cross-framework evaluation: Three measures as applicable to the different schemes. Bold-
measures as applicable to the different schemes. Bold- face scores are the highest in their column.
face scores are highest in their column. Italic scores
are the highest for dependency parsers in their column.
Formalism PS Trees MF Trees Dep Trees
Theory STB STB ut Dep Dep
Formalism PS Trees MF Trees Dep Trees Metrics T ED E VAL T ED E VAL T ED E VAL
Theory PTB tlt SD (PTB tlt SD) SD S INGLE M ULTIPLE S INGLE
ut SD
U: 0.9266 U: 0.9264
Metrics T ED E VAL T ED E VAL T ED E VAL M ALT N/A
L: 0.8225 L: 0.8372
S INGLE M ULTIPLE S INGLE
U: 0.9275 U: 0.9272
U: 0.9525 U: 0.9524 MST N/A
M ALT N/A L: 0.8121 L: 0.8275
L: 0.9088 L: 0.9186
U: 0.9239 U: 0.9281
U: 0.9549 U: 0.9548 B KLY-STB-RR N/A
MST N/A L: 0.7946 L: 0.7861
L: 0.9049 L: 0.9149
U: 0.9645 U: 0.9677 U: 0.9649
B ERKELEY Table 4: Swedish cross-framework evaluation: T ED E-
L: 0.9271 L: 0.9227 L: 0.9324
VAL scores against the native gold and the generalized
U: 0.9667 U: 9702 U: 0.9679
B ROWN gold. Boldface scores are the highest in their column.
L: 0.9301 L: 9264 L: 0.9362
51
LAS/UAS to compare across frameworks, and we languages such as English (de Marneffe et al.,
use T ED E VAL for cross-framework evaluation. 2006) but since grammatical relations are re-
Training the Berkeley parser on RR trees which flected differently in different languages (Postal
encode a mapping of PS nodes to grammatical and Perlmutter, 1977; Bresnan, 2000), a proce-
functions allows us to compare parse results for dure to read off these relations in a language-
trees belonging to the STB theory with trees obey- independent fashion from phrase-structure trees
ing the STB-Dep theory. For unlabeled T ED E- does not, and should not, exist (Rambow, 2010).
VAL scores, the dependency parsers perform at the The crucial point is that even when using ex-
same level as the constituency parser, though the ternal scripts for recovering a relational scheme
difference is insignificant. For labeled T ED E VAL for phrase-structure trees, our protocol has a clear
the dependency parsers significantly outperform advantage over simply scoring converted trees.
the constituency parser. When considering only Manually created conversion scripts alter the the-
the dependency parsers, there is a small advantage oretical assumptions inherent in the trees and thus
for Malt on labeled dependencies, and an advan- may bias the results. Our generalization operation
tage for MST on unlabeled dependencies, but the and three-way TED make sure that theory-specific
latter is insignificant. This effect is replicated in idiosyncrasies injected through such scripts do
Table 4 where we evaluate dependency parsers us- not lead to over-penalizing or over-crediting
ing T ED E VAL against their own gold theories. Ta- theory-specific structural variations.
ble 4 further confirms that there is a gap between Certain linguistic structures cannot yet be eval-
the STB and the STB-Dep theories, reflected in uated with our protocol because of the strict as-
the scores against the native and generalized gold. sumption that the labeled spans in a parse form a
tree. In the future we plan to extend the protocol
5 Discussion for evaluating structures that go beyond linearly-
We presented a formal protocol for evaluating ordered trees in order to allow for non-projective
parsers across frameworks and used it to soundly trees and directed acyclic graphs. In addition, we
compare parsing results for English and Swedish. plan to lift the restriction that the parse yield is
Our approach follows the three-phase protocol of known in advance, in order to allow for evalua-
Tsarfaty et al. (2011), namely: (i) obtaining a for- tion of joint parse-segmentation hypotheses.
mal common ground for the different representa-
tion types, (ii) computing the theoretical common 6 Conclusion
ground for each test sentence, and (iii) counting We developed a protocol for comparing parsing
only what counts, that is, measuring the distance results across different theories and representa-
between the common ground and the parse tree tion types which is framework-independent in the
while discarding annotation-specific edits. sense that it can accommodate any formal syntac-
A pre-condition for applying our protocol is the tic framework that encodes grammatical relations,
availability of a relational interpretation of trees in and it is language-independent in the sense that
the different frameworks. For dependency frame- there is no language specific knowledge encoded
works this is straightforward, as these relations in the procedure. As such, this protocol is ad-
are encoded on top of dependency arcs. For con- equate for parser evaluation in cross-framework
stituency trees with an inherent mapping of nodes and cross-language tasks and parsing competi-
onto grammatical relations (Merlo and Musillo, tions, and using it across the board is expected
2005; Gabbard et al., 2006; Tsarfaty and Simaan, to open new horizons in our understanding of the
2008), a procedure for reading relational schemes strengths and weaknesses of different parsers in
off of the trees is trivial to implement. the face of different theories and different data.
For parsers that are trained on and parse into
bare-bones phrase-structure trees this is not so. Acknowledgments We thank David McClosky,
Reading off the relational structure may be more Marco Khulmann, Yoav Goldberg and three
costly and require interjection of additional theo- anonymous reviewers for useful comments. We
retical assumptions via manually written scripts. further thank Jennifer Foster for the Brown parses
Scripts that read off grammatical relations based and parameter files. This research is partly funded
on tree positions work well for configurational by the Swedish National Science Foundation.
52
References Mohamed Maamouri, Ann Bies, Tim Buckwalter, and
Wigdan Mekki. 2004. The Penn Arabic treebank:
Philip Bille. 2005. A survey on tree edit distance and Building a large-scale annotated Arabic corpus. In
related. problems. Theoretical Computer Science, Proceedings of NEMLAR International Conference
337:217239. on Arabic Language Resources and Tools.
Ezra Black, Steven P. Abney, D. Flickenger, Clau- Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
dia Gdaniec, Ralph Grishman, P. Harrison, Don- Marcinkiewicz. 1993. Building a large annotated
ald Hindle, Robert Ingria, Frederick Jelinek, Ju- corpus of English: The Penn Treebank. Computa-
dith L. Klavans, Mark Liberman, Mitchell P. Mar- tional Linguistics, 19:313330.
cus, Salim Roukos, Beatrice Santorini, and Tomek Ryan McDonald, Fernando Pereira, Kiril Ribarov, and
Strzalkowski. 1991. A procedure for quantitatively Jan Hajic. 2005. Non-projective dependency pars-
comparing the syntactic coverage of English gram- ing using spanning tree algorithms. In HLT 05:
mars. In Proceedings of the DARPA Workshop on Proceedings of the conference on Human Language
Speech and Natural Language, pages 306311. Technology and Empirical Methods in Natural Lan-
Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolf- guage Processing, pages 523530, Morristown, NJ,
gang Lezius, and George Smith. 2002. The Tiger USA. Association for Computational Linguistics.
treebank. In Proceedings of TLT. Beata Megyesi. 2009. The open source tagger Hun-
Joan Bresnan. 2000. Lexical-Functional Syntax. PoS for Swedish. In Proceedings of the 17th Nordic
Blackwell. Conference of Computational Linguistics (NODAL-
Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X IDA), pages 239241.
shared task on multilingual dependency parsing. In Igor Melcuk. 1988. Dependency Syntax: Theory and
Proceedings of CoNLL-X, pages 149164. Practice. State University of New York Press.
Aoife Cahill, Michael Burke, Ruth ODonovan, Stefan Paola Merlo and Gabriele Musillo. 2005. Accurate
Riezler, Josef van Genabith, and Andy Way. 2008. function parsing. In Proceedings of EMNLP, pages
Wide-coverage deep statistical parsing using auto- 620627.
matic dependency structure annotation. Computa- Joakim Nivre and Beata Megyesi. 2007. Bootstrap-
tional Linguistics, 34(1):81124. ping a Swedish Treebank using cross-corpus har-
John Carroll, Edward Briscoe, and Antonio Sanfilippo. monization and annotation projection. In Proceed-
1998. Parser evaluation: A survey and a new pro- ings of TLT.
posal. In Proceedings of LREC, pages 447454. Joakim Nivre, Johan Hall, and Jens Nilsson. 2006.
Daniel Cer, Marie-Catherine de Marneffe, Daniel Ju- Maltparser: A data-driven parser-generator for de-
rafsky, and Christopher D. Manning. 2010. Pars- pendency parsing. In Proceedings of LREC, pages
ing to Stanford Dependencies: Trade-offs between 22162219.
speed and accuracy. In Proceedings of LREC. Joakim Nivre, Laura Rimell, Ryan McDonald, and
Eugene Charniak and Mark Johnson. 2005. Coarse- Carlos Gomez-Rodrguez. 2010. Evaluation of de-
to-fine n-best parsing and maxent discriminative pendency parsers on unbounded dependencies. In
reranking. In Proceedings of ACL. Proceedings of COLING, pages 813821.
Slav Petrov, Leon Barrett, Romain Thibaux, and Dan
Paul Cohen. 1995. Empirical Methods for Artificial
Klein. 2006. Learning accurate, compact, and in-
Intelligence. The MIT Press.
terpretable tree annotation. In Proceedings of ACL.
Marie-Catherine de Marneffe, Bill MacCartney, and
Paul M. Postal and David M. Perlmutter. 1977. To-
Christopher D. Manning. 2006. Generating typed
ward a universal characterization of passivization.
dependency parses from phrase structure parses. In
In Proceedings of the 3rd Annual Meeting of the
Proceedings of LREC, pages 449454.
Berkeley Linguistics Society, pages 394417.
Ryan Gabbard, Mitchell Marcus, and Seth Kulick. Owen Rambow. 2010. The simple truth about de-
2006. Fully parsing the Penn treebank. In Proceed- pendency and phrase structure representations: An
ing of HLT-NAACL, pages 184191. opinion piece. In Proceedings of HLT-ACL, pages
Jesus Gimenez and Llus Marquez. 2004. SVMTool: 337340.
A general POS tagger generator based on support Roy Schwartz, Omri Abend, Roi Reichart, and Ari
vector machines. In Proceedings of LREC. Rappoport. 2011. Neutralizing linguistically prob-
Sandra Kubler, Ryan McDonald, and Joakim Nivre. lematic annotations in unsupervised dependency
2009. Dependency Parsing. Number 2 in Synthesis parsing evaluation. In Proceedings of ACL, pages
Lectures on Human Language Technologies. Mor- 663672.
gan & Claypool Publishers. Khalil Simaan, Alon Itai, Yoad Winter, Alon Altman,
Dekang Lin. 1995. A dependency-based method for and Noa Nativ. 2001. Building a Tree-Bank for
evaluating broad-coverage parsers. In Proceedings Modern Hebrew Text. In Traitement Automatique
of IJCAI-95, pages 14201425. des Langues.
53
Reut Tsarfaty and Khalil Simaan. 2008. Relational-
Realizational parsing. In Proceedings of CoLing.
Reut Tsarfaty, Joakim Nivre, and Evelina Andersson.
2011. Evaluating dependency parsing: Robust and
heuristics-free cross-framework evaluation. In Pro-
ceedings of EMNLP.
Kaizhong Zhang and Dennis Shasha. 1989. Sim-
ple fast algorithms for the editing distance between
trees and related problems. In SIAM Journal of
Computing, volume 18, pages 12451262.
54
Dependency Parsing of Hungarian: Baseline Results and Challenges
55
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 5565,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
and old information (the topic) precedes the focus with that of a nominative noun while in the second
position. Thus, the position relative to the verb case, it coincides with a dative noun.
has no predictive force as regards the syntactic According to these facts, a Hungarian parser
function of the given argument: while in English, must rely much more on morphological analysis
the noun phrase before the verb is most typically than e.g. an English one since in Hungarian it
the subject, in Hungarian, it is the focus of the is morphemes that mostly encode morphosyntac-
sentence, which itself can be the subject, object tic information. One of the consequences of this
or any other argument (E. Kiss, 2002). is that Hungarian sentences are shorter in terms
The grammatical function of words is deter- of word numbers than English ones. Based on
mined by case suffixes as in gyerek child gye- the word counts of the HungarianEnglish paral-
reknek (child-DAT) for (a/the) child. Hungarian lel corpus Hunglish (Varga et al., 2005), an En-
nouns can have about 20 cases1 which mark the glish sentence contains 20.5% more words than its
relationship between the head and its arguments Hungarian equivalent. These extra words in En-
and adjuncts. Although there are postpositions glish are most frequently prepositions, pronomi-
in Hungarian, case suffixes can also express re- nal subjects or objects, whose parent and depen-
lations that are expressed by prepositions in En- dency label are relatively easy to identify (com-
glish. pared to other word classes). This train of thought
Verbs are inflected for person and number and indicates that the cross-lingual comparison of fi-
the definiteness of the object. Since conjugational nal parser scores should be conducted very care-
information is sufficient to deduce the pronominal fully.
subject or object, they are typically omitted from
the sentence: Varlak (wait-1 SG 2 OBJ) I am wait- 3 Related work
ing for you. This pro-drop feature of Hungar- We decided to focus on dependency parsing in
ian leads to the fact that there are several clauses this study as it is a superior framework for non-
without an overt subject or object. configurational languages. It has gained inter-
Another peculiarity of Hungarian is that the est in natural language processing recently be-
third person singular present tense indicative form cause the representation itself does not require
of the copula is phonologically empty, i.e. there the words inside of constituents to be consecu-
are apparently verbless sentences in Hungarian: tive and it naturally represent discontinuous con-
A haz nagy (the house big) The house is big. structions, which are frequent in languages where
However, in other tenses or moods, the copula grammatical relations are often signaled by mor-
is present as in A haz nagy lesz (the house big phology instead of word order (McDonald and
will.be) The house will be big. Nivre, 2011). The two main efficient approaches
There are two possessive constructions in for dependency parsing are the graph-based and
Hungarian. First, the possessive relation is only the transition-based parsers. The graph-based
marked on the possessed noun (in contrast, it is models look for the highest scoring directed span-
marked only on the possessor in English): a fiu ning tree in the complete graph whose nodes are
kutyaja (the boy dog-POSS) the boys dog. Sec- the words of the sentence in question. They solve
ond, both the possessor and the possessed bear a the machine learning problem of finding the opti-
possessive marker: a fiunak a kutyaja (the boy- mal scoring function of subgraphs (Eisner, 1996;
DAT the dog- POSS ) the boys dog. In the latter McDonald et al., 2005). The transition-based ap-
case, the possessor and the possessed may not be proaches parse a sentence in a single left-to-right
adjacent within the sentence as in A fiunak latta a pass over the words. The next transition in these
kutyajat (the boy-DAT see-PAST 3 SGOBJ the dog- systems is predicted by a classifier that is based
POSS - ACC ) He saw the boys dog, which results on history-related features (Kudo and Matsumoto,
in a non-projective syntactic tree. Note that in 2002; Nivre et al., 2004).
the first case, the form of the possessor coincides Although the available treebanks for Hungar-
1
Hungarian grammars and morphological coding sys- ian are relatively big (82K sentences) and fully
tems do not agree on the exact number of cases, some rare manually annotated, the studies on parsing Hun-
suffixes are treated as derivational suffixes in one grammar garian are rather limited. The Szeged (Con-
and as case suffixes in others; see e.g. Farkas et al. (2010).
stituency) Treebank (Csendes et al., 2005) con-
56
sists of six domains namely, short business The annotation employs 16 coarse grained POS
news, newspaper, law, literature, compositions tags, 95 morphological feature values and 29 de-
and informatics and it is manually annotated pendency labels. 19.6% of the sentences in the
for the possible alternatives of words morpho- corpus contain non-projective edges and 1.8% of
logical analyses, the disambiguated analysis and the edges are non-projective2 , which is almost 5
constituency trees. We are aware of only two times more frequent than in English and is the
articles on phrase-structure parsers which were same as the Czech non-projectivity level (Buch-
trained and evaluated on this corpus (Barta et al., holz and Marsi, 2006). Here we discuss two an-
2005; Ivan et al., 2007) and there are a few studies notation principles along with our modifications
on hand-crafted parsers reporting results on small in the dataset for this study which strongly influ-
own corpora (Babarczy et al., 2005; Proszeky et ence the parsers accuracies.
al., 2004).
Named Entities (NEs) were treated as one to-
The Szeged Dependency Treebank (Vincze et
ken in the Szeged Dependency Treebank. Assum-
al., 2010) was constructed by first automatically
ing a perfect phrase recogniser on the whitespace
converting the phrase-structure trees into depen-
tokenised input for them is quite unrealistic. Thus
dency trees, then each of them was manually
we decided to split them into tokens for this study.
investigated and corrected. We note that the
The new tokens automatically got a proper noun
dependency treebank contains more information
with default morphological features morphologi-
than the constituency one as linguistic phenom-
cal analysis except for the last token the head of
ena (like discontinuous structures) were not anno-
the phrase , which inherited the morphological
tated in the former corpus, but were added to the
analysis of the original multiword unit (which can
dependency treebank. To the best of our knowl-
contain various grammatical information). This
edge no parser results have been published on this
resulted in an N N N N POS sequence for Kovacs
corpus. Both corpora are available at www.inf.
es tarsa kft. Smith and Co. Ltd. which would
u-szeged.hu/rgai/SzegedTreebank.
be annotated as N C N N in the Penn Treebank.
The multilingual track of the CoNLL-2007
Moreover, we did not annotate any internal struc-
Shared Task (Nivre et al., 2007) addressed also
ture of Named Entities. We consider the last word
the task of dependency parsing of Hungarian. The
of multiword named entities as the head because
Hungarian corpus used for the shared task con-
of morphological reasons (the last word of multi-
sists of automatically converted dependency trees
word units gets inflected in Hungarian) and all the
from the Szeged Constituency Treebank. Several
previous elements are attached to the succeeding
issues of the automatic conversion tool were re-
word, i.e. the penultimate word is attached to the
considered before the manual annotation of the
last word, the antepenultimate word to the penulti-
Szeged Dependency Treebank was launched and
mate one etc. The reasons for these considerations
the annotation guidelines contained instructions
are that we believe that there are no downstream
related to linguistic phenomena which could not
applications which can exploit the information of
be converted from the constituency representa-
the internal structures of Named Entities and we
tion for a detailed discussion, see Vincze et al.
imagine a pipeline where a Named Entity Recog-
(2010). Hence the annotation schemata of the
niser precedes the parsing step.
CoNLL-2007 Hungarian corpus and the Szeged
Dependency Treebank are rather different and the Empty copula: In the verbless clauses (pred-
final scores reported for the former are not di- icative nouns or adjectives) the Szeged Depen-
rectly comparable with our reported scores here dency Treebank introduces virtual nodes (16,000
(see Section 5). items in the corpus). This solution means that
a similar tree structure is ascribed to the same
4 The Szeged Dependency Treebank sentence in the present third person singular and
We utilize the Szeged Dependency Treebank all the other tenses / persons. A further argu-
(Vincze et al., 2010) as the basis of our experi- ment for the use of a virtual node is that the vir-
ments for Hungarian dependency parsing. It con- tual node is always present at the syntactic level
tains 82,000 sentences, 1.2 million words and 2
Using the transitive closure definition of Nivre and Nils-
250,000 punctuation marks from six domains. son (2005).
57
corpus Malt MST Mate
ULA LAS ULA LAS ULA LAS
dev 88.3 (89.9) 85.7 (87.9) 86.9 (88.5) 80.9 (82.9) 89.7 (91.1) 86.8 (89.0)
Hungarian
test 88.7 (90.2) 86.1 (88.2) 87.5 (89.0) 81.6 (83.5) 90.1 (91.5) 87.2 (89.4)
dev 87.8 (89.1) 84.5 (86.1) 89.4 (91.2) 86.1 (87.7) 91.6 (92.7) 88.5 (90.0)
English
test 88.8 (89.9) 86.2 (87.6) 90.7 (91.8) 87.7 (89.2) 92.6 (93.4) 90.3 (91.5)
Table 1: Results achieved by the three parsers on the (full) Hungarian (Szeged Dependency Treebank) and
English (CoNLL-2009) datasets. The scores in brackets are achieved with gold-standard POS tagging.
since it is overt in all the other forms, tenses and Tools: We employed a finite state automata-
moods of the verb. Still, the state-of-the-art de- based morphological analyser constructed from
pendency parsers cannot handle virtual nodes. For the morphdb.hu lexical resource (Tron et al.,
this study, we followed the solution of the Prague 2006) and we used the MSD-style morphological
Dependency Treebank (Hajic et al., 2000) and vir- code system of the Szeged TreeBank (Alexin et
tual nodes were removed from the gold standard al., 2003). The output of the morphological anal-
annotation and all of their dependents were at- yser is a set of possible lemmamorphological
tached to the head of the original virtual node and analysis pairs. This set of possible morphologi-
they were given a dedicated edge label (Exd). cal analyses for a word form is then used as pos-
sible alternatives instead of open and closed tag
Dataset splits: We formed training, develop-
sets in a standard sequential POS tagger. Here,
ment and test sets from the corpus where each
we applied the Conditional Random Fields-based
set consists of texts from each of the domains.
Stanford POS tagger (Toutanova et al., 2003) and
We paid attention to the issue that a document
carried out 5-fold-cross POS training/tagging in-
should not be separated into different datasets be-
side the subcorpora.4 For the English experiments
cause it could result in a situation where a part of
we used the predicted POS tags provided for the
the test document was seen in the training dataset
CoNLL-2009 shared task (Hajic et al., 2009).
(which is unrealistic because of unknown words,
As the dependency parser we employed three
style and frequently used grammatical structures).
state-of-the-art data-driven parsers, a transition-
As the fiction subcorpus consists of three books
based parser (Malt) and two graph-based parsers
and the law subcorpus consists of two rules, we
(MST and Mate parsers). The Malt parser (Nivre
took half of one of the documents for the test
et al., 2004) is a transition-based system, which
and development sets and used the other part(s)
uses an arc-eager system along with support vec-
for training there. This principle was followed at
tor machines to learn the scoring function for tran-
our cross-fold-validation experiments as well ex-
sitions and which uses greedy, deterministic one-
cept for the law subcorpus. We applied 3 folds for
best search at parsing time. As one of the graph-
cross-validation for the fiction subcorpus, other-
based parsers, we employed the MST parser (Mc-
wise we used 10 folds (splitting at documentary
Donald et al., 2005) with a second-order feature
boundaries would yield a training fold consisting
decoder. It uses an approximate exhaustive search
of just 3000 sentences).3
for unlabeled parsing, then a separate arc label
5 Experiments classifier is applied to label each arc. The Mate
parser (Bohnet, 2010) is an efficient second or-
We carried out experiments using three state-of- der dependency parser that models the interaction
the-art parsers on the Szeged Dependency Tree- between siblings as well as grandchildren (Car-
bank (Vincze et al., 2010) and on the English reras, 2007). Its decoder works on labeled edges,
datasets of the CoNLL-2009 Shared Task (Hajic i.e. it uses a single-step approach for obtaining
et al., 2009). labeled dependency trees. Mate uses a rich and
3
Both the training/development/test and the cross- 4
The JAVA implementation of the morphological anal-
validation splits are available at www.inf.u-szeged. yser and the slightly modified POS tagger along with trained
hu/rgai/SzegedTreebank. models are available at http://www.inf.u-szeged.
hu/rgai/magyarlanc.
58
corpus #sent. length CPOS DPOS ULA all ULA LAS all LAS
newspaper 9189 21.6 97.2 96.5 88.0 (90.0) +0.8 84.7 (87.5) +1.0
short business 8616 23.6 98.0 97.7 93.8 (94.8) +0.3 91.9 (93.4) +0.4
fiction 9279 12.6 96.9 95.8 87.7 (89.4) -0.5 83.7 (86.2) -0.3
law 8347 27.3 98.3 98.1 90.6 (90.7) +0.2 88.9 (89.0) +0.2
computer 8653 21.9 96.4 95.8 91.3 (92.8) -1.2 88.9 (91.2) -1.6
composition 22248 13.7 96.7 95.6 92.7 (93.9) +0.3 88.9 (91.0) +0.3
Table 2: Domain results achieved by the Mate parser in cross-validation settings. The scores in brackets are
achieved with gold-standard POS tagging. The all columns contain the added value of extending the training
sets with each of the five out-domain subcorpora.
well-engineered feature set and it is enhanced by ment was to gain an insight into the performance
a Hash Kernel, which leads to higher accuracy. of the parsers which can only access configura-
tional information. These parsers achieved worse
Evaluation metrics: We apply the Labeled At-
results than the full parsers by 6.8 ULA, 20.3 LAS
tachment Score (LAS) and Unlabeled Attachment
and 2.9 ULA, 6.4 LAS on the development sets
Score (ULA), taking into account punctuation as
of Hungarian and English, respectively. As ex-
well for evaluating dependency parsers and the
pected, Hungarian suffers much more when the
accuracy on the main POS tags (CPOS) and a
parser has to learn from configurational informa-
fine-grained morphological accuracy (DPOS) for
tion only, especially when grammatical functions
evaluating the POS tagger. In the latter, the analy-
have to be predicted (LAS). Despite this, the re-
sis is regarded as correct if the main POS tag and
sults of Table 1 show that the parsers can practi-
each of the morphological features of the token in
cally eliminate this gap by learning from morpho-
question are correct.
logical features (and lexicalization). This means
Results: Table 1 shows the results got by the that the data-driven parsers employing a very rich
parsers on the whole Hungarian corpora and on feature set can learn a model which effectively
the English datasets. The most important point captures the dependency structures using feature
is that scores are not different from the English weights which are radically different from the
scores (although they are not directly compara- ones used for English.
ble). To understand the reasons for this, we man- Another cause of the relatively high scores is
ually investigated the set of firing features with that the CPOS accuracy scores on Hungarian
the highest weights in the Mate parser. Although and English are almost equal: 97.2 and 97.3, re-
the assessment of individual feature contributions spectively. This also explains the small differ-
to a particular decoder decision is not straightfor- ence between the results got by gold-standard and
ward, we observed that features encoding config- predicted POS tags. Moreover, the parser can
urational information (i.e. the direction or length also exploit the morphological features as input
of an edge, the words or POS tag sequences/sets in Hungarian.
between the governor and the dependent) were The Mate parser outperformed the other two
frequently among the highest weighted features parsers on each of the four datasets. Comparing
in English but were extremely rare in Hungarian. the two graph-based parsers Mate and MST, the
For instance, one of the top weighted features for gap between them was twice as big in LAS than in
a subject dependency in English was the there is ULA in Hungarian, which demonstrates that the
no word between the head and the dependent fea- one-step approach looking for the maximum
ture while this never occurred among the top fea- labeled spanning tree is more suitable for Hun-
tures in Hungarian. garian than the two-step arc labeling approach of
As a control experiment, we trained the Mate MST. This probably holds for other morpholog-
parser only having access to the gold-standard ically rich languages too as the decoder can ex-
POS tag sequences of the sentences, i.e. we ploit information from the labels of decoded arcs.
switched off the lexicalization and detailed mor- Based on these results, we decided to use only
phological information. The goal of this experi- Mate for our further experiments.
59
Table 2 provides an insight into the effect of low, they form important features for the parser,
domain differences on POS tagging and pars- thus we will focus on the more accurate handling
ing scores. There is a noticeable difference be- of these cases in future work.
tween the newspaper and the short business
Comparison to CoNLL-2007 results: The
news corpora. Although these domains seem to
best performing participant of the CoNLL-2007
be close to each other at the first glance (both are
Shared Task (Nivre et al., 2007) achieved an ULA
news), they have different characteristics. On the
of 83.6 and LAS of 80.3 (Hall et al., 2007) on
one hand, short business news is a very narrow
the Hungarian corpus. The difference between the
domain consisting of 2-3 sentence long financial
top performing English and Hungarian systems
short reports. It frequently uses the same gram-
were 8.14 ULA and 9.3 LAS. The results reported
matical structures (like Stock indexes rose X per-
in 2007 were significantly lower and the gap be-
cent at the Y Stock on Wednesday) and the lexi-
tween English and Hungarian is higher than our
con is also limited. On the other hand, the news-
current values. To locate the sources of difference
paper subcorpus consists of full journal articles
we carried out other experiments with Mate on
covering various domains and it has a fancy jour-
the CoNLL-2007 dataset using the gold-standard
nalist style.
POS tags (the shared task used gold-standard POS
The effect of extending the training dataset with
tags for evaluation).
out-of-domain parses is not convincing. In spite
First we trained and evaluated Mate on the
of the ten times bigger training datasets, there
original CoNLL-2007 datasets, where it achieved
are two subcorpora where they just harmed the
ULA 84.3 and LAS 80.0. Then we used the sen-
parser, and the improvement on other subcorpora
tences of the CoNLL-2007 datasets but with the
is less than 1 percent. This demonstrates well the
new, manual annotation. Here, Mate achieved
domain-dependence of parsing.
ULA 88.6 and LAS 85.5, which means that the
The parser and the POS tagger react to do-
modified annotation schema and the less erro-
main difficulties in a similar way, according to
neous/noisy annotation caused an improvement of
the first four rows of Table 2. This observation
ULA 4.3 and LAS 5.5. The annotation schema
holds for the scores of the parsers working with
changed a lot: coordination had to be corrected
gold-standard POS tags, which suggests that do-
manually since it is treated differently after con-
main difficulties harm POS tagging and parsing as
version, moreover, the internal structure of ad-
well. Regarding the two last subcorpora, the com-
jectival/participial phrases was not marked in the
positions consist of very short and usually simple
original constituency treebank, so it was also
sentences and the training corpora are twice as big
added manually (Vincze et al., 2010). The im-
compared with other subcorpora. Both factors are
provement in the labeled attachment score is prob-
probably the reasons for the good parsing perfor-
ably due to the reduction of the label set (from 49
mance. In the computer corpus, there are many
to 29 labels), which step was justified by the fact
English terms which are manually tagged with an
that some morphosyntactic information was dou-
unknown tag. They could not be accurately pre-
bly coded in the case of nouns (e.g. hazzal (house-
dicted by the POS tagger but the parser could pre-
INS) with the/a house) in the original CoNLL-
dict their syntactic role.
2007 dataset first, by their morphological case
Table 2 also tells us that the difference between
(Cas=ins) and second, by their dependency label
CPOS and DPOS is usually less than 1 percent.
(INS).
This experimentally supports that the ambigu-
Lastly, as the CoNLL-2007 sentences came
ity among alternative morphological analyses
from the newspaper subcorpus, we can compare
is mostly present at the POS-level and the mor-
these scores with the ULA 90.0 and LAS 87.5
phological features are efficiently identified by
of Table 2. The ULA 1.5 and LAS 2.0 differ-
our morphological analyser. The most frequent
ences are the result of the bigger training corpus
morphological features which cannot be disam-
(9189 sentences on average compared to 6390 in
biguated at the word level are related to suffixes
the CoNLL-2007 dataset).
with multiple functions or the word itself cannot
be unambiguously segmented into morphemes.
Although the number of such ambiguous cases is
60
Hungarian English
label attachment label attachment
virtual nodes 31.5% 39.5% multiword NEs 15.2% 17.6%
conjunctions and negation 11.2% PP-attachment 15.9%
noun attachment 9.6% non-canonical word order 6.4% 6.5%
more than 1 premodifier 5.1% misplaced clause 9.7%
coordination 13.5% 16.5% coordination 8.5% 12.5%
mislabeled adverb 16.3% mislabeled adverb 40.1%
annotation errors 10.7% 6.8% annotation errors 9.7% 8.5%
other 28.0% 11.3% other 20.1% 29.3%
TOTAL 100% 100% TOTAL 100% 100%
Table 3: The most frequent corpus-specific and general attachment and labeling error categories (based on a
manual investigation of 200200 erroneous sentences).
61
verb. In free word order languages, the order of typically, the prepositional complement which
the arguments of the infinitive and the main verb follows the head was attached to the verb instead
may get mixed, which is called scrambling (Ross, of the noun or vice versa. In contrast, Hungarian
1986). This is not a common source of error in is a head-after-dependent language, which means
English as arguments cannot scramble. that dependents most often occur before the head.
Furthermore, there are no prepositions in Hungar-
Article attachment: In Hungarian, if there is
ian, and grammatical relations encoded by prepo-
an article before a prenominal modifier, it can be-
sitions in English are conveyed by suffixes or
long to the head noun and to the modifier as well.
postpositions. Thus, if there is a modifier before
In a szoba ajtaja (the room door-3 SGPOSS) the
the nominal head, it requires the presence of a
door of the room the article belongs to the modi-
participle as in: Felvette a kirakatban levo ruhat
fier but when the prenominal modifier cannot have
(take.on-PAST 3 SGOBJ the shop.window-INE be-
an article (e.g. a februarban indulo projekt (the
ing dress-ACC) She put on the dress in the shop
February-INE starting project) the project start-
window. The English sentence is ambiguous (ei-
ing in February), it is attached to the head noun
ther the event happens in the shop window or the
(i.e. to projekt project). It was not always clear
dress was originally in the shop window) while
for the parser which parent to select for the arti-
the Hungarian has only the latter meaning.6
cle. In contrast, these cases are not problematic
in English since the modifier typically follows the General dependency parsing difficulties:
head and thus each article precedes its head noun. There were certain structures that led to typical
label and/or attachment errors in both languages.
Conjunctions or negation words most typ-
The most frequent one among them is coordi-
ically the words is too, csak only/just and
nation. However, it should be mentioned that
nem/sem not were much more frequently at-
syntactic ambiguities are often problematic even
tached to the wrong node in Hungarian than in
for humans to disambiguate without contextual
English. In Hungarian, they are ambiguous be-
or background semantic knowledge.
tween being adverbs and conjunctions and it is
In the case of label errors, the relation between
mostly their conjunctive uses which are problem-
the given node and its parent was labeled incor-
atic from the viewpoint of parsing. On the other
rectly. In both English and Hungarian, one of the
hand, these words have an important role in mark-
most common errors of this type was mislabeled
ing the information structure of the sentence: they
adverbs and adverbial phrases, e.g. locative ad-
are usually attached to the element in focus posi-
verbs were labeled as ADV/MODE. However, the
tion, and if there is no focus, they are attached
frequency rate of this error type is much higher
to the verb. However, sentences with or with-
in English than in Hungarian, which may be re-
out focus can have similar word order but their
lated to the fact that in the English corpus, there
stress pattern is different. Dependency parsers
is a much more balanced distribution of adverbial
obviously cannot recognize stress patterns, hence
labels than in the Hungarian one (where the cat-
conjunctions and negation words are sometimes
egories MODE and TLOCY are responsible for
erroneously attached to the verb in Hungarian.
90% of the occurrences). Assigning the most fre-
English sentences with non-canonical word quent label of the training dataset to each adverb
order (e.g. questions) were often incorrectly yields an accuracy of 82% in English and 93% in
parsed, e.g. the noun following the main verb is Hungarian, which suggests that there is a higher
the object in sentences like Replied a salesman: level of ambiguity for English adverbial phrases.
Exactly., where it is the subject that follows the For instance, the preposition by may introduce an
verb for stylistic reasons. However, in Hungarian, adverbial modifier of manner (MNR) in by cre-
morphological information is of help in such sen- ating a bill and the agent in a passive sentence
tences, as it is not the position relative to the verb (LGS). Thus, labeling adverbs seems to be a more
but the case suffix that determines the grammati- 6
However, there exists a head-before-dependent version
cal role of the noun. of the sentence (Felvette a ruhat a kirakatban), whose pre-
In English, high or low PP-attachment was ferred reading is She was in the shop window while dressing
responsible for many parsing ambiguities: most up, that is, the modifier belongs to the verb.
62
difficult task in English.7 7 Conclusions
Clauses were also often mislabeled in both lan-
We showed that state-of-the-art dependency
guages, most typically when there was no overt
parsers achieve similar results in terms of at-
conjunction between clauses. Another source of
tachment scores on Hungarian and English. Al-
error was when more than one modifier occurred
though the results with this comparison should be
before a noun (5.1% and 4.2% of attachment er-
taken with a pinch of salt as sentence lengths
rors in Hungarian and in English): in these cases,
(and information encoded in single words) differ,
the first modifier could belong to the noun (a
domain differences and annotation schema diver-
brown Japanese car) or to the second modifier (a
gences are uncatchable we conclude that parsing
brown haired girl).
Hungarian is just as hard a task as parsing English.
Multiword Named Entities: As we mentioned We argued that this is due to the relatively good
in Section 4, members of multiword Named Enti- POS tagging accuracy (which is a consequence
ties had a proper noun POS-tag and an NE label of the low ambiguity of alternative morphological
in our dataset. Hence when parsing is based on analyses of a sentence and the good coverage of
gold standard POS-tags, their recognition is al- the morphological analyser) and the fact that data-
most perfect while it is a frequent source or er- driven dependency parsers employ a rich feature
rors in the CoNLL-2009 corpus. We investigated representation which enables them to learn differ-
the parse of our 200 sentences with predicted POS ent kinds of feature weight profiles.
tags at NEs and found that this introduces several We also discussed the domain differences
errors (about 5% of both attachment and labeling among the subcorpora of the Szeged Dependency
errors) in Hungarian. On the other hand, the re- Treebank and their effect on parsing results. Our
sults are only slightly worse in English, i.e. iden- results support that there can be higher differences
tifying the inner structure of NEs does not depend in parsing scores among domains in one language
on whether the parser builds on gold standard or than among corpora from a similar domain but
predicted POS-tags since function words like con- different languages (which again marks pitfalls of
junctions or prepositions which mark grammat- inter-language comparison of parsing scores).
ical relations are tagged in the same way in both Our systematic error analysis showed that han-
cases. The relative frequency of this error type is dling the virtual nodes (mostly empty copula) is
much higher in English even when the Hungar- a frequent source of errors. We identified several
ian parser does not have access to the gold proper phenomena which are not typically listed as Hun-
noun POS tags. The reason for this is simple: in garian syntax-specific features but are challeng-
the Penn Treebank the correct internal structure of ing for the current data-driven parsers, however,
the NEs has to be identified beyond the phrase they are not problematic in English (like the at-
boundaries while in Hungarian their members tachment of conjunctions and negation words and
just form a chain. the attachment problem of nouns and articles).
We concluded based on our quantitative analy-
Annotation errors: We note that our analysis
sis that a further notable error reduction is only
took into account only sentences which contained
achievable if distinctive attention is paid to these
at least one parsing error and we crawled only
language-specific phenomena.
the dependencies where the gold standard anno-
We intend to investigate the problem of vir-
tation and the output of the parser did not match.
tual nodes in dependency parsing in more depth
Hence, the frequency of annotation errors is prob-
and to implement new feature templates for the
ably higher than we found (about 1% of the en-
Hungarian-specific challenges as future work.
tire set of dependencies) during our investigation
as there could be annotation errors in the error- Acknowledgments
free sentences and also in the investigated sen-
tences where the parser agrees with that error. This work was supported in part by the Deutsche
7
Forschungsgemeinschaft grant SFB 732 and the
We would nevertheless like to point out that adverbial
NIH grant (project codename MASZEKER) of
labels have a highly semantic nature, i.e. it could be argued
that it is not the syntactic parser that should identify them but the Hungarian government.
a semantic processor.
63
References Nianwen Xue, and Yi Zhang. 2009. The CoNLL-
2009 Shared Task: Syntactic and Semantic Depen-
Zoltan Alexin, Janos Csirik, Tibor Gyimothy, Karoly
dencies in Multiple Languages. In Proceedings of
Bibok, Csaba Hatvani, Gabor Proszeky, and Laszlo
the Thirteenth Conference on Computational Nat-
Tihanyi. 2003. Annotated Hungarian National Cor-
ural Language Learning (CoNLL 2009): Shared
pus. In Proceedings of the EACL, pages 5356.
Task, pages 118.
Anna Babarczy, Balint Gabor, Gabor Hamp, and
Johan Hall, Jens Nilsson, Joakim Nivre, Gulsen
Andras Rung. 2005. Hunpars: a rule-based sen-
Eryigit, Beata Megyesi, Mattias Nilsson, and
tence parser for Hungarian. In Proceedings of the
Markus Saers. 2007. Single Malt or Blended?
6th International Symposium on Computational In-
A Study in Multilingual Parser Optimization. In
telligence.
Proceedings of the CoNLL Shared Task Session of
Csongor Barta, Dora Csendes, Janos Csirik, Andras EMNLP-CoNLL 2007, pages 933939.
Hocza, Andras Kocsor, and Kornel Kovacs. 2005.
Szilard Ivan, Robert Ormandi, and Andras Kocsor.
Learning syntactic tree patterns from a balanced
2007. Magyar mondatok SVM alapu szintaxis
Hungarian natural language database, the Szeged
elemzese [SVM-based syntactic parsing of Hun-
Treebank. In Proceedings of 2005 IEEE Interna-
garian sentences]. In V. Magyar Szamtogepes
tional Conference on Natural Language Processing
Nyelveszeti Konferencia, pages 281283.
and Knowledge Engineering, pages 225 231.
Taku Kudo and Yuji Matsumoto. 2002. Japanese
Bernd Bohnet. 2010. Top accuracy and fast depen-
dependency analysis using cascaded chunking. In
dency parsing is not a contradiction. In Proceedings
Proceedings of the 6th Conference on Natural Lan-
of the 23rd International Conference on Computa-
guage Learning - Volume 20, COLING-02, pages
tional Linguistics (Coling 2010), pages 8997.
17.
Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X
Ryan McDonald and Joakim Nivre. 2011. Analyzing
Shared Task on Multilingual Dependency Parsing.
and integrating dependency parsers. Computational
In Proceedings of the Tenth Conference on Com-
Linguistics, 37:197230.
putational Natural Language Learning (CoNLL-X),
Ryan McDonald, Fernando Pereira, Kiril Ribarov, and
pages 149164.
Jan Hajic. 2005. Non-Projective Dependency Pars-
Xavier Carreras. 2007. Experiments with a higher-
ing using Spanning Tree Algorithms. In Proceed-
order projective dependency parser. In Proceed-
ings of Human Language Technology Conference
ings of the CoNLL Shared Task Session of EMNLP-
and Conference on Empirical Methods in Natural
CoNLL 2007, pages 957961.
Language Processing, pages 523530.
Dora Csendes, Janos Csirik, Tibor Gyimothy, and
Joakim Nivre and Jens Nilsson. 2005. Pseudo-
Andras Kocsor. 2005. The Szeged Treebank. In
Projective Dependency Parsing. In Proceedings
TSD, pages 123131.
of the 43rd Annual Meeting of the Association
Katalin E. Kiss. 2002. The Syntax of Hungarian. for Computational Linguistics (ACL05), pages 99
Cambridge University Press, Cambridge. 106.
Jason M. Eisner. 1996. Three new probabilistic mod- Joakim Nivre, Johan Hall, and Jens Nilsson. 2004.
els for dependency parsing: an exploration. In Pro- Memory-Based Dependency Parsing. In HLT-
ceedings of the 16th conference on Computational NAACL 2004 Workshop: Eighth Conference
linguistics - Volume 1, COLING 96, pages 340 on Computational Natural Language Learning
345. (CoNLL-2004), pages 4956.
Richard Farkas, Daniel Szeredi, Daniel Varga, and Joakim Nivre, Johan Hall, Sandra Kubler, Ryan Mc-
Veronika Vincze. 2010. MSD-KR harmonizacio a Donald, Jens Nilsson, Sebastian Riedel, and Deniz
Szeged Treebank 2.5-ben [Harmonizing MSD and Yuret. 2007. The CoNLL 2007 Shared Task
KR codes in the Szeged Treebank 2.5]. In VII. Ma- on Dependency Parsing. In Proceedings of the
gyar Szamtogepes Nyelveszeti Konferencia, pages CoNLL Shared Task Session of EMNLP-CoNLL
349353. 2007, pages 915932.
Jan Hajic, Alena Bohmova, Eva Hajicova, and Barbora Gabor Proszeky, Laszlo Tihanyi, and Gabor L. Ugray.
Vidova-Hladka. 2000. The Prague Dependency 2004. Moose: A Robust High-Performance Parser
Treebank: A Three-Level Annotation Scenario. In and Generator. In Proceedings of the 9th Workshop
Anne Abeille, editor, Treebanks: Building and of the European Association for Machine Transla-
Using Parsed Corpora, pages 103127. Amster- tion.
dam:Kluwer.
John R. Ross. 1986. Infinite syntax! ABLEX, Nor-
Jan Hajic, Massimiliano Ciaramita, Richard Johans- wood, NJ.
son, Daisuke Kawahara, Maria Antonia Mart, Llus
Lucien Tesniere. 1959. Elements de syntaxe struc-
Marquez, Adam Meyers, Joakim Nivre, Sebastian
turale. Klincksieck, Paris.
Pado, Jan Stepanek, Pavel Stranak, Mihai Surdeanu,
64
Kristina Toutanova, Dan Klein, Christopher D. Man-
ning, and Yoram Singer. 2003. Feature-rich part-
of-speech tagging with a cyclic dependency net-
work. In Proceedings of the 2003 Conference
of the North American Chapter of the Association
for Computational Linguistics on Human Language
Technology - Volume 1, pages 173180.
Viktor Tron, Peter Halacsy, Peter Rebrus, Andras
Rung, Eszter Simon, and Peter Vajda. 2006. Mor-
phdb.hu: Hungarian lexical database and morpho-
logical grammar. In Proceedings of 5th Inter-
national Conference on Language Resources and
Evaluation (LREC 06).
Daniel Varga, Peter Halacsy, Andras Kornai, Viktor
Nagy, Laszlo Nemeth, and Viktor Tron. 2005. Par-
allel corpora for medium density languages. In Pro-
ceedings of the RANLP, pages 590596.
Veronika Vincze, Dora Szauter, Attila Almasi, Gyorgy
Mora, Zoltan Alexin, and Janos Csirik. 2010. Hun-
garian Dependency Treebank. In Proceedings of the
Seventh Conference on International Language Re-
sources and Evaluation (LREC10).
65
Dependency Parsing with Undirected Graphs
Abstract
We introduce a new approach to transition-
based dependency parsing in which the
parser does not directly construct a depen- 0 1 2 3
dency structure, but rather an undirected
graph, which is then converted into a di- Figure 1: An example dependency structure where
rected dependency tree in a post-processing transition-based parsers enforcing the single-head con-
step. This alleviates error propagation, straint will incur in error propagation if they mistak-
since undirected parsers do not need to ob- enly build a dependency link 1 2 instead of 2 1
serve the single-head constraint. (dependency links are represented as arrows going
Undirected parsers can be obtained by sim- from head to dependent).
plifying existing transition-based parsers
satisfying certain conditions. We apply this
approach to obtain undirected variants of It has been shown by McDonald and Nivre
the planar and 2-planar parsers and of Cov- (2007) that such parsers suffer from error prop-
ingtons non-projective parser. We perform
agation: an early erroneous choice can place the
experiments on several datasets from the
CoNLL-X shared task, showing that these
parser in an incorrect state that will in turn lead to
variants outperform the original directed al- more errors. For instance, suppose that a sentence
gorithms in most of the cases. whose correct analysis is the dependency graph
in Figure 1 is analyzed by any bottom-up or left-
1 Introduction to-right transition-based parser that outputs de-
Dependency parsing has proven to be very use- pendency trees, therefore obeying the single-head
ful for natural language processing tasks. Data- constraint (only one incoming arc is allowed per
driven dependency parsers such as those by Nivre node). If the parser chooses an erroneous transi-
et al. (2004), McDonald et al. (2005), Titov and tion that leads it to build a dependency link from
Henderson (2007), Martins et al. (2009) or Huang 1 to 2 instead of the correct link from 2 to 1, this
and Sagae (2010) are accurate and efficient, they will lead it to a state where the single-head con-
can be trained from annotated data without the straint makes it illegal to create the link from 3 to
need for a grammar, and they provide a simple 2. Therefore, a single erroneous choice will cause
representation of syntax that maps to predicate- two attachment errors in the output tree.
argument structure in a straightforward way. With the goal of minimizing these sources of
In particular, transition-based dependency errors, we obtain novel undirected variants of
parsers (Nivre, 2008) are a type of dependency several parsers; namely, of the planar and 2-
parsing algorithms which use a model that scores planar parsers by Gomez-Rodrguez and Nivre
transitions between parser states. Greedy deter- (2010) and the non-projective list-based parser
ministic search can be used to select the transition described by Nivre (2008), which is based on
to be taken at each state, thus achieving linear or Covingtons algorithm (Covington, 2001). These
quadratic time complexity. variants work by collapsing the LEFT- ARC and
66
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 6676,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
RIGHT- ARC transitions in the original parsers, say that i is the head of j and, conversely, that j
which create right-to-left and left-to-right depen- is a syntactic dependent of i.
dency links, into a single ARC transition creating Given a dependency graph G = (Vw , E), we
an undirected link. This has the advantage that write i ? j E if there is a (possibly empty)
the single-head constraint need not be observed directed path from i to j; and i ? j E if
during the parsing process, since the directed no- there is a (possibly empty) path between i and j in
tions of head and dependent are lost in undirected the undirected graph underlying G (omitting the
graphs. This gives the parser more freedom and references to E when clear from the context).
can prevent situations where enforcing the con- Most dependency-based representations of syn-
straint leads to error propagation, as in Figure 1. tax do not allow arbitrary dependency graphs, in-
On the other hand, these new algorithms have stead, they are restricted to acyclic graphs that
the disadvantage that their output is an undirected have at most one head per node. Dependency
graph, which has to be post-processed to recover graphs satisfying these constraints are called de-
the direction of the dependency links and generate pendency forests.
a valid dependency tree. Thus, some complexity Definition 1 A dependency graph G is said to be
is moved from the parsing process to this post- a forest iff it satisfies:
processing step; and each undirected parser will
outperform the directed version only if the simpli- 1. Acyclicity constraint: if i ? j, then not
fication of the parsing phase is able to avoid more j i.
errors than are generated by the post-processing. 2. Single-head constraint: if j i, then there
As will be seen in latter sections, experimental re- is no k 6= j such that k i.
sults indicate that this is in fact the case.
The rest of this paper is organized as follows: A node that has no head in a dependency for-
Section 2 introduces some notation and concepts est is called a root. Some dependency frame-
that we will use throughout the paper. In Sec- works add the additional constraint that depen-
tion 3, we present the undirected versions of the dency forests have only one root (or, equivalently,
parsers by Gomez-Rodrguez and Nivre (2010) that they are connected). Such a forest is called a
and Nivre (2008), as well as some considerations dependency tree. A dependency tree can be ob-
about the feature models suitable to train them. In tained from any dependency forest by linking all
Section 4, we discuss post-processing techniques of its root nodes as dependents of a dummy root
that can be used to recover dependency trees from node, conventionally located in position 0 of the
undirected graphs. Section 5 presents an empir- input.
ical study of the performance obtained by these 2.2 Transition Systems
parsers, and Section 6 contains a final discussion. In the framework of Nivre (2008), transition-
2 Preliminaries based parsers are described by means of a non-
deterministic state machine called a transition
2.1 Dependency Graphs system.
Let w = w1 . . . wn be an input string. A de- Definition 2 A transition system for dependency
pendency graph for w is a directed graph G = parsing is a tuple S = (C, T, cs , Ct ), where
(Vw , E), where Vw = {0, . . . , n} is the set of
nodes, and E Vw Vw is the set of directed 1. C is a set of possible parser configurations,
arcs. Each node in Vw encodes the position of 2. T is a finite set of transitions, which are par-
a token in w, and each arc in E encodes a de- tial functions t : C C,
pendency relation between two tokens. We write 3. cs is a total initialization function mapping
i j to denote a directed arc (i, j), which will each input string to a unique initial configu-
also be called a dependency link from i to j.1 We ration, and
1
In practice, dependency links are usually labeled, but
4. Ct C is a set of terminal configurations.
to simplify the presentation we will ignore labels throughout
most of the paper. However, all the results and algorithms
To obtain a deterministic parser from a non-
presented can be applied to labeled dependency graphs and deterministic transition system, an oracle is used
will be so applied in the experimental evaluation. to deterministically select a single transition at
67
each configuration. An oracle for a transition sys- the planar system that uses two stacks, allowing
tem S = (C, T, cs , Ct ) is a function o : C T . it to recognize 2-planar structures, a larger set
Suitable oracles can be obtained in practice by of dependency structures that has been shown to
training classifiers on treebank data (Nivre et al., cover the vast majority of non-projective struc-
2004). tures in a number of treebanks (Gomez-Rodrguez
2.3 The Planar, 2-Planar and Covington and Nivre, 2010).
Transition Systems This transition system, shown in Figure 2, has
configurations of the form c = h0 , 1 , B, Ai ,
Our undirected dependency parsers are based where we call 0 the active stack and 1 the in-
on the planar and 2-planar transition systems active stack. Its S HIFT, L EFT-A RC, R IGHT-A RC
by Gomez-Rodrguez and Nivre (2010) and the and R EDUCE transitions work similarly to those
version of the Covington (2001) non-projective in the planar parser, but while S HIFT pushes the
parser defined by Nivre (2008). We now outline first word in the buffer to both stacks; the other
these directed parsers briefly, a more detailed de- three transitions only work with the top of the ac-
scription can be found in the above references. tive stack, ignoring the inactive one. Finally, a
2.3.1 Planar S WITCH transition is added that makes the active
The planar transition system by Gomez- stack inactive and vice versa.
Rodrguez and Nivre (2010) is a linear-time 2.3.3 Covington Non-Projective
transition-based parser for planar dependency
forests, i.e., forests whose dependency arcs do not Covington (2001) proposes several incremen-
cross when drawn above the words. The set of tal parsing strategies for dependency representa-
planar dependency structures is a very mild ex- tions and one of them can recover non-projective
tension of that of projective structures (Kuhlmann dependency graphs. Nivre (2008) implements a
and Nivre, 2006). variant of this strategy as a transition system with
Configurations in this system are of the form configurations of the form c = h1 , 2 , B, Ai,
c = h, B, Ai where and B are disjoint lists of where 1 and 2 are lists containing partially pro-
nodes from Vw (for some input w), and A is a set cessed words and B is the buffer list of unpro-
of dependency links over Vw . The list B, called cessed words.
the buffer, holds the input words that are still to The Covington non-projective transition sys-
be read. The list , called the stack, is initially tem is shown at the bottom in Figure 2. At each
empty and is used to hold words that have depen- configuration c = h1 , 2 , B, Ai, the parser has
dency links pending to be created. The system to consider whether any dependency arc should
is shown at the top in Figure 2, where the nota- be created involving the top of the buffer and the
tion | i is used for a stack with top i and tail , words in 1 . A L EFT-A RC transition adds a link
and we invert the notation for the buffer for clarity from the first node j in the buffer to the node in the
(i.e., i | B as a buffer with top i and tail B). head of the list 1 , which is moved to the list 2
The system reads the input sentence and creates to signify that we have finished considering it as a
links in a left-to-right order by executing its four possible head or dependent of j. The R IGHT-A RC
transitions, until it gets to a terminal configura- transition does the same manipulation, but creat-
tion. A S HIFT transition moves the first (leftmost) ing the symmetric link. A N O -A RC transition re-
node in the buffer to the top of the stack. Transi- moves the head of the list 1 and inserts it at the
tions L EFT-A RC and R IGHT-A RC create leftward head of the list 2 without creating any arcs: this
or rightward link, respectively, involving the first transition is to be used where there is no depen-
node in the buffer and the topmost node in the dency relation between the top node in the buffer
stack. Finally, R EDUCE transition is used to pop and the head of 1 , but we still may want to cre-
the top word from the stack when we have fin- ate an arc involving the top of the buffer and other
ished building arcs to or from it. nodes in 1 . Finally, if we do not want to create
any such arcs at all, we can execute a S HIFT tran-
2.3.2 2-Planar sition, which advances the parsing process by re-
The 2-planar transition system by Gomez- moving the first node in the buffer B and inserting
Rodrguez and Nivre (2010) is an extension of it at the head of a list obtained by concatenating
68
1 and 2 . This list becomes the new 1 , whereas 3.1 Feature models
2 is empty in the resulting configuration.
Some of the features that are typically used to
Note that the Covington parser has quadratic train transition-based dependency parsers depend
complexity with respect to input length, while the on the direction of the arcs that have been built up
planar and 2-planar parsers run in linear time. to a certain point. For example, two such features
for the planar parser could be the POS tag associ-
3 The Undirected Parsers ated with the head of the topmost stack node, or
The transition systems defined in Section 2.3 the label of the arc going from the first node in the
share the common property that their L EFT-A RC buffer to its leftmost dependent.3
and R IGHT-A RC have exactly the same effects ex- As the notion of head and dependent is lost in
cept for the direction of the links that they create. undirected graphs, this kind of features cannot be
We can take advantage of this property to define used to train undirected parsers. Instead, we use
undirected versions of these transition systems, by features based on undirected relations between
transforming them as follows: nodes. We found that the following kinds of fea-
tures worked well in practice as a replacement for
Configurations are changed so that the arc set
features depending on arc direction:
A is a set of undirected arcs, instead of di-
rected arcs. Information about the ith node linked to a
given node (topmost stack node, topmost
The L EFT-A RC and R IGHT-A RC transitions
buffer node, etc.) on the left or on the right,
in each parser are collapsed into a single A RC
and about the associated undirected arc, typi-
transition that creates an undirected arc.
cally for i = 1, 2, 3,
The preconditions of transitions that guaran-
Information about whether two nodes are
tee the single-head constraint are removed,
linked or not in the undirected graph, and
since the notions of head and dependent are
about the label of the arc between them,
lost in undirected graphs.
Information about the first left and right
By performing these transformations and leaving undirected siblings of a given node, i.e., the
the systems otherwise unchanged, we obtain the first node q located to the left of the given node
undirected variants of the planar, 2-planar and p such that p and q are linked to some common
Covington algorithms that are shown in Figure 3. node r located to the right of both, and vice
Note that the transformation can be applied versa. Note that this notion of undirected sib-
to any transition system having L EFT-A RC and lings does not correspond exclusively to sib-
R IGHT-A RC transitions that are equal except for lings in the directed graph, since it can also
the direction of the created link, and thus col- capture other second-order interactions, such
lapsable into one. The above three transition sys- as grandparents.
tems fulfill this property, but not every transition
system does. For example, the well-known arc- 4 Reconstructing the dependency forest
eager parser of Nivre (2003) pops a node from the The modified transition systems presented in the
stack when creating left arcs, and pushes a node previous section generate undirected graphs. To
to the stack when creating right arcs, so the trans- obtain complete dependency parsers that are able
formation cannot be applied to it.2 to produce directed dependency forests, we will
need a reconstruction step that will assign a direc-
2
One might think that the arc-eager algorithm could still tion to the arcs in such a way that the single-head
be transformed by converting each of its arc transitions into constraint is obeyed. This reconstruction step can
an undirected transition, without collapsing them into one. be implemented by building a directed graph with
However, this would result into a parser that violates the
acyclicity constraint, since the algorithm is designed in such
weighted arcs corresponding to both possible di-
a way that acyclicity is only guaranteed if the single-head rections of each undirected edge, and then finding
constraint is kept. It is easy to see that this problem cannot an optimum branching to reduce it to a directed
happen in parsers where L EFT-A RC and R IGHT-A RC transi-
3
tions have the same effect: in these, if a directed graph is not These example features are taken from the default model
parsable in the original algorithm, its underlying undirected for the planar parser in version 1.5 of MaltParser (Nivre et
graph cannot not be parsable in the undirected variant. al., 2006).
69
Planar initial/terminal configurations: cs (w1 . . . wn ) = h[], [1 . . . n], i, Cf = {h, [], Ai C}
Transitions: S HIFT h, i|B, Ai h|i, B, Ai
R EDUCE h|i, B, Ai h, B, Ai
L EFT-A RC h|i, j|B, Ai h|i, j|B, A {(j, i)}i
only if @k | (k, i) A (single-head) and i j 6 A (acyclicity).
R IGHT-A RC h|i, j|B, Ai h|i, j|B, A {(i, j)}i
only if @k | (k, j) A (single-head) and i j 6 A (acyclicity).
2-Planar initial/terminal configurations: cs (w1 . . . wn ) = h[], [], [1 . . . n], i, Cf = {h0 , 1 , [], Ai C}
Transitions: S HIFT h0 , 1 , i|B, Ai h0 |i, 1 |i, B, Ai
R EDUCE h0 |i, 1 , B, Ai h0 , 1 , B, Ai
L EFT-A RC h0 |i, 1 , j|B, Ai h0 |i, 1 , j|B, A {j, i)}i
only if @k | (k, i) A (single-head) and i j 6 A (acyclicity).
R IGHT-A RC h0 |i, 1 , j|B, Ai h0 |i, 1 , j|B, A {(i, j)}i
only if @k | (k, j) A (single-head) and i j 6 A (acyclicity).
S WITCH h0 , 1 , B, Ai h1 , 0 , B, Ai
Covington initial/term. configurations: cs (w1 . . . wn ) = h[], [], [1 . . . n], i, Cf = {h1 , 2 , [], Ai C}
Transitions: S HIFT h1 , 2 , i|B, Ai h1 2 |i, [], B, Ai
N O -A RC h1 |i, 2 , B, Ai h1 , i|2 , B, Ai
L EFT-A RC h1 |i, 2 , j|B, Ai h1 , i|2 , j|B, A {(j, i)}i
only if @k | (k, i) A (single-head) and i j 6 A (acyclicity).
R IGHT-A RC h1 |i, 2 , j|B, Ai h1 , i|2 , j|B, A {(i, j)}i
only if @k | (k, j) A (single-head) and i j 6 A (acyclicity).
Figure 2: Transition systems for planar, 2-planar and Covington non-projective dependency parsing.
Figure 3: Transition systems for undirected planar, 2-planar and Covington non-projective dependency parsing.
70
tree. Different criteria for assigning weights to A(U ) as follows:
arcs provide different variants of the reconstruc-
tion technique. 1 if (i, j) A1 (U ),
c(i, j)
To describe these variants, we first introduce 2 if (i, j) A2 (U ) (i, j) 6 A1 (U ).
preliminary definitions. Let U = (Vw , E) be
an undirected graph produced by an undirected This approach gives the same cost to all arcs
parser for some string w. We define the follow- obtained from the undirected graph U , while also
ing sets of arcs: allowing (at a higher cost) to attach any node to
the dummy root. To obtain satisfactory results
A1 (U ) = {(i, j) | j 6= 0 {i, j} E},
with this technique, we must train our parser to
A2 (U ) = {(0, i) | i Vw }. explicitly build undirected arcs from the dummy
Note that A1 (U ) represents the set of arcs ob- root node to the root word(s) of each sentence us-
tained from assigning an orientation to an edge ing arc transitions (note that this implies that we
in U , except arcs whose dependent is the dummy need to represent forests as trees, in the manner
root, which are disallowed. On the other hand, described at the end of Section 2.1). Under this
A2 (U ) contains all the possible arcs originating assumption, it is easy to see that we can obtain the
from the dummy root node, regardless of whether correct directed tree T for a sentence if it is pro-
their underlying undirected edges are in U or not; vided with its underlying undirected tree U : the
this is so that reconstructions are allowed to link tree is obtained in O(n) as the unique orientation
unattached tokens to the dummy root. of U that makes each of its edges point away from
The reconstruction process consists of finding the dummy root.
a minimum branching (i.e. a directed minimum This approach to reconstruction has the advan-
spanning tree) for a weighted directed graph ob- tage of being very simple and not adding any com-
tained from assigning a cost c(i, j) to each arc plications to the parsing process, while guarantee-
(i, j) of the following directed graph: ing that the correct directed tree will be recovered
if the undirected tree for a sentence is generated
D(U ) = {Vw , A(U ) = A1 (U ) A2 (U )}. correctly. However, it is not very robust, since the
That is, we will find a dependency tree T = direction of all the arcs in the output depends on
(Vw , AT A(U )) such that the sum of costs of which node is chosen as sentence head and linked
the arcs in AT is minimal. In general, such a min- to the dummy root. Therefore, a parsing error af-
imum branching can be calculated with the Chu- fecting the undirected edge involving the dummy
Liu-Edmonds algorithm (Chu and Liu, 1965; Ed- root may result in many dependency links being
monds, 1967). Since the graph D(U ) has O(n) erroneous.
nodes and O(n) arcs for a string of length n, this 4.2 Label-based reconstruction
can be done in O(n log n) if implemented as de-
scribed by Tarjan (1977). To achieve a more robust reconstruction, we use
However, applying these generic techniques is labels to encode a preferred direction for depen-
not necessary in this case: since our graph U is dency arcs. To do so, for each pre-existing label
acyclic, the problem of reconstructing the forest X in the training set, we create two labels Xl and
can be reduced to choosing a root word for each Xr . The parser is then trained on a modified ver-
connected component in the graph, linking it as sion of the training set where leftward links orig-
a dependent of the dummy root and directing the inally labelled X are labelled Xl , and rightward
other arcs in the component in the (unique) way links originally labelled X are labelled Xr . Thus,
that makes them point away from the root. the output of the parser on a new sentence will be
It remains to see how to assign the costs c(i, j) an undirected graph where each edge has a label
to the arcs of D(U ): different criteria for assign- with an annotation indicating whether the recon-
ing scores will lead to different reconstructions. struction process should prefer to link the pair of
nodes with a leftward or a rightward arc. We can
4.1 Naive reconstruction then assign costs to our minimum branching algo-
A first, very simple reconstruction technique can rithm so that it will return a tree agreeing with as
be obtained by assigning arc costs to the arcs in many such annotations as possible.
71
To do this, we call A1+ (U ) A1 (U ) the set a. R
of arcs in A1 (U ) that agree with the annotations, R L L L
i.e., arcs (i, j) A1 (U ) where either i < j and
0 1 2 3 4 5
i, j is labelled Xr in U , or i > j and i, j is labelled
Xl in U . We call A1 (U ) the set of arcs in A1 (U )
b.
that disagree with the annotations, i.e., A1 (U ) =
A1 (U )\A1+ (U ). And we assign costs as follows:
0 1 2 3 4 5
1 if (i, j) A1+ (U ),
c(i, j) 2 if (i, j) A1 (U ),
c.
2n if (i, j) A2 (U ) (i, j) 6 A1 (U ).
72
The LIBSVM feature models for the arc-eager based reconstruction technique of Section 4.2, im-
projective and pseudo-projective parsers are the proves parsing accuracy on most of the tested
same used by these parsers in the CoNLL-X dataset/algorithm combinations, and it can out-
shared task, where the pseudo-projective version perform state-of-the-art transition-based parsers.
of MaltParser was one of the two top performing The accuracy improvements achieved by re-
systems (Buchholz and Marsi, 2006). For the 2- laxing the single-head constraint to mitigate er-
planar parser, we took the feature models from ror propagation were able to overcome the er-
Gomez-Rodrguez and Nivre (2010) for the lan- rors generated in the reconstruction phase, which
guages included in that paper. For all the algo- were few: we observed empirically that the dif-
rithms and datasets, the feature models used for ferences between the undirected LAS obtained
the undirected parsers were adapted from those of from the undirected graph before the reconstruc-
the directed parsers as described in Section 3.1.4 tion and the final directed LAS are typically be-
The results show that the use of undirected low 0.20%. This is true both for the naive and
parsing with label-based reconstruction clearly label-based transformations, indicating that both
improves the performance in the vast majority of techniques are able to recover arc directions accu-
the datasets for the planar and Covington algo- rately, and the accuracy differences between them
rithms, where in many cases it also improves upon come mainly from the differences in training (e.g.
the corresponding projective and non-projective having tentative arc direction as part of feature
state-of-the-art parsers provided for comparison. information in the label-based reconstruction and
In the case of the 2-planar parser the results are not in the naive one) rather than from the differ-
less conclusive, with improvements over the di- ences in the reconstruction methods themselves.
rected versions in five out of the eight languages. The reason why we can apply the undirected
The improvements in LAS obtained with label- simplification to the three parsers that we have
based reconstruction over directed parsing are sta- used in this paper is that their L EFT-A RC and
tistically significant at the .05 level5 for Danish, R IGHT-A RC transitions have the same effect ex-
German and Portuguese in the case of the pla- cept for the direction of the links they create.
nar parser; and Czech, Danish and Turkish for The same transformation and reconstruction tech-
Covingtons parser. No statistically significant de- niques could be applied to any other transition-
crease in accuracy was detected in any of the al- based dependency parsers sharing this property.
gorithm/dataset combinations. The reconstruction techniques alone could po-
As expected, the good results obtained by the tentially be applied to any dependency parser
undirected parsers with label-based reconstruc- (transition-based or not) as long as it can be some-
tion contrast with those obtained by the variants how converted to output undirected graphs.
with root-based reconstruction, which performed The idea of parsing with undirected relations
worse in all the experiments. between words has been applied before in the
6 Discussion work on Link Grammar (Sleator and Temperley,
1991), but in that case the formalism itself works
We have presented novel variants of the planar with undirected graphs, which are the final out-
and 2-planar transition-based parsers by Gomez- put of the parser. To our knowledge, the idea of
Rodrguez and Nivre (2010) and of Covingtons using an undirected graph as an intermediate step
non-projective parser (Covington, 2001; Nivre, towards obtaining a dependency structure has not
2008) which ignore the direction of dependency been explored before.
links, and reconstruction techniques that can be
used to recover the direction of the arcs thus pro- Acknowledgments
duced. The results obtained show that this idea This research has been partially funded by the Spanish
of undirected parsing, together with the label- Ministry of Economy and Competitiveness and FEDER
(projects TIN2010-18552-C03-01 and TIN2010-18552-
4
All the experimental settings and feature models used C03-02), Ministry of Education (FPU Grant Program) and
are included in the supplementary material and also available Xunta de Galicia (Rede Galega de Recursos Lingusticos
at http://www.grupolys.org/cgomezr/exp/. para unha Soc. do Conec.). The experiments were conducted
5
Statistical significance was assessed using Dan Bikels with the help of computing resources provided by the Su-
randomized comparator: http://www.cis.upenn. percomputing Center of Galicia (CESGA). We thank Joakim
edu/dbikel/software.html Nivre for helpful input in the early stages of this work.
73
Planar UPlanarN UPlanarL MaltP
Lang. LAS(p) UAS(p) LAS(p) UAS(p) LAS(p) UAS(p) LAS(p) UAS(p)
Arabic 66.93 (67.34) 77.56 (77.22) 65.91 (66.33) 77.03 (76.75) 66.75 (67.19) 77.45 (77.22) 66.43 (66.74) 77.19 (76.83)
Chinese 84.23 (84.20) 88.37 (88.33) 83.14 (83.10) 87.00 (86.95) 84.51* (84.50*) 88.37 (88.35*) 86.42 (86.39) 90.06 (90.02)
Czech 77.24 (77.70) 83.46 (83.24) 75.08 (75.60) 81.14 (81.14) 77.60* (77.93*) 83.56* (83.41*) 77.24 (77.57) 83.40 (83.19)
Danish 83.31 (82.60) 88.02 (86.64) 82.65 (82.45) 87.58 (86.67*) 83.87* (83.83*) 88.94* (88.17*) 83.31 (82.64) 88.30 (86.91)
German 84.66 (83.60) 87.02 (85.67) 83.33 (82.77) 85.78 (84.93) 86.32* (85.67*) 88.62* (87.69*) 86.12 (85.48) 88.52 (87.58)
Portug. 86.22 (83.82) 89.80 (86.88) 85.89 (83.82) 89.68 (87.06*) 86.52* (84.83*) 90.28* (88.03*) 86.60 (84.66) 90.20 (87.73)
Swedish 83.01 (82.44) 88.53 (87.36) 81.20 (81.10) 86.50 (85.86) 82.95 (82.66*) 88.29 (87.45*) 82.89 (82.44) 88.61 (87.55)
Turkish 62.70 (71.27) 73.67 (78.57) 59.83 (68.31) 70.15 (75.17) 63.27* (71.63*) 73.93* (78.72*) 62.58 (70.96) 73.09 (77.95)
Table 1: Parsing accuracy of the undirected planar parser with naive (UPlanarN) and label-based (UPlanarL)
postprocessing in comparison to the directed planar (Planar) and the MaltParser arc-eager projective (MaltP)
algorithms, on eight datasets from the CoNLL-X shared task (Buchholz and Marsi, 2006): Arabic (Hajic et al.,
2004), Chinese (Chen et al., 2003), Czech (Hajic et al., 2006), Danish (Kromann, 2003), German (Brants et
al., 2002), Portuguese (Afonso et al., 2002), Swedish (Nilsson et al., 2005) and Turkish (Oflazer et al., 2003;
Atalay et al., 2003). We show labelled (LAS) and unlabelled (UAS) attachment score excluding and including
punctuation tokens in the scoring (the latter in brackets). Best results for each language are shown in boldface,
and results where the undirected parser outperforms the directed version are marked with an asterisk.
Table 2: Parsing accuracy of the undirected 2-planar parser with naive (U2PlanarN) and label-based (U2PlanarL)
postprocessing in comparison to the directed 2-planar (2Planar) and MaltParser arc-eager pseudo-projective
(MaltPP) algorithms. The meaning of the scores shown is as in Table 1.
Table 3: Parsing accuracy of the undirected Covington non-projective parser with naive (UCovingtonN) and
label-based (UCovingtonL) postprocessing in comparison to the directed algorithm (Covington). The meaning
of the scores shown is as in Table 1.
74
References Proceedings of the NEMLAR International Confer-
ence on Arabic Language Resources and Tools.
Susana Afonso, Eckhard Bick, Renato Haber, and Di-
Jan Hajic, Jarmila Panevova, Eva Hajicova, Jarmila
ana Santos. 2002. Floresta sinta(c)tica: a tree-
Panevova, Petr Sgall, Petr Pajas, Jan Stepanek,
bank for Portuguese. In Proceedings of the 3rd In-
Jir Havelka, and Marie Mikulova. 2006.
ternational Conference on Language Resources and
Prague Dependency Treebank 2.0. CDROM CAT:
Evaluation (LREC 2002), pages 19681703, Paris,
LDC2006T01, ISBN 1-58563-370-4. Linguistic
France. ELRA.
Data Consortium.
Nart B. Atalay, Kemal Oflazer, and Bilge Say. 2003. Liang Huang and Kenji Sagae. 2010. Dynamic pro-
The annotation process in the Turkish treebank. gramming for linear-time incremental parsing. In
In Proceedings of EACL Workshop on Linguisti- Proceedings of the 48th Annual Meeting of the As-
cally Interpreted Corpora (LINC-03), pages 243 sociation for Computational Linguistics, ACL 10,
246, Morristown, NJ, USA. Association for Com- pages 10771086, Stroudsburg, PA, USA. Associa-
putational Linguistics. tion for Computational Linguistics.
Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolf- Matthias T. Kromann. 2003. The Danish dependency
gang Lezius, and George Smith. 2002. The tiger treebank and the underlying linguistic theory. In
treebank. In Proceedings of the Workshop on Tree- Proceedings of the 2nd Workshop on Treebanks and
banks and Linguistic Theories, September 20-21, Linguistic Theories (TLT), pages 217220, Vaxjo,
Sozopol, Bulgaria. Sweden. Vaxjo University Press.
Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X Marco Kuhlmann and Joakim Nivre. 2006. Mildly
shared task on multilingual dependency parsing. In non-projective dependency structures. In Proceed-
Proceedings of the 10th Conference on Computa- ings of the COLING/ACL 2006 Main Conference
tional Natural Language Learning (CoNLL), pages Poster Sessions, pages 507514.
149164. Andre Martins, Noah Smith, and Eric Xing. 2009.
Chih-Chung Chang and Chih-Jen Lin, 2001. Concise integer linear programming formulations
LIBSVM: A Library for Support Vec- for dependency parsing. In Proceedings of the
tor Machines. Software available at Joint Conference of the 47th Annual Meeting of the
http://www.csie.ntu.edu.tw/cjlin/libsvm. ACL and the 4th International Joint Conference on
K. Chen, C. Luo, M. Chang, F. Chen, C. Chen, Natural Language Processing of the AFNLP (ACL-
C. Huang, and Z. Gao. 2003. Sinica treebank: De- IJCNLP), pages 342350.
sign criteria, representational issues and implemen- Ryan McDonald and Joakim Nivre. 2007. Character-
tation. In Anne Abeille, editor, Treebanks: Building izing the errors of data-driven dependency parsing
and Using Parsed Corpora, chapter 13, pages 231 models. In Proceedings of the 2007 Joint Confer-
248. Kluwer. ence on Empirical Methods in Natural Language
Y. J. Chu and T. H. Liu. 1965. On the shortest arbores- Processing and Computational Natural Language
cence of a directed graph. Science Sinica, 14:1396 Learning (EMNLP-CoNLL), pages 122131.
1400. Ryan McDonald, Fernando Pereira, Kiril Ribarov, and
Jan Hajic. 2005. Non-projective dependency pars-
Michael A. Covington. 2001. A fundamental algo-
ing using spanning tree algorithms. In Proceedings
rithm for dependency parsing. In Proceedings of
of the Human Language Technology Conference
the 39th Annual ACM Southeast Conference, pages
and the Conference on Empirical Methods in Nat-
95102.
ural Language Processing (HLT/EMNLP), pages
Jack Edmonds. 1967. Optimum branchings. Journal
523530.
of Research of the National Bureau of Standards,
Jens Nilsson, Johan Hall, and Joakim Nivre. 2005.
71B:233240.
MAMBA meets TIGER: Reconstructing a Swedish
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and treebank from Antiquity. In Peter Juel Henrichsen,
C.-J. Lin. 2008. LIBLINEAR: A library for large editor, Proceedings of the NODALIDA Special Ses-
linear classification. Journal of Machine Learning sion on Treebanks.
Research, 9:18711874. Joakim Nivre and Jens Nilsson. 2005. Pseudo-
Carlos Gomez-Rodrguez and Joakim Nivre. 2010. projective dependency parsing. In Proceedings of
A transition-based parser for 2-planar dependency the 43rd Annual Meeting of the Association for
structures. In Proceedings of the 48th Annual Meet- Computational Linguistics (ACL), pages 99106.
ing of the Association for Computational Linguis- Joakim Nivre, Johan Hall, and Jens Nilsson. 2004.
tics, ACL 10, pages 14921501, Stroudsburg, PA, Memory-based dependency parsing. In Proceed-
USA. Association for Computational Linguistics. ings of the 8th Conference on Computational Nat-
Jan Hajic, Otakar Smrz, Petr Zemanek, Jan Snaidauf, ural Language Learning (CoNLL-2004), pages 49
and Emanuel Beska. 2004. Prague Arabic Depen- 56, Morristown, NJ, USA. Association for Compu-
dency Treebank: Development in data and tools. In tational Linguistics.
75
Joakim Nivre, Johan Hall, and Jens Nilsson. 2006.
MaltParser: A data-driven parser-generator for de-
pendency parsing. In Proceedings of the 5th In-
ternational Conference on Language Resources and
Evaluation (LREC), pages 22162219.
Joakim Nivre. 2003. An efficient algorithm for pro-
jective dependency parsing. In Proceedings of the
8th International Workshop on Parsing Technolo-
gies (IWPT), pages 149160.
Joakim Nivre. 2008. Algorithms for Deterministic
Incremental Dependency Parsing. Computational
Linguistics, 34(4):513553.
Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tur,
and Gokhan Tur. 2003. Building a Turkish tree-
bank. In Anne Abeille, editor, Treebanks: Build-
ing and Using Parsed Corpora, pages 261277.
Kluwer.
Daniel Sleator and Davy Temperley. 1991. Pars-
ing English with a link grammar. Technical Re-
port CMU-CS-91-196, Carnegie Mellon University,
Computer Science.
R. E. Tarjan. 1977. Finding optimum branchings.
Networks, 7:2535.
Ivan Titov and James Henderson. 2007. A latent vari-
able model for generative dependency parsing. In
Proceedings of the 10th International Conference
on Parsing Technologies (IWPT), pages 144155.
76
The Best of Both Worlds A Graph-based Completion Model for
Transition-based Parsers
77
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 7787,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
called second-order factors) that closely follow for a given input word at a point where no or only
the bottom-up combination of subspans in the partial information about the words own depen-
parsing algorithm, i.e., the feature functions de- dents (and further decendents) is available. Fig-
pend on the presence of two specific dependency ure 1 illustrates such a case.
edges. Configurations not directly supported by
the bottom-up building of larger spans are more
cumbersome to integrate into the model (since the
combination algorithm has to be adjusted), in par-
ticular for third-order factors or higher.
Empirically, i.e., when applied in supervised
machine learning experiments based on existing
treebanks for various languages, both strategies
(and further refinements of them not mentioned
here) turn out roughly equal in their capability Figure 1: The left set of brackets indicates material
of picking up most of the relevant patterns well; that has been processed or is under consideration; on
some subtle strengths and weaknesses are com- the right is the input, still to be processed. Access to in-
plementary, such that stacking of two parsers rep- formation that is yet unavailable would help the parser
to decide on the correct transition.
resenting both strategies yields the best results
(Nivre and McDonald, 2008): in training and ap-
Here, the parser has to decide whether to create an
plication, one of the parsers is run on each sen-
edge between house and with or between bought
tence prior to the other, providing additional fea-
and with (which is technically achieved by first
ture information for the other parser. Another suc-
popping house from the stack and then adding the
cessful technique to combine parsers is voting as
edge). At this time, no information about the ob-
carried out by Sagae and Lavie (2006).
ject of with is available; with fails to provide what
The present paper addresses the question if
we call a complete factor for the calculation of the
and how a more integrated combination of the
scores of the alternative transitions under consid-
strengths of the two strategies can be achieved
eration. In other words, the model cannot make
and implemented efficiently to warrant competi-
use of any evidence to distinguish between the
tive results.
two examples in Figure 1, and it is bound to get
one of the two cases wrong.
The main issue and solution strategy. In or-
Figure 2 illustrates the same case from the per-
der to preserve the conceptual (and complexity)
spective of a graph-based parser.
advantages of the transition-based strategy, the
integrated algorithm we are looking for has to
be transition-based at the top level. The advan-
tages of the graph-based approach a more glob-
ally informed basis for the decision among dif-
ferent attachment options have to be included
as part of the scoring procedure. As a prerequi- Figure 2: A second order model as used in graph-based
site, our algorithm will require a memory for stor- parsers has access to the crucial information to build
ing alternative analyses among which to choose. the correct tree. In this case, the parser condsiders the
word friend (as opposed to garden, for instance) as it
This has been previously introduced in transition-
introduces the bold-face edge.
based approaches in the form of a beam (Johans-
son and Nugues, 2006): rather than representing Here, the combination of subspans is performed
only the best-scoring history of transitions, the k at a point when their internal structure has been
best-scoring alternative histories are kept around. finalized, i.e., the attachment of with (to bought
As we will indicate in the following, the mere or house) is not decided until it is clear that friend
addition of beam search does not help overcome is the object of with; hence, the semantically im-
a representational key issue of transition-based portant lexicalization of withs object informs the
parsing: in many situations, a transition-based higher-level attachment decision through a so-
parser is forced to make an attachment decision called second order factor in the feature model.
78
Given a suitable amount of training data, the parser requires only one training phase (without
model can thus learn to make the correct deci- jackknifing) and it uses only a single transition-
sion. The dynamic-programming based graph- based decoder.
based parser is designed in such a way that any The structure of this paper is as follows. In Sec-
score calculation is based on complete factors for tion 2, we discuss related work. In Section 3, we
the subspans that are combined at this point. introduce our transition-based parser and in Sec-
Note that the problem for the transition-based tion 4 the completion model as well as the im-
parser cannot be remedied by beam search alone. plementation of third order models. In Section 5,
If we were to keep the two options for attach- we describe experiments and provide evaluation
ing with around in a beam (say, with a slightly results on selected data sets.
higher score for attachment to house, but with
bought following narrowly behind), there would 2 Related Work
be no point in the further processing of the sen- Kudo and Matsumoto (2002) and Yamada and
tence at which the choice could be corrected: the Matsumoto (2003) carried over the idea for de-
transition-based parser still needs to make the de- terministic parsing by chunks from Abney (1991)
cision that friend is attached to with, but this will to dependency parsing. Nivre (2003) describes
not lead the parser to reconsider the decision made in a more strict sense the first incremental parser
earlier on. that tries to find the most appropriate dependency
The strategy we describe in this paper applies tree by a sequence of local transitions. In order
in this very type of situation: whenever infor- to optimize the results towards a more globally
mation is added in the transition-based parsing optimal solution, Johansson and Nugues (2006)
process, the scores of all the histories stored in first applied beam search, which leads to a sub-
the beam are recalculated based on a scoring stantial improvment of the results (cf. also (Titov
model inspired by the graph-based parsing ap- and Henderson, 2007)). Zhang and Clark (2008)
proach, i.e., taking complete factors into account augment the beam-search algorithm, adapting the
as they become incrementally available. As a con- early update strategy of Collins and Roark (2004)
sequence the beam is reordered, and hence, the to dependency parsing. In this approach, the
incorrect preference of an attachment of with to parser stops and updates the model when the or-
house (based on incomplete factors) can later be acle transition sequence drops out of the beam.
corrected as friend is processed and the complete In contrast to most other approaches, the training
second-order factor becomes available.2 procedure of Zhang and Clark (2008) takes the
The integrated transition-based parsing strategy complete transition sequence into account as it is
has a number of advantages: calculating the update. Zhang and Clark compare
(1) We can integrate and investigate a number of aspects of transition-based and graph-based pars-
third order factors, without the need to implement ing, and end up using a transition-based parser
a more complex parsing model each time anew to with a combined transition-based/second-order
explore the properties of such distinct model. graph-based scoring model (Zhang and Clark,
(2) The parser with completion model main- 2008, 567), which is similar to the approach we
tains the favorable complexity of transition-based describe in this paper. However, their approach
parsers. does not involve beam rescoring as the partial
(3) The completion model compensates for the structures built by the transition-based parser are
lower accuracy of cases when only incomplete in- subsequently augmented; hence, there are cases in
formation is available. which our approach is able to differentiate based
(4) The parser combines the two leading pars- on higher-order factors that go unnoticed by the
ing paradigms in a single efficient parser with- combined model of (Zhang and Clark, 2008, 567).
out stacking the two approaches. Therefore the One step beyond the use of a beam is a dynamic
programming approach to carry out a full search
2
Since search is not exhaustive, there is of course a slight in the state space, cf. (Huang and Sagae, 2010;
danger that the correct history drops out of the beam before
complete information becomes available. But as our experi-
Kuhlmann et al., 2011). However, in this case
ments show, this does not seem to be a serious issue empiri- one has to restrict the employed features to a set
cally. which fits to the elements composed by the dy-
79
namic programming approach. This is a trade-off swap}) taken up to this point.
between an exhaustive search and a unrestricted (1) The initial state 0 has an empty stack, the
(rich) feature set and the question which provides input buffer is the full input string x, and the edge
a higher accuracy is still an open research ques- set is empty. (2) The (partial) transition function
tion, cf. (Kuhlmann et al., 2011). (i , t) : x maps a state and an opera-
Parsing of non-projective dependency trees is tion t to a new state i+1 . (3) Final states f are
an important feature for many languages. At characterized by an empty input buffer and stack;
first most algorithms were restricted to projec- no further transitions can be taken.
tive dependency trees and used pseudo-projective The transition function is informally defined as
parsing (Kahane et al., 1998; Nivre and Nilsson, follows: The shift transition removes the first ele-
2005). Later, additional transitions were intro- ment of the input buffer and pushes it to the stack.
duced to handle non-projectivity (Attardi, 2006; The left-arcl transition adds an edge with label l
Nivre, 2009). The most common strategy uses from the first word in the buffer to the word on
the swap transition (Nivre, 2009; Nivre et al., top of the stack, removes the top element from
2009), an alternative solution uses two planes the stack and pushes the first element of the input
and a switch transition to switch between the two buffer to the stack.
planes (Gomez-Rodrguez and Nivre, 2010). The right-arcl transition adds an edge from word
Since we use the scoring model of a graph- on top of the stack to the first word in the input
based parser, we briefly review releated work buffer and removes the top element of the input
on graph-based parsing. The most well known buffer and pushes that element onto the stack.
graph-based parser is the MST (maximum span- The reduce transition pops the top word from the
ning tree) parser, cf. (McDonald et al., 2005; Mc- stack.
Donald and Pereira, 2006). The idea of the MST The swap changes the order of the two top el-
parser is to find the highest scoring tree in a graph ements on the stack (possibly generating non-
that contains all possible edges. Eisner (1996) projkective trees).
introduced a dynamic programming algorithm to When more than one operation is applicable, a
solve this problem efficiently. Carreras (2007) in- scoring function assigns a numerical value (based
troduced the left-most and right-most grandchild on a feature vector and a weight vector trained
as factors. We use the factor model of Carreras by supervised machine learning) to each possi-
(2007) as starting point for our experiments, cf. ble continuation. When using a beam search ap-
Section 4. We extend Carreras (2007) graph- proach with beam size k, the highest-scoring k al-
based model with factors involving three edges ternative states with the same length n of transi-
similar to that of Koo and Collins (2010). tion history h are kept in a set beamn .
In the beam-based parsing algorithm (cf. the
3 Transition-based Parser with a Beam pseudo code in Algorithm 1), all candidate states
This section specifies the transition-based beam- for the next set beamn+1 are determined using
search parser underlying the combined approach the transition function , but based on the scor-
more formally. Sec. 4 will discuss the graph- ing function, only the best k are preserved. (Fi-
based scoring model that we are adding. nal) states to which no more transitions apply are
The input to the parser is a word string x, copied to the next state set. This means that once
the goal is to find the optimal set y of labeled all transition paths have reached a final state, the
edges xi l xj forming a dependency tree over x overall best-scoring states can be read off the fi-
{root}. We characterize the state of a transition- nal beamn . The y of the top-scoring state is the
based parser as i =hi , i , yi , hi i, i , the set predicted parse.
of possible states. i is a stack of words from x Under the plain transition-based scoring
that are still under consideration; i is the input regime scoreT , the score for a state is the sum
buffer, the suffix of x yet to be processed; yi the of the local scores for the transitions ti in the
set of labeled edges already assigned (a partial la- states history sequence:
beled dependency tree); hi is a sequence record-
P|h|
ing the history of transitions (from the set of op- scoreT () = i=0 w f (i , ti )
erations = {shift, left-arcl , right-arcl , reduce,
80
Algorithm 1: Transition-based parser up to this point, which is continuously augmented.
// x is the input sentence, k is the beam size This means if at a given point n in the transition
0 = , 0 = x, y0 = , h = path, complete information for a particular config-
0 h0 , 0 , y0 , h0 i // initial parts of a state uration (e.g., a third-order factor involving a head,
beam0 {0 } // create initial state its dependent and its grand-child dependent) is
n 0 // iteration unavailable, scoring will ignore this factor at time
repeat n, but the configuration will inform the scoring
nn+1
later on, maybe at point n + 4, when the complete
for all j beamn1 do
transitions possible-applicable-transition (j )
information for this factor has entered the partial
// if no transition is applicable keep state j : graph yn+4 .
if transitions = then beamn beamn {j } We present results for a number of different
else for all ti transitions do second-order and third-order feature models.
// apply the transition i to state j
(j , ti ) Second Order Factors. We start with the
beamn beamn {} model introduced by Carreras (2007). Figure 3
// end for illustrates the factors used.
// end for
sort beamn due to the score(j )
beamn sublist (beamn , 0, k)
until beamn1 = beamn // beam changed?
4 Completion Model
We define an augmented scoring function which
Figure 4: 2b. The left-most dependent of the head or
can be used in the same beam-search algorithm in
the right-most dependent in the right-headed case.
order to ensure that in the scoring of alternative
transition paths, larger configurations can be ex- Figure 4 illustrates a new type of factor we use,
ploited as they are completed in the incremental which includes the left-most dependent in the left-
process. The feature configurations can be largely headed case and symmetricaly the right-most sib-
taken from graph-based approaches. Here, spans ling in the right-head case.
from the string are assembled in a bottom-up fash-
ion, and the scoring for an edge can be based on Third Order Factors. In addition to the second
structurally completed subspans (factors). order factors, we investigate combinations of third
Our completion model for scoring a state n order factors. Figure 5 and 6 illustrate the third
incorporates factors for all configurations (match- order factors, which are similar to the factors of
ing the extraction scheme that is applied) that are Koo and Collins (2010). They restrict the factor
present in the partial dependency graph yn built to the innermost sibling pair for the tri-siblings
81
and the outermost pair for the grand-siblings. We model, we have to add the scoring function (2a)
use the first two siblings of the dependent from the sum:
the left side of the head for the tri-siblings and (2b) scoreG2b (x, y) = scoreG2a (x, y)
the first two dependents of the child for the grand- P
+ (h,c,cmi)y w fgra (x,h,c,cmi)
siblings. With these factors, we aim to capture
non-projective edges and subcategorization infor-
In order to build a scoring function for combi-
mation. Figure 7 illustrates a factor of a sequence
nation of the factors shown in Figure 5 to 7, we
of four nodes. All the right headed variants are
have to add to the equation 2b one or more of the
symmetrically and left out for brevity.
following sums:
P
(3a) (h,c,ch1,ch2)y w fgra (x,h,c,ch1,ch2)
P
(3b) (h,c,cm1,cm2)y w fgra (x,h,c,cm1,cm2)
P
(3c) (h,c,cmo,tmo)y w fgra (x,h,c,cmo,tmo)
Figure 5: 3a. The first two children of the head, which
do not include the edge between the head and the de-
pendent.
Feature Set. The feature set of the transition
model is similar to that of Zhang and Nivre
(2011). In addition, we use the cross product of
morphologic features between the head and the
dependent since we apply also the parser on mor-
phologic rich languages.
Figure 6: 3b. The first two children of the dependent.
The feature sets of the completion model de-
scribed above are mostly based on previous work
(McDonald et al., 2005; McDonald and Pereira,
2006; Carreras, 2007; Koo and Collins, 2010).
The models denoted with + use all combinations
Figure 7: 3c. The right-most dependent of the right-
of words before and after the head, dependent,
most dependent.
sibling, grandchilrden, etc. These are respectively
three-, and four-grams for the first order and sec-
Integrated approach. To obtain an integrated
ond order. The algorithm includes these features
system for the various feature models, the scoring
only the words left and right do not overlap with
function of the transition-based parser from Sec-
the factor (e.g. the head, dependent, etc.). We use
tion 3 is augmented by a family of scoring func-
feature extraction procedure for second order, and
tions scoreGm for the completion model, where m
third order factors. Each feature extracted in this
is from 2a, 2b, 3a etc., x is the input string, and y
procedure includes information about the position
is the (partial) dependency tree built so far:
of the nodes relative to the other nodes of the part
scoreTm () = scoreT () + scoreGm (x, y) and a factor identifier.
The scoring function of the completion model
depends on the selected factor model Gm . The Training. For the training of our parser, we use
model G2a comprises the edge factoring of Fig- a variant of the perceptron algorithm that uses the
ure 3. With this model, we obtain the following Passive-Aggressive update function, cf. (Freund
scoring function. and Schapire, 1998; Collins, 2002; Crammer et
al., 2006). The Passive-Aggressive perceptron
P
scoreG2a (x, y) = (h,c)y w ff irst (x,h,c) uses an aggressive update strategy by modifying
P
+ (h,c,ci)y w fsib (x,h,c,ci) the weight vector by as much as needed to clas-
P
+ (h,c,cmo)y w fgra (x,h,c,cmo) sify correctly the current example, cf. (Crammer
P
+ (h,c,cmi)y w fgra (x,h,c,cmi) et al., 2006). We apply a random function (hash
function) to retrieve the weights from the weight
The function f maps the input sentence x, and vector instead of a table. Bohnet (2010) showed
a subtree y defined by the indexes to a feature- that the Hash Kernel improves parsing speed and
vector. Again, w is the corresponding weight vec- accuracy since the parser uses additionaly nega-
tor. In order to add the factor of Figure 4 to our tive features. Ganchev and Dredze (2008) used
82
this technique for structured prediction in NLP to select the best parse tree. The complexity of the
reduce the needed space, cf. (Shi et al., 2009). transition-based parser is quadratic due to swap
We use as weight vector size 800 million. After operation in the worse case, which is rare, and
the training, we counted 65 millions non zero O(n) in the best case, cf. (Nivre, 2009). The
weights for English (penn2malt), 83 for Czech beam size B is constant. Hence, the complexity
and 87 millions for German. The feature vectors is in the worst case O(n2 ).
are the union of features originating from the The parsing time is to a large degree deter-
transition sequence of a sentence and the features mined by the feature extraction, the score calcu-
of the factors over all edges of a dependency tree lation and the implementation, cf. also (Goldberg
(e.g. G2a , etc.). To prevent over-fitting, we use and Elhadad, 2010). The transition-based parser
averaging to cope with this problem, cf. (Freund is able to parse 30 sentences per second. The
and Schapire, 1998; Collins, 2002). We calculate parser with completion model processes about 5
the error e as the sum of all attachment errors and sentences per second with a beam size of 80.
label errors both weighted by 0.5. We use the Note, we use a rich feature set, a completion
following equations to compute the update. model with third order factors, negative features,
and a large beam. 3
loss: lt = e-(scoreT (xgt , ytg )-scoreT (xt , yt )) We implemented the following optimizations:
(1) We use a parallel feature extraction for the
lt
PA-update: t = ||fg fp ||2 beam elements. Each process extracts the fea-
tures, scores the possible transitions and computes
We train the model to select the transitions and the score of the completion model. After the ex-
the completion model together and therefore, we tension step, the beam is sorted and the best ele-
use one parameter space. In order to compute the ments are selected according to the beam size.
weight vector, we employ standard online learn- (2) The calculation of each score is optimized (be-
ing with 25 training iterations, and carry out early yond the distinction of a static and a dynamic
updates, cf. Collins and Roark (2004; Zhang and component): We calculate for each location de-
Clark (2008). termined by the last element sl i and the first
element of b0 i a numeric feature representa-
Efficient Implementation. Keeping the scoring tion. This is kept fix and we add only the numeric
with the completion model tractable with millions value for each of the edge labels plus a value for
of feature weights and for second- and third-order the transition left-arc or right-arc. In this way, we
factors requires careful bookkeeping and a num- create the features incrementally. This has some
ber of specialized techniques from recent work on similarity to Goldberg and Elhadad (2010).
dependency parsing. (3) We apply edge filtering as it is used in graph-
We use two variables to store the scores (a) based dependency parsing, cf. (Johansson and
for complete factors and (b) for incomplete fac- Nugues, 2008), i.e., we calculate the edge weights
tors. The complete factors (first-order factors and only for the labels that were found for the part-of-
higher-order factors for which further augmenta- speech combination of the head and dependent in
tion is structurally excluded) need to be calculated the training data.
only once and can then be stored with the tree fac-
tors. The incomplete factors (higher-order factors 5 Parsing Experiments and Discussion
whose node elements may still receive additional
descendants) need to be dynamically recomputed The results of different parsing systems are of-
while the tree is built. ten hard to compare due to differences in phrase
structure to dependency conversions, corpus ver-
The parsing algorithm only has to compute the
sion, and experimental settings. For better com-
scores of the factored model when the transition-
parison, we provide results on English for two
based parser selects a left-arc or right-arc transi-
commonly used data sets, based on two differ-
tion and the beam has to be sorted. The parser
ent conversions of the Penn Treebank. The first
sorts the beam when it exceeds the maximal beam
uses the Penn2Malt conversion based on the head-
size, in order to discard superfluous parses or
3
when the parsing algorithm terminates in order to 6 core, 3.33 Ghz Intel Nehalem
83
Section Sentences PoS Acc. Parser UAS LAS
Training 2-21 39.832 97.08 (McDonald et al., 2005) 90.9
Dev 24 1.394 97.18 (McDonald and Pereira, 2006) 91.5
Test 23 2.416 97.30 (Huang and Sagae, 2010) 92.1
(Zhang and Nivre, 2011) 92.9
Table 1: Overview of the training, development and (Koo and Collins, 2010) 93.04
test data split converted to dependency graphs with (Martins et al., 2010) 93.26
T (baseline) 92.7
head-finding rules of (Yamada and Matsumoto, 2003).
G2a (baseline) 92.89
The last column shows the accuracy of Part-of-Speech
T2a 92.94 91.87
tags. T2ab 93.16 92.08
T2ab3a 93.20 92.10
T2ab3b 93.23 92.15
finding rules of Yamada and Matsumoto (2003).
T2ab3c 93.17 92.10
Table 1 gives an overview of the properties of the T2ab3abc+ 93.39 92.38
corpus. The annotation of the corpus does not G2a+ 93.1
contain non-projective links. The training data (Koo et al., 2008) 93.16
was 10-fold jackknifed with our own tagger.4 . Ta- (Carreras et al., 2008) 93.5
(Suzuki et al., 2009) 93.79
ble 1 shows the tagging accuracy.
Table 2 lists the accuracy of our transition- Table 2: English Attachment Scores for the
based parser with completion model together with Penn2Malt conversion of the Penn Treebank for the
results from related work. All results use pre- test set. Punctuation is excluded from the evaluation.
dicted PoS tags. As a baseline, we present in ad- The results marked with are not directly comparable
to our work as they depend on additional sources of
dition results without the completion model and
information (Brown Clusters).
a graph-based parser with second order features
(G2a ). For the Graph-based parser, we used 10
training iterations. The following rows denoted tags. From the same data set, we selected the
with Ta , T2a , T2ab , T2ab3a , T2ab3b , T2ab3bc , and corpora for Czech and German. In all cases, we
T2a3abc present the result for the parser with com- used the provided training, development, and test
pletion model. The subscript letters denote the data split, cf. (Hajic et al., 2009). In contrast
used factors of the completion model as shown to the evaluation of the Penn2Malt conversion,
in Figure 3 to 7. The parsers with subscribed plus we include punctuation marks for these corpora
(e.g. G2a+ ) in addition use feature templates that and follow in that the evaluation schema of the
contain one word left or right of the head, depen- CoNLL Shared Task 2009. Table 3 presents the
dent, siblings, and grandchildren. We left those results as obtained for these data set.
feature in our previous models out as they may in- The transition-based parser obtains higher ac-
terfere with the second and third order factors. As curacy scores for Czech but still lower scores for
in previous work, we exclude punctuation marks English and German. For Czech, the result of T
for the English data converted with Penn2Malt in is 1.59 percentage points higher than the top la-
the evaluation, cf. (McDonald et al., 2005; Koo beled score in the CoNLL shared task 2009. The
and Collins, 2010; Zhang and Nivre, 2011).5 We reason is that T includes already third order fea-
optimized the feature model of our parser on sec- tures that are needed to determine some edge la-
tion 24 and used section 23 for evaluation. We use bels. The transition-based parser with completion
a beam size of 80 for our transition-based parser model T2a has even 2.62 percentage points higher
and 25 training iterations. accuracy and it could improve the results of the
The second English data set was obtained by parser T by additional 1.03 percentage points.
using the LTH conversion schema as used in the The results of the parser T are lower for English
CoNLL Shared Task 2009, cf. (Hajic et al., 2009). and German compared to the results of the graph-
This corpus preserves the non-projectivity of the based parser G2a . The completion model T2a can
phrase structure annotation, it has a rich edge reach a similar accuracy level for these two lan-
label set, and provides automatic assigned PoS guages. The third order features let the transition-
4
http://code.google.com/p/mate-tools/
based parser reach higher scores than the graph-
5
We follow Koo and Collins (2010) and ignore any token based parser. The third order features contribute
whose POS tag is one of the following tokens :,. for each language a relatively small improvement
84
Parser Eng. Czech German 6 Conclusion and Future Work
(Gesmundo
et al., 2009) 88.79/- 80.38 87.29 The parser introduced in this paper combines
(Bohnet, 2009) 89.88/- 80.11 87.48
advantageous properties from the two major
T (Baseline) 89.52/92.10 81.97/87.26 87.53/89.86
G2a (Baseline) 90.14/92.36 81.13/87.65 87.79/90.12 paradigms in data-driven dependency parsing,
T2a 90.20/92.55 83.01/88.12 88.22/90.36 in particular worst case quadratic complexity of
T2ab 90.26/92.56 83.22/88.34 88.31/90.24 transition-based parsing with a swap operation
T2ab3a 90.20/90.51 83.21.88.30 88.14/90.23
and the consideration of complete second and
T2ab3b 90.26/92.57 83.22/88.35 88.50/90.59
T2ab3abc 90.31/92.58 83.31/88.30 88.33/90.45 third order factors in the scoring of alternatives.
G2a+ 90.39/92.8 81.43/88.0 88.26/90.50 While previous work using third order factors, cf.
T2ab3ab+ 90.36/92.66 83.48/88.47 88.51/90.62 Koo and Collins (2010), was restricted to unla-
beled and projective trees, our parser can produce
Table 3: Labeled Attachment Scores of parsers that
use the data sets of the CoNLL shared task 2009. In
labeled and non-projective dependency trees.
line with previous work, punctuation is included. The In contrast to parser stacking, which involves
parsers marked with used a joint model for syntactic running two parsers in training and application,
parsing and semantic role labelling. We provide more we use only the feature model of a graph-based
parsing results for the languages of CoNLL-X Shared parser but not the graph-based parsing algorithm.
Task at http://code.google.com/p/mate-tools/. This is not only conceptually superior, but makes
training much simpler, since no jackknifing has
Parser UAS LAS to be carried out. Zhang and Clark (2008) pro-
(Zhang and Clark, 2008) 84.3 posed a similar combination, without the rescor-
(Huang and Sagae, 2010) 85.2 ing procedure. Our implementation allows for the
(Zhang and Nivre, 2011) 86.0 84.4 use of rich feature sets in the combined scoring
T2ab3abc+ 87.5 85.9
functions, and our experimental results show that
Table 4: Chinese Attachment Scores for the conver- the graph-based completion model leads to an
sion of CTB 5 with head rules of Zhang and Clark increase of between 0.4 (for English) and about
(2008). We take the standard split of CTB 5 and use 1 percentage points (for Czech). The scores go
in line with previous work gold segmentation, POS- beyond the current state of the art results for ty-
tags and exclude punctuation marks for the evaluation.
pologically different languages such as Chinese,
Czech, English, and German. For Czech, English
(Penn2Malt) and German, these are to our knowl-
of the score. Small and statistically significant im- ege the highest reported scores of a dependency
provements provides the additional second order parser that does not use additional sources of in-
factor (2b).6 We tried to determine the best third formation (such as extra unlabeled training data
order factors or set of factors but we cannot denote for clustering). Note that the efficient techniques
such a factor which is the best for all languages. and implementation such as the Hash Kernel, the
For German, we obtained a significant improve- incremental calculation of the scores of the com-
ment with the factor (3b). We believe that this is pletion model, and the parallel feature extraction
due to the flat annotation of PPs in the German as well as the parallelized transition-based pars-
corpus. If we combine all third order factors we ing strategy play an important role in carrying out
obtain for the Penn2Malt conversion a small im- this idea in practice.
provement of 0.2 percentage points over the re-
sults of (2ab). We think that a more deep feature
selection for third order factors may help to im- References
prove the actuary further. S. Abney. 1991. Parsing by chunks. In Principle-
In Table 4, we present results on the Chinese Based Parsing, pages 257278. Kluwer Academic
Treebank. To our knowledge, we obtain the best Publishers.
published results so far. G. Attardi. 2006. Experiments with a Multilan-
guage Non-Projective Dependency Parser. In Tenth
Conference on Computational Natural Language
6
The results of the baseline T compared to T2ab3abc are Learning (CoNLL-X).
statistically significant (p < 0.01). B. Bohnet. 2009. Efficient Parsing of Syntactic and
85
Semantic Dependency Structures. In Proceedings S. Pado, J. Stepanek, P. Stranak, M. Surdeanu,
of the 13th Conference on Computational Natural N. Xue, and Y. Zhang. 2009. The CoNLL-2009
Language Learning (CoNLL-2009). shared task: Syntactic and semantic dependencies
B. Bohnet. 2010. Top accuracy and fast dependency in multiple languages. In Proceedings of the Thir-
parsing is not a contradiction. In Proceedings of the teenth Conference on Computational Natural Lan-
23rd International Conference on Computational guage Learning (CoNLL 2009): Shared Task, pages
Linguistics (Coling 2010), pages 8997, Beijing, 118, Boulder, United States, June.
China, August. Coling 2010 Organizing Commit- L. Huang and K. Sagae. 2010. Dynamic programming
tee. for linear-time incremental parsing. In Proceedings
X. Carreras, M. Collins, and T. Koo. 2008. Tag, of the 48th Annual Meeting of the Association for
dynamic programming, and the perceptron for ef- Computational Linguistics, pages 10771086, Up-
ficient, feature-rich parsing. In Proceedings of the psala, Sweden, July. Association for Computational
Twelfth Conference on Computational Natural Lan- Linguistics.
guage Learning, CoNLL 08, pages 916, Strouds- R. Johansson and P. Nugues. 2006. Investigating
burg, PA, USA. Association for Computational Lin- multilingual dependency parsing. In Proceedings
guistics. of the Shared Task Session of the Tenth Confer-
X. Carreras. 2007. Experiments with a Higher-order ence on Computational Natural Language Learning
Projective Dependency Parser. In EMNLP/CoNLL. (CoNLL-X), pages 206210, New York City, United
M. Collins and B. Roark. 2004. Incremental parsing States, June 8-9.
with the perceptron algorithm. In ACL, pages 111 R. Johansson and P. Nugues. 2008. Dependency-
118. based SyntacticSemantic Analysis with PropBank
M. Collins. 2002. Discriminative Training Methods and NomBank. In Proceedings of the Shared Task
for Hidden Markov Models: Theory and Experi- Session of CoNLL-2008, Manchester, UK.
ments with Perceptron Algorithms. In EMNLP. S. Kahane, A. Nasr, and O. Rambow. 1998.
Pseudo-projectivity: A polynomially parsable non-
K. Crammer, O. Dekel, S. Shalev-Shwartz, and
projective dependency grammar. In COLING-ACL,
Y. Singer. 2006. Online Passive-Aggressive Al-
pages 646652.
gorithms. Journal of Machine Learning Research,
T. Koo and M. Collins. 2010. Efficient third-order
7:551585.
dependency parsers. In Proceedings of the 48th
J. Eisner. 1996. Three New Probabilistic Models for
Annual Meeting of the Association for Computa-
Dependency Parsing: An Exploration. In Proceed-
tional Linguistics, pages 111, Uppsala, Sweden,
ings of the 16th International Conference on Com-
July. Association for Computational Linguistics.
putational Linguistics (COLING-96), pages 340
Terry Koo, Xavier Carreras, and Michael Collins.
345, Copenhaen.
2008. Simple semi-supervised dependency parsing.
Y. Freund and R. E. Schapire. 1998. Large margin pages 595603.
classification using the perceptron algorithm. In T. Kudo and Y. Matsumoto. 2002. Japanese de-
11th Annual Conference on Computational Learn- pendency analysis using cascaded chunking. In
ing Theory, pages 209217, New York, NY. ACM proceedings of the 6th conference on Natural lan-
Press. guage learning - Volume 20, COLING-02, pages 1
K. Ganchev and M. Dredze. 2008. Small statisti- 7, Stroudsburg, PA, USA. Association for Compu-
cal models by random feature mixing. In Proceed- tational Linguistics.
ings of the ACL-2008 Workshop on Mobile Lan- M. Kuhlmann, C. Gomez-Rodrguez, and G. Satta.
guage Processing. Association for Computational 2011. Dynamic programming algorithms for
Linguistics. transition-based dependency parsers. In ACL, pages
A. Gesmundo, J. Henderson, P. Merlo, and I. Titov. 673682.
2009. A Latent Variable Model of Syn- Andre Martins, Noah Smith, Eric Xing, Pedro Aguiar,
chronous Syntactic-Semantic Parsing for Multiple and Mario Figueiredo. 2010. Turbo parsers: De-
Languages. In Proceedings of the 13th Confer- pendency parsing by approximate variational infer-
ence on Computational Natural Language Learning ence. pages 3444.
(CoNLL-2009), Boulder, Colorado, USA., June 4-5. R. McDonald and F. Pereira. 2006. Online Learning
Y. Goldberg and M. Elhadad. 2010. An efficient al- of Approximate Dependency Parsing Algorithms.
gorithm for easy-first non-directional dependency In In Proc. of EACL, pages 8188.
parsing. In HLT-NAACL, pages 742750. R. McDonald, K. Crammer, and F. Pereira. 2005. On-
C. Gomez-Rodrguez and J. Nivre. 2010. A line Large-margin Training of Dependency Parsers.
Transition-Based Parser for 2-Planar Dependency In Proc. ACL, pages 9198.
Structures. In ACL, pages 14921501. J. Nivre and R. McDonald. 2008. Integrating Graph-
J. Hajic, M. Ciaramita, R. Johansson, D. Kawahara, Based and Transition-Based Dependency Parsers.
M. Antonia Mart, L. Marquez, A. Meyers, J. Nivre, In ACL-08, pages 950958, Columbus, Ohio.
86
J. Nivre and J. Nilsson. 2005. Pseudo-projective de-
pendency parsing. In ACL.
J. Nivre, M. Kuhlmann, and J. Hall. 2009. An im-
proved oracle for dependency parsing with online
reordering. In Proceedings of the 11th Interna-
tional Conference on Parsing Technologies, IWPT
09, pages 7376, Stroudsburg, PA, USA. Associa-
tion for Computational Linguistics.
J. Nivre. 2003. An Efficient Algorithm for Pro-
jective Dependency Parsing. In 8th International
Workshop on Parsing Technologies, pages 149160,
Nancy, France.
J. Nivre. 2009. Non-Projective Dependency Parsing
in Expected Linear Time. In Proceedings of the
47th Annual Meeting of the ACL and the 4th IJC-
NLP of the AFNLP, pages 351359, Suntec, Singa-
pore.
K. Sagae and A. Lavie. 2006. Parser combina-
tion by reparsing. In NAACL 06: Proceedings of
the Human Language Technology Conference of the
NAACL, Companion Volume: Short Papers on XX,
pages 129132, Morristown, NJ, USA. Association
for Computational Linguistics.
Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola,
and S.V.N. Vishwanathan. 2009. Hash Kernels for
Structured Data. In Journal of Machine Learning.
J. Suzuki, H. Isozaki, X. Carreras, and M Collins.
2009. An empirical study of semi-supervised struc-
tured conditional models for dependency parsing.
In EMNLP, pages 551560.
I. Titov and J. Henderson. 2007. A Latent Variable
Model for Generative Dependency Parsing. In Pro-
ceedings of IWPT, pages 144155.
H. Yamada and Y. Matsumoto. 2003. Statistical De-
pendency Analysis with Support Vector Machines.
In Proceedings of IWPT, pages 195206.
Y. Zhang and S. Clark. 2008. A tale of two
parsers: investigating and combining graph-based
and transition-based dependency parsing using
beam-search. In Proceedings of EMNLP, Hawaii,
USA.
Y. Zhang and J. Nivre. 2011. Transition-based de-
pendency parsing with rich non-local features. In
Proceedings of the 49th Annual Meeting of the As-
sociation for Computational Linguistics: Human
Language Technologies, pages 188193, Portland,
Oregon, USA, June. Association for Computational
Linguistics.
87
Answer Sentence Retrieval by Matching Dependency Paths
Acquired from Question/Answer Sentence Pairs
Michael Kaisser
AGT Group (R&D) GmbH
Jagerstr. 41, 10117 Berlin, Germany
mkaisser@agtgermany.com
88
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 8898,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
Set Head Tail damental problem, but shifting focus from query
Query # 15,665 12,500 term/document term mismatch to mismatches ob-
how 1.33% 2.42% served between the grammatical structure of Nat-
what 0.77% 1.89% ural Language Queries and relevant text pieces. In
define 0.34% 0.18%
order to achieve this we analyze the queries and
is/are 0.25% 0.42%
the relevant contents syntactic structure by using
where 0.18% 0.45%
do/does 0.14% 0.30% dependency paths.
can 0.14% 0.25% Especially in QA there is a strong tradition
why 0.13% 0.30% of using dependency structures: (Lin and Pan-
who 0.12% 0.38% tel, 2001) present an unsupervised algorithm to
when 0.09% 0.21% automatically discover inference rules (essentially
which 0.03% 0.08% paraphrases) from text. These inference rules are
Total 3.55% 6.86% based on dependency paths, each of which con-
nects two nouns. Their paths have the following
Table 1: Percentages of Natural Language queries in
head and tail search engine query logs. See text for form:
details. N:subj:VfindV:obj:NsolutionN:to:N
This path represents the relation X finds a solu-
sued less that 500 times during a three months pe- tion to Y and can be mapped to another path rep-
riod and it disregards query frequency. As a result, resenting e.g. X solves Y. As such the approach
rare and frequent queries have the same chance of is suitable to detect paraphrases that describe the
being selected. Doubles are excluded from both relation between two entities in documents. How-
sets. Table 1 lists the percentage of queries in ever, the paper does not describe how the mined
the query sets that start with the specified word. paraphrases can be linked to questions, and which
In most contexts this indicates that the query is a paraphrase is suitable to answer which question
question, which in turn means that we are dealing type.
with an NLQ. Of course there are many NLQs that (Attardi et al., 2001) describes a QA system
start with words other than the ones listed, so we that, after a set of candidate answer sentences
can expect their real percentage to be even higher. have been identified, matches their dependency
relations against the question. Questions and
2 Related Work
answer sentences are parsed with MiniPar (Lin,
In IR the problem that queries and relevant tex- 1998) and the dependency output is analyzed in
tual content often do not exhibit the same terms is order to determine whether relations present in a
commonly encountered. Latent Semantic Index- question also appear in a candidate sentence. For
ing (Deerwester et al., 1900) was an early, highly the question Who killed John F. Kennedy, for
influential approach to solve this problem. More example an answer sentence is expected to con-
recently, a significant amount of research is ded- tain the answer as subject of the verb kill, to
icated to query alteration approaches. (Cui et al., which John F. Kennedy should be in object re-
2002), for example, assume that if queries con- lation.
taining one term often result in the selection of (Cui et al., 2005) describe a fuzzy depen-
documents containing another term, then a strong dency relation matching approach to passage re-
relationship between the two terms exist. In their trieval in QA. Here, the authors present a statis-
approach, query terms and document terms are tical technique to measure the degree of overlap
linked via sessions in which users click on doc- between dependency relations in candidate sen-
uments that are presented as results for the query. tences with their corresponding relations in the
(Riezler and Liu, 2010) apply a Statistical Ma- question. Question/answer passage pairs from
chine Translation model to parallel data consist- TREC-8 and TREC-9 evaluations are used as
ing of user queries and snippets from clicked web training data. As in some of the papers mentioned
documents and in such a way extract contextual earlier, a statistical translation model is used, but
expansion terms from the query rewrites. this time to learn relatedness between paths. (Cui
We see our work as addressing the same fun- et al., 2004) apply the same idea to answer ex-
89
traction. In each sentences returned by the IR 4. The acquisition of Alaska by the United
module, all named entities of the expected answer States of America from Russia in 1867 is
types are treated as answer candidates. For ques- known as Sewards Folly.
tions with an unknown answer type, all NPs in
the candidate sentence are considered. Then those The remaining three sentences introduce vari-
paths in the answer sentence that are connected ous forms of syntactic and semantic transforma-
to an answer candidate are compared against the tions. In order to capture a wide range of possible
corresponding paths in the question, in a similar ways on how answer sentences can be formulated,
fashion as in (Cui et al., 2005). The candidate in our model a candidate sentence is not evalu-
whose paths show the highest matching score is ated according to its similarity with the question.
selected. (Shen and Klakow, 2006) also describe Instead, its similarity to known answer sentences
a method that is primarily based on similarity (which were presented to the system during train-
scores between dependency relation pairs. How- ing) is evaluated. This allows to us to capture a
ever, their algorithm computes the similarity of much wider range of syntactic and semantic trans-
paths between key phrases, not between words. formations.
Furthermore, it takes relations in a path not as in-
dependent from each other, but acknowledges that 3 Overview of the Algorithm
they form a sequence, by comparing two paths Our algorithm uses input data containing pairs of
with the help of an adaptation of the Dynamic the following:
Time Warping algorithm (Rabiner et al., 1991).
(Molla, 2006) presents an approach for the ac- NLQs/Questions NLQs that describe the users
quisition of question answering rules by apply- information need. For the experiments car-
ing graph manipulation methods. Questions are ried out in this paper we use questions from
represented as dependency graphs, which are ex- the TREC QA track 2002-2006.
tended with information from answer sentences.
Relevant textual content This is a piece of text
These combined graphs can then be used to iden-
that is relevant to the user query in that it
tify answers. Finally, in (Wang et al., 2007), a
contains the information the user is search-
quasi-synchronous grammar (Smith and Eisner,
ing for. In this paper, we use sentences ex-
2006) is used to model relations between ques-
tracted from the AQUAINT corpus (Graff,
tions and answer sentences.
2002) that contain the answer to the given
In this paper we describe an algorithm that
TREC question.
learns possible syntactic answer sentence formu-
lations for syntactic question classes from a set of In total, the data available to us for our experi-
example question/answer sentence pairs. Unlike ments consists of 8,830 question/answer sentence
the related work described above, it acknowledges pairs. This data is publicly available, see (Kaisser
that a) a valid answer sentences syntax might and Lowe, 2008). The algorithm described in this
be very different for the questions syntax and b) paper has three main steps:
several valid answer sentence structures, which
might be completely independent from each other, Phrase alignment Key phrases from the ques-
can exist for one and the same question. tion are paired with phrases from the answer
To illustrate this consider the question When sentences.
was Alaska purchased? The following four sen- Pattern creation The dependency structures of
tences all answer the given question, but only the queries and answer sentences are analyzed
first sentence is a straightforward reformulation of and patterns are extracted.
the question: Pattern evaluation The patterns discovered in
1. The United States purchased Alaska in 1867 the last step are evaluated and a confidence
from Russia. score is assigned to each.
2. Alaska was bought from Russia in 1867. The acquired patterns can then be used during
3. In 1867, the Russian Empire sold the Alaska retrieval, where a question is matched against the
territory to the USA. antecedents describing the syntax of the question.
90
Input: (a) Query: When was Alaska purchased?
(b) Answer sentence: The acquisition of Alaska happened in 1867.
Step 1: Question is segmented into key phrases and stop words:
When[1]+was[2]+NP[3]+VERB[4]
Step 2: Key question phrases are aligned with key answer sentence phrases:
[3]Alaska Alaska
[4]purchased acquisition
ANSWER 1867
Step 3: A pre-computed parse tree of the answer sentence is loaded:
1: The (the, DT, 2) [det]
2: acquisition (acquisition, NN, 5) [nsubj]
3: of (of, IN, 2) [prep]
4: Alaska (Alaska, IN, 2) [pobj]
5: happened (happen, VBD, null) [ROOT]
6: in (in, IN, 5) [prep]
7: 1867 (1867, CD, 6) [pobj]
Step 4: Dependency paths from key question phrases to the answer are computed:
Alaska1867: pobjprepnsubjpreppobj
acquisition1867: nsubjpreppobj
Step 5: The resulting pattern is stored:
Query: When[1]+was[2]+NP[3]+VERB[4]
Path 3: pobjprepnsubjpreppobj
Path 4: nsubjpreppobj
Figure 1: The pattern creation algorithm exemplified in five key steps for the query When was Alaska pur-
chased? and the answer sentence The acquisition of Alaska happened in 1867.
Note that one question can potentially match sev- tify and align phrases. Word Alignment is im-
eral patterns. The consequents contain descrip- portant in many fields of NLP, e.g. Machine
tions of grammatical structures of potential an- Translation (MT) where words in parallel, bilin-
swer sentences that can be used to identify and gual corpora need to be aligned, see (Och and
evaluate candidate sentences. Ney, 2003) for a comparison of various statisti-
cal alignment models. In our case however we
4 Phrase Alignment are dealing with a monolingual alignment prob-
lem which enables us to exploit clues not available
The goal of this processing step is to align phrases
for bilingual alignment: First of all, we can expect
from the question with corresponding phrases
many query words to be present in the answer sen-
from the answer sentences in the training data.
tence, either with the exact same surface appear-
Consider the following example:
ance or in some morphological variant. Secondly,
Query: When was the Alaska territory pur- there are tools available that tell us how semanti-
chased? cally related two words are, most notably Word-
Answer sentence: The acquisition of what Net (Miller et al., 1993). For these reasons we im-
would become the territory of Alaska took place plemented a bespoke alignment strategy, tailored
in 1867. towards our problem description.
The mapping that has to be achieved is: This method is described in detail in (Kaisser,
Query Answer Sentence 2009). The processing steps described in the
phrase phrase next sections build on its output. For reasons of
Alaska territory territory of Alaska brevity, we skip a detailed explanations in this pa-
purchased acquisition per and focus only on its key part: the alignment
ANSWER 1867 of words with very different surface structures.
In our approach, this is a two step process. For more details we would like to point the reader
First we align on a word level, then the output to the aforementioned work.
of the word alignment process is used to iden- In the above example, the alignment of pur-
91
chased and acquisition is the most problem- Klein and Manning, 2003a), so at this point they
atic, because the surface structures of the two are simply loaded from file. Step 4 is the key step
words clearly are very different. For such cases in our algorithm. From the previous steps, we
we experimented with a number of alignment know where the key constituents from the ques-
strategies based on WordNet. These approaches tion as well as the answer are located in the an-
are similar in that each picks one word that has to swer sentence. This enables us to compute the
be aligned from the question at a time and com- dependency paths in the answer sentences parse
pares it to all of the non-stop words in the answer tree that connect the answer with the key con-
sentence. Each of the answer sentence words is stituents. In our example the answer is 1867
assigned a value between zero and one express- and the key constituents are acquisition and
ing its relatedness to the question word. The Alaska. Knowing the syntactic relationships
highest scoring word, if above a certain thresh- (captured by their dependency paths) between the
old, is selected as the closest semantic match. answer and the key phrases enables us to capture
Most of these approaches make use of Word- one syntactic possibility of how answer sentences
Net::Similarity, a Perl software package that mea- to queries of the form When+was+NP+VERB can
sures semantic similarity (or relatedness) between be formulated.
a pair of word senses by returning a numeric value As can be seen in Step 5 a flat syntactic ques-
that represents the degree to which they are sim- tion representation is stored, together with num-
ilar or related (Pedersen et al., 2004). Addition- bers assigned to each constituent. The num-
ally, we developed a custom-built method that as- bers for those constituents for which alignments
sumes that two words are semantically related if in the answer sentence were sought and found
any kind of pointer exists between any occurrence are listed together with the resulting dependency
of the words root form in WordNet. For details of paths. Path 3 for example denotes the path from
these experiments, please refer to (Kaisser, 2009). constituent 3 (the NP Alaska) to the answer. If
In our experiments the custom-built method per- no alignment could be found for a constituent,
formed best, and was therefore used for the exper- null is stored instead of a path. Should two or
iments described in this paper. The main reasons more alternative constituents be identified for one
for this are: question constituent, additional patterns are cre-
ated, so that each contains one of the possibilities.
1. Many of the measures in the Word- The described procedure is repeated for all ques-
Net::Similarity package take only hyponym/ tion/answer sentence pairs in the training set and
hypernym relations into account. This makes for each, one or more patterns are created.
aligning word of different parts of speech It is worth to note that many TREC ques-
difficult or even impossible. However, such tions are fairly short and grammatically sim-
alignments are important for our needs. ple. In our training data we for exam-
ple find 102 questions matching the pattern
2. Many of the measures return results, even if When[1]+was[2]+NP[3]+VERB[4], which
only a weak semantic relationship exists. For together list 382 answer sentences, and thus 382
our purposes however, it is beneficial to only potentially different answer sentence structures
take strong semantic relations into account. from which patterns can be gained. As a result,
the amount of training examples we have avail-
5 Pattern Creation
able, is sufficient to achieve the performance de-
Figure 1 details our algorithm in its five key steps. scribed in Section 7. The algorithm described in
In step 1 and 2 key phrases from the question are this paper can of course also be used for more
aligned to the corresponding phrases in the an- complicated NLQs, although in such a scenario a
swer sentence, see Section 4 of this paper. Step significantly larger amount of training data would
3 is concerned with retrieving the parse tree for have to be used.
the answer sentence. In our implementation all
6 Pattern Evaluation
answer sentences in the training set have for per-
formance reasons been parsed beforehand with For each created pattern, at least one match-
the Stanford Parser (Klein and Manning, 2003b; ing example must exists: the sentence that was
92
used to create it in the first place. However, we
n
do not know how precise each pattern is. To X
score(ac) = score(pi ) (2)
this end, an additional processing step between i=1
pattern creation and application is needed: pat-
where
tern evaluation. Similar approaches to ours have (
been described in the relevant literature, many correcti +1
if match
score(pi ) = correcti +incorrecti +2 (3)
of them concerned with bootstrapping, starting 0 no match
with (Ravichandran and Hovy, 2002). The gen-
eral purpose of this step is to use the available The highest scoring candidate is selected.
data about questions and their correct answers to We would like to explicitly call out one prop-
evaluate how often each created pattern returns a erty of our algorithm: It effectively returns two
correct or an incorrect result. This data is stored entities: a) a sentence that constitutes a valid
with each pattern and the result of the equation, response to the query, b) the head node of a
often called pattern precision, can be used during phrase in that sentence that constitutes the answer.
retrieval stage. Pattern precision in our case is de- Therefore the algorithm can be used for sentence
fined as: retrieval or for answer retrieval. It depends on
#correct + 1
the application which of the two behaviors is de-
p= (1) sired. In the next section, we evaluate its answer
#correct + #incorrect + 2
retrieval performance.
We use Lucene to retrieve the top 100 para-
7 Experiments & Results
graphs from the AQUAINT corpus by issuing a
query that consists of the querys key words and This section provides an evaluation of the algo-
all non-stop words in the answer. Then, all pat- rithm described in this paper. The key questions
terns are loaded whose antecedent matches the we seek to answer are the following:
query that is currently being processed. After that,
constituents from all sentences in the retrieved 1. How does our method perform when com-
100 paragraphs are aligned to the querys con- pared to a baseline that extracts dependency
stituents in the same way as for the sentences dur- paths from the question?
ing pattern creation, see Section 5. Now, the paths 2. How much does the described algorithm im-
specified in these patterns are searched for in the prove performance of a state-of-the-art QA
paragraphs parse trees. If they are all found, system?
it is checked whether they all point to the same
node and whether this nodes surface structure is 3. What is the effect of training data size on per-
in some morphological form present in the answer formance? Can we expect that more training
strings associated with the question in our train- data would further improve the algorithms
ing data. If this is the case a variable in the pat- performance?
tern named correct is increased by 1, otherwise
7.1 Evaluation Setup
the variable incorrect is increased by 1. After the
evaluation process is finished the final version of We use all factoid questions in TRECs QA test
the pattern given as an example in Figure 1 now sets from 2002 to 2006 for evaluation for which
is: a known answer exists in the AQUAINT corpus.
Query: When[1]+was[2]+NP[3]+VERB[4] Additionally, the data in (Lin and Katz, 2005) is
Path 3: pobjprepnsubjpreppobj used. In this paper the authors attempt to identify
Path 4: nsubjpreppobj a much more complete set of relevant documents
Correct: 15 for a subset of TREC 2002 questions than TREC
Incorrect: 4 itself. We adopt a cross validation approach for
our evaluation. Table 4 shows how the data is split
The variables correct and incorrect are used into five folds.
during retrieval, where the score of an answer can- In order to evaluate the algorithms patterns we
didate ac is the sum of all scores of all matching need a set of sentences to which they can be ap-
patterns p: plied. In a traditional QA system architecture,
93
Test Number of Correct Answer Sentences
Mean Med
set =0 <= 1 <= 3 <= 5 <= 10 <= 25 <= 50 >= 75 >= 90 >= 100
2002 0.203 0.396 0.580 0.671 0.809 0.935 0.984 0.0 0.0 0.0 6.86 2.0
2003 0.249 0.429 0.627 0.732 0.828 0.955 0.997 0.003 0.003 0.0 5.67 2.0
2004 0.221 0.368 0.539 0.637 0.799 0.936 0.985 0.0 0.0 0.0 6.51 3.0
2005 0.245 0.404 0.574 0.665 0.777 0.912 0.987 0.0 0.0 0.0 7.56 2.0
2006 0.241 0.389 0.568 0.665 0.807 0.920 0.966 0.006 0.0 0.0 8.04 3.0
Table 2: Fraction of sentences that contain correct answers in Evaluation Set 1 (approximation).
Table 3: Fraction of sentences that contain correct answers in Evaluation Set 2 (approximation).
94
Test Q Qs with Min one Overall Accuracy Acc. if
place the answer sentences in the input file with set number patterns correct correct overall pattern
2002 429 321 77 37 0.086 0.115
the questions, and assume that the question word 2003 354 237 39 26 0.073 0.120
2004 204 142 25 15 0.074 0.073
indicates the position where the answer should be 2005 319 214 38 18 0.056 0.084
located. 2006 352 208 34 16 0.045 0.077
Sum 1658 1122 213 112 0.068 0.100
Test Q Qs with >1 Overall Accuracy Acc. if
set number patterns correct correct overall pattern
2002 429 321 147 50 0.117 0.156
Table 8: Baseline performance based on evaluation set
2003 354 237 76 22 0.062 0.093 2.
2004 204 142 74 26 0.127 0.183
2005 319 214 97 46 0.144 0.215
2006 352 208 85 31 0.088 0.149
Sum 1658 1122 452 176 0.106 0.156 described in this paper and the baseline approach
do not make use of many techniques commonly
Table 5: Performance based on evaluation set 1.
used to increase performance of a QA system, e.g.
TF-IDF fallback strategies, fuzzy matching, man-
Test Q Qs with >1 Overall Accuracy Acc. if
set number patterns correct correct overall pattern ual reformulation patterns etc. It was a deliberate
2002 429 321 239 133 0.310 0.414
2003 354 237 149 88 0.248 0.371 decision from our side not to use any of these ap-
2004 204 142 119 65 0.319 0.458
2005 319 214 161 92 0.288 0.429 proaches. After all, this would result in an ex-
2006 352 208 139 84 0.238 0.403
Sum 1658 1122 807 462 0.278 0.411
perimental setup where the performance of our
answer extraction strategy could not have been
Table 6: Performance based on evaluation set 2. observed in isolation. The QA system used as a
baseline in the next section makes use of many of
Tables 5 and 6 show how our algorithm per- these techniques and we will see that our method,
forms on evaluation sets 1 and 2, respectively. Ta- as described here, is suitable to increase its per-
bles 7 and 8 show how the baseline performs on formance significantly.
evaluation sets 1 and 2, respectively. The tables
columns list the year of the TREC test set used, 7.3 Impact on an existing QA System
the number of questions in the set (we only use Tables 9 and 10 show how our algorithm in-
questions for which we know that there is an an- creases performance of our QuALiM system, see
swer in the corpus), the number of questions for e.g. (Kaisser et al., 2006). Section 6 in this pa-
which one or more patterns exist, how often at per describes via formulas 2 and 3 how answer
least one pattern returned the correct answer, how candidates are ranked. This ranking is combined
often we get an overall correct result by taking with the existing QA systems candidate ranking
all patterns and their confidence values into ac- by simply using it as an additional feature that
count, accuracy@1 of the overall system, and ac- boosts candidates proportionally to their confi-
curacy@1 computed only for those questions for dence score. The difference between both tables
which we have at least one pattern available (for is that the first uses all 1658 questions in our test
all other questions the system returns no result.) sets for the evaluation, whereas the second con-
As can be seen, on evaluation set 1 our method siders only those 1122 questions for which our
outperforms the baseline by 300%, on evaluation system was able to learn a pattern. Thus for Table
set 2 by 311%, taking accuracy if a pattern exists 10 questions which the system had no chance of
as a basis. answering due to limited training data are omitted.
Test Q Qs with Min one Overall Accuracy Acc. if As can be seen, accuracy@1 increases by 4.9% on
set number patterns correct correct overall pattern
2002 429 321 43 14 0.033 0.044 the complete test set and by 11.5% on the partial
2003 354 237 28 10 0.028 0.042
2004 204 142 19 6 0.029 0.042 set.
2005 319 214 21 7 0.022 0.033
2006 352 208 20 7 0.020 0.034 Note that the QA system used as a baseline is
Sum 1658 1122 131 44 0.027 0.039 at an advantage in at least two respects: a) It has
Table 7: Baseline performance based on evaluation set important web-based components and as such has
1. access to a much larger body of textual informa-
tion. b) The algorithm described in this paper is an
Many of the papers cited earlier that use an ap- answer extraction approach only. For paragraph
proach similar to our baseline approach of course retrieval we use the same approach as for evalu-
report much better results than Tables 7 and 8. ation set 1, see Section 7.1. However, in more
This however is not too surprising as the approach than 20% of the cases, this method returns not
95
a single paragraph that contains both the answer
and at least one question keyword. In such cases,
the simple paragraph retrieval makes it close to
impossible for our algorithm to return the correct
answer.
8 Conclusions
Test Set QuALiM QASP combined increase
2002 0.530 0.156 0.595 12.3%
2003
2004
0.380
0.465
0.093
0.183
0.430
0.514
13.3%
10.6%
In this paper we present an algorithm that acquires
2005 0.388 0.214 0.421 8.4% syntactic information about how relevant textual
2006 0.385 0.149 0.428 11.3%
02-06 0.436 0.157 0.486 11.5% content to a question can be formulated from a
collection of paired questions and answer sen-
Table 10: Top-1 accuracy of the QuALiM system on
its own and when combined with the algorithm de- tences. Other than previous work employing de-
scribed in this paper, when only considering questions pendency paths for QA, our approach does not as-
for which a pattern could be acquired from the training sume that a valid answer sentence is similar to the
data. All increases are statistically significant using a question and it allows many potentially very dif-
sign test (p < 0.05). ferent syntactic answer sentence structures. The
algorithm is evaluated using TREC data, and it
is shown that it outperforms an algorithm that
merely uses the syntactic information contained
7.4 Effect of Training Data Size in the question itself by 300%. It is also shown
that the algorithm improves the performance of a
We now assess the effect of training data size on
state-of-the-art QA system significantly.
performance. Tables 5 and 6 presented earlier
show that an average of 32.2% of the questions As always, there are many ways how we could
have no matching patterns. This is because the imagine our algorithm to be improved. Combin-
data used for training contained no examples for a ing it with fuzzy matching techniques as in (Cui et
significant subset of question classes. It can be ex- al., 2004) or (Cui et al., 2005) is an obvious direc-
pected that, if more training data would be avail- tion for future work. We are also aware that in or-
able, this percentage would decrease and perfor- der to apply our algorithm on a larger scale and in
mance would increase. In order to test this as- a real world setting with real users, we would need
sumption, we repeated the evaluation procedure a much larger set of training data. These could
detailed in this section several times, initially us- be acquired semi-manually, for example by using
ing data from only one TREC test set for train- crowd-sourcing techniques. We are also thinking
ing and then gradually adding more sets until all about fully automated approaches, or about us-
available training data had been used. The results ing indirect human evidence, e.g. user clicks in
for evaluation set 2 are presented in Figure 2. As search engine logs. Typically users only see the
can be seen, every time more data is added, per- title and a short abstract of the document when
formance increases. This strongly suggests that clicking on a result, so it is possible to imagine a
the point of diminishing returns, when adding ad- scenario where a subset of these abstracts, paired
ditional training data no longer improves perfor- with user queries, could serve as training data.
mance is not yet reached.
96
References Dekang Lin and Patrick Pantel. 2001. Discovery of
Inference Rules for Question-Answering. Natural
Giuseppe Attardi, Antonio Cisternino, Francesco
Language Engineering, 7(4):343360.
Formica, Maria Simi, and Alessandro Tommasi.
2001. PIQASso: Pisa Question Answering System. Dekang Lin. 1998. Dependency-based Evaluation of
In Proceedings of the 2001 Edition of the Text RE- MINIPAR. In Workshop on the Evaluation of Pars-
trieval Conference (TREC-01). ing Systems.
Gosse Bouma, Jori Mur, and Gertjan van Noord. 2005. George A. Miller, Richard Beckwith, Christiane Fell-
Reasoning over Dependency Relations for QA. In baum, Derek Gross, and Katherine Miller. 1993.
Proceedings of the IJCAI workshop on Knowledge Introduction to WordNet: An On-Line Lexical
and Reasoning for Answering Questions (KRAQ- Database. Journal of Lexicography, 3(4):235244.
05). Diego Molla. 2006. Learning of Graph-based
Hang Cui, Ji-Rong Wen, Jian-Yun Nie, and Wei-Ying Question Answering Rules. In Proceedings of
Ma. 2002. Probabilistic query expansion using HLT/NAACL 2006 Workshop on Graph Algorithms
query logs. In 11th International World Wide Web for Natural Language Processing.
Conference (WWW-02). Franz Josef Och and Hermann Ney. 2003. A System-
Hang Cui, Keya Li, Renxu Sun, Tat-Seng Chua, and atic Comparison of Various Statistical Alignment
Min-Yen Kan. 2004. National University of Sin- Models. Computational Linguistics, 29(1):1952.
gapore at the TREC-13 Question Answering Main Ted Pedersen, Siddharth Patwardhan, and Jason
Task. In Proceedings of the 2004 Edition of the Text Michelizzi. 2004. WordNet::Similarity - Measur-
REtrieval Conference (TREC-04). ing the Relatedness of Concepts. In Proceedings
Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan, and of the Nineteenth National Conference on Artificial
Tat-Seng Chua. 2005. Question Answering Pas- Intelligence (AAAI-04).
sage Retrieval Using Dependency Relations. In John Prager. 2006. Open-Domain Question-
Proceedings of the 28th ACM-SIGIR International Answering. Foundations and Trends in Information
Conference on Research and Development in Infor- Retrieval, 1(2).
mation Retrieval (SIGIR-05). L. R. Rabiner, A. E. Rosenberg, and S. E. Levin-
Scott Deerwester, Susan Dumais, George Furnas, son. 1991. Considerations in Dynamic Time Warp-
Thomas Landauer, and Richard Harshman. 1900. ing Algorithms for Discrete Word Recognition. In
Indexing by Latent Semantic Analysis. Journal of Proceedings of IEEE Transactions on Acoustics,
the American society for information science, 41(6). Speech and Signal Processing.
David Graff. 2002. The AQUAINT Corpus of English
Deepak Ravichandran and Eduard Hovy. 2002.
News Text.
Learning Surface Text Patterns for a Question An-
Michael Kaisser and John Lowe. 2008. Creating a
swering System. In Proceedings of the 40th Annual
Research Collection of Question Answer Sentence
Meeting of the Association for Computational Lin-
Pairs with Amazons Mechanical Turk. In Proceed-
guistics (ACL-02).
ings of the Sixth International Conference on Lan-
Stefan Riezler and Yi Liu. 2010. Query Rewriting
guage Resources and Evaluation (LREC-08).
using Monolingual Statistical Machine Translation.
Michael Kaisser, Silke Scheible, and Bonnie Webber.
Computational Linguistics, 36(3).
2006. Experiments at the University of Edinburgh
for the TREC 2006 QA track. In Proceedings of Dan Shen and Dietrich Klakow. 2006. Exploring Cor-
the 2006 Edition of the Text REtrieval Conference relation of Dependency Relation Paths for Answer
(TREC-06). Extraction. In Proceedings of the 21st International
Michael Kaisser. 2009. Acquiring Syntactic and Conference on Computational Linguistics and 44th
Semantic Transformations in Question Answering. Annual Meeting of the ACL (COLING/ACL-06).
Ph.D. thesis, University of Edinburgh. David A. Smith and Jason Eisner. 2006. Quasisyn-
Dan Klein and Christopher D. Manning. 2003a. Ac- chronous grammars: Alignment by Soft Projec-
curate Unlexicalized Parsing. In Proceedings of the tion of Syntactic Dependencies. In Proceedings of
41st Meeting of the Association for Computational the HLTNAACL Workshop on Statistical Machine
Linguistics (ACL-03). Translation.
Dan Klein and Christopher D. Manning. 2003b. Fast Ellen M. Voorhees. 1999. Overview of the Eighth
Exact Inference with a Factored Model for Natural Text REtrieval Conference (TREC-8). In Pro-
Language Parsing. In Advances in Neural Informa- ceedings of the Eighth Text REtrieval Conference
tion Processing Systems 15. (TREC-8).
Jimmy Lin and Boris Katz. 2005. Building a Reusable Ellen M. Voorhees. 2003. Overview of the TREC
Test Collection for Question Answering. Journal of 2003 Question Answering Track. In Proceedings of
the American Society for Information Science and the 2003 Edition of the Text REtrieval Conference
Technology (JASIST). (TREC-03).
97
Mengqiu Wang, Noah A. Smith, and Teruko Mita-
mura. 2007. What is the Jeopardy model? A Qua-
sisynchronous Grammar for QA. In Proceedings of
EMNLP-CoNLL 2007.
98
Can Click Patterns across Users Query Logs Predict Answers to
Definition Questions?
Alejandro Figueroa
Yahoo! Research Latin America
Blanco Encalada 2120, Santiago, Chile
afiguero@yahoo-inc.com
99
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 99108,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
compared to other research in the area of defini- Katz et al., 2007; Westerhout, 2009; Navigli and
tion extraction. It is at the crossroads of query Velardi, 2010). Due to training, there is a press-
log analysis and QA systems. We study the click ing necessity for large-scale authoritative sources
behavior of search engines users with regard to of descriptive and non-descriptive nuggets. In the
definition questions. Based on this study, we pro- same manner, there is a growing importance of
pose a novel way of acquiring large-scale and het- strategies capable of extracting trustworthy and
erogeneous training material for this task, which negative/positive samples from any type of text.
consists of: Conventionally, these methods interpret descrip-
tions as positive examples, whereas contexts pro-
automatically obtaining positive samples in viding non-descriptive information as negative el-
accordance with click patterns of search en- ements. Four representative techniques are:
gine users. This aids in harvesting a host
of descriptions from non-KB sources in con- centroid vector (Xu et al., 2003; Cui et
junction with descriptive information from al., 2004) collects an array of articles about
KBs. the definiendum from a battery of pre-
determined KBs. These articles are then
automatically acquiring negative data in con- used to learn a vector of word frequencies,
sonance with redundancy patterns across wherewith answer candidates are rated af-
snippets displayed within search engine re- terwards. Sometimes web-snippets together
sults when processing definition queries. with a query reformulation method are ex-
In brief, our experiments reveal that these pat- ploited instead of pre-defined KBs (Chen et
terns can be effectively exploited for devising ef- al., 2006).
ficient models. (Androutsopoulos and Galanis, 2005) gath-
Given the huge amount of amassed data, we ered articles from KBs to score 250-
additionally contrast the performance of systems characters windows carrying the definien-
built on top of samples originated solely from dum. These windows were taken from
KB, non-KB, and both combined. Our compar- the Internet, and accordingly, highly sim-
ison corroborates that KBs yield massive trust- ilar windows were interpreted as positive
worthy descriptive knowledge, but they do not examples, while highly dissimilar as nega-
bear enough diversity to discriminate all answer- tive samples. For this purpose, two thresh-
ing nuggets within any kind of text. Essentially, olds are used, which ensure the trustwor-
our experiments unveil that non-KB data is richer thiness of both sets. However, they also
and therefore it is useful for discovering more de- cause the sets to be less diverse as not all
scriptive nuggets than KB material. But its usage definienda are widely covered across KBs.
relies on its cleanness and on a negative set. Many Indeed, many facets outlined within the 250-
people had these intuitions before, but to the best characters windows will not be detected.
of our knowledge, we provide the first empirical
confirmation and quantification. (Xu et al., 2005) manually labeled samples
The road-map of this paper is as follows: sec- taken from an Intranet. Manual annotations
tion 2 touches on related works; section 3 digs are constrained to a small amount of exam-
deeper into click patterns for definition questions, ples, because it requires substantial human
subsequently section 4 explains our corpus con- efforts to tag a large corpus, and disagree-
struction strategy; section 5 describes our experi- ments between annotators are not uncom-
ments, and section 6 draws final conclusions. mon.
100
nitional QA, that is to say, massive examples har- the desire of going to a specific site that the user
vested from KBs and non-KBs. Fundamentally, has in mind, and the latter regards the goal of
positive examples are extracted from web snippets learning something by reading or viewing some
grounded on click patterns of users of a search en- content (Rose and Levinson, 2004). Navigational
gine, whereas the negative collection is acquired queries are hence of less relevance to definition
via redundancy patterns across web-snippets dis- questions, and for this reason, these were removed
played to the user by the search engine. This data in congruence with the next three criteria:
is capitalized by two state-of-the-art definition ex-
tractors, which are different in nature. In addition, (Lee et al., 2005) pointed out that users will
our paper discusses the effect on the performance only visit the web site they bear in mind,
of different sorts (KBs and non-KBs) and amount when prompting navigational queries. Thus,
of training data. these queries are characterized by clicking
As for user clicks, they provide valuable rele- the same URL almost all the time (Lee et al.,
vance feedback for a variety of tasks, cf. (Radlin- 2005). More precisely, we discarded queries
ski et al., 2010). For instance, (Ji et al., 2009) that: a) appear more than four times in the
extracted relevance information from clicked and query log; and which at the same time b) its
non-clicked documents within aggregated search most clicked URL represents more than 98%
sessions. They modelled sequences of clicks as of all its clicks. Following the same idea, we
a means of learning to globally rank the relative additionally eliminated prompted URLs and
relevance of all documents with respect to a given queries where the clicked URL is of the form
query. (Xu et al., 2010) improved the quality of www.search-query-without-spaces.
training material for learning to rank approaches
By the same token, queries containing key-
via predicting labels using clickthrough data. In
words such as homepage, on-line, and
our work, we combine click patterns across Ya-
sign in were also removed.
hoo! search query logs with QA techniques to
build one-sided and two-sided classifiers for rec- After the previous steps, many navigational
ognizing answers to definition questions. queries (e.g., facebook) still remained in
the query log. We noticed that a substantial
3 User Click Analysis for Definition QA portion was signaled by several frequently
In this section, we examine a collection of queries and indistinctly clicked URLs. Take for
submitted to Yahoo! search engine during the pe- instance facebook: www.facebook.com
riod from December 2010 to March 2011. More and www.facebook.com/login.php.
specifically, for this analysis, we considered a
log encompassing a random sample of 69,845,262 With this in mind, we discarded entries em-
(23,360,089 distinct) queries. Basically, this log bodied in a manually compiled black list.
comprises the query sent by the user in conjunc- This list contains the 600 highest frequent
tion with the displayed URLs and the information cases.
about the sequence of their clicks.
In the first place, we associate each query with A third category in (Rose and Levinson, 2004)
a category in the taxonomy proposed by (Rose regards resource queries, which we distinguished
and Levinson, 2004), and in this way definition via keywords like image, lyrics and maps.
queries are selected. Secondly, we investigate Altogether, an amount of (35.67%) 24,916,610
user click patterns observed across these filtered (3,576,817 distinct) queries were seen as navi-
definition questions. gational and resource. Note that in (Rose and
Levinson, 2004) both classes encompassed be-
3.1 Finding Definition Queries tween 37%-38% of their query set.
According to (Broder, 2002; Lee et al., 2005; Subsequently, we profited from the remaining
Dupret and Piwowarski, 2008), the intention of 44,928,652 (informational) entries for detecting
the user falls into at least two categories: navi- queries where the intention of the user is find-
gational (e.g., google) and informational (e.g., ing descriptive information about a topic (i.e.,
maximum entropy models). The former entails definiendum). In the taxonomy delineated by
101
(Rose and Levinson, 2004), informational queries 3.2 User Click Patterns
are sub-categorized into five groups including list, In substance, the first filter recognizes the inten-
locate, and definitional (directed and undirected). tion of the user by means of the formulation given
In practice, we filtered definition questions as fol- by the user (e.g., What is a/the/an...). With re-
lows: gard to this filter, some interesting observations
1. We exploited an array of expressions that are as follows:
are commonly utilized in query analysis for
In 40.27% of the entries, users did not visit
classifying definition questions (Figueroa,
any of the displayed web-sites. Conse-
2010). E.g., Who is/was..., What is/was
quently, we concluded that the information
a/an..., define..., and describe.... Over-
conveyed within the multiple snippets was
all, these rules assisted in selecting 332,227
often enough to answer the respective def-
entries.
inition question. In other words, a signifi-
2. As stated in (Dupret and Piwowarski, 2008), cant fraction of the users were satisfied with
informational queries are typified by the user a small set of brief, but quickly generated de-
clicking several documents. In light of that, scriptions.
we say that some definitional queries are
characterized by multiple clicks, where at In 2.18% of these cases, the search engine re-
least one belongs to a KB. This aids in cap- turned no results, and a few times users tried
turing the intention of the user when look- another paraphrase or query, due to useless
ing for descriptive knowledge and only en- results or misspellings.
tering noun phrases like thoracic outlet syn- We also noticed that definition questions
drome: matched by these expressions are seldom re-
www.medicinenet.com lated to more than one click, although infor-
en.wikipedia.org mational queries produce several clicks, in
health.yahoo.net general. In 46.44% of the cases, the user
www.livestrong.com clicked a sole document, and more surpris-
health.yahoo.net ingly, we observed that users are likely to
en.wikipedia.org click sources different from KBs, in con-
www.medicinenet.com trast to the widespread belief in definition
www.mayoclinic.com QA research. Users pick hits originating
en.wikipedia.org from small but domain-specific web-sites as
www.nismat.org a result of at least two effects: a) they are
en.wikipedia.org looking for minor or ancillary senses of the
definiendum (e.g., ETA in www.travel-
Table 1: Four distinct sequences of hosts clicked by industry-dictionary.com); and more perti-
users given the search query: thoracic outlet syn- nent b) the user does not trust the information
drome. yielded by KBs and chooses more authorita-
tive resources, for instance, when looking for
In so doing, we manually compiled a list reliable medical information (e.g., What is
of 36 frequently clicked KB hosts (e.g., hypothyroidism?, and What is mrsa infec-
Wikipedia and Britannica encyclopedia). tion?).
This filter produced 567,986 queries.
While the first filter infers the intention of the
Unfortunately, since query logs stored by user from the query itself, the second deduces it
search engines are not publicly available due to from the origin of the clicked documents. With
privacy and legal concerns, there is no accessible regard to this second filter, clicking patterns are
training material to build models on top of anno- more disperse. Here, the first two clicks normally
tated data. Thus, we exploited the aforementioned correspond to the top two/three ranked hits re-
hand-crafted rules to connect queries to their re- turned by the search engine, see also (Ji et al.,
spective category in this taxonomy. 2009). Also, sequences of clicks signal that users
102
normally visit only one site belonging to a KB, they appear within snippets across several ques-
and at least one coming from a non-KB (see Ta- tions. In other words: If it seems to answer every
ble 1). question, it will probably answer no question.
All in all, the insight gained in this analysis al- Take for instance:
lows the construction of an heterogeneous corpus Information about #Q# in the Columbia
for definition question answering. Put differently, Encyclopedia , Computer Desktop
these user click patterns offer a way to obtain huge Encyclopedia , computing dictionary
amounts of heterogeneous training material. In Conversely, templates that are more plausible
this way the heavy dependence of open-domain to be answers are strongly related to their specific
description identifiers on KB data can be allevi- definition questions, and consequently, they are
ated. low in frequency and unlikely to be in the result
set of a large number of queries. This negative set
4 Click-Based Corpus Acquisition was expanded with templates coming from titles
Since queries obtained by the previous two filters of snippets, which at the same time, have a fre-
are not associated with the actual snippets seen quency higher than four across all snippets (inde-
by the users (due to storage limitations), snip- pendent on which queries they appear). This pro-
pets were recovered by means of submitting the cess cooperated on gathering 1,021,571 different
queries to Yahoo! search engine. negative examples. In order to measure the pre-
After retrieval, we benefited from OpenNLP1 cision of this process, we randomly selected and
for detecting sentence boundaries, tokenization checked 1,000 elements, and we found an error of
and part-of-speech (POS) information. Here, we 1.3%.
additionally interpreted truncations (. . .) as sen-
4.2 Positive Set
tence delimiters. POS tags were used to recognize
and replace numbers with a placeholder (#CD#) As for the positive set, this was constructed
as a means of creating sentence templates. We only from the summary section of web-snippets
modified numbers as their value is just as of- clicked by the users. We constrained these snip-
ten confusing as useful (Baeza-Yates and Ribeiro- pets to bear a title template associated with at least
Neto, 1999). two web-snippets clicked for two distinct queries.
Along with numbers, sequences of full Some good examples are:
and partial matches of the definiendum were What is #Q# ? Choices and Consequences.
also substituted with placeholders, #Q# and Biology question : What is an #Q# ?
#QT#, respectively. To exemplify, consider Since clicks are linked with entire snippets,
this pre-processed snippet regarding Benjamin it is uncertain which sentences are genuine de-
Millepied from www.mashceleb.com: scriptions (see the previous example). There-
#Q# / News & Biography - MashCeleb fore, we removed those templates already con-
Latest news coverage of #Q# tained in the negative set, along with those sam-
#Q# ( born #CD# ) is a principal dancer
ples that matched an array of well-known hand-
at New York City Ballet and a ballet
choreographer... crafted rules. This set included:
We benefit from these templates for building a. sentences containing words such as ask,
both a positive and a negative training set. report, say, and unless (Kil et al.,
2005; Schlaefer et al., 2007);
4.1 Negative Set
The negative set comprised templates appearing b. sentences bearing several named entities
across all (clicked and unclicked) web-snippets, (Schlaefer et al., 2006; Schlaefer et al.,
which at the same time, are related to more 2007), which were recognized by the number
than five distinct queries. We hypothesize that of tokens starting with a capital letter versus
these prominent elements correspond to non- those starting with a lowercase letter;
informative, and thus non-descriptive, content as
c. statements of persons (Schlaefer et al.,
1
http://opennlp.sourceforge.net 2007); and
103
d. we also profited from about five hundred 23,132 elements, and some illustrative annota-
common expressions across web snippets in- tions are shown in Table 2. It is worth highlight-
cluding Picture of , and Jump to : naviga- ing that these examples signal that our models
tion , search, as well as Recent posts. are considering pattern-free descriptions, that is
to say, unlike other systems (Xu et al., 2003; Katz
This process assisted in acquiring 881,726 dif- et al., 2004; Fernandes, 2004; Feng et al., 2006;
ferent examples, where 673,548 came from KBs. Figueroa and Atkinson, 2009; Westerhout, 2009)
Here, we also randomly selected 1,000 instances which consider definitions aligning an array of
and manually checked if they were actual descrip- well-known patterns (e.g., is a and also known
tions. The error of this set was 12.2%. as), our models disregard any class of syntactic
To put things into perspective, in contrast to constraint.
other corpus acquisition approaches, the present As to a baseline system, we accounted for the
method generated more than 1,800,000 positive centroid vector (Xu et al., 2003; Cui et al., 2004).
and negative training samples combined, while When implementing, we followed the blueprint
the open-domain strategy of (Miliaraki and An- in (Chen et al., 2006), and it was built for each
droutsopoulos, 2004; Androutsopoulos and Gala- definiendum from a maximum of 330 web snip-
nis, 2005) ca. 20,000 examples, the close-domain pets fetched by means of Bing Search. This base-
technique of (Xu et al., 2005) about 3,000 and line achieved a modest performance as it correctly
(Fahmi and Bouma, 2006) ca. 2,000. classified 43.75% of the testing examples. In de-
tail, 47.75% out of the 56.25% of the misclas-
5 Answering New Definition Queries sified elements were a result of data-sparseness.
This baseline has been widely used as a starting
In our experiments, we checked the effectiveness
point for comparison purposes, however it is hard
of our user click-based corpus acquisition tech-
for this technique to discover diverse descriptive
nique by studying its impact on two state-of-the-
nuggets. This problem stems from the narrow-
art systems. The first one is based on the bi-term
coverage of the centroid vector learned for the re-
LMs proposed by (Chen et al., 2006). This sys-
spective definienda (Zhang et al., 2005). In short,
tem requires only positive samples as training ma-
these figures support the necessity for more robust
terial. Conversely, our second system capitalizes
methods based on massive training material.
on both positive and negative examples, and it is
Experiments. We trained both models by sys-
based on the Maximum Entropy (ME) models
tematically increasing the size of the training ma-
presented by (Fahmi and Bouma, 2006). These
terial by 1%. For this, we randomly split the train-
ME2 models amalgamated bigrams and unigrams
ing data into 100 equally sized packs, and system-
as well as two additional syntactic features, which
atically added one to the previously selected sets
were not applicable to our task (i.e, sentence posi-
(i. e., 1%, 2%, 3%, . . ., 99%, 100%). We also ex-
tion). We added to this model the sentence length
perimented with: 1) positive examples originated
as a feature in order to homologate the attributes
solely from KBs; 2) positive samples harvested
used by both systems, therefore offering a good
only from non-KBs; and eventually 3) all positive
framework to assess the impact of our negative
examples combined.
set. Note that (Fahmi and Bouma, 2006), unlike
Figure 1 juxtaposes the outcomes accom-
us, applied their models only to sentences observ-
plished by both techniques under the different
ing some specific syntactic patterns.
configurations. These figures, compared with re-
With regard to the test set, this was constructed
sults obtained by the baseline, indicate the im-
by manually annotating 113,184 sentence tem-
portant contribution of our corpus to tackle data-
plates corresponding to 3,162 unseen definienda.
sparseness. This contrast substantiates our claim
In total, this array of unseen testing instances
that click patterns can be utilized as indicators of
encompassed 11,566 different positive samples.
answers to definition questions. Since our models
In order to build a balanced testing collection,
ignore definition patterns, they have the potential
the same number of negative examples were ran-
of detecting a wide diversity of descriptive infor-
domly selected. Overall, our testing set contains
mation.
2
http://maxent.sourceforge.net/about.html Further, the improvement of about 9%-10% by
104
Label Example/Template
+ Propylene #Q# is a type of alcohol made from fermented yeast and carbohydrates and
is commonly used in a wide variety of products .
+ #Q# is aggressive behavior intended to achieve a goal .
+ In Hispanic culture , when a girl turns #CD# , a celebration is held called the #Q#,
symbolizing the girl s passage to womanhood .
+ Kirschwasser , German for cherry water and often shortened to #Q# in English-speaking
countries , is a colorless brandy made from black ...
+ From the Gaelic dubhglas meaning #Q#, #QT# stream , or from the #QT# river .
+ Council Bluffs Orthopedic Surgeon Doctors physician directory - Read about #Q#, damage
to any of the #CD# tendons that stabilize the shoulder joint .
+ It also occurs naturally in our bodies in fact , an average size adult manufactures up to
#CD# grams of #Q# daily during normal metabolism .
- Sterling Silver #Q# Hoop Earrings Overstockjeweler.com
- I know V is the rate of reaction and the #Q# is hal ...
- As sad and mean as that sounds , there is some truth to it , as #QT# as age their bodies do
not function as well as they used to ( in all respects ) so there is a ...
- If you re new to the idea of Christian #Q#, what I call the wild things of God ,
- A look at the Biblical doctrine of the #QT# , showing the biblical basis for the teaching and
including a discussion of some of the common objections .
- #QT# is Users Choice ( application need to be run at #QT# , but is not system critical ) ,
this page shows you how it affects your Windows operating system .
- Your doctor may recommend that you use certain drugs to help you control your #Q# .
- Find out what is the full meaning of #Q# on Abbreviations.com !
means of exploiting our negative set makes its Best True Positive
positive contribution clear. In particular, this sup- Conf. of Accuracy positives examples
ME-combined 80.72% 88% 881,726
ports our hypothesis that redundancy across web-
ME-KB 80.33% 89.37% 673,548
snippets pertaining to several definition questions ME-N-KB 78.99% 93.38% 208,178
can be exploited as negative evidence. On the
whole, this enhancement also suggests that ME Table 3: Comparison of performance, the total amount
models are a better option than LMs. and origin of training data, and the number of recog-
nized descriptions.
Furthermore, in the case of ME models, putting
together evidence from KB and non-KBs bet-
ters the performance. Conversely, in the case of racy. Nevertheless, this fraction (32%) is still
LMs, we do not observe a noticeable improve- larger than the data-sets considered by other open-
ment when unifying both sources. We attribute domain Machine Learning approaches (Miliaraki
this difference to the fact that non-KB data is nois- and Androutsopoulos, 2004; Androutsopoulos
ier, and thus negative examples are necessary to and Galanis, 2005).
cushion this noise. By and large, the outcomes In detail, when contrasting the confusion ma-
show that the usage of descriptive information de- trices of the best configurations accomplished
rived exclusively from KBs is not the best, but a by ME-combined (80.72%), ME-KB (80.33%)
cost-efficient solution. and ME-N-KB (78.99%), one can find that ME-
Incidentally, Figure 1 reveals that more training combined correctly identified 88% of the answers
data does not always imply better results. Overall, (true positives), while ME-KB 89.37% and ME-
the best performance (ME-combined 80.72%) N-KB 93.38% (see Table 3).
was reaped when considering solely 32% of the Interestingly enough, non-KB data only em-
training material. Hence, ME-KB finished with bodies 23.61% of all positive training material,
the best performance when accounting for about but it still has the ability to recognize more an-
215,500 positive examples (see Table 3). Adding swers. Despite of that, the other two strate-
more examples brought about a decline in accu- gies outperform ME-N-KB, because they are able
105
Figure 1: Results for each configuration (accuracy).
to correctly label more negative test examples. size of the corpus. Our figures additionally sug-
Given these figures, we can conclude that this is gest that more effort should go into increasing di-
achieved by mitigating the impact of the noise in versity than the number of training instances. In
the training corpus by means of cleaner (KB) data. light of these observations, we also conjecture that
We verified this synergy by inspecting the num- a more reduced, but diverse and manually anno-
ber of answers from non-KBs detected by the tated, corpus might be more effective. In partic-
three top configurations in Table 3: ME-combined ular, a manually checked corpus distilled by in-
(9,086), ME-KB (9,230) and ME-N-KB (9,677). specting click patterns across query logs of search
In like manner, we examined the confusion ma- engines.
trix for the best configuration (ME-combined Lastly, in order to evaluate how good a click
80.72%): 1,388 (6%) positive examples were mis- predictor the three top ME-configurations are,
labeled as negative, while 3,071 (13.28%) nega- we focused our attention only on the manu-
tive samples were mistagged as positive. ally labeled positive samples (answers) that were
In addition, we performed significance tests uti- clicked by the users. Overall, 86.33% (ME-
lizing two-tailed paired t-test at 95% confidence combined), 88.85% (ME-KB) and 92.45% (ME-
interval on twenty samples. For this, we used N-KB) of these responses were correctly pre-
only the top three configurations in Table 3 and dicted. In light of that, one can conclude that
each sample was determined by using boostrap- (clicked and non-clicked) answers to definition
ping resampling. Each sample has the same size questions can be identified/predicted on the basis
of the original test corpus. Overall, the tests im- of users click patterns across query logs.
plied that all pairs were statistically different from From the viewpoint of search engines, web
each other. snippets are computed off-line, in general. In
In summary, the results show that both negative so doing, some methods select the spans of text
examples and combining positive examples from bearing query terms with the potential of putting
heterogeneous sources are indispensable to tackle the document on top of the rank (Turpin et al.,
any class of text. However, it is vital to lessen the 2007; Tsegay et al., 2009). This helps to create an
noise in non-KB data, since this causes a more abridged version of the document that can quickly
adverse effect on the performance. Given the up- produce the snippet. This has to do with the trade-
perbound in accuracy, our outcomes indicate that off between storage capacity, indexing, and re-
cleanness and quality are more important than the trieval speed. Ergo, our technique can help to de-
106
termine whether or not a span of text is worth ex- this implies that these tools have to be re-trained
panding, or in some cases whether or not it should to cope with web-snippets.
be included in the snippet view of the document.
In our instructive snippet, we now might have: Acknowledgements
Benjamin Millepied / News & This work was partially supported by R&D
Biography - MashCeleb project FONDEF D09I1185. We also thank our
Benjamin Millepied (born 1977) is a
principal dancer at New York City Ballet
reviewers for their interesting comments, which
and a ballet choreographer of helped us to make this work better.
international reputation. Millepied was
born in Bordeaux, France. His...
References
Improving the results of informational (e.g.,
definition) queries, especially of less frequent I. Androutsopoulos and D. Galanis. 2005. A prac-
tically Unsupervised Learning Method to Identify
ones, is key for competing commercial search
Single-Snippet Answers to Definition Questions on
engines as they are embodied in the non- the web. In HLT/EMNLP, pages 323330.
navigational tail where these engines differ the R. Baeza-Yates and B. Ribeiro-Neto. 1999. Modern
most (Zaragoza et al., 2010). Information Retrieval. Addison Wesley.
A. Broder. 2002. A Taxonomy of Web Search. SIGIR
6 Conclusions Forum, 36:310, September.
Y. Chen, M. Zhon, and S. Wang. 2006. Reranking An-
This work investigates into the click behavior of swers for Definitional QA Using Language Model-
commercial search engine users regarding defi- ing. In Coling/ACL-2006, pages 10811088.
nition questions. These behaviour patterns are H. Cui, K. Li, R. Sun, T.-S. Chua, and M.-Y. Kan.
then exploited as a corpus acquisition technique 2004. National University of Singapore at the
for definition QA, which offers the advantage of TREC 13 Question Answering Main Task. In Pro-
encompassing positive samples from heterogo- ceedings of TREC 2004. NIST.
neous sources. In contrast, negative examples Georges E. Dupret and Benjamin Piwowarski. 2008.
are obtained in conformity to redundancy pat- A user browsing model to predict search engine
click data from past observations. In SIGIR 08,
terns across snippets, which are returned by the
pages 331338.
search engine when processing several definition
Ismail Fahmi and Gosse Bouma. 2006. Learning to
queries. The effectiveness of these patterns, and Identify Definitions using Syntactic Features. In
hence of the obtained corpus, was tested by means Proceedings of the Workshop on Learning Struc-
of two models different in nature, where both tured Information in Natural Language Applica-
were capable of achieving an accuracy higher than tions.
70%. Donghui Feng, Deepak Ravichandran, and Eduard H.
As a future work, we envision that answers de- Hovy. 2006. Mining and Re-ranking for Answering
tected by our strategy can aid in determining some Biographical Queries on the Web. In AAAI.
Aaron Fernandes. 2004. Answering Definitional
query expansion terms, and thus to devise some
Questions before they are Asked. Masters thesis,
relevance feedback methods that can bring about Massachusetts Institute of Technology.
an improvement in terms of the recall of answers. A. Figueroa and J. Atkinson. 2009. Using Depen-
Along the same lines, it can cooperate on the vi- dency Paths For Answering Definition Questions on
sualization of the results by highlighting and/or The Web. In WEBIST 2009, pages 643650.
extending truncated answers, that is more infor- Alejandro Figueroa. 2010. Finding Answers to Defini-
mative snippets, which is one of the holy grail of tion Questions on the Web. Phd-thesis, Universitaet
search operators, especially when processing in- des Saarlandes, 7.
formational queries. K. Han, Y. Song, and H. Rim. 2006. Probabilis-
tic Model for Definitional Question Answering. In
NLP tools (e.g., parsers and name entity recog-
Proceedings of SIGIR 2006, pages 212219.
nizers) can also be exploited for designing better
Shihao Ji, Ke Zhou, Ciya Liao, Zhaohui Zheng, Gui-
training data filters and more discriminative fea- Rong Xue, Olivier Chapelle, Gordon Sun, and
tures for our models that can assist in enhanc- Hongyuan Zha. 2009. Global ranking by exploit-
ing the performance, cf. (Surdeanu et al., 2008; ing user clicks. In Proceedings of the 32nd inter-
Figueroa, 2010; Surdeanu et al., 2011). However, national ACM SIGIR conference on Research and
107
development in information retrieval, SIGIR 09, Yohannes Tsegay, Simon J. Puglisi, Andrew Turpin,
pages 3542, New York, NY, USA. ACM. and Justin Zobel. 2009. Document compaction
B. Katz, M. Bilotti, S. Felshin, A. Fernandes, for efficient query biased snippet generation. In
W. Hildebrandt, R. Katzir, J. Lin, D. Loreto, Proceedings of the 31th European Conference on
G. Marton, F. Mora, and O. Uzuner. 2004. An- IR Research on Advances in Information Retrieval,
swering multiple questions on a topic from hetero- ECIR 09, pages 509520, Berlin, Heidelberg.
geneous resources. In Proceedings of TREC 2004. Springer-Verlag.
NIST. Andrew Turpin, Yohannes Tsegay, David Hawking,
B. Katz, S. Felshin, G. Marton, F. Mora, Y. K. Shen, and Hugh E. Williams. 2007. Fast generation of
G. Zaccak, A. Ammar, E. Eisner, A. Turgut, and result snippets in web search. In Proceedings of
L. Brown Westrick. 2007. CSAIL at TREC 2007 the 30th annual international ACM SIGIR confer-
Question Answering. In Proceedings of TREC ence on Research and development in information
2007. NIST. retrieval, SIGIR 07, pages 127134, New York,
Jae Hong Kil, Levon Lloyd, and Steven Skiena. 2005. NY, USA. ACM.
Question Answering with Lydia (TREC 2005 QA Eline Westerhout. 2009. Extraction of definitions us-
track). In Proceedings of TREC 2005. NIST. ing grammar-enhanced machine learning. In Pro-
ceedings of the EACL 2009 Student Research Work-
U. Lee, Z. Liu, and J. Cho. 2005. Automatic Iden-
shop, pages 8896.
tification of User Goals in Web Search. In Pro-
ceedings of the 14th WWW conference, WWW 05, Jinxi Xu, Ana Licuanan, and Ralph Weischedel. 2003.
pages 391400. TREC2003 QA at BBN: Answering Definitional
Questions. In Proceedings of TREC 2003, pages
S. Miliaraki and I. Androutsopoulos. 2004. Learn-
98106. NIST.
ing to identify single-snippet answers to definition
J. Xu, Y. Cao, H. Li, and M. Zhao. 2005. Ranking
questions. In COLING 04, pages 13601366.
Definitions with Supervised Learning Methods. In
Roberto Navigli and Paola Velardi. 2010. WWW2005, pages 811819.
LearningWord-Class Lattices for Definition
Jingfang Xu, Chuanliang Chen, Gu Xu, Hang Li, and
and Hypernym Extraction. In Proceedings of
Elbio Renato Torres Abib. 2010. Improving qual-
the 48th Annual Meeting of the Association for
ity of training data for learning to rank using click-
Computational Linguistics (ACL 2010).
through data. In Proceedings of the third ACM in-
Filip Radlinski, Martin Szummer, and Nick Craswell. ternational conference on Web search and data min-
2010. Inferring query intent from reformulations ing, WSDM 10, pages 171180, New York, NY,
and clicks. In Proceedings of the 19th international USA. ACM.
conference on World wide web, WWW 10, pages H. Zaragoza, B. Barla Cambazoglu, and R. Baeza-
11711172, New York, NY, USA. ACM. Yates. 2010. We Search Solved? All Result Rank-
Daniel E. Rose and Danny Levinson. 2004. Un- ings the Same? In Proceedings of CKIM10, pages
derstanding User Goals in Web Search. In WWW, 529538.
pages 1319. Zhushuo Zhang, Yaqian Zhou, Xuanjing Huang, and
B. Sacaleanu, G. Neumann, and C. Spurk. 2008. Lide Wu. 2005. Answering Definition Questions
DFKI-LT at QA@CLEF 2008. In In Working Notes Using Web Knowledge Bases. In Proceedings of
for the CLEF 2008 Workshop. IJCNLP 2005, pages 498506.
Nico Schlaefer, P. Gieselmann, and Guido Sautter.
2006. The Ephyra QA System at TREC 2006. In
Proceedings of TREC 2006. NIST.
Nico Schlaefer, Jeongwoo Ko, Justin Betteridge,
Guido Sautter, Manas Pathak, and Eric Nyberg.
2007. Semantic Extensions of the Ephyra QA Sys-
tem for TREC 2007. In Proceedings of TREC 2007.
NIST.
Mihai Surdeanu, Massimiliano Ciaramita, and Hugo
Zaragoza. 2008. Learning to Rank Answers on
Large Online QA Collections. In Proceedings of the
46th Annual Meeting of the Association for Compu-
tational Linguistics (ACL 2008), pages 719727.
Mihai Surdeanu, Massimiliano Ciaramita, and Hugo
Zaragoza. 2011. Learning to rank answers to non-
factoid questions from web collections. Computa-
tional Linguistics, 37:351383.
108
Adaptation of Statistical Machine Translation Model for Cross-Lingual
Information Retrieval in a Service Context
Vassilina Nikoulina Bogomil Kovachev
Xerox Research Center Europe Informatics Institute
vassilina.nikoulina@xrce.xerox.com University of Amsterdam
B.K.Kovachev@uva.nl
109
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 109119,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
ing according to the BLEU2 (Papineni et al., 2001) ror on the training data.
score, or via reranking the Nbest translation can- To our knowledge, existing work that use MT-
didates generated by a baseline system based on based techniques for query translation use an out-
new parameters (and possibly new features) that of-the-box MT system, without adapting it for
aim to optimize a retrieval metric. query translation in particular (Jones et al., 1999;
It is important to note that both of the pro- Wu et al., 2008) (although some query expan-
posed approaches allow keeping the MT system sion techniques might be applied to the produced
independent of the document collection and in- translation afterwards (Wu and He, 2010)).
dexing, and thus suitable for a query translation There is a number of works done for do-
service. These two approaches can also be com- main adaptation in Statistical Machine Transla-
bined by using the model produced with the first tion. However, we want to distinguish between
approach as a baseline that produces the Nbest list genre and domain adaptation in this work. Gen-
of translations that is then given to the reranking erally, genre can be seen as a sub-problem of do-
approach. main. Thus, we consider genre to be the general
The remainder of this paper is organized as fol- style of the text e.g. conversation, news, blog,
lows. We first present related work addressing the query (responsible mostly for the text structure)
problem of query translation. We then describe while the domain reflects more what the text is
two approaches towards adapting an SMT system about eg. social science, healthcare, history, so
to the query-genre: tuning the SMT system on a domain adaptation involves lexical disambigua-
parallel set of queries (Section 3.1) and adapting tion and extra lexical coverage problems. To our
machine translation via the reranking framework knowledge, there is not much work addressing ex-
(Section 3.2). We then present our experimental plicitly the problem of genre adaptation for SMT.
settings and results (Section 4) and conclude in Some work done on domain adaptation could be
section 5. applied to genre adaptation, such as incorporating
available in-domain corpora in the SMT model:
2 Related work either monolingual (Bertoldi and Federico, 2009;
Wu et al., 2008; Zhao et al., 2004; Koehn and
We may distinguish two main groups of ap- Schroeder, 2007), or small parallel data used for
proaches to CLIR: document translation and tuning the SMT parameters (Zheng et al., 2010;
query translation. We concentrate on the second Pecina et al., 2011).
group which is more relevant to our settings. The
standard query translation methods use different 3 Our approach
translation resources such as bilingual dictionar-
This work is based on the hypothesis that the
ies, parallel corpora and/or machine translation.
general-purpose SMT system needs to be adapted
The aspect of disambiguation is important for the
for query translation. Although in (Ferro and
first two techniques.
Peters, 2009) it has been mentioned that using
Different methods were proposed to deal with
Google translate (general-purpose MT) for query
disambiguation issues, often relying on the docu-
translation allowed to CLEF participants to obtain
ment collection or embedding the translation step
the best CLIR performance, there is still 10% gap
directly into the retrieval model (Hiemstra and
between monolingual and cross-lingual IR. We
Jong, 1999; Berger et al., 1999; Kraaij et al.,
believe that, as in (Clinchant and Renders, 2007),
2003). Other methods rely on external resources
more adapted query translation, possibly further
like query logs (Gao et al., 2010), Wikipedia (Ja-
combined with query expansion techniques, can
didinejad and Mahmoudi, 2009) or the web (Nie
lead to improved retrieval.
and Chen, 2002; Hu et al., 2008). (Gao et al.,
The problem of the SMT adaptation for query-
2006) proposes syntax-based translation models
genre translation has different quality aspects.
to deal with the disambiguation issues (NP-based,
On the one hand, we want our model to pro-
dependency-based). The candidate translations
duce a good translation (well-formed and trans-
proposed by these models are then reranked with
mitting the information contained in the source
the model learned to minimize the translation er-
query) of an input query. On the other hand, we
2
Standard MT evaluation metric want to obtain good retrieval performance using
110
the proposed translation. These two aspects are Our hypothesis is that the impact of different
not necessarily correlated: a bag-of-word transla- features should be different depending on whether
tion can lead to good retrieval performance, even we translate a full sentence, or a query-genre en-
though it wont be syntactically well-formed; at try. Thus, one would expect that in the case
the same time a well-formed translation can lead of query-genre the language model or the distor-
to worse retrieval if the wrong lexical choice is tion features should get less importance than in
done. Moreover, often the retrieval demands some the case of the full-sentence translation. MERT
linguistic preprocessing (eg. lemmatisation, PoS tuning on a genre-adapted parallel corpus should
tagging) which in interaction with badly-formed leverage this information from the data, adapting
translations might bring some noise. the SMT model to the query-genre. We would
A couple of works studied the correlation be- also like to note that the tuning approach (pro-
tween the standard MT evaluation metrics and posed for domain adaptation by (Zheng et al.,
the retrieval precision. Thus, (Fujii et al., 2009) 2010)) seems to be more appropriate for genre
showed a good correlation of the BLEU scores adaptation than for domain adaptation where the
with the MAP scores for Cross-Lingual Patent problem of lexical ambiguity is encoded in the
Retrieval. However, the topics in patent search translation model and re-weighting the main fea-
(long and well structured) are very different from tures might not be sufficient.
standard queries. (Kettunen, 2009) also found a We use the MERT implementation provided
pretty high correlation ( 0.8 0.9) between stan- with the Moses toolkit with default settings. Our
dard MT evaluation metrics (METEOR(Banerjee assumption is that this procedure although not ex-
and Lavie, 2005), BLEU, NIST(Doddington, plicitly aimed at improving retrieval performance
2002)) and retrieval precision for long queries. will nevertheless lead to better query transla-
However, the same work shows that the correla- tions when compared to the baseline. The results
tion decreases ( 0.6 0.7) for short queries. of this apporach allow us also to observe whether
In this paper we propose two approaches to and to what extent changes in BLEU scores are
SMT adaptation for queries. The first one op- correlated to changes in MAP scores.
timizes BLEU, while the second one optimizes
Mean Average Precision (MAP), a standard met- 3.2 Reranking framework for query
ric in information retrieval. Well address the is- translation
sue of the correlation between BLEU and MAP in The second approach addresses the retrieval qual-
Section 4. ity problem. An SMT system is usually trained to
Both of the proposed approaches rely on the optimize the quality of the translation (eg. BLEU
phrase-based SMT (PBMT) model (Koehn et al., score for SMT), which is not necessarily corre-
2003) implemented in the Open Source SMT lated with the retrieval quality (especially for the
toolkit MOSES (Koehn et al., 2007). short queries). Thus, for example, the word or-
der which is crucial for translation quality (and is
3.1 Tuning for genre adaptation taken into account by most MT evaluation met-
First, we propose to adapt the PBMT model by rics) is often ignored by IR models. Our second
tuning the models weights on a parallel set of approach follows (Nie, 2010, pp.106) argument
queries. This approach addresses the first as- that the translation problem is an integral part
pect of the problem, which is producing a good of the whole CLIR problem, and unified CLIR
translation. The PBMT model combines differ- models integrating translation should be defined.
ent types of features via a log-linear model. The We propose integrating the IR metric (MAP) into
standard features include (Koehn, 2010, Chapter the translation model optimisation step via the
5): language model, word penalty, distortion, dif- reranking framework.
ferent translation models, etc. The weights of Previous attempts to apply the reranking ap-
these features are learned during the tuning step proach to SMT did not show significant improve-
with the MERT (Och, 2003) algorithm. Roughly ments in terms of MT evaluation metrics (Och
the MERT algorithm tunes feature weights one by et al., 2003; Nikoulina and Dymetman, 2008).
one and optimizes them according to the BLEU One of the reasons being the poor diversity of the
score obtained. Nbest list of the translations. However, we be-
111
lieve that this approach has more potential in the defined as a weighted linear combination of
context of query translation. features: t() = arg maxtGEN (q) F (t)
First of all the average query length is 5 words, As shown above the best translation is selected ac-
which means that the Nbest list of the translations cording to features weights . In order to learn
is more diverse than in the case of general phrase the weights maximizing the retrieval perfor-
translation (average length 25-30 words). mance, an appropriate annotated training set has
Moreover, the retrieval precision is more natu- to be created. We use the CLEF tracks to create
rally integrated into the reranking framework than the training set. The retrieval scores annotations
standard MT evaluation metrics such as BLEU. are based on the document relevance annotations
The main reason is that the notion of Average Re- performed by human annotators during the CLEF
trieval Precision is well defined for a single query campaign.
translation, while BLEU is defined on the corpus The annotated training set is created out of
level and correlates poorly with human quality queries {q1 , ..., qK } with an Nbest list of trans-
judgements for the individual translations (Specia lations GEN (qi ) of each query qi , i {1..K} as
et al., 2009; Callison-Burch et al., 2009). follows:
Finally, the reranking framework allows a lot
of flexibility. Thus, it allows enriching the base- A list of N (we take N = 1000) translations
line translation model with new complex features (GEN (qi )) is produced by the baseline MT
which might be difficult to introduce into the model for each query qi , i = 1..K.
translation model directly. Each translation t GEN (qi ) is used
Other works applied the reranking framework to perform a retrieval from a target docu-
to different NLP tasks such as Named Entities ment collection, and an Average Precision
Extraction (Collins, 2001), parsing (Collins and score (AP (t)) is computed for each t
Roark, 2004), and language modelling (Roark et GEN (qi ) by comparing its retrieval to the
al., 2004). Most of these works used the reranking relevance annotations done during the CLEF
framework to combine generative and discrimina- campaign.
tive methods when both approaches aim at solv-
ing the same problem: the generative model pro- The weights are learned with the objective of
duces a set of hypotheses, and the best hypoth- maximizing MAP for all the queries of the train-
esis is chosen afterwards via the discriminative ing set, and, therefore, are optimized for retrieval
reranking model, which allows to enrich the base- quality.
line model with the new complex and heteroge- The weights optimization is done with
neous features. We suggest using the reranking the Margin Infused Relaxed Algorithm
framework to combine two different tasks: Ma- (MIRA)(Crammer and Singer, 2003), which
chine Translation and Cross-lingual Information was applied to SMT by (Watanabe et al., 2007;
Retrieval. In this context the reranking framework Chiang et al., 2008). MIRA is an online learning
doesnt only allow enriching the baseline transla- algorithm where each weights update is done to
tion model but also performing training using a keep the new weights as close as possible to the
more appropriate evaluation metric. old weights (first term), and score oracle trans-
lation (the translation giving the best retrieval
3.2.1 Reranking training score : ti = arg maxt AP (t)) higher than each
Generally, the reranking framework can be re- non-oracle translation (tij ) by a margin at least as
sumed in the following steps : wide as the loss lij (second term):
0
1. The baseline (generic-purpose) MT system = min0 21 k k2 +
generates a list of candidate translations 0
C K ) F (t )
P
i=1 max j=1..N lij (F (ti ij
GEN (q) for each query q;
The loss lij is defined as the difference in the re-
2. A vector of features F (t) is assigned to each
trieval average precision between the oracle and
translation t GEN (q);
non-oracle translations: lij = AP (ti ) AP (tij ).
3. The best translation t is chosen as the one C is the regularization parameter which is chosen
maximizing the translation score, which is via 5-fold cross-validation.
112
3.2.2 Features PoS mapping features. The goal of the PoS
One of the advantages of the reranking frame- mapping features is to control the correspondence
work is that new complex features can be easily of Part Of Speech Tags between an input query
integrated. We suggest to enrich the reranking and its translation. As the coupling features, the
model with different syntax-based features, such PoS mapping features rely on the word align-
as: ments between the source sentence and its trans-
lation3 . A vector of sparse features is introduced
features relying on dependency structures: where each component corresponds to a pair of
called therein coupling features (proposed by PoS tags aligned in the training data. We intro-
(Nikoulina and Dymetman, 2008)); duce a generic PoS map variant, which counts a
number of occurrences of a specific pair of PoS
features relying on Part of Speech Tagging: tags, and lexical PoS map variant, which weights
called therein PoS mapping features. down these pairs by a lexical alignment score
(p(s|t) or p(t|s)).
By integrating the syntax-based features we
have a double goal: showing the potential of 4 Experiments
the reranking framework with more complex fea- 4.1 Experimental basis
tures, and examining whether the integration of
syntactic information could be useful for query 4.1.1 Data
translation. To simulate parallel query data we used trans-
lation equivalent CLEF topics. The data set used
Coupling features. The goal of the coupling for the first approach consists of the CLEF topic
features is to measure the similarity between data from the following years and tasks: AdHoc-
source and target dependency structures. The ini- main track from 2000 to 2008; CLEF AdHoc-
tial hypothesis is that a better translation should TEL track 2008; Domain Specific tracks from
have a dependency structure closer to the one of 2000 to 2008; CLEF robust tracks 2007 and 2008;
the source query. GeoCLEf tracks 2005-2007. To avoid the issue of
In this work we experiment with two dif- overlapping topics we removed duplicates. The
ferent coupling variants proposed in (Nikoulina created parallel queries set contained 500 700
and Dymetman, 2008), namely, Lexicalised and parallel entries (depending on the language pair,
Label-specific coupling features. Table 1) and was used for Moses parameters tun-
The generic coupling features are based on ing.
the notion of rectangles that are of the follow- In order to create the training set for the rerank-
ing type : ((s1 , ds12 , s2 ), (t1 , dt12 , t2 )), where ing approach, we need to have access to the rele-
ds12 is an edge between source words s1 and s2 , vance judgements. We didnt have access to all
dt12 is an edge between target words t1 and t2 , relevance judgements of the previously desribed
s1 is aligned with t1 and s2 is aligned with t2 . tracks. Thus we used only a subset of the previ-
Lexicalised features take into account the qual- ously extracted parallel set, which includes CLEF
ity of lexical alignment, by weighting each rect- 2000-2008 topics from the AdHoc-main, AdHoc-
angle (s1 , s2 , t1 , t2 ) by a probability of align- TEL and GeoCLEF tracks.
ing s1 to t1 and s2 to t2 (eg. p(s1 |t1 )p(s2 |t2 ) or The number of queries obtained altogether is
p(t1 |s1 )p(t2 |s2 )). shown in (Table 1).
The Label-Specific features take into account
4.1.2 Baseline
the nature of the aligned dependencies. Thus, the
rectangles of the form ((s1, subj, s2), (t1, subj, We tested our approaches on the CLEF AdHoc-
t2)) will get more weight than a rectangle ((s1, TEL 2009 task (50 topics). This task dealt
subj, s2), (t1, nmod, t2)). The importance of with monolingual and cross-lingual search in a
each rectangle is learned on the parallel anno- library catalog. The monolingual retrieval is
tated corpus by introducing a collection of Label- 3
This alignment can be either produced by a toolkit like
Specific coupling features, each for a specific pair GIZA++(Och and Ney, 2003) or obtained directly by a sys-
of source label and target label. tem that produced the Nbest list of the translations (Moses).
113
Language pair Number of queries The 5best retrieval can be seen as a sort of query
Total queries expansion, without accessing the document col-
En - Fr, Fr - En 470 lection or any external resources.
En - De, De - En 714 Given that the query length is shorter than for a
Annotated queries standard sentence, the 4-gramm BLEU (used for
En - Fr, Fr - En 400 standart MT evaluation) might not be able to cap-
En - De, De - En 350 ture the difference between the translations (eg.
English-German 4-gramm BLEU is equal to 0 for
Table 1: Top: total number of parallel queries gathered our task). For that reason we report both 3- and
from all the CLEF tasks (size of the tuning set). Bot- 4-gramm BLEU scores.
tom: number of queries extracted from the tasks for Note, that the French-English baseline retrieval
which the human relevance judgements were availble
quality is much better than the German-English.
(size of the reranking training set).
This is probably due to the fact that our German-
English translation system doesnt use any de-
performed with the lemur4 toolkit (Ogilvie and coumpounding, which results into many non-
Callan, 2001). The preprocessing includes lem- translated words.
matisation (with the Xerox Incremental Parser-
XIP (At-Mokhtar et al., 2002)) and filtering out 4.2 Results
the function words (based on XIP PoS tagging). We performed the query-genre adaptation ex-
Table 2 shows the performance of the monolin- periments for English-French, French-English,
gual retrieval model for each collection. The German-English and English-German language
monolingual retrieval results are comparable to pairs.
the CLEF AdHoc-TEL 2009 participants (Ferro Ideally, we would have liked to combine the
and Peters, 2009). Let us note here that it is not two approaches we proposed: use the query-
the case for our CLIR results since we didnt ex- genre-tuned model to produce the Nbest list
ploit the fact that each of the collections could ac- which is then reranked to optimize the MAP
tually contain the entries in a language other than score. However, it was not possible in our exper-
the official language of the collection. imental settings due to the small amount of train-
The cross-lingual retrieval is performed as fol- ing data available. We thus simply compare these
lows : two approaches to a baseline approach and com-
ment on their respective performance.
the input query (eg. in English) is first trans-
lated into the language of the collection (eg. 4.2.1 Query-genre tuning approach
German);
For the CLEF-tuning experiments we used the
this translation is used to search the target same translation model and language model as for
collection (eg. Austrian National Library for the baseline (Europarl-based). The weights were
German ) . then tuned on the CLEF topics described in sec-
tion 4.1.1. We then tested the system obtained on
The baseline translation is produced with 50 parallel queries from the CLEF AdHoc-TEL
Moses trained on Europarl. Table 2 reports the 2009 task.
baseline performance both in terms of MT evalu- Table 3 describes the results of the evalua-
ation metrics (BLEU) and Information Retrieval tion. We observe consistent 1-best MAP improve-
evaluation metric MAP (Mean Average Preci- ments, but unstable BLEU (3-gramm) (improve-
sion). ments for English-German, and degradation for
The 1best MAP score corresponds to the case other language pairs), although one would have
when the single translation is proposed for the expected BLEU to be improved in this experi-
retrieval by the query translation model. 5best mental setting given that BLEU was the objective
MAP score corresponds to the case when the 5 function for MERT. These results, on one side,
top translations proposed by the translation ser- confirm the remark of (Kettunen, 2009) that there
vice are concatenated and used for the retrieval. is a correlation (although low) between BLEU
4
http://www.lemurproject.org/ and MAP scores. The unstable BLEU scores
114
MAP MAP BLEU BLEU
MAP
1-best 5-best 4-gramm 3-gramm
Monolingual IR Bilingual IR
French-English 0.1828 0.2186 0.1199 0.1568
English 0.3159
German-English 0.0941 0.0942 0.2351 0.2923
French 0.2386 English-French 0.1504 0.1543 0.2863 0.3423
German 0.2162 English-German 0.1009 0.1157 0.0000 0.1218
Table 2: Baseline MAP scores for monolingual and bilingual CLEF AdHoc TEL 2009 task.
Table 3: BLEU and MAP performance on CLEF AdHoc TEL 2009 task for the genre-tuned model.
might also be explained by the small size of the structure: mostly content words and fewer func-
test set (compared to a standard test set of 1000 tion words when compared to the full sentence.
full-sentences). The language model weight is consistently
Secondly, we looked at the weights of the fea- though not drastically smaller when tuning with
tures both in the baseline model (Europarl-tuned) CLEF data. We suppose that this is due to the
and in the adapted model (CLEF-tuned), shown in fact that a Europarl-base language model is not
Table 4. We are unsure how suitable the sizes of the best choice for translating query data.
the CLEF tuning sets are, especially for the pairs 4.2.2 Reranking approach
involving English and French. Nevertheless we
The reranking experiments include different
do observe and comment on some patterns.
features combinations. First, we experiment with
For the pairs involving English and German the Moses features only in order to make this ap-
the distortion weight is much higher when tuning proach comparable with the first one. Secondly,
with CLEF data compared to tuning with Europarl we compare different syntax-based features com-
data. The picture is reversed when looking at the binations, as described in section 3.2.2. Thus, we
two pairs involving English and French. This is compare the following reranking models (defined
to be expected if we interpret a high distortion by the feature set): moses, lex (lexical coupling
weight as follows: it is not encouraged to place + moses features), lab (label-specific coupling +
source words that are near to each other far away moses features), posmaplex (lexical PoS mapping
from each other in the translation. Indeed, the lo- + moses features ), lab-lex (label-specific cou-
cal reorderings are much more frequent between pling + lexical coupling + moses features), lab-
English and French (e.g. white house = maison lex-posmap (label-specific coupling + lexical cou-
blanche), while the long-distance reorderings are pling features + generic PoS mapping). To reduce
more typcal between English and German. the size of feature-functions vectors we take only
The word penalty is consistenly higher over all the 20 most frequent features in the training data
pairs when tuning with CLEF data compared to for Label-specific coupling and PoS mapping fea-
tuning with Europarl data. We could see an ex- tures. The computation of the syntax features is
planation for this pattern in the smaller size of based on the rule-based XIP parser, where some
the CLEF sentences if we interpret higher word heuristics specific to query processing have been
penalty as a preference for shorter translations. integrated into English and French (but not Ger-
This can be explained both with the smaller aver- man) grammars (Brun et al., 2012).
age size of the queries and with the specific query The results of these experiments are illustrated
115
Lng pair Tune set DW LM (f |e) lex(f |e) (e|f ) lex(e|f ) PP WP
Europarl 0.0801 0.1397 0.0431 0.0625 0.1463 0.0638 -0.0670 -0.3975
Fr-En
CLEF 0.0015 0.0795 -0.0046 0.0348 0.1977 0.0208 -0.2904 0.3707
Europarl 0.0588 0.1341 0.0380 0.0181 0.1382 0.0398 -0.0904 -0.4822
De-En
CLEF 0.3568 0.1151 0.1168 0.0549 0.0932 0.0805 0.0391 -0.1434
Europarl 0.0789 0.1373 0.0002 0.0766 0.1798 0.0293 -0.0978 -0.4002
En-Fr
CLEF 0.0322 0.1251 0.0350 0.1023 0.0534 0.0365 -0.3182 -0.2972
Europarl 0.0584 0.1396 0.0092 0.0821 0.1823 0.0437 -0.1613 -0.3233
En-De
CLEF 0.3451 0.1001 0.0248 0.0872 0.2629 0.0153 -0.0431 0.1214
Table 4: Feature weights for the query-genre tuned model. Abbreviations: DW - distortion weight, LM - language
model weight, PP - phrase penalty, WP - word penalty, -phrase translation probability, lex-lexical weighting.
Query Example MAP bleu1 German which can be explained by the fact that
Src1 Weibliche Martyrer the German grammar used for query processing
Ref Female Martyrs
was not adapted for queries as opposed to English
T1 female martyrs 0.07 1
T2 Women martyr 0.4 0
and French grammars. However, we do not ob-
Src 2 Genmanipulation am serve the same tendency for BLEU score, where
Menschen only a few of the adapted models outperform the
Ref Human Gene Manipula- baseline, which confirms the hypothesis of the
tion low correlation between BLEU and MAP scores
T1 On the genetic manipula- 0.044 0.167 in these settings. Table 5 gives some examples of
tion of people the queries translations before (T1) and after (T2)
T2 genetic manipulation of 0.069 0.286
reranking. These examples also illustrate differ-
the human being
Src 3 Arbeitsrecht in der Eu-
ent types of disagreement between MAP and 1-
ropaischen Union gramm BLEU5 score.
Ref European Union Labour The results for English-German and English-
Laws French look more confusing. This can be partly
T1 Labour law in the Euro- 0.015 0.5 due to the more rich morphology of the target lan-
pean Union guages which may create more noise in the syn-
T2 labour legislation in the 0.036 0.5
tax structure. Reranking however improves over
European Union
the 1-best MAP baseline for English-German, and
Table 5: Some examples of queries translations (T1: 5-best MAP is also improved excluding the mod-
baseline, T2: after reranking with lab-lex), MAP and els involving PoS tagging for German (posmap,
1-gramm BLEU scores for German-English. posmaplex, lab-lex-posmap). The results for
English-French are more difficult to interpret. To
find out the reason of such a behavior, we looked
in Figure 1. To keep the figure more readable, at the translations. We observed the following to-
we report only on 3-gramm BLEU scores. When kenization problem for French: the apostrophe is
computing the 5best MAP score, the order in the systematically separated, e.g. d aujourd hui.
Nbest list is defined by a corresponding reranking This leads to both noisy pre-retrieval preprocess-
model. Each reranking model is illustrated by a ing (eg. d is tagged as a NOUN) and noisy syntax-
single horizontal red bar. We compare the rerank- based feature values, which might explain the un-
ing results to the baseline model (vertical line) and stable results.
also to the results of the first approach (yellow bar Finally, we can see that the syntax-based fea-
labelled MERT:moses) on the same figure. tures can be beneficial for the final retrieval qual-
First, we remark that the adapted models ity: the models with syntax features can outper-
(query-genre tuning and reranking) outperform form the model basd on the moses features only.
the baseline in terms of MAP (1best and 5 best) The syntax-based features leading to the most sta-
for French-English and German-English transla-
tions for most of the models. The only exception 5
The higher order BLEU scores are equal to 0 for most
is posmaplex model (based on PoS tagging) for of the individual translations.
116
Figure 1: Reranking results. The vertical line corresponds to the baseline scores. The lowest bar (MERT:moses,
in yellow): the results of the tuning approach, other bars(in red): the results of the reranking approach.
ble results seem to be lab-lex (combination of lex- of MAP is improved between 1-2.5 points. We
ical and label-specific coupling): it leads to the believe that the combination of these two meth-
best gains over 1-best and 5-best MAP for all lan- ods would be the most beneficial setting, although
guage pairs excluding English-French. This is a we were not able to prove this experimentally
surprising result given the fact that the underlying (due to the lack of training data). None of these
IR model doesnt take syntax into account in any methods require access to the document collec-
way. In our opinion, this is probably due to the tion at test time, and can be used in the context
interaction between the pre-retrieval preprocess- of a query translation service. The combination
ing (lemmatisation, PoS tagging) done with the of our adapted SMT model with other state-of-the
linguistic tools which might produce noisy results art CLIR techniques (eg. query expansion with
when applied to the SMT outputs. The rerank- PRF) will be explored in future work.
ing with syntax-based features allows to choose
a better-formed query for which the PoS tagging Acknowledgements
and lemmatisation tools produce less noise which This research was supported by the European
leads to a better retrieval. Unions ICT Policy Support Programme as part of
the Competitiveness and Innovation Framework
5 Conclusion Programme, CIP ICT-PSP under grant agreement
nr 250430 (Project GALATEAS).
In this work we proposed two methods for query-
genre adaptation of an SMT model: the first
method addressing the translation quality aspect References
and the second one the retrieval precision aspect. Salah At-Mokhtar, Jean-Pierre Chanod, and Claude
We have shown that CLIR performance in terms Roux. 2002. Robustness beyond shallowness: in-
117
cremental deep parsing. Natural Language Engi- Technology Research, pages 138145, San Diego,
neering, 8:121144, June. California. Morgan Kaufmann Publishers Inc.
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: Nicola Ferro and Carol Peters. 2009. CLEF 2009
an automatic metric for MT evaluation with im- ad hoc track overview: TEL and persian tasks.
proved correlation with human judgments. In Pro- In Working Notes for the CLEF 2009 Workshop,
ceedings of the ACL Workshop on Intrinsic and Ex- Corfu, Greece.
trinsic Evaluation Measures for Machine Transla- Atsushi Fujii, Masao Utiyama, Mikio Yamamoto, and
tion and/or Summarization, pages 6572, Ann Ar- Takehito Utsuro. 2009. Evaluating effects of ma-
bor, Michigan, June. Association for Computational chine translation accuracy on cross-lingual patent
Linguistics. retrieval. In Proceedings of the 32nd international
Adam Berger, John Lafferty, and John La Erty. 1999. ACM SIGIR conference on Research and develop-
The weaver system for document retrieval. In In ment in information retrieval, SIGIR 09, pages
Proceedings of the Eighth Text REtrieval Confer- 674675.
ence (TREC-8, pages 163174. Jianfeng Gao, Jian-Yun Nie, and Ming Zhou. 2006.
Nicola Bertoldi and Marcello Federico. 2009. Do- Statistical query translation models for cross-
main adaptation for statistical machine translation language information retrieval. 5:323359, Decem-
with monolingual resources. In Proceedings of ber.
the Fourth Workshop on Statistical Machine Trans- Wei Gao, Cheng Niu, Jian-Yun Nie, Ming Zhou, Kam-
lation, pages 182189. Association for Computa- Fai Wong, and Hsiao-Wuen Hon. 2010. Exploit-
tional Linguistics. ing query logs for cross-lingual query suggestions.
Caroline Brun, Vassilina Nikoulina, and Nikolaos La- ACM Trans. Inf. Syst., 28(2).
gos. 2012. Linguistically-adapted structural query
Djoerd Hiemstra and Franciska de Jong. 1999. Dis-
annotation for digital libraries in the social sciences.
ambiguation strategies for cross-language informa-
In Proceedings of the 6th EACL Workshop on Lan-
tion retrieval. In Proceedings of the Third European
guage Technology for Cultural Heritage, Social Sci-
Conference on Research and Advanced Technology
ences, and Humanities, Avignon, France, April.
for Digital Libraries, pages 274293.
Chris Callison-Burch, Philipp Koehn, Christof Monz,
and Josh Schroeder. 2009. Findings of the 2009 Rong Hu, Weizhu Chen, Peng Bai, Yansheng Lu,
Workshop on Statistical Machine Translation. In Zheng Chen, and Qiang Yang. 2008. Web query
Proceedings of the Fourth Workshop on Statistical translation via web log mining. In Proceedings of
Machine Translation, pages 128, Athens, Greece, the 31st annual international ACM SIGIR confer-
March. Association for Computational Linguistics. ence on Research and development in information
David Chiang, Yuval Marton, and Philip Resnik. retrieval, SIGIR 08, pages 749750. ACM.
2008. Online large-margin training of syntactic and Amir Hossein Jadidinejad and Fariborz Mahmoudi.
structural translation features. In Proceedings of the 2009. Cross-language information retrieval us-
2008 Conference on Empirical Methods in Natural ing meta-language index construction and structural
Language Processing, pages 224233. Association queries. In Proceedings of the 10th cross-language
for Computational Linguistics. evaluation forum conference on Multilingual in-
Stephane Clinchant and Jean-Michel Renders. 2007. formation access evaluation: text retrieval experi-
Query translation through dictionary adaptation. In ments, CLEF09, pages 7077, Berlin, Heidelberg.
CLEF07, pages 182187. Springer-Verlag.
Michael Collins and Brian Roark. 2004. Incremental Gareth Jones, Sakai Tetsuya, Nigel Collier, Akira Ku-
parsing with the perceptron algorithm. In ACL 04: mano, and Kazuo Sumita. 1999. Exploring the
Proceedings of the 42nd Annual Meeting on Asso- use of machine translation resources for english-
ciation for Computational Linguistics. japanese cross-language information retrieval. In In
Michael Collins. 2001. Ranking algorithms for Proceedings of MT Summit VII Workshop on Ma-
named-entity extraction: boosting and the voted chine Translation for Cross Language Information
perceptron. In ACL02: Proceedings of the 40th Retrieval, pages 181188.
Annual Meeting on Association for Computational Kimmo Kettunen. 2009. Choosing the best mt pro-
Linguistics, pages 489496, Philadelphia, Pennsyl- grams for clir purposes can mt metrics be help-
vania. Association for Computational Linguistics. ful? In Proceedings of the 31th European Confer-
Koby Crammer and Yoram Singer. 2003. Ultracon- ence on IR Research on Advances in Information
servative online algorithms for multiclass problems. Retrieval, ECIR 09, pages 706712, Berlin, Hei-
Journal of Machine Learning Research, 3:951991. delberg. Springer-Verlag.
George Doddington. 2002. Automatic evaluation Philipp Koehn and Josh Schroeder. 2007. Experi-
of Machine Translation quality using n-gram co- ments in domain adaptation for statistical machine
occurrence statistics. In Proceedings of the sec- translation. In Proceedings of the Second Work-
ond international conference on Human Language shop on Statistical Machine Translation, StatMT
118
07, pages 224227. Association for Computational K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2001.
Linguistics. Bleu: a method for automatic evaluation of machine
Philipp Koehn, Franz Josef Och, and Daniel Marcu. translation.
2003. Statistical phrase-based translation. In Pavel Pecina, Antonio Toral, Andy Way, Vassilis Pa-
NAACL 03: Proceedings of the 2003 Conference pavassiliou, Prokopis Prokopidis, and Maria Gi-
of the North American Chapter of the Association agkou. 2011. Towards using web-crawled data for
for Computational Linguistics on Human Language domain adaptation in statistical machine translation.
Technology, pages 4854, Morristown, NJ, USA. In Proceedings of the 15th Annual Conference of
Association for Computational Linguistics. the European Associtation for Machine Translation,
Philipp Koehn, Hieu Hoang, Alexandra Birch, pages 297304, Leuven, Belgium. European Asso-
Chris Callison-Burch, Marcello Federico, Nicola ciation for Machine Translation.
Bertoldi, Brooke Cowan, Wade Shen, Christine Brian Roark, Murat Saraclar, Michael Collins, and
Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Mark Johnson. 2004. Discriminative language
Alexandra Constantin, and Evan Herbst. 2007. modeling with conditional random fields and the
Moses: open source toolkit for statistical machine perceptron algorithm. In Proceedings of the 42nd
translation. In ACL 07: Proceedings of the 45th Annual Meeting of the Association for Computa-
Annual Meeting of the ACL on Interactive Poster tional Linguistics (ACL04), July.
and Demonstration Sessions, pages 177180. As- Lucia Specia, Marco Turchi, Nicola Cancedda, Marc
sociation for Computational Linguistics. Dymetman, and Nello Cristianini. 2009. Estimat-
Philip Koehn. 2010. Statistical Machine Translation. ing the sentence-level quality of machine translation
Cambridge University Press. systems. In Proceedings of the 13th Annual Confer-
ence of the EAMT, page 2835, Barcelona, Spain.
Wessel Kraaij, Jian-Yun Nie, and Michel Simard.
Taro Watanabe, Jun Suzuki, Hajime Tsukada, and
2003. Embedding web-based statistical trans-
Hideki Isozaki. 2007. Online large-margin train-
lation models in cross-language information re-
ing for statistical machine translation. In Proceed-
trieval. Computational Linguistiques, 29:381419,
ings of the 2007 Joint Conference on Empirical
September.
Methods in Natural Language Processing and Com-
Jian-yun Nie and Jiang Chen. 2002. Exploiting the putational Natural Language Learning (EMNLP-
web as parallel corpora for cross-language informa- CoNLL), pages 764773, Prague, Czech Republic.
tion retrieval. Web Intelligence, pages 218239. Association for Computational Linguistics.
Jian-Yun Nie. 2010. Cross-Language Information Re- Dan Wu and Daqing He. 2010. A study of query
trieval. Morgan & Claypool Publishers. translation using google machine translation sys-
Vassilina Nikoulina and Marc Dymetman. 2008. Ex- tem. Computational Intelligence and Software En-
periments in discriminating phrase-based transla- gineering (CiSE).
tions on the basis of syntactic coupling features. In Hua Wu, Haifeng Wang, and Chengqing Zong. 2008.
Proceedings of the ACL-08: HLT Second Workshop Domain adaptation for statistical machine transla-
on Syntax and Structure in Statistical Translation tion with domain dictionary and monolingual cor-
(SSST-2), pages 5560. Association for Computa- pora. In Proceedings of the 22nd International
tional Linguistics, June. Conference on Computational Linguistics (Col-
Franz Josef Och and Hermann Ney. 2003. A sys- ing2008), pages 993100.
tematic comparison of various statistical alignment Bing Zhao, Matthias Eck, and Stephan Vogel. 2004.
models. Computational Linguistics, 29(1):1951. Language model adaptation for statistical machine
Franz Josef Och, Daniel Gildea, Sanjeev Khudanpur, translation with structured query models. In Pro-
Anoop Sarkar, Kenji Yamada, Alex Fraser, Shankar ceedings of the 20th international conference on
Kumar, Libin Shen, David Smith, Katherine Eng, Computational Linguistics, COLING 04. Associ-
Viren Jain, Zhen Jin, and Dragomir Radev. 2003. ation for Computational Linguistics.
Syntax for Statistical Machine Translation: Final Zhongguang Zheng, Zhongjun He, Yao Meng, and
report of John Hopkins 2003 Summer Workshop. Hao Yu. 2010. Domain adaptation for statisti-
Technical report, John Hopkins University. cal machine translation in development corpus se-
Franz Josef Och. 2003. Minimum error rate train- lection. In Universal Communication Symposium
ing in statistical machine translation. In ACL 03: (IUCS), 2010 4th International, pages 27. IEEE.
Proceedings of the 41st Annual Meeting on Asso-
ciation for Computational Linguistics, pages 160
167, Morristown, NJ, USA. Association for Com-
putational Linguistics.
Paul Ogilvie and James P. Callan. 2001. Experiments
using the lemur toolkit. In TREC.
119
Computing Lattice BLEU Oracle Scores for Machine Translation
120
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 120129,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
for other metrics such as METEOR (Banerjee and that it contains a unique initial state q0 and a
Lavie, 2005) or TER (Snover et al., 2006). The unique final state qF . Let f denote the set of all
exact computation of oracles under corpus level paths from q0 to qF in Lf . Each path f cor-
metrics, such as BLEU, poses supplementary com- responds to a possible translation e . The job of
binatorial problems that will not be addressed in a (conventional) decoder is to find the best path(s)
this work. in Lf using scores that combine the edges fea-
In this paper, we present two original methods ture vectors with the parameters learned during
for finding approximate oracle hypotheses on lat- tuning.
tices. The first one is based on a linear approxima- In oracle decoding, the decoders job is quite
tion of the corpus BLEU, that was originally de- different, as we assume that at least a reference
signed for efficient Minimum Bayesian Risk de- rf is provided to evaluate the quality of each indi-
coding on lattices (Tromble et al., 2008). The sec- vidual hypothesis. The decoder therefore aims at
ond one, based on Integer Linear Programming, is finding the path that generates the hypothesis
an extension to lattices of a recent work on failure that best matches rf . For this task, only the output
analysis for phrase-based decoders (Wisniewski labels ei will matter, the other informations can be
et al., 2010). In this framework, we study two left aside.4
decoding strategies: one based on a generic ILP Oracle decoding assumes the definition of a
solver, and one, based on Lagrangian relaxation. measure of the similarity between a reference
Our contribution is also experimental as we and a hypothesis. In this paper we will con-
compare the quality of the BLEU approxima- sider sentence-level approximations of the popu-
tions and the time performance of these new ap- lar BLEU score (Papineni et al., 2002). BLEU is
proaches with several existing methods, for differ- formally defined for two parallel corpora, E =
ent language pairs and using the lattice generation {ej }Jj=1 and R = {rj }Jj=1 , each containing J
capacities of two publicly-available state-of-the- sentences as:
art phrase-based decoders: Moses1 and N-code2 . Y n 1/n
The rest of this paper is organized as follows. n-BLEU(E, R) = BP pm , (1)
In Section 2, we formally define the oracle decod- m=1
ing task and recall the formalism of finite state
automata on semirings. We then describe (Sec- where BP = min(1, e1c1 (R)/c1 (E) ) is the
tion 3) two existing approaches for solving this brevity penalty and pm = cm (E, R)/cm (E) are
task, before detailing our new proposals in sec- clipped or modified m-gram precisions: cm (E) is
tions 4 and 5. We then report evaluations of the the total number of word m-grams in E; cm (E, R)
existing and new oracles on machine translation accumulates over sentences the number of m-
tasks. grams in ej that also belong to rj . These counts
are clipped, meaning that a m-gram that appears
2 Preliminaries k times in E and l times in R, with k > l, is only
counted l times. As it is well known, BLEU per-
2.1 Oracle Decoding Task forms a compromise between precision, which is
We assume that a phrase-based decoder is able directly appears in Equation (1), and recall, which
to produce, for each source sentence f , a lattice is indirectly taken into account via the brevity
Lf = hQ, i, with # {Q} vertices (states) and penalty. In most cases, Equation (1) is computed
# {} edges. Each edge carries a source phrase with n = 4 and we use BLEU as a synonym for
fi , an associated output phrase ei as well as a fea- 4- BLEU .
ture vector hi , the components of which encode BLEU is defined for a pair of corpora, but, as an
various compatibility measures between fi and ei . oracle decoder is working at the sentence-level, it
We further assume that Lf is a word lattice, should rely on an approximation of BLEU that can
meaning that each ei carries a single word3 and linear chain of arcs.
4
The algorithms described below can be straightfor-
1
http://www.statmt.org/moses/ wardly generalized to compute oracle hypotheses under
2
http://ncode.limsi.fr/ combined metrics mixing model scores and quality measures
3
Converting a phrase lattice to a word lattice is a simple (Chiang et al., 2008), by weighting each edge with its model
matter of redistributing a compound input or output over a score and by using these weights down the pipe.
121
evaluate the similarity between a single hypoth- 2.3 Finite State Acceptors
esis and its reference. This approximation intro- The implementations of the oracles described in
duces a discrepancy as gathering sentences with the first part of this work (sections 3 and 4) use the
the highest (local) approximation may not result common formalism of finite state acceptors (FSA)
in the highest possible (corpus-level) BLEU score. over different semirings and are implemented us-
Let BLEU0 be such a sentence-level approximation ing the generic OpenFST toolbox (Allauzen et al.,
of BLEU. Then lattice oracle decoding is the task 2007).
of finding an optimal path (f ) among all paths A (, )-semiring K over a set K is a system
f for a given f , and amounts to the following hK, , , 0, 1i, where hK, , 0i is a commutative
optimization problem: monoid with identity element 0, and hK, , 1i is
a monoid with identity element 1. distributes
(f ) = arg max BLEU0 (e , rf ). (2) over , so that a (b c) = (a b) (a c)
f
and (b c) a = (b a) (c a) and element
0 annihilates K (a 0 = 0 a = 0).
2.2 Compromises of Oracle Decoding Let A = (, Q, I, F, E) be a weighted finite-
state acceptor with labels in and weights in K,
As proved by Leusch et al. (2008), even with
meaning that the transitions (q, , q 0 ) in A carry a
brevity penalty dropped, the problem of deciding
weight w K. Formally, E is a mapping from
whether a confusion network contains a hypoth-
(Q Q) into K; likewise, initial I and fi-
esis with clipped uni- and bigram precisions all
nal weight F functions are mappings from Q into
equal to 1.0 is NP-complete (and so is the asso-
K. We borrow the notations of Mohri (2009):
ciated optimization problem of oracle decoding
if = (q, a, q 0 ) is a transition in domain(E),
for 2-BLEU). The case of more general word and
p() = q (resp. n() = q 0 ) denotes its origin
phrase lattices and 4-BLEU score is consequently
(resp. destination) state, w() = its label and
also NP-complete. This complexity stems from
E() its weight. These notations extend to paths:
chaining up of local unigram decisions that, due
if is a path in A, p() (resp. n()) is its initial
to the clipping constraints, have non-local effect
(resp. ending) state and w() is the label along
on the bigram precision scores. It is consequently
the path. A finite state transducer (FST) is an FSA
necessary to keep a possibly exponential num-
with output alphabet, so that each transition car-
ber of non-recombinable hypotheses (character-
ries a pair of input/output symbols.
ized by counts for each n-gram in the reference)
As discussed in Sections 3 and 4, several oracle
until very late states in the lattice.
decoding algorithms can be expressed as shortest-
These complexity results imply that any oracle path problems, provided a suitable definition of
decoder has to waive either the form of the objec- the underlying acceptor and associated semiring.
tive function, replacing BLEU with better-behaved In particular, quantities such as:
scoring functions, or the exactness of the solu- M
tion, relying on approximate heuristic search al- E(), (3)
gorithms. (A)
In Table 1, we summarize different compro- where the total weight of a successful path =
mises that the existing (section 3), as well as 1 . . . l in A is computed as:
our novel (sections 4 and 5) oracle decoders,
l
have to make. The target and target level O
columns specify the targeted score. None of E() =I(p(1 )) E(i ) F (n(l ))
i=1
the decoders optimizes it directly: their objec-
tive function is rather the approximation of BLEU can be efficiently found by generic shortest dis-
given in the target replacement column. Col- tance algorithms over acyclic graphs (Mohri,
umn search details the accuracy of the target re- 2002). For FSA-based implementations over
placement optimization. Finally, columns clip- semirings where = max, the optimization
ping and brevity indicate whether the corre- problem (2) is thus reduced to Equation (3), while
sponding properties of BLEU score are considered the oracle-specific details can be incorporated into
in the target substitute and in the search algorithm. in the definition of .
122
this paper existing oracle target target level target replacement search clipping brevity
LM-2g/4g 2/4- BLEU sentence P2 (e; r) or P4 (e; r) exact no no
PB 4- BLEU sentence partial log BLEU (4) appr. no no
PB` 4- BLEU sentence partial log BLEU (4) appr. no yes
LB-2g/4g 2/4- BLEU corpus linear appr. lin BLEU (5) exact no yes
SP 1- BLEU sentence unigram count exact no yes
ILP 2- BLEU sentence uni/bi-gram counts (7) appr. yes yes
RLX 2- BLEU sentence uni/bi-gram counts (8) exact yes yes
Table 1: Recapitulative overview of oracle decoders.
In this section, we describe our reimplementation Another approach is put forward in (Dreyer et
of two approximate search algorithms that have al., 2007) and used in (Li and Khudanpur, 2009):
been proposed in the literature to solve the oracle oracle translations are shortest paths in a lattice
decoding problem for BLEU. In addition to their L, where the weight of each path is the sen-
approximate nature, none of them accounts for the tence level log BLEU() score of the correspond-
fact that the count of each matching word has to ing complete or partial hypothesis:
be clipped. 1 X
log BLEU() = log pm . (4)
4
3.1 Language Model Oracle (LM) m=1...4
The simplest approach we consider is introduced Here, the brevity penalty is ignored and n-
in (Li and Khudanpur, 2009), where oracle decod- gram precisions are offset to avoid null counts:
ing is reduced to the problem of finding the most pm = (cm (e , r) + 0.1)/(cm (e ) + 0.1).
likely hypothesis under a n-gram language model This approach has been reimplemented using
trained with the sole reference translation. the FST formalism by defining a suitable semir-
Let us suppose we have a n-gram language ing. Let each weight of the semiring keep a set
model that gives a probability P (en |e1 . . . en1 ) of tuples accumulated up to the current state of
of word en given the n 1 previous words. the lattice. Each tuple contains three words of re-
The probability
Q of a hypothesis e is then cent history, a partial hypothesis as well as current
Pn (e|r) = i=1 P (ei+n |ei . . . ei+n1 ). The lan- values of the length of the partial hypothesis, n-
guage model can conveniently be represented as a gram counts (4 numbers) and the sentence-level
FSA ALM , with each arc carrying a negative log- log BLEU score defined by Equation (4). In the
probability weight and with additional -type fail- beginning each arc is initialized with a singleton
ure transitions to accommodate for back-off arcs. set containing one tuple with a single word as the
If we train, for each source sentence f , a sepa- partial hypothesis. For the semiring operations we
rate language model ALM (rf ) using only the ref- define one common -operation and two versions
erence rf , oracle decoding amounts to finding a of the -operation:
shortest (most probable) path in the weighted FSA L1 P B L2 appends a word on the edge of
resulting from the composition L ALM (rf ) over L2 to L1 s hypotheses, shifts their recent histories
the (min, +)-semiring: and updates n-gram counts, lengths, and current
score; L1 P B L2 merges all sets from L1
LM (f ) = ShortestPath(L ALM (rf )). and L2 and recombinates those having the same
recent history; L1 P B` L2 merges all sets
This approach replaces the optimization of n- from L1 and L2 and recombinates those having
BLEU with a search for the most probable path the same recent history and the same hypothesis
under a simplistic n-gram language model. One length.
may expect the most probable path to select fre- If several hypotheses have the same recent
quent n-gram from the reference, thus augment- history (and length in the case of P B` ), re-
ing n-BLEU. combination removes all of them, but the one
123
1:/0 1:111/0
0:00/0 1:11/0
0:/0 1:101/0
1:1/0 1 10 11
0:0/0 1:/0 0:110/0
1:01/0 0:100/0
q 0 0:10/0 1
0:/0 q 1:011/0
0:/0
0:/0 0:000/0
q
1:/0 0:010/0
1:/0
0 01 00
1:001/0
Figure 1: Examples of the n automata for = {0, 1} and n = 1 . . . 3. Initial and final states are marked,
respectively, with bold and with double borders. Note that arcs between final states are weighted with 0, while in
reality they will have this weight only if the corresponding n-gram does not appear in the reference.
with the largest current BLEU score. Optimal gram, and all weighted transitions of the kind
path is then found by launching the generic (1n1 , n : 1n /n 1n (r), 2n ), where s are
ShortestDistance(L) algorithm over one of in , input word sequence 1n1 and output se-
the semirings above. quence 2n , are, respectively, the maximal prefix
The (P B` , P B )-semiring, in which the and suffix of an n-gram 1n .
equal length requirement also implies equal In supplement, we add auxiliary states corre-
brevity penalties, is more conservative in recom- sponding to m-grams (m < n 1), whose func-
bining hypotheses and should achieve final BLEU tional purpose is to help reach one of the main
n1
that is least as good as that obtained with the (n 1)-gram states. There are ||||11 , n > 1,
(P B , P B )-semiring5 . such supplementary states and their transitions are
(1k , k+1 : 1k+1 /0, 1k+1 ), k = 1 . . . n2. Apart
4 Linear BLEU Oracle (LB)
from these auxiliary states, the rest of the graph
In this section, we propose a new oracle based on (i.e., all final states) reproduces the structure of
the linear approximation of the corpus BLEU in- the well-known de Bruijn graph B(, n) (see Fig-
troduced in (Tromble et al., 2008). While this ap- ure 1).
proximation was earlier used for Minimum Bayes To actually compute the best hypothesis, we
Risk decoding in lattices (Tromble et al., 2008; first weight all arcs in the input FSA L with 0 to
Blackwood et al., 2010), we show here how it can obtain 0 . This makes each words weight equal
also be used to approximately compute an oracle in a hypothesis path, and the total weight of the
translation. path in 0 is proportional to the number of words
Given five real parameters 0...4 and a word vo- in it. Then, by sequentially composing 0 with
cabulary , Tromble et al. (2008) showed that one other n s, we discount arcs whose output n-gram
can approximate the corpus-BLEU with its first- corresponds to a matching n-gram. The amount
order (linear) Taylor expansion: of discount is regulated by the ratio between n s
for n > 0.
4
X X With all operations performed over the
lin BLEU() = 0 |e |+ n cu (e )u (r),
(min, +)-semiring, the oracle translation is then
n=1 un
(5) given by:
where cu (e) is the number of times the n-gram
u appears in e, and u (r) is an indicator variable LB = ShortestPath(0 1 2 3 4 ).
testing the presence of u in r.
We set parameters n as in (Tromble et al.,
To exploit this approximation for oracle decod-
2008): 0 = 1, roughly corresponding to the
ing, we construct four weighted FSTs n con-
brevity penalty (each word in a hypothesis adds
taining a (final) state for each possible (n 1)-
up equally to the final path length) and n =
5
See, however, experiments in Section 6. (4p rn1 )1 , which are increasing discounts
124
define, for every edge i , an associated reward, i
36
34 that describes the edges local contribution to the
32
36
34 30
28
hypothesis score. For instance, for the sentence
32
BLEU 30
28
26
24
approximation of the 1-BLEU score, the rewards
26
24
22
are defined as:
22 0
0.2 (
0.4
r
1 if w(i ) is in the reference,
1
0.8
0.6 0.8
0.6
i =
p
0.4
0.2
0 1
2 otherwise,
Figure 2: Performance of the LB-4g oracle for differ- where 1 and 2 are two positive constants cho-
ent combinations of p and r on WMT11 de2en task. sen to maximize the corpus BLEU score6 . Con-
stant 1 (resp. 2 ) is a reward (resp. a penalty)
for generating a word in the reference (resp. not in
for matching n-grams. The values of p and r were
the reference). The score of an assignment P
found by grid search with a 0.05 step value. A P#{}
is then defined as: score() = i=1 i i . This
typical result of the grid evaluation of the LB or-
score can be seen as a compromise between the
acle for German to English WMT11 task is dis-
number of common words in the hypothesis and
played on Figure 2. The optimal values for the
the reference (accounting for recall) and the num-
other pairs of languages were roughly in the same
ber of words of the hypothesis that do not appear
ballpark, with p 0.3 and r 0.2.
in the reference (accounting for precision).
5 Oracles with n-gram Clipping As explained in Section 2.3, finding the or-
acle hypothesis amounts to solving the shortest
In this section, we describe two new oracle de- distance (or path) problem (3), which can be re-
coders that take n-gram clipping into account. formulated by a constrained optimization prob-
These oracles leverage on the well-known fact lem (Wolsey, 1998):
that the shortest path problem, at the heart of
#{}
all the oracles described so far, can be reduced X
straightforwardly to an Integer Linear Program- arg max i i (6)
P i=1
ming (ILP) problem (Wolsey, 1998). Once oracle X X
decoding is formulated as an ILP problem, it is s.t. = 1, =1
relatively easy to introduce additional constraints, (qF ) + (q0 )
for instance to enforce n-gram clipping. We will
X X
= 0, q Q \ {q0 , qF }
first describe the optimization problem of oracle + (q) (q)
decoding and then present several ways to effi-
ciently solve it. where q0 (resp. qF ) is the initial (resp. final) state
of the lattice and (q) (resp. + (q)) denotes the
5.1 Problem Description set of incoming (resp. outgoing) edges of state q.
Throughout this section, abusing the notations, These path constraints ensure that the solution of
we will also think of an edge i as a binary vari- the problem is a valid path in the lattice.
able describing whether the edge is selected or The optimization problem in Equation (6) can
not. The set {0, 1}#{} of all possible edge as- be further extended to take clipping into account.
signments will be denoted by P. Note that , the Let us introduce, for each word w, a variable w
set of all paths in the lattice is a subset of P: by that denotes the number of times w appears in the
enforcing some constraints on an assignment in hypothesis clipped to the number of times, it ap-
P, it can be guaranteed that it will represent a path pears in the reference. Formally, w is defined by:
in the lattice. For the sake of presentation, we as-
sume that each edge i generates a single word X
w(i ) and we focus first on finding the optimal w = min , cw (r)
(w)
hypothesis with respect to the sentence approxi-
mation of the 1-BLEU score. 6
We tried several combinations of 1 and 2 and kept
As 1-BLEU is decomposable, it is possible to the one that had the highest corpus 4-BLEU score.
125
whereP (w) is the subset of edges generating w, 5.2 Shortest Path Oracle (SP)
and (w) is the number of occurrences of As a trivial special class of the above formula-
w in the solution and cw (r) is the number of oc- tion, we also define a Shortest Path Oracle (SP)
currences of w in the reference r. Using the that solves the optimization problem in (6). As
variables, we define a clipped approximation of no clipping constraints apply, it can be solved ef-
1- BLEU : ficiently using the standard Bellman algorithm.
#{}
5.3 Oracle Decoding through Lagrangian
X X X
1 w 2 i w
w i=1 w Relaxation (RLX)
Indeed, the clipped number of words in the hy- In this section, we introduce another method to
pothesis that appear in the reference is given by solve problem (7) without relying on an exter-
P P#{} P nal ILP solver. Following (Rush et al., 2010;
w w , and i=1 i w w corresponds to
the number of words in the hypothesis that do not Chang and Collins, 2011), we propose an original
appear in the reference or that are surplus to the method for oracle decoding based on Lagrangian
clipped count. relaxation. This method relies on the idea of re-
Finally, the clipped lattice oracle is defined by laxing the clipping constraints: starting from an
the following optimization problem: unconstrained problem, the counts clipping is en-
forced by incrementally strengthening the weight
#{}
X X of paths satisfying the constraints.
arg max (1 + 2 ) w 2 i The oracle decoding problem with clipping
P,w w i=1
constraints amounts to solving:
(7)
X #{}
s.t. w 0, w cw (r), w arg min
X
i i (8)
(w)
X X i=1
X
= 1, =1 s.t. cw (r), w r
(qF ) + (q0 ) (w)
X X
= 0, q Q \ {q0 , qF } where, by abusing the notations, r also denotes
+ (q) (q)
the set of words in the reference. For sake of clar-
where the first three sets of constraints are the lin- ity, the path constraints are incorporated into the
earization of the definition of w , made possible domain (the arg min runs over and not over P).
by the positivity of 1 and 2 , and the last three To solve this optimization problem we consider its
sets of constraints are the path constraints. dual form and use Lagrangian relaxation to deal
In our implementation we generalized this op- with clipping constraints.
timization problem to bigram lattices, in which Let = {w }wr be positive Lagrange mul-
each edge is labeled by the bigram it generates. tipliers, one for each different word of the refer-
Such bigram FSAs can be produced by compos- ence, then the Lagrangian of the problem (8) is:
ing the word lattice with 2 from Section 4. In
#{}
this case, the reward of an edge will be defined as X X X
a combination of the (clipped) number of unigram L(, ) = i i + w cw (r)
i=1 wr (w)
matches and bigram matches, and solving the op-
timization problem yields a 2-BLEU optimal hy- The dual objective is L() = min L(, )
pothesis. The approach can be further generalized and the dual problem is: max,0 L(). To
to higher-order BLEU or other metrics, as long as solve the latter, we first need to work out the dual
the reward of an edge can be computed locally. objective:
The constrained optimization problem (7) can
be solved efficiently using off-the-shelf ILP = arg min L(, )
solvers7 .
7
#{}
In our experiments we used Gurobi (Optimization, X
2010) a commercial ILP solver that offers free academic li- = arg min i w(i ) i
i=1
cense.
126
where we assume that w(i ) is 0 when word decoder fr2en de2en en2de
w(i ) is not in the reference. In the same way N-code 27.88 22.05 15.83
oracle test
as in Section 5.2, the solution of this problem can Moses 27.68 21.85 15.89
be efficiently retrieved with a shortest path algo- N-code 36.36 29.22 21.18
rithm. Moses 35.25 29.13 22.03
It is possible to optimize L() by noticing that
it is a concave function. It can be shown (Chang Table 2: Test BLEU scores and oracle scores on
and Collins, 2011) that, at convergence, the clip- 100-best lists for the evaluated systems.
ping constraints will be enforced in the optimal
solution. In this work, we chose to use a simple
and 4). Systems were trained on the data provided
gradient descent to solve the dual problem. A sub-
for the WMT11 Evaluation task10 , tuned on the
gradient of the dual objective is:
WMT09 test data and evaluated on WMT10 test
L() X set11 to produce lattices. The BLEU test scores
= cw (r).
w and oracle scores on 100-best lists with the ap-
(w)
proximation (4) for N-code and Moses are given
Each component of the gradient corresponds to in Table 2. It is not until considering 10,000-best
the difference between the number of times the lists that n-best oracles achieve performance com-
word w appears in the hypothesis and the num- parable to the (mediocre) SP oracle.
ber of times it appears in the reference. The algo- To make a fair comparison with the ILP and
rithm below sums up the optimization of task (8). RLX oracles which optimize 2-BLEU, we in-
In the algorithm (t) corresponds to the step size cluded 2-BLEU versions of the LB and LM ora-
at the tth iteration. In our experiments we used a cles, identified below with the -2g suffix. The
constant step size of 0.1. Compared to the usual two versions of the PB oracle are respectively
gradient descent algorithm, there is an additional denoted as PB and PB`, by the type of the -
projection step of on the positive orthant, which operation they consider (Section 3.2). Parame-
enforces the constraint 0. ters p and r for the LB-4g oracle for N-code were
found with grid search and reused for Moses:
(0)
w, w 0 p = 0.25, r = 0.15 (fr2en); p = 0.175, r = 0.575
for t = 1 T do (en2de) and p = 0.35, r = 0.425 (de2en). Cor-
(t) = arg min i i w(i ) i
respondingly, for the LB-2g oracle: p = 0.3, r =
P
if all clipping constraints are enforced 0.15; p = 0.3, r = 0.175 and p = 0.575, r = 0.1.
then optimal solution found The proposed LB, ILP and RLX oracles were
else for w r do the best performing oracles, with the ILP and
nw n. of occurrences of w in (t) RLX oracles being considerably faster, suffering
(t) (t) only a negligible decrease in BLEU, compared to
w w + (t) (nw cw (r))
(t)
w max(0, w )
(t) the 4-BLEU-optimized LB oracle. We stopped
RLX oracle after 20 iterations, as letting it con-
verge had a small negative effect (1 point of the
6 Experiments corpus BLEU), because of the sentence/corpus dis-
crepancy ushered by the BLEU score approxima-
For the proposed new oracles and the existing ap- tion.
proaches, we compare the quality of oracle trans- Experiments showed consistently inferior per-
lations and the average time per sentence needed formance of the LM-oracle resulting from the op-
to compute them8 on several datasets for 3 lan- timization of the sentence probability rather than
guage pairs, using lattices generated by two open- BLEU . The PB oracle often performed compara-
source decoders: N-code and Moses9 (Figures 3 bly to our new oracles, however, with sporadic
8
Experiments were run in parallel on a server with 64G resource-consumption bursts, that are difficult to
of RAM and 2 Xeon CPUs with 4 cores at 2.3 GHz.
9 10
As the ILP (and RLX) oracle were implemented in http://www.statmt.org/wmt2011
11
Python, we pruned Moses lattices to accelerate task prepa- All BLEU scores are reported using the multi-bleu.pl
ration for it. script.
127
50 6 30
BLEU BLEU BLEU
avg. time avg. time avg. time
48.22
48.12
47.82
47.71
46.76
46.48
5 1.5
35.49
45 35
35.09
34.85
34.79
34.76
34.70
1
25.34
4 25
24.85
24.78
24.75
24.73
24.66
41.23
40
avg. time, s
avg. time, s
avg. time, s
1
BLEU
BLEU
BLEU
38.91
38.75
3
22.19
30.78
35 30
20.78
20.74
29.53
29.53
2 20 0.5
0.5
30
1
25 0 25 0 15 0
RLX ILP LB-4g LB-2g PB PBl SP LM-4g LM-2g RLX ILP LB-4g LB-2g PB PBl SP LM-4g LM-2g RLX ILP LB-4g LB-2g PB PBl SP LM-4g LM-2g
37.73
29.94
BLEU BLEU 4 BLEU
avg. time avg. time avg. time 9
36.91
28.94
36.75
28.76
28.68
36.62
28.65
28.64
36.52
36.43
3
8
45 35
44.44
26.48
44.08
43.82
43.82
3 7
43.42
43.20
25
41.03
6
40
avg. time, s
avg. time, s
avg. time, s
2
BLEU
BLEU
BLEU
5
2
36.34
36.25
30.52
35 30
21.29
21.23
20
29.51
29.45
3
1
1
30 2
25 0 25 0 15 0
RLX ILP LB-4g LB-2g PB PBl SP LM-4g LM-2g RLX ILP LB-4g LB-2g PB PBl SP LM-4g LM-2g RLX ILP LB-4g LB-2g PB PBl SP LM-4g LM-2g
Figure 4: Oracles performance for Moses lattices pruned with parameter -b 0.5.
avoid without more cursory hypotheses recom- ter approximations of BLEU than was previously
bination strategies and the induced effect on the done, taking the corpus-based nature of BLEU, or
translations quality. The length-aware PB` oracle clipping constrainst into account, delivering better
has unexpectedly poorer scores compared to its oracles without compromising speed.
length-agnostic PB counterpart, while it should, Using 2-BLEU and 4-BLEU oracles yields com-
at least, stay even, as it takes the brevity penalty parable performance, which confirms the intuition
into account. We attribute this fact to the com- that hypotheses sharing many 2-grams, would
plex effect of clipping coupled with the lack of likely have many common 3- and 4-grams as well.
control of the process of selecting one hypothe- Taking into consideration the exceptional speed of
sis among several having the same BLEU score, the LB-2g oracle, in practice one can safely opti-
length and recent history. Anyhow, BLEU scores mize for 2-BLEU instead of 4-BLEU, saving large
of both of PB oracles are only marginally differ- amounts of time for oracle decoding on long sen-
ent, so the PB`s conservative policy of pruning tences.
and, consequently, much heavier memory con- Overall, these experiments accentuate the
sumption makes it an unwanted choice. acuteness of scoring problems that plague modern
decoders: very good hypotheses exist for most in-
7 Conclusion put sentences, but are poorly evaluated by a linear
We proposed two methods for finding oracle combination of standard features functions. Even
translations in lattices, based, respectively, on a though the tuning procedure can be held respon-
linear approximation to the corpus-level BLEU sible for part of the problem, the comparison be-
and on integer linear programming techniques. tween lattice and n-best oracles shows that the
We also proposed a variant of the latter approach beam search leaves good hypotheses out of the n-
based on Lagrangian relaxation that does not rely best list until very high value of n, that are never
on a third-party ILP solver. All these oracles have used in practice.
superior performance to existing approaches, in
Acknowledgments
terms of the quality of the found translations, re-
source consumption and, for the LB-2g oracles, This work has been partially funded by OSEO un-
in terms of speed. It is thus possible to use bet- der the Quaero program.
128
References Mehryar Mohri. 2002. Semiring frameworks and al-
gorithms for shortest-distance problems. J. Autom.
Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wo- Lang. Comb., 7:321350.
jciech Skut, and Mehryar Mohri. 2007. OpenFst: Mehryar Mohri. 2009. Weighted automata algo-
A general and efficient weighted finite-state trans- rithms. In Manfred Droste, Werner Kuich, and
ducer library. In Proc. of the Int. Conf. on Imple- Heiko Vogler, editors, Handbook of Weighted Au-
mentation and Application of Automata, pages 11 tomata, chapter 6, pages 213254.
23. Gurobi Optimization. 2010. Gurobi optimizer, April.
Michael Auli, Adam Lopez, Hieu Hoang, and Philipp Version 3.0.
Koehn. 2009. A systematic analysis of translation Kishore Papineni, Salim Roukos, Todd Ward, and
model search spaces. In Proc. of WMT, pages 224 Wei-Jing Zhu. 2002. BLEU: a method for auto-
232, Athens, Greece. matic evaluation of machine translation. In Proc. of
Satanjeev Banerjee and Alon Lavie. 2005. ME- the Annual Meeting of the ACL, pages 311318.
TEOR: An automatic metric for MT evaluation with Alexander M. Rush, David Sontag, Michael Collins,
improved correlation with human judgments. In and Tommi Jaakkola. 2010. On dual decomposi-
Proc. of the ACL Workshop on Intrinsic and Extrin- tion and linear programming relaxations for natural
sic Evaluation Measures for Machine Translation, language processing. In Proc. of the 2010 Conf. on
pages 6572, Ann Arbor, MI, USA. EMNLP, pages 111, Stroudsburg, PA, USA.
Graeme Blackwood, Adria de Gispert, and William Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
Byrne. 2010. Efficient path counting transducers nea Micciulla, and John Makhoul. 2006. A study
for minimum bayes-risk decoding of statistical ma- of translation edit rate with targeted human anno-
chine translation lattices. In Proc. of the ACL 2010 tation. In Proc. of the Conf. of the Association for
Conference Short Papers, pages 2732, Strouds- Machine Translation in the America (AMTA), pages
burg, PA, USA. 223231.
Yin-Wen Chang and Michael Collins. 2011. Exact de- Roy W. Tromble, Shankar Kumar, Franz Och, and
coding of phrase-based translation models through Wolfgang Macherey. 2008. Lattice minimum
lagrangian relaxation. In Proc. of the 2011 Conf. on bayes-risk decoding for statistical machine transla-
EMNLP, pages 2637, Edinburgh, UK. tion. In Proc. of the Conf. on EMNLP, pages 620
629, Stroudsburg, PA, USA.
David Chiang, Yuval Marton, and Philip Resnik.
Marco Turchi, Tijl De Bie, and Nello Cristianini.
2008. Online large-margin training of syntactic
2008. Learning performance of a machine trans-
and structural translation features. In Proc. of the
lation system: a statistical and computational anal-
2008 Conf. on EMNLP, pages 224233, Honolulu,
ysis. In Proc. of WMT, pages 3543, Columbus,
Hawaii.
Ohio.
Markus Dreyer, Keith B. Hall, and Sanjeev P. Khu- Guillaume Wisniewski, Alexandre Allauzen, and
danpur. 2007. Comparing reordering constraints Francois Yvon. 2010. Assessing phrase-based
for SMT using efficient BLEU oracle computation. translation models with oracle decoding. In Proc.
In Proc. of the Workshop on Syntax and Structure of the 2010 Conf. on EMNLP, pages 933943,
in Statistical Translation, pages 103110, Morris- Stroudsburg, PA, USA.
town, NJ, USA. L. Wolsey. 1998. Integer Programming. John Wiley
Gregor Leusch, Evgeny Matusov, and Hermann Ney. & Sons, Inc.
2008. Complexity of finding the BLEU-optimal hy-
pothesis in a confusion network. In Proc. of the
2008 Conf. on EMNLP, pages 839847, Honolulu,
Hawaii.
Zhifei Li and Sanjeev Khudanpur. 2009. Efficient
extraction of oracle-best translations from hyper-
graphs. In Proc. of Human Language Technolo-
gies: The 2009 Annual Conf. of the North Ameri-
can Chapter of the ACL, Companion Volume: Short
Papers, pages 912, Morristown, NJ, USA.
Percy Liang, Alexandre Bouchard-Cote, Dan Klein,
and Ben Taskar. 2006. An end-to-end discrim-
inative approach to machine translation. In Proc.
of the 21st Int. Conf. on Computational Linguistics
and the 44th annual meeting of the ACL, pages 761
768, Morristown, NJ, USA.
129
Toward Statistical Machine Translation without Parallel Corpora
130
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 130140,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
2 Background
verdienen
aufrgund
Wieviel
seines
Profils
sollte
man
We begin with a brief overview of the stan-
in
dard phrase-based statistical machine translation
How
model. Here, we define the parameters which
much
we later replace with monolingual alternatives.
We continue with a discussion of bilingual lex- should m
131
2.2 Bilingual lexicon induction for SMT
40
algorithms that attempt to learn translations from
monolingual corpora. Rapp (1995) was the first
30
to propose using non-parallel texts to learn the
Accuracy, %
translations of words. Using large, unrelated En-
20
glish and German corpora (with 163m and 135m
10
tionary (with 22k entires), Rapp (1999) demon-
strated that reasonably accurate translations could Top 1
Top 10
be learned for 100 German nouns that were not
0
contained in the seed bilingual dictionary. His al- 0 100 200 300 400 500 600
132
ES Context Projected ES EN Context terrorist (en)
Vector Context Vector compare Vectors
Occurrences
terrorista (es)
economico s1 t1 policy
planeta s2 t2 growth
project
tasa s3 t3 foreign
dict.
terrorist (en)
Occurrences
extranjero sN-1 tM-1 economic riqueza (es)
empleo sN tM activity
policy activity of
para crecer
para crecer to expand
(projected)
Time
133
be scored in terms of their temporal similarity 3.2 Lexical similarity features
(Schafer and Yarowsky, 2002; Klementiev and
In addition to the three phrase similarity features
Roth, 2006; Alfonseca et al., 2009). The intu-
used in our model c(f, e), t(f, e) and w(f, e)
ition is that news stories in different languages
we include four additional lexical similarity fea-
will tend to discuss the same world events on the
tures for each of phrase pair. The first three lex-
same day. The frequencies of translated phrases
ical features clex (f, e), tlex (f, e) and wlex (f, e)
over time give them particular signatures that will
are the lexical equivalents of the phrase-level con-
tend to spike on the same dates. For instance, if
textual, temporal and wikipedia topic similarity
the phrase asian tsunami is used frequently dur-
scores. They score the similarity of individual
ing a particular time span, the Spanish transla-
words within the phrases. To compute these
tion maremoto asiatico is likely to also be used
lexical similarity features, we average similarity
frequently during that time. Figure 4 illustrates
scores over all possible word alignments across
how the temporal distribution of terrorist is more
the two phrases. Because individual words are
similar to Spanish terrorista than to other Span-
more frequent than multiword phrases, the accu-
ish phrases. We calculate the temporal similar-
racy of clex , tlex , and wlex tends to be higher than
ity between a pair of phrases t(f, e) using the
their phrasal equivalents (this is similar to the ef-
method defined by Klementiev and Roth (2006).
fect observed in Figure 2).
We generate a temporal signature for each phrase
by sorting the set of (time-stamped) documents in Orthographic / phonetic similarity. The final
the monolingual corpus into a sequence of equally lexical similarity feature that we incorporate is
sized temporal bins and then counting the number o(f, e), which measures the orthographic similar-
of phrase occurrences in each bin. In our exper- ity between words in a phrase pair. Etymolog-
iments, we set the window size to 1 day, so the ically related words often retain similar spelling
size of temporal signatures is equal to the num- across languages with the same writing system,
ber of days spanned by our corpus. We use cosine and low string edit distance sometimes signals
distance to compare the normalized temporal sig- translation equivalency. Berg-Kirkpatrick and
natures for a pair of phrases (f, e). Klein (2011) present methods for learning cor-
respondences between the alphabets of two lan-
Topic similarity. Phrases and their translations
guages. We can also extend this idea to language
are likely to appear in articles written about the
pairs not sharing the same writing system since
same topic in two languages. Thus, topic or cat-
many cognates, borrowed words, and names re-
egory information associated with monolingual
main phonetically similar. Transliterations can be
data can also be used to indicate similarity be-
generated for tokens in a source phrase (Knight
tween a phrase and its candidate translation. In
and Graehl, 1997), with o(f, e) calculating pho-
order to score a pair of phrases, we collect their
netic similarity rather than orthographic.
topic signatures by counting their occurrences in
each topic and then comparing the resulting vec- The three phrasal and four lexical similarity
tors. We again use the cosine similarity mea- scores are incorporated into the log linear trans-
sure on the normalized topic signatures. In our lation model as feature functions, replacing the
experiments, we use interlingual links between bilingually estimated phrase translation probabil-
Wikipedia articles to estimate topic similarity. We ities and lexical weighting probabilities w. Our
treat each linked article pair as a topic and collect seven similarity scores are not the only ones that
counts for each phrase across all articles in its cor- could be incorporated into the translation model.
responding language. Thus, the size of a phrase Various other similarity scores can be computed
topic signature is the number of article pairs with depending on the available monolingual data and
interlingual links in Wikipedia, and each compo- its associated metadata (see, e.g. Schafer and
nent contains the number of times the phrase ap- Yarowsky (2002)).
pears in (the appropriate side of) the correspond-
3.3 Reordering
ing pair. Our Wikipedia-based topic similarity
feature, w(f, e), is similar in spirit to polylingual The remaining component of the phrase-based
topic models (Mimno et al., 2009), but it is scal- SMT model is the reordering model. We
able to full bilingual lexicon induction. introduce a novel algorithm for estimating
134
Facebook
Input: Source and target phrases f and e,
Anlegen
einfach
Profils
Source and target monolingual corpora Cf and Ce ,
eines
Das
Phrase table pairs T = {(f (i) , e(i) )}N
i=1 .
ist
in
Output: Orientation features (pm , ps , pd ).
What
Sf sentences containing f in Cf ;
Se sentences containing e in Ce ; does
(Bf , , ) CollectOccurs(f, N i=1 f
(i) , S );
f
(Be , Ae , De ) CollectOccurs(e, i=1 e(i) , Se );
N your
cm = cs = cd = 0;
Facebook
foreach unique f 0 in Bf do
foreach translation e0 of f 0 in T do profile s
cm = cm + #Be (e0 );
cs = cs + #Ae (e0 ); reveal
cd = cd + #De (e0 );
c cm + cs + cd ;
return ( ccm , ccs , ccd ) Figure 6: Collecting phrase orientation statistics for
a English-German phrase pair (profile, Profils)
CollectOccurs(r, R, S) from non-parallel sentences (the German sentence
B (); A (); D (); translates as Creating a Facebook profile is easy).
foreach sentence s S do
foreach occurrence of phrase r in s do
B B + (longest preceding r and in R); taining their corresponding translations (e, e0 ), we
A A + (longest following r and in R); are able to increment orientation counts for (f, e)
D D + (longest discontinuous w/ r and in
R); by looking at whether e and e0 are adjacent,
swapped, or discontinuous. The orientations cor-
return (B, A, D);
respond directly to those shown in Figure 1.
One subtly of our method is that shorter and
Figure 5: Algorithm for estimating reordering more frequent phrases (e.g. punctuation) are more
probabilities from monolingual data. likely to appear in multiple orientations with a
given phrase, and therefore provide poor evi-
po (orientation|f, e) from two monolingual cor- dence of reordering. Therefore, we (a) collect
pora instead a bitext. the longest contextual phrases (which also appear
Figure 1 illustrates how the phrase pair orienta- in the phrase table) for reordering feature estima-
tion statistics are estimated in the standard phrase- tion, and (b) prune the set of sentences so that
based SMT pipeline. For a phrase pair like (f = we only keep a small set of least frequent contex-
Profils, e = profile), we count its orien- tual phrases (this has the effect of dropping many
tation with the previously translated phrase pair function words and punctuation marks and and re-
(f 0 = in Facebook, e0 = Facebook) across lying more heavily on multi-word content phrases
all translated sentence pairs in the bitext. to estimate the reordering).2
In our pipeline we do not have translated sen- Our algorithm for learning the reordering pa-
tence pairs. Instead, we look for monolingual rameters is given in Figure 5. The algorithm
sentences in the source corpus which contain estimates a probability distribution over mono-
the source phrase that we are interested in, like tone, swap, and discontinuous orientations (pm ,
f = Profils, and at least one other phrase ps , pd ) for a phrase pair (f, e) from two mono-
that we have a translation for, like f 0 = in lingual corpora Cf and Ce . It begins by calling
Facebook. We then look for all target lan- CollectOccurs to collect the longest match-
guage sentences in the target monolingual cor- ing phrase table phrases that precede f in source
pus that contain the translation of f (here e = monolingual data (Bf ), as well as those that pre-
profile) and any translation of f 0 . Figure 6 il- cede (Be ), follow (Ae ), and are discontinuous
lustrates that it is possible to find evidence for (De ) with e in the target language data. For each
po (swapped|Profils, profile), even from the non- unique phrase f 0 preceding f , we look up transla-
parallel, non-translated sentences drawn from two tions in the phrase table T. Next, we count3 how
independent monolingual corpora. By looking for 2
The pruning step has an additional benefit of minimizing
foreign sentences containing pairs of adjacent for- the memory needed for orientation feature estimations.
eign phrases (f, f 0 ) and English sentences con- 3
#L (x) returns the count of object x in list L.
135
Monolingual training corpora Spanish-English phrase table
Europarl Gigaword Wikipedia Phrase pairs 3,093,228
date range 4/96-10/09 5/94-12/08 n/a Spanish phrases 89,386
uniq shared dates 829 5,249 n/a English phrases 926,138
Spanish articles n/a 3,727,954 59,463 Spanish unigrams 13,216
English articles n/a 4,862,876 59,463 Avg # translations 98.7
Spanish lines 1,307,339 22,862,835 2,598,269 Spanish bigrams 41,426
English lines 1,307,339 67,341,030 3,630,041 Avg # translations 31.9
Spanish words 28,248,930 774,813,847 39,738,084 Spanish trigrams 34,744
English words 27,335,006 1,827,065,374 61,656,646 Avg # translations 13.5
Table 1: Statistics about the monolingual training data and the phrase table that was used in all of the experiments.
many translations e0 of f 0 appeared before, after was re-run for every experiment.
or were discontinuous with e in the target lan- We estimate the parameters of our model from
guage data. Finally, the counts are normalized and two sets of monolingual data, detailed in Table 1:
returned. These normalized counts are the values
we use as estimates of po (orientation|f, e). First, we treat the two sides of the Europarl
parallel corpus as independent, monolingual
4 Experimental Setup corpora. Haghighi et al. (2008) also used
this method to show how well translations
We use the Spanish-English language pair to test could be learned from monolingual corpora
our method for estimating the parameters of an under ideal conditions, where the contextual
SMT system from monolingual corpora. This al- and temporal distribution of words in the two
lows us to compare our method against the nor- monolingual corpora are nearly identical.
mal bilingual training procedure. We expect bilin-
gual training to result in higher translation qual- Next, we estimate the features from truly
ity because it is a more direct method for learn- monolingual corpora. To estimate the con-
ing translation probabilities. We systematically textual and temporal similarity features, we
remove different parameters from the standard use the Spanish and English Gigaword cor-
phrase-based model, and then replace them with pora.5 These corpora are substantially larger
our monolingual equivalents. Our goal is to re- than the Europarl corpora, providing 27x as
cover as much of the loss as possible for each of much Spanish and 67x as much English for
the deleted bilingual components. contextual similarity, and 6x as many paired
The standard phrase-based model that we use dates for temporal similarity. Topical simi-
as our top-line is the Moses system (Koehn et larity is estimated using Spanish and English
al., 2007) trained over the full Europarl v5 par- Wikipedia articles that are paired with inter-
allel corpus (Koehn, 2005). With the exception language links.
of maximum phrase length (set to 3 in our ex-
periments), we used default values for all of the To project context vectors from Spanish to En-
parameters. All experiments use a trigram lan- glish, we use a bilingual dictionary containing en-
guage model trained on the English side of the tries for 49,795 Spanish words. Note that end-to-
Europarl corpus using SRILM with Kneser-Ney end translation quality is robust to substantially
smoothing. To tune feature weights in minimum reducing dictionary size, but we omit these ex-
error rate training, we use a development bitext periments due to space constraints. The con-
of 2,553 sentence pairs, and we evaluate per- text vectors for words and phrases incorporate co-
formance on a test set of 2,525 single-reference occurrence counts using a two-word window on
translated newswire articles. These development either side.
and test datasets were distributed in the WMT The title of our paper uses the word towards be-
shared task (Callison-Burch et al., 2010).4 MERT cause we assume that an inventory of phrase pairs
is given. Future work will explore inducing the
4
Specifcially, news-test2008 plus news-syscomb2009 for
5
dev and newstest2009 for test. We use the afp, apw and xin sections of the corpora.
136
BL
10.52
B
10
4.00
BM/B
M/M
B/B
-/M
M/-
B/-
-/B
o/-
c/-
t/-
-/-
0
1 2 3 4 5 6 7 8 9 10 11
25
25
25
Exp Phrase scores / orientation scores 22.92
23.36
23.36
1 B/B bilingual / bilingual (Moses) 21.87 21.54 Estimated Using Europarl
Estimated Using Monolingual Corpora
2 B/- bilingual / distortion
20
20
20
18.79
18.79
3 -/B none / bilingual
17.00
17.00
17.92
17.92 16.85 17.50
4 -/- none / distortion 15.35 14.78
15
14.07
14.07 14.02
BLEU
14.02
14.02
15
13.13 12.86
15
5, 12 -/M none / mono
BLEU
BLEU
13.13
6, 13 t/- temporal mono / distortion 10.52
10.15
10.15
10
7,14 o/- orthographic mono / distortion
10
10
8, 15 c/- contextual mono / distortion
16 w/- Wikipedia topical mono / distorion
4.00
BM/B
9, 17 M/- all mono / distortion
25
M/M
5
B/B
-/M
M/-
BM/B
t/-B/-
-/B
o/-
c/-
22.92
t/-
M/M
-/M
M/-
w/-
o/-
-/-
c/-
21.87 21.54 10, 18 M/M all mono / mono
Estimated Using Europarl
0
11, 19 BM/B bilingual + all mono / bilingual
20
00
1 2 3 4 5 6 7 8 9 10 11
16.85 17.50 12 13 14 15 16 17 18 19
15.35
25
25
Exp
14.78Phrase scores / orientation scores
14.02 23.36
23.36
15
20
20
18.79
18.79
10
15
15
4.00 5, 12 -/M
of parameters are estimated, thenone
first/ mono
part is for phrase-table features, 13.13 the second is for reordering probabilities.
5
BLEU
BLEU
13.13
BM/B
M/M
B/B
-/M
M/-
-/B
o/-
c/-
t/-
-/-
10
10
0
1 2 3 4 5 6 8, 15
16
7 c/- 8 contextual
w/-
9 mono10/ distortion 11
Wikipedia topical mono / distorion
5 Experimental Results
25
25
BM/B
bilingual / bilingual (Moses)
M/M
Estimated Using Monolingual
10, 18 M/M Corpora
-/M
M/-
Figures 7 and 8 give experimental results. Figure
w/-
o/-
c/-
all mono / mono
t/-
bilingual / distortion
20
20
none / distortion
17.00
17.00
12 13 14 15 16 17 18 19
14.02
14.02 14.07
14.07 based model when each of the bilingually esti-
15
15
none / mono
BLEU
13.13
BLEU
13.13
temporal mono / distortion
orthographic mono / distortion 10.15
10.15 mated features are removed. It shows how much
10
10
M/-
w/-
o/-
c/-
12 13 14 15 16 17 18 19
phrase table itself from monolingual texts. Across Experiments 1-4 remove bilingually estimated pa-
all of our experiments, we use the phrase table rameters from the standard model. For Spanish-
that the bilingual model learned from the Europarl English, the relative contribution of the phrase-
parallel corpus. We keep its phrase pairs, but we table features (which include the phrase transla-
drop all of its scores. Table 1 gives details of the tion probabilities and the lexical weights w) is
phrase pairs. In our experiments, we estimated greater than the reordering probabilities. When
similarity and reordering scores for more than 3 the reordering probability po (orientation|f, e) is
million phrase pairs. For each source phrase, the eliminated and replaced with a simple distance-
set of possible translations was constrained and based distortion feature that does not require a
likely to contain good translations. However, the bitext to estimate, the score dips only marginally
average number of possible translations was high since word order in English and Spanish is simi-
(ranging from nearly 100 translations for each un- lar. However, when both the reordering and the
igram to 14 for each trigram). These contain a phrase table features are dropped, leaving only
lot of noise and result in low end-to-end transla- the LM feature and the phrase penalty, the result-
tion quality without good estimates of translation ing translation quality is abysmal, with the score
quality, as the experiments in Section 5.1 show. dropping a total of over 17 BLEU points.
137
reordering probabilities from monolingual data ( Their method has no notion of translation similar-
/M) adds 5 BLEU points, which is 73% of the po- ity aside from a bilingual dictionary. Similarly,
tential recovery going from the model (/) to the Sanchez-Cartagena et al. (2011) supplement an
model with bilingual reordering features (/B). SMT phrase table with translation pairs extracted
Of the temporal, orthographic, and contextual from a bilingual dictionary and give each a fre-
monolingual features the temporal feature per- quency of one for computing translation scores.
forms the best. Together (M/), they recover Ravi and Knight (2011) treat MT without paral-
more than each individually. Combining mono- lel training data as a decipherment task and learn
lingually estimated reordering and phrase table a translation model from monolingual text. They
features (M/M) yields a total gain of 13.5 BLEU translate corpora of Spanish time expressions and
points, or over 75% of the BLEU score loss that subtitles, which both have a limited vocabulary,
occurred when we dropped all features from the into English. Their method has not been applied
phrase table. However, these results use mono- to broader domains of text.
lingual corpora which have practically identical Most work on learning translations from mono-
phrasal and temporal distributions. lingual texts only examine small numbers of fre-
quent words. Huang et al. (2005) and Daume and
5.3 Estimating features using truly Jagarlamudi (2011) are exceptions that improve
monolingual corpora MT by mining translations for OOV items.
Experiments 12-18 estimate all of the features A variety of past research has focused on min-
from truly monolingual corpora. Our novel al- ing parallel or comparable corpora from the web
gorithm for estimating reordering holds up well (Munteanu and Marcu, 2006; Smith et al., 2010;
and recovers 69% of the loss, only 0.4 BLEU Uszkoreit et al., 2010). Others use an existing
points less than when estimated from the Europarl SMT system to discover parallel sentences within
monolingual texts. The temporal similarity fea- independent monolingual texts, and use them to
ture does not perform as well as when it was esti- re-train and enhance the system (Schwenk, 2008;
mated using Europarl data, but the contextual fea- Chen et al., 2008; Schwenk and Senellart, 2009;
ture does. The topic similarity using Wikipedia Rauf and Schwenk, 2009; Lambert et al., 2011).
performs the strongest of the individual features. These are complementary but orthogonal to our
Combining the monolingually estimated re- research goals.
ordering features with the monolingually esti-
mated similarity features (M/M) yields a total 7 Conclusion
gain of 14.8 BLEU points, or over 82% of the
BLEU point loss that occurred when we dropped This paper has demonstrated a novel set of tech-
all features from the phrase table. This is equiv- niques for successfully estimating phrase-based
alent to training the standard system on a bi- SMT parameters from monolingual corpora, po-
text with roughly 60,000 lines or nearly 2 million tentially circumventing the need for large bitexts,
words (learning curve omitted for space). which are expensive to obtain for new languages
Finally, we supplement the standard bilingually and domains. We evaluated the performance of
estimated model parameters with our monolin- our algorithms in a full end-to-end translation sys-
gual features (BM/B), and we see a 1.5 BLEU tem. Assuming that a bilingual-corpus-derived
point increase over the standard model. There- phrase table is available, we were able utilize our
fore, our monolingually estimated scores capture monolingually-estimated features to recover over
some novel information not contained in the stan- 82% of BLEU loss that resulted from removing
dard feature set. the bilingual-corpus-derived phrase-table proba-
bilities. We also showed that our monolingual fea-
6 Additional Related Work tures add 1.5 BLEU points when combined with
standard bilingually estimated features. Thus our
Carbonell et al. (2006) described a data-driven techniques have stand-alone efficacy when large
MT system that used no parallel text. It produced bilingual corpora are not available and also make
translation lattices using a bilingual dictionary a significant contribution to combined ensemble
and scored them using an n-gram language model. performance when they are.
138
References on Data-Driven Machine Translation, Toulouse,
France.
Enrique Alfonseca, Massimiliano Ciaramita, and
Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick,
Keith Hall. 2009. Gazpacho and summer rash:
and Dan Klein. 2008. Learning bilingual lexi-
lexical relationships from temporal patterns of web
cons from monolingual corpora. In Proceedings of
search queries. In Proceedings of EMNLP.
ACL/HLT.
Taylor Berg-Kirkpatrick and Dan Klein. 2011. Simple
Fei Huang, Ying Zhang, and Stephan Vogel. 2005.
effective decipherment via combinatorial optimiza-
Mining key phrase translations from web corpora.
tion. In Proceedings of the 2011 Conference on
In Proceedings of EMNLP.
Empirical Methods in Natural Language Process-
ing (EMNLP-2011), Edinburgh, Scotland, UK. Alexandre Klementiev and Dan Roth. 2006. Weakly
supervised named entity transliteration and discov-
Shane Bergsma and Benjamin Van Durme. 2011.
ery from multilingual comparable corpora. In Pro-
Learning bilingual lexicons using the visual simi-
ceedings of the ACL/Coling.
larity of labeled web images. In Proceedings of the
International Joint Conference on Artificial Intelli- Kevin Knight and Jonathan Graehl. 1997. Machine
gence. transliteration. In Proceedings of ACL.
Peter Brown, John Cocke, Stephen Della Pietra, Vin- Philipp Koehn and Kevin Knight. 2002. Learning a
cent Della Pietra, Frederick Jelinek, Robert Mercer, translation lexicon from monolingual corpora. In
and Paul Poossin. 1988. A statistical approach to ACL Workshop on Unsupervised Lexical Acquisi-
language translation. In 12th International Confer- tion.
ence on Computational Linguistics (CoLing-1988). Philipp Koehn, Franz Josef Och, and Daniel Marcu.
Peter Brown, Stephen Della Pietra, Vincent Della 2003. Statistical phrase-based translation. In Pro-
Pietra, and Robert Mercer. 1993. The mathemat- ceedings of HLT/NAACL.
ics of machine translation: Parameter estimation. Philipp Koehn, Hieu Hoang, Alexandra Birch,
Computational Linguistics, 19(2):263311, June. Chris Callison-Burch, Marcello Federico, Nicola
Chris Callison-Burch, Philipp Koehn, Christof Monz, Bertoldi, Brooke Cowan, Wade Shen, Christine
Kay Peterson, Mark Przybocki, and Omar Zaidan. Moran, Richard Zens, Chris Dyer, Ondrej Bojar,
2010. Findings of the 2010 joint workshop on sta- Alexandra Constantin, and Evan Herbst. 2007.
tistical machine translation and metrics for machine Moses: Open source toolkit for statistical machine
translation. In Proceedings of the Workshop on Sta- translation. In Proceedings of the ACL-2007 Demo
tistical Machine Translation. and Poster Sessions.
Jaime Carbonell, Steve Klein, David Miller, Michael Philipp Koehn. 2005. Europarl: A parallel corpus for
Steinbaum, Tomer Grassiany, and Jochen Frey. statistical machine translation. In Proceedings of
2006. Context-based machine translation. In Pro- the Machine Translation Summit.
ceedings of AMTA. Shankar Kumar and William Byrne. 2004. Local
Boxing Chen, Min Zhang, Aiti Aw, and Haizhou Li. phrase reordering models for statistical machine
2008. Exploiting n-best hypotheses for SMT self- translation. In Proceedings of HLT/NAACL.
enhancement. In Proceedings of ACL/HLT, pages Patrik Lambert, Holger Schwenk, Christophe Ser-
157160. van, and Sadaf Abdul-Rauf. 2011. Investigations
David Chiang. 2005. A hierarchical phrase-based on translation model adaptation using monolingual
model for statistical machine translation. In Pro- data. In Proceedings of the Workshop on Statistical
ceedings of ACL. Machine Translation, pages 284293, Edinburgh,
Hal Daume and Jagadeesh Jagarlamudi. 2011. Do- Scotland, UK.
main adaptation for machine translation by mining David Mimno, Hanna Wallach, Jason Naradowsky,
unseen words. In Proceedings of ACL/HLT. David Smith, and Andrew McCallum. 2009.
Pascale Fung and Lo Yuen Yee. 1998. An IR approach Polylingual topic models. In Proceedings of
for translating new words from nonparallel, compa- EMNLP.
rable texts. In Proceedings of ACL/CoLing. Dragos Stefan Munteanu and Daniel Marcu. 2006.
Nikesh Garera, Chris Callison-Burch, and David Extracting parallel sub-sentential fragments from
Yarowsky. 2009. Improving translation lexicon in- non-parallel corpora. In Proceedings of the
duction from monolingual corpora via dependency ACL/Coling.
contexts and part-of-speech equivalences. In Thir- Franz Josef Och and Hermann Ney. 2003. A sys-
teenth Conference On Computational Natural Lan- tematic comparison of various statistical alignment
guage Learning (CoNLL-2009), Boulder, Colorado. models. Computational Linguistics, 29(1):1951.
Ulrich Germann. 2001. Building a statistical machine Franz Josef Och and Hermann Ney. 2004. The align-
translation system from scratch: How much bang ment template approach to statistical machine trans-
for the buck can we expect? In ACL 2001 Workshop lation. Computational Linguistics, 30(4):417449.
139
Franz Joseph Och. 2002. Statistical Machine Transla-
tion: From Single-Word Models to Alignment Tem-
plates. Ph.D. thesis, RWTH Aachen.
Franz Josef Och. 2003. Minimum error rate training
for statistical machine translation. In Proceedings
of ACL.
Reinhard Rapp. 1995. Identifying word translations
in non-parallel texts. In Proceedings of ACL.
Reinhard Rapp. 1999. Automatic identification of
word translations from unrelated English and Ger-
man corpora. In Proceedings of ACL.
Sadaf Abdul Rauf and Holger Schwenk. 2009. On the
use of comparable corpora to improve SMT perfor-
mance. In Proceedings of EACL.
Sujith Ravi and Kevin Knight. 2011. Deciphering for-
eign language. In Proceedings of ACL/HLT.
Vctor M. Sanchez-Cartagena, Felipe Sanchez-
Martnez, and Juan Antonio Perez-Ortiz. 2011.
Integrating shallow-transfer rules into phrase-based
statistical machine translation. In Proceedings of
the XIII Machine Translation Summit.
Charles Schafer and David Yarowsky. 2002. Inducing
translation lexicons via diverse similarity measures
and bridge languages. In Proceedings of CoNLL.
Holger Schwenk and Jean Senellart. 2009. Transla-
tion model adaptation for an Arabic/French news
translation system by lightly-supervised training. In
MT Summit.
Holger Schwenk. 2008. Investigations on large-scale
lightly-supervised training for statistical machine
translation. In Proceedings of IWSLT.
Jason R. Smith, Chris Quirk, and Kristina Toutanova.
2010. Extracting parallel sentences from compa-
rable corpora using document level alignment. In
Proceedings of HLT/NAACL.
Christoph Tillman. 2004. A unigram orientation
model for statistical machine translation. In Pro-
ceedings of HLT/NAACL.
Christoph Tillmann. 2003. A projection extension al-
gorithm for statistical machine translation. In Pro-
ceedings of EMNLP.
Jakob Uszkoreit, Jay M. Ponte, Ashok C. Popat, and
Moshe Dubiner. 2010. Large scale parallel docu-
ment mining for machine translation. In Proceed-
ings of CoLing.
Ashish Venugopal, Stephan Vogel, and Alex Waibel.
2003. Effective phrase translation extraction from
alignment models. In Proceedings of ACL.
140
Character-Based Pivot Translation for Under-Resourced Languages and
Domains
Jorg Tiedemann
Department of Linguistics and Philology
Uppsala University, Uppsala/Sweden
jorg.tiedemann@lingfil.uu.se
141
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 141151,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
cuss character-based translation models followed 2003). In our setup we added the parameter
by a detailed presentation of our experimental that can be used to weight the importance of one
results. Finally, we briefly summarize related model over the other. This can be useful as we
work and conclude the paper with discussions and do not consider the entire hypothesis space but
prospects for future work. only a small subset of N-best lists. In the sim-
plest case, this weight is set to 0.5 making both
2 Pivot Models models equally important. An alternative to fit-
Information from pivot languages can be incorpo- ting the interpolation weight would be to per-
rated in SMT models in various ways. The main form a global optimization procedure. However,
principle refers to the combination of source- a straightforward implementation of pivot-based
to-pivot and pivot-to-target translation models. MERT would be prohibitively slow due to the
In our setup, one of these models includes a expensive two-step translation procedure over n-
resource-poor language (source or target) and the best lists.
other one refers to a standard model with ap- A general condition for the pivot approach is to
propriate data resources. A condition is that we assume independent training sets for both transla-
have at least some training data for the translation tion models as already pointed out by (Bertoldi
between pivot and the resource-poor language. et al., 2008). In contrast to research presented
However, for the original task (source-to-target in related work (see, for example, (Koehn et al.,
translation) we do not require any data resources 2009)) this condition is met in our setup in which
except for purposes of comparison. all data sets represent different samples over the
We will explore various models for the transla- languages considered (see section 4).2
tion between the resource-poor language and the
pivot language and most of them are not compat- 3 Character-Based SMT
ible with standard phrase-based translation mod- The basic idea behind character-based translation
els. Hence, triangulation methods (Cohn and La- models is to take advantage of the strong lexi-
pata, 2007) for combining phrase tables are not cal and syntactic similarities between closely re-
applicable in our case. Instead, we explore a lated languages. Consider, for example, Figure
cascaded approach (also called transfer method 1. Related languages like Catalan and Spanish or
(Wu and Wang, 2009)) in which we translate the Danish and Norwegian have common roots and,
input text in two steps using a linear interpo- therefore, use similar concepts and express them
lation for rescoring N-best lists. Following the in similar grammatical structures. Spelling con-
method described in (Utiyama and Isahara, 2007) ventions can still be quite different but those dif-
and (Wu and Wang, 2009), we use the best n hy- ferences are often very consistent. The Bosnian-
potheses from the translation of source sentences Macedonian example also shows that we do not
s to pivot sentences p and combine them with the have to require any alphabetic overlap in order to
top m hypotheses for translating these pivot sen- obtain character-level similarities.
tences to target sentences t:
Regularities between such closely related lan-
guages can be captured below the word level. We
L
can also assume a more or less monotonic rela-
sp sp pt pt
X
t argmax k hk (s, p) + (1 )k hk (p, t) tion between the two languages which motivates
t
k=1
the idea of translation models over character N-
where hxy grams treating translation as a transliteration task
k are feature functions for model xy
with appropriate weights xy 1 Basically, this (Vilar et al., 2007). Conceptually it is straightfor-
k .
means that we simply add the scores and, sim- ward to think of phrase-based models on the char-
ilar to related work, we assume that the feature acter level. Sequences of characters can be used
weights can be set independently for each model instead of word N-grams for both, translation and
using minimum error rate training (MERT) (Och, language models. Training can proceed with the
same tools and approaches. The basic task is to
1
Note, that we do not require the same feature functions
2
in both models even though the formula above implies this Note that different samples may still include common
for simplicity of representation. sentences.
142
cedure and the resulting transducer can be used to
find the Viterbi alignment between characters ac-
cording to the best sequence of edit operations ap-
plied to transform one string into the other. Exten-
sions to this model are possible, for example the
use of many-to-many alignments which have been
shown to be very effective in letter-to-phoneme
alignment tasks (Jiampojamarn et al., 2007).
One advantage of the edit-distance-based trans-
ducer models is that the alignments they pre-
dict are strictly monotonic and cannot easily be
Figure 1: Some examples of movie subtitle transla- confused by spurious relations between charac-
tions between closely related languages (either sharing
ters over longer distances. Long distance align-
parts of the same alphabet or not).
ments are only possible in connection with a se-
ries of insertions and deletions that usually in-
prepare the data to comply with the training pro- crease the alignment costs in such a way that they
cedures (see Figure 2). are avoided if possible. On the other hand, IBM
word alignment models also prefer monotonic
alignments over non-monotonic ones if there is no
good reason to do otherwise (i.e., there is frequent
evidence of distorted alignments). However, the
size of the vocabulary in a character-level model
is very small (several orders of magnitude smaller
Figure 2: Data pre-processing for training models on
than on the word level) and this may cause serious
the character level. Spaces are represented by and
each sentence is treated as one sequence of characters. confusion of the word alignment model that very
much relies on context-independent lexical trans-
lation probabilities. Hence, for character align-
3.1 Character Alignment ment, the lexical evidence is much less reliable
One crucial difference is the alignment of charac- without their context.
ters, which is required instead of an alignment of It is certainly possible to find a compromise be-
words. Clearly, the traditional IBM word align- tween word-level and character-level models in
ment models are not designed for this task es- order to generalize below word boundaries but
pecially with respect to distortion. However, the avoiding alignment problems as discussed above.
same generative story can still be applied in gen- Morpheme-based translation models have been
eral. Vilar et al. (2007) explore a two-step proce- explored in several studies with similar motiva-
dure where words are aligned first (with the tradi- tions as in our approach, a better generalization
tional IBM models) to divide sentence pairs into from sparse training data (Fishel and Kirik, 2010;
aligned segments of reasonable size and the char- Luong et al., 2010). However, these approaches
acters are then aligned with the same algorithm. have the drawback that they require proper mor-
An alternative is to use models designed for phological analyses. Data-driven techniques ex-
transliteration or related character-level transfor- ist even for morphology, but their use in SMT
mation tasks. Many approaches are based on still needs to be shown (Fishel, 2009). The sit-
transducer models that resemble string edit oper- uation is comparable to the problems of integrat-
ations such as insertions, deletions and substitu- ing linguistically motivated phrases into phrase-
tions (Ristad and Yianilos, 1998). Weighted fi- based SMT (Koehn et al., 2003). Instead we opt
nite state transducers (WFSTs) can be trained on for a more general approach to extend context to
unaligned pairs of character sequences and have facilitate, especially, the alignment step. Figure 3
been shown to be very effective for transliteration shows how we can transform texts into sequences
tasks or letter-to-phoneme conversions (Jiampoja- of bigrams that can be aligned with standard ap-
marn et al., 2007). The training procedure usually proaches without making any assumptions about
employs an expectation maximization (EM) pro- linguistically motivated segmentations.
143
cu ur rs so o c co on nf fi ir rm ma ad do o . . BLEU, NIST, METEOR etc. The same simple
q qu ue e e es s e es so o ? ? post-processing as mentioned in the previous sec-
tion can be applied to turn the character transla-
Figure 3: Two Spanish sentences as sequences of char- tions into normal text. However, it can be use-
acter bigrams with a final marking the end of a sen- ful to look at some other measures as well that
tence. consider near matches on the character level in-
stead of matching words and word N-grams only.
In this way we can construct a parallel corpus with Character-level models have the ability to produce
slightly richer contextual information as input to strings that may be close to the reference and still
the alignment program. The vocabulary remains do not match any of the words contained. They
small (for example, 1267 bigrams in the case of may generate non-words that include mistakes
Spanish compared to 84 individual characters in which look like spelling-errors or minor gram-
our experiments) but lexical translation probabili- matical mistakes. Those words are usually close
ties become now much more differentiated. enough to the correct target words to be recog-
With this, it is now possible to use the align- nized by the user, which is often more acceptable
ment between bigrams to train a character-level than leaving foreign words untranslated. This is
translation system as we have the same number of especially true as many unknown words represent
bigrams as we have characters (and the first char- important content words that bear a lot of infor-
acter in each bigram corresponds to the charac- mation. The problem of unknown words is even
ter at that position). Certainly, it is also possible more severe for morphologically rich language as
to train a bigram translation model (and language many word forms are simply not part of (sparse)
model). This has the (one and only) advantage training data sets. Untranslated words are espe-
that one character of context across phrase bound- cially annoying when translating languages that
aries (i.e. character N-grams) is used in the se- use different writing systems. Consider, for ex-
lection of translation alternatives from the phrase ample, the following subtitles in Macedonian (us-
table.3 ing Cyrillic letters) that have been translated from
Bosnian (written in Latin characters):
3.2 Tuning Character-Level Models
reference: , .
A final remark on training character-based SMT word-based: casu vina, .
models is concerned with feature weight tun- char-based: , .
ing. It certainly makes not much sense to com- reference: .
pute character-level BLEU scores for tuning fea- word-based: starom svetilistu.
char-based: .
ture weights especially with the standard settings
of matching relatively short N-grams. Instead The underlined parts mark examples of character-
we would still like to measure performance in level differences with respect to the reference
terms of word-level BLEU scores (or any other translation. For the pivot translation approach, it
MT evaluation metric used in minimum error is important that the translations generated in the
rate training). Therefore, it is important to post- first step can be handled by the second one. This
process character-translated development sets be- means, that words generated by a character-based
fore adjusting weights. This is simply done model should at least be valid input words for the
by merging characters accordingly and replacing second step, even though they might refer to er-
the place-holders with spaces again. Thereafter, roneous inflections in that context. Therefore, we
MERT can run as usual. add another measure to our experimental results
presented below the number of unknown words
3.3 Evaluation
with respect to the input language of the second
Character-level translations can be evaluated in step. This applies only to models that are used
the same way as other translation hypotheses, as the first step in pivot-based translations. For
for example using automatic measures such as other models, we include a string similarity mea-
3
Using larger units (trigrams, for example) led to lower
sure based on the longest common subsequence
scores in our experiments (probably due to data sparseness) ratio (LCSR) (Stephen, 1992) in order to give an
and, therefore, are not reported here. impression about the closeness of the system
144
output to the reference translations. and another 2000 sentences for testing. For Gali-
cian, we only used 1000 sentences for each set
4 Experiments due to the lack of additional data. We were espe-
cially careful when preparing the data to exclude
We conducted a series of experiments to test
all sentences from tuning and test sets that could
the ideas of (character-level) pivot translation for
be found in any pivot or direct translation model.
resource-poor languages. We chose to use data
Hence, all test sentences are unseen strings for all
from a collection of translated subtitles com-
models presented in this paper (but they are not
piled in the freely available OPUS corpus (Tiede-
comparable with each other as they are sampled
mann, 2009b). This collection includes a large
individually from independent data sets).
variety of languages and contains mainly short
sentences and sentence fragments, which suits language pair #sents #words
character-level alignment very well. The selected Galician English
settings represent translation tasks between lan- Galician Spanish 2k 15k
guages (and domains) for which only very limited Catalan English 50k 400k
Catalan Spanish 64k 500k
training data is available or none at all.
Spanish English 30M 180M
Below we present results from two general
Macedonian English 220k 1.2M
tasks:4 (i) Translating between English and a Macedonian Bosnian 12k 60k
resource-poor language (in both directions) via Macedonian Bulgarian 155k 800k
a pivot language that is close related to the Bosnian English 2.1M 11M
resource-poor language. (ii) Translating between Bulgarian English 14M 80M
two languages in a domain for which no in- Table 1: Training data for the translation task between
domain training data is available via a pivot lan- closely related languages in the domain of movie sub-
guage with in-domain data. We will start with titles. Number of sentences (#sents) and number of
the presentation of the first task and the character- words (#words) in thousands (k) and millions (M) (av-
based translation between closely related lan- erages of source and target language).
guages.
The data sets represent several interesting test
4.1 Task 1: Pivoting via Related Languages cases: Galician is the least supported language
We decided to look at resource-poor languages with extremely little training data for building our
from two language families: Macedonian repre- pivot model. There is no data for the direct model
senting a Slavic language from the Balkan re- and, therefore, no explicit baseline for this task.
gion, Catalan and Galician representing two Ro- There is 30 times more data available for Catalan-
mance languages spoken mainly in Spain. There English, but still too little for a decent standard
is only little or no data available for translating SMT model. Interesting here is that we have more
from or to English for these languages. However, or less the same amount of data available for the
there are related languages with medium or large baseline and for the pivot translation between the
amounts of training data. For Macedonian, we related languages. The data set for Macedonian
use Bulgarian (which also uses a Cyrillic alpha- English is by far the largest among the baseline
bet) and Bosnian (another related language that models and also bigger than the sets available for
mainly uses Latin characters) as the pivot lan- the related pivot languages. Especially Macedo-
guage. For Catalan and Galician, the obvious nian Bosnian is not well supported. The inter-
choice was Spanish (however, Portuguese would, esting questions is whether tiny amounts of pivot
for example, have been another reasonable op- data can still be competitive. In all three cases,
tion for Galician). Table 1 lists the data avail- there is much more data available for the trans-
able for training the various models. Furthermore, lation models between English and the pivot lan-
we reserved 2000 sentences for tuning parameters guage.
In the following section we will look at the
4
In all experiments we use standard tools like Moses, translation between related languages with vari-
Giza++, SRILM, mteval etc. Details about basic settings are
omitted here due to space constraints but can be found in
ous models and training setups before we con-
the supplementary material. The data sets are available from sider the actual translation task via the bridge lan-
here: http://stp.lingfil.uu.se/joerg/index.php?resources guages.
145
bs-mk bg-mk es-gl es-ca
Model BLEU % LCSR BLEU % LCSR BLEU % LCSR BLEU % LCSR
word-based 15.43 0.5067 14.66 0.6225 41.11 0.7966 62.73 0.8526
char WFST1:1 21.37++ 0.6903 13.33 0.6159 36.94 0.7832 73.17++ 0.8728
char WFST2:2 19.17++ 0.6737 12.67 0.6190 43.39++ 0.8083 70.64++ 0.8684
char IBMchar 23.17++ 0.6968 14.57 0.6347 45.21++ 0.8171 73.12++ 0.8767
char IBMbigram 24.84++ 0.7046 15.01++ 0.6374 44.06++ 0.8144 74.21++ 0.8803
Table 2: Translating from a related pivot language to the target language. Bosnian (bs) / Bulgarian (bg)
Macedonian (mk); Galician (gl) / Catalan (ca) Spanish (es). Word-based refers to standard phrase-based SMT
models. All other models use phrases over character sequences. The WFSTx:y models use weighted finite state
transducers for character alignment with units that are at most x and y characters long, respectively. Other
models use Viterbi alignments created by IBM model 4 using GIZA++ (Och and Ney, 2003) between characters
(IBMchar ) or bigrams (IBMbigram ). LCSR refers to the averaged longest common subsequence ratio between
system translations and references. Results are significantly better (p < 0.01++ , p < 0.05+ ) or worse (p <
0.01 , p < 0.05 ) than the word-based baseline.
Table 3: Translating from the source language to a related pivot language. UNK gives the proportion of unknown
words with respect to the translation model from the pivot language to English.
4.1.1 Translating Related Languages produce consistently worse translation models (at
The main challenge for the translation mod- least in terms of BLEU scores). The reason for
els between related languages is the restriction to this might be that the IBM models can handle
very limited parallel training data. Character-level noise in the training data more robustly. How-
models make it possible to generalize to very ba- ever, in terms of unknown words, WFST-based
sic translation units leading to robust models in alignment is very competitive and often the best
the sense of models without unknown events. The choice (but not much different from the best IBM
basic question is whether they provide reasonable based models). The use of character bigrams
translations with respect to given accepted refer- leads to further BLEU improvements for all data
ences. Tables 2 and 3 give a comprehensive sum- sets except Galician-Spanish. However, this data
mary of various models for the languages selected set is extremely small, which may cause unpre-
in our experiments. dictable results. In any case, the differences
We can see that at least one character-based between character-based alignments and bigram-
translation model outperforms the standard word- based ones are rather small and our experiments
based model in all cases. This is true (and not very do not lead to conclusive results.
surprising) for the language pairs with very little
4.1.2 Pivot Translation
training data but it is also the case for language
pairs with slightly more reasonable data sets like In this section we now look at cascaded transla-
Bulgarian-Macedonian. The automatic measures tions via the related pivot language. Tables 4 and
indicate decent translation performances at this 5 summarize the results for various settings.
stage which encourages their use in pivot trans- As we can see, the pivot translations for Cata-
lation that we will discuss in the next section. lan and Galician outperform the baselines by a
Furthermore, we can also see the influence of large margin. Here, the baselines are, of course,
different character alignment algorithms. Some- very weak due to the minimal amount of train-
what surprisingly, the best results are achieved ing data. Furthermore, the Catalan-English test
with IBM alignment models that are not designed set appears to be very easy considering the rela-
for this purpose. Transducer-based alignments tively high BLEU scores achieved even with tiny
146
Model (BLEU in %) 1x1 10x10 Model (BLEU in %) 1x1 10x10
English Catalan (baseline) 26.70 English Maced. (baseline) 11.04
English (Spanish = Catalan) 8.38 English Bosn. -word- Maced. 7.33 7.64
English Spanish -word- Catalan 38.91++ 39.59++ English Bosn. -char- Maced. 9.99 10.34
English Spanish -char- Catalan 44.46++ 46.82++ English Bulg. -word- Maced. 12.49++ 12.62++
Catalan English (baseline) 27.86 English Bulg. -char- Maced. 11.57++ 11.59+
(Catalan = Spanish) English 9.52 Maced. English (baseline) 20.24
Catalan -word- Spanish English 38.41++ 38.65++ Maced. -word- Bosn. English 12.36 12.48
Catalan -char- Spanish English 40.43++ 40.73++ Maced. -char- Bosn. English 18.73 18.64
English Galician (baseline) Maced. -word- Bulg. English 19.62 19.74
English (Spanish = Galician) 7.46 Maced. -char- Bulg. English 21.05 21.10
English Spanish -word- Galician 20.55 20.76
English Spanish -char- Galician 21.12 21.09 Table 5: Translating between Macedonian (Maced)
Galician English (baseline) and English via Bosnian (Bosn) / Bulgarian (Bulg).
(Galician = Spanish) English 5.76
Galician -word- Spanish English 13.16 13.20
Galician -char- Spanish English 16.04 16.02 the BLEU scores are much lower for all models
involved (even for the high-density languages),
Table 4: Translating between Galician/Catalan and En- which indicates larger problems with the gener-
glish via Spanish using a standard phrase-based SMT ation of correct output and intermediate transla-
baseline, SpanishEnglish SMT models to translate tions.
from/to Catalan/Galician and pivot-based approaches Interesting is the fact that we can achieve al-
using word-level models or character-level models most the same performance as the baseline when
(based on IBMbigram alignments) with either one-best
translating via Bosnian even though we had much
(1x1) or N-best lists (10x10 with = 0.85).
less training data at our disposal for the translation
between Macedonian and Bosnian. In this setup,
amounts of training data for the baseline. Still, no we can see that a character-based model was nec-
test sentence appears in any training or develop- essary in order to obtain the desired abstraction
ment set for either direct translation or pivot mod- from the tiny amount of training data.
els. From the results, we can also see that Catalan
and Galician are quite different from Spanish and 4.2 Task 2: Pivoting for Domain Adaptation
require language-specific treatment. Using a large Sparse resources are not only a problem for spe-
Spanish English model (with over 30% BLEU cific languages but also for specific domains.
in both directions) to translate from or to Cata- SMT models are very sensitive to domain shifts
lan or Galician is not an option. The experiments and domain-specific data is often rare. In the fol-
show that character-based pivot models lead to lowing, we investigate a test case of translating
better translations than word-based pivot models between two languages (English and Norwegian)
(in terms of BLEU scores). This reflects the per- with reasonable amounts of data resources but in
formance gains presented in Table 2. Rescoring the wrong domain (movie subtitles instead of le-
of N-best lists, on the other hand, does not have gal texts). Here again, we facilitate the transla-
a big impact on our results. However, we did not tion process by a pivot language, this time with
spend time optimizing the parameters of N-best domain-specific data.
size and interpolation weight. The task is to translate legal texts from Norwe-
The results from the Macedonian task are not as gian (Bokmal) to English and vice versa. The test
clear. This is especially due to the different setup set is taken from the EnglishNorwegian Parallel
in which the baseline uses more training data than Corpus (ENPC) (Johansson et al., 1996) and con-
any of the related language pivot models. How- tains 1493 parallel sentences (a selection of Eu-
ever, we can still see that the pivot translation via ropean treaties, directives and agreements). Oth-
Bulgarian clearly outperforms the baseline. For erwise, there is no training data available in this
the case of translating to Macedonian via Bulgar- domain for English and Norwegian. Table 6 lists
ian, the word-based model seems to be more ro- the other data resources we used in our study.
bust than the character-level model. This may be As we can see, there is decent amount of train-
due to a larger number of non-words generated ing data for English Norwegian, but the domain
by the character-based pivot model. In general, is strikingly different. On the other hand, there
147
Language pair Domain #sents #words tion process is enormous. As expected, the out-
EnglishNorwegian subtitles 2.4M 18M
of-domain baseline does not perform well even
NorwegianDanish subtitles 1.5M 10M
DanishEnglish DGT-TM 430k 9M though it uses the largest amount of training data
in our setup. It is even outperformed by the in-
Table 6: Training data available for the domain adapta- domain pivot model when pretending that Norwe-
tion task. DGT-TM refers to the translation memories gian is in fact Danish. For the translation into En-
provided by the JRC (Steinberger et al., 2006)
glish, the in-domain language model helps a lit-
tle bit (similar resources are not available for the
is in-domain data for other languages like Danish other direction). However, having the strong in-
that may act as an intermediate pivot. Further- domain model for translating to (and from) the
more, we have out-of-domain data for the transla- pivot language improves the scores dramatically.
tion between pivot and Norwegian. The sizes of The out-of-domain model in the other part of the
the training data sets for the pivot models are com- cascaded translation does not destroy this advan-
parable (in terms of words). The in-domain pivot tage completely and the overall score is much
data is controlled and very consistent and, there- higher than any other baseline.
fore, high quality translations can be expected. In our setup, we used again a closely related
The subtitle data is noisy and includes various language as a pivot. However, this time we
movie genres. It is important to mention that the had more data available for training the pivot
pivot data still does not contain any sentence in- translation model. Naturally, the advantages of
cluded in the EnglishNorwegian test set. the character-level approach diminishes and the
Table 7 summarizes the results of our experi- word-level model becomes a better alternative.
ments when using Danish and in-domain data as However, there can still be a good reason for the
a pivot in translations from and to Norwegian. use of a character-based model as we can see in
the success of the bigram model (subsbi ) in the
Model (task: English Norwegian) BLEU translation from Norwegian to English (via Dan-
(step 1) English dgt Danish 52.76 ish). A character-based model may generalize be-
(step 2) Danish subswo Norwegian 29.87 yond domain-specific terminology which leads to
(step 2) Danish subsch Norwegian 29.65
(step 2) Danish subsbi Norwegian 25.65 a reduction of unknown words when applied to
English subs Norwegian (baseline) 7.20 a new domain. Note that using a character-based
English dgt (Danish = Norwegian) 9.44++ model in step two could possibly cause more harm
English dgt Danish -subswo - Norwegian 17.49++ than using it in step one of the pivot-based pro-
English dgt Danish -subsch - Norwegian 17.61++
English dgt Danish -subsbi - Norwegian 14.07++
cedure. Using n-best lists for a subsequent word-
based translation in step two may fix errors caused
Model (task: Norwegian English) BLEU by character-based translation simply by ignoring
(step 1) Norwegian subswo Danish 30.15 hypotheses containing them, which makes such a
(step 1) Norwegian subsch Danish 27.81 model more robust to noisy input.
(step 1) Norwegian subsbi Danish 28.52
(step 2) Danish dgt English 57.23 Finally, as an alternative, we can also look at
Norwegian subs English (baseline) 11.41 other pivot languages. The domain adaptation
(Norwegian = Danish) dgt English 13.21++ task is not at all restricted to closely related pivot
Norwegian subs+dgtLM English 13.33++ languages especially considering the success of
Norwegian subswo Danish dgt English 25.75++
(Norwegian subsch Danish dgt English 23.77++
word-based models in the experiments above. Ta-
Norwegian subsbi Danish dgt English 26.29++ ble 8 lists results for three other pivot languages.
Surprisingly, the results are much worse than
Table 7: Translating out-of-domain data via Dan- for the Danish test case. Apparently, these mod-
ish. Models using in-domain data are marked with
els are strongly influenced by the out-of-domain
dgt and out-of-domain models are marked with subs.
subs+dgtLM refers to a model with an out-of-domain translation between Norwegian and the pivot lan-
translation model and an added in-domain language guage. The only success can be seen with an-
model. The subscripts wo, ch and bi refer to word, other closely related language, Swedish. Lexical
character and bigram models, respectively. and syntactic similarity seems to be important to
create models that are robust enough for domain
The influence of in-domain data in the transla- shifts in the cascaded translation setup.
148
Pivot=xx enxx xxno enxxno extremely sparse data sets. Moreover, charac-
German 53.09 23.60 3.15
ter level models introduce an abstraction that re-
French 66.47 17.84 5.03
Swedish 52.62 24.79 10.07++ duce the number of unknown words dramatically.
Pivot=xx noxx xxen noxxen In most cases, these unknown words represent
German 15.02 53.02 5.52 information-rich units that bear large portions of
French 17.69 65.85 8.78 the meaning to be translated. The following illus-
Swedish 19.72 59.55 16.35++
trates this effect on example translations with and
Table 8: Alternative word-based pivot translations be- without pivot model:
tween Norwegian (no) and English (en).
Example: Catalan English (via Spanish)
Referen
e: I have to grade these papers.
5 Related Work Baseline: Tin
que quali
ar these examens.
Pivotword : Tin
que quali
ar these tests.
There is a wide range of pivot language ap- Pivotchar : I have to grade these papers.
proaches to machine translation and a number Example: Ma
edonian English (via Bulgarian)
of strategies have been proposed. One of them Referen
e: It's a simple matter of self-preservation.
Baseline: It's simply a question of .
is often called triangulation and usually refers Pivotword : That's a matter of .
to the combination of phrase tables (Cohn and Pivotchar : It's just a question of yourself.
Lapata, 2007). Phrase translation probabilities
are merged and lexical weights are estimated by Leaving unseen words untranslated is not only an-
bridging word alignment models (Wu and Wang, noying (especially if the input language uses a
2007; Bertoldi et al., 2008). Cascaded translation different writing system) but often makes transla-
via pivot languages are discussed by (Utiyama tions completely incomprehensible. Pivot trans-
and Isahara, 2007) and are frequently used by var- lations will still not be perfect (see example
ious researchers (de Gispert and Marino, 2006; two above), but can at least be more intelli-
Koehn et al., 2009; Wu and Wang, 2009) and gible. Character-based models can even take
commercial systems such as Google Translate. care of tokenization errors as the one shown
A third strategy is to generate or augment data above (Tincque should be two words Tinc
sets with the help of pivot models. This is, for que). Fortunately, the generation of non-word
example, explored by (de Gispert and Marino, sequences (observed as unknown words) does not
2006) and (Wu and Wang, 2009) (who call it the seem to be a big problem and no special treatment
synthetic method). Pivoting has also been used is required to avoid such output. We would still
for paraphrasing and lexical adaptation (Bannard like to address this issue in future work by adding
and Callison-Burch, 2005; Crego et al., 2010). a word level LM in character-based SMT. How-
(Nakov and Ng, 2009) investigate pivot languages ever, (Vilar et al., 2007) already showed that this
for resource-poor languages (but only when trans- did not have any positive effect in their character-
lating from the resource-poor language). They based system. In a second study, we also showed
also use transliteration for adapting models to a that pivot models can be useful for adapting to
new (related) language. Character-level SMT has a new domain. The use of in-domain pivot data
been used for transliteration (Matthews, 2007; leads to systems that outperform out-of-domain
Tiedemann and Nabende, 2009) and also for the translation models by a large margin. Our find-
translation between closely related languages (Vi- ings point to many prospects for future work.
lar et al., 2007; Tiedemann, 2009a). For example, we would like to investigate combi-
nations of character-based and word-based mod-
6 Conclusions and Discussion
els. Character-based models may also be used for
In this paper, we have discussed possibilities to treating unknown words only. Multiple source ap-
translate via pivot languages on the character proaches via several pivots is another possibility
level. These models are useful to support under- to be explored. Finally, we also need to further
resourced languages and explore strong lexical investigate the robustness of the approach with re-
and syntactic similarities between closely related spect to other language pairs, data sets and learn-
languages. Such an approach makes it possible ing parameters.
to train reasonable translation models even with
149
References Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical phrase-based translation. In Pro-
Colin Bannard and Chris Callison-Burch. 2005. Para-
ceedings of the 2003 Conference of the North Amer-
phrasing with bilingual parallel corpora. In Pro-
ican Chapter of the Association for Computational
ceedings of the 43rd Annual Meeting of the Associa-
Linguistics on Human Language Technology - Vol-
tion for Computational Linguistics (ACL05), pages
ume 1, NAACL 03, pages 4854, Stroudsburg, PA,
597604, Ann Arbor, Michigan, June. Association
USA. Association for Computational Linguistics.
for Computational Linguistics.
Philipp Koehn, Hieu Hoang, Alexandra Birch,
Nicola Bertoldi, Madalina Barbaiani, Marcello Fed-
Chris Callison-Burch, Marcello Federico, Nicola
erico, and Roldano Cattoni. 2008. Phrase-Based
Bertoldi, Brooke Cowan, Wade Shen, Christine
Statistical Machine Translation with Pivot Lan-
Moran, Richard Zens, Chris Dyer, Ondrej Bojar,
guages. In Proceedings of the International Work-
Alexandra Constantin, and Evan Herbst. 2007.
shop on Spoken Language Translation, pages 143
Moses: Open source toolkit for statistical ma-
149, Hawaii, USA.
chine translation. In Proceedings of the 45th An-
Trevor Cohn and Mirella Lapata. 2007. Machine
nual Meeting of the Association for Computational
translation by triangulation: Making effective use
Linguistics Companion Volume Proceedings of the
of multi-parallel corpora. In Proceedings of the
Demo and Poster Sessions, pages 177180, Prague,
45th Annual Meeting of the Association of Compu-
Czech Republic, June. Association for Computa-
tational Linguistics, pages 728735, Prague, Czech
tional Linguistics.
Republic, June. Association for Computational Lin-
Philipp Koehn, Alexandra Birch, and Ralf Steinberger.
guistics.
2009. 462 machine translation systems for europe.
Josep Maria Crego, Aurelien Max, and Francois Yvon.
In Proceedings of MT Summit XII, pages 6572, Ot-
2010. Local lexical adaptation in machine transla-
tawa, Canada.
tion through triangulation: SMT helping SMT. In
Proceedings of the 23rd International Conference Minh-Thang Luong, Preslav Nakov, and Min-Yen
on Computational Linguistics (Coling 2010), pages Kan. 2010. A hybrid morpheme-word represen-
232240, Beijing, China, August. Coling 2010 Or- tation for machine translation of morphologically
ganizing Committee. rich languages. In Proceedings of the 2010 Con-
ference on Empirical Methods in Natural Language
A. de Gispert and J.B. Marino. 2006. Catalan-english
Processing, pages 148157, Cambridge, MA, Octo-
statistical machine translation without parallel cor-
ber. Association for Computational Linguistics.
pus: Bridging through spanish. In Proceedings of
the 5th Workshop on Strategies for developing Ma- David Matthews. 2007. Machine transliteration of
chine Translation for Minority Languages (SALT- proper names. Masters thesis, School of Informat-
MIL06) at LREC, pages 6568, Genova, Italy. ics, University of Edinburgh.
Mark Fishel and Harri Kirik. 2010. Linguistically Preslav Nakov and Hwee Tou Ng. 2009. Im-
motivated unsupervised segmentation for machine proved statistical machine translation for resource-
translation. In Proceedings of the International poor languages using related resource-rich lan-
Conference on Language Resources and Evaluation guages. In Proceedings of the 2009 Conference on
(LREC), pages 17411745, Valletta, Malta. Empirical Methods in Natural Language Process-
Mark Fishel. 2009. Deeper than words: Morph-based ing, pages 13581367, Singapore, August. Associ-
alignment for statistical machine translation. In ation for Computational Linguistics.
Proceedings of the Conference of the Pacific Associ- Franz Josef Och and Hermann Ney. 2003. A sys-
ation for Computational Linguistics PacLing 2009, tematic comparison of various statistical alignment
Sapporo, Japan. models. Computational Linguistics, 29(1):1951.
Sittichai Jiampojamarn, Grzegorz Kondrak, and Tarek Franz Josef Och. 2003. Minimum error rate training
Sherif. 2007. Applying many-to-many alignments in statistical machine translation. In Proceedings
and hidden markov models to letter-to-phoneme of the 41st Annual Meeting of the Association for
conversion. In Human Language Technologies Computational Linguistics, pages 160167, Sap-
2007: The Conference of the North American Chap- poro, Japan, July. Association for Computational
ter of the Association for Computational Linguis- Linguistics.
tics; Proceedings of the Main Conference, pages Eric Sven Ristad and Peter N. Yianilos. 1998.
372379, Rochester, New York, April. Association Learning string edit distance. IEEE Transactions
for Computational Linguistics. on Pattern Recognition and Machine Intelligence,
Stig Johansson, Jarle Ebeling, and Knut Hofland. 20(5):522532, May.
1996. Coding and aligning the English-Norwegian Ralf Steinberger, Bruno Pouliquen, Anna Widiger,
Parallel Corpus. In K. Aijmer, B. Altenberg, Camelia Ignat, Tomaz Erjavec, and Dan Tufis.
and M. Johansson, editors, Languages in Contrast, 2006. The JRC-Acquis: A multilingual aligned par-
pages 87112. Lund University Press. allel corpus with 20+ languages. In Proceedings of
150
the 5th International Conference on Language Re-
sources and Evaluation (LREC), pages 21422147.
Graham A. Stephen. 1992. String Search. Technical
report, School of Electronic Engineering Science,
University College of North Wales, Gwynedd.
Jorg Tiedemann and Peter Nabende. 2009. Translat-
ing transliterations. International Journal of Com-
puting and ICT Research, 3(1):3341.
Jorg Tiedemann. 2009a. Character-based PSMT for
closely related languages. In Proceedings of 13th
Annual Conference of the European Association for
Machine Translation (EAMT09), pages 12 19,
Barcelona, Spain.
Jorg Tiedemann. 2009b. News from OPUS - A col-
lection of multilingual parallel corpora with tools
and interfaces. In Recent Advances in Natural Lan-
guage Processing, volume V, pages 237248. John
Benjamins, Amsterdam/Philadelphia.
Masao Utiyama and Hitoshi Isahara. 2007. A com-
parison of pivot methods for phrase-based statisti-
cal machine translation. In Human Language Tech-
nologies 2007: The Conference of the North Amer-
ican Chapter of the Association for Computational
Linguistics; Proceedings of the Main Conference,
pages 484491, Rochester, New York, April. Asso-
ciation for Computational Linguistics.
David Vilar, Jan-Thorsten Peter, and Hermann Ney.
2007. Can we translate letters? In Proceedings of
the Second Workshop on Statistical Machine Trans-
lation, pages 3339, Prague, Czech Republic, June.
Association for Computational Linguistics.
Hua Wu and Haifeng Wang. 2007. Pivot language ap-
proach for phrase-based statistical machine transla-
tion. In Proceedings of the 45th Annual Meeting of
the Association of Computational Linguistics, pages
856863, Prague, Czech Republic, June. Associa-
tion for Computational Linguistics.
Hua Wu and Haifeng Wang. 2009. Revisiting pivot
language approach for machine translation. In Pro-
ceedings of the Joint Conference of the 47th An-
nual Meeting of the ACL and the 4th International
Joint Conference on Natural Language Processing
of the AFNLP, pages 154162, Suntec, Singapore,
August. Association for Computational Linguistics.
151
Does more data always yield better translations?
Guillem Gasco, Martha-Alicia Rocha, German Sanchis-Trilles,
Jesus Andres-Ferrer and Francisco Casacuberta
Departament de Sistemes Informatics i Computacio
Universitat Politecnica de Valencia
Cam de Vera s/n, 46022 Valencia, Spain
{ggasco,mrocha,gsanchis,jandres,fcn}@dsic.upv.es
152
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 152161,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
only for the purpose of selecting bilingual sen- infrequent n-grams. In Section 5 experimental re-
tences. However, references are not used at any sults are reported. Finally, the main results of the
stage within the translation system for obtaining work and several future work directions are dis-
the hypotheses. Note that although we are not cussed in Section 6.
able to achieve such an improvement without an
oracle, this result restates the BSS problem as an 2 Related Work
interesting approach not only for reducing com- Training data selection has been receiving an in-
putational effort but also for significantly boost- creasing amount of attention within the SMT
ing performance. To our knowledge, no previous community. For instance, in (Li et al., 2010;
work has quantified the room of improvement in Gasco et al., 2010) several BSS techniques, sim-
which BSS techniques could incur. ilar to those analysed in this paper, have been
In order to assess the performance of the dif- applied for training MT systems when there are
ferent BSS techniques, translation results are ob- large training corpora available. However, nei-
tained by using a standard state-of-the-art SMT ther such techniques have been formalised, nor its
system (Koehn et al., 2007). The most recent lit- performance thoroughly analysed. A similar ap-
erature defines the SMT problem (Papineni et al., proach that gives weights to different subcorpora
1998; Och and Ney, 2002) as follows: given an was proposed in (Matsoukas et al., 2009).
input sentence f from a certain source language, In (Lu et al., 2007), information retrieval meth-
the purpose is to find an output sentence e in a ods are used in order to produce different sub-
certain target language such that models which are then weighted according to the
K sentence to be translated. In such work, authors
X
e = arg max k hk (f , e) (1) define the baseline as the result obtained train-
e
k=1 ing only with the corpus that share the same do-
main of the test. Afterwards they claim that they
where hk (f , e) is a score function representing an are able to improve baseline translation quality by
important feature for the translation of f into e, adding new sentences retrieved with their method.
as for example the language model of the target However, they neither compare their technique
language, a reordering model or several transla- with random sentence selection, nor with a model
tion models. k are the log-linear combination trained with all the corpora.
weights. Although the techniques that are applied for
The main contributions of this paper are: BSS are often very similar to those applied for ac-
A BSS technique is analysed, which im- tive learning (AL), both problems are essentially
proves the results obtained with a random different. Since the AL strategies assume that
bilingual sentence selection strategy when the pool of sentences are not translated, they are
the specific domain to be translated signifi- usually interested in finding the best monolingual
cantly differs from that of the pool of sen- subset of sentences to be translated by a human
tences. annotator. In contrast, in BSS, it is assumed that a
Another BSS technique is analysed that, us- fairly large amount of bilingual corpora is readily
ing less than 0.5% of the sentences avail- available, and the main goal consists in selecting
able, significantly improves over random se- only those sentences which will maximise system
lection, beating a system trained with all the performance.
pool of sentences. Some works have applied sentence selection in
small scale AL frameworks. These works extend
We prove, by means of an oracle, that a wise
the training corpora at most with 5000 sentences.
BSS technique can yield large improvements
In (Ananthakrishnan et al., 2010), sentences are
when compared with systems trained with all
selected by means of discriminative techniques.
data available.
In (Haffari et al., 2009) a technique is proposed
The remaining of the paper is structured as fol- for increasing the counts of phrases that are con-
lows. Section 2 summarises the related work. sidered infrequent. Both works significantly dif-
Sections 3 and 4 present two BSS techniques, fer from the current work not only on the frame-
namely, probabilistic sampling and recovery of work, but also on the scale of the experiments, the
153
proposed techniques and the obtained improve- the sample bias. The proposed approach relies in
ments. Similar ideas applied to adaptation prob- conserving the probability distribution of the task
lems have been proposed in (Moore and Lewis, domain by wisely selecting the bilingual pairs to
2010; Axelrod et al., 2011). be used from the whole pool of sentences. Hence,
it is mandatory to exclude sentences from the pool
3 Probabilistic Sampling that distort the actual probability. In order to ap-
As discussed in Section 2, BSS has inherently proximate the probability distribution, we assume
attached many meaningful links with AL tech- that a small but representative corpus is avail-
niques. Selecting samples for learning our mod- able from the task domain. This corpus, referred
els, incurs in a well-known difficulty in AL, the henceforth as the in-domain corpus, provides a
so-called sample bias problem (Dasgupta, 2009). way to build an initial model which approximates
This problem, which is spread to the BSS case, the actual probability of the system. The pool of
is summarised as the distortion introduced by the sentences will be oppositely denoted as the out-
active strategy into the probability distribution un- of-domain corpus.
derlying the training corpus. This bias forces the The actual probability of the task domain, the
training algorithm to learn a distorted probability so called in-domain probability, is approximated
model which can significantly differ from the ac- with the following model
tual one.
p(e, f , |e|, |f |) = p(e, f | |e|, |f |) p(|e|, |f |) (5)
In order to further analyse the sampling bias
problem, consider the maximum likelihood esti- where p(|e|, |f |) denotes the in-domain length
mation (MLE) of a probability model, p (e, f ) probability, and p(e, f | |e|, |f |) the in-domain
for a given corpus of N data points,{(en , fn )}, bilingual probability.
sampled from the actual probability distribution, The length probability is estimated by MLE
Pr(e, f ). Recall that e denotes a target sen-
tence whereas f stands for its source counter- N (|e| + |f |)
part. MLE techniques aims at minimising the p(|e|, |f |) = (6)
N
Kullback-Leibler divergence between the actual
unknown probability distribution and the proba- where N (|e|+|f |) is the number of bilingual pairs
bility model (Bishop, 2006), defined as in the in-domain corpus such that their lengths
sum up to |e|+|f | and N denotes the total num-
X Pr(e, f ) ber of sentences. Note that no distinction is made
KL(Pr | p ) = Pr(e, f ) log
p (e, f ) between source and target lengths since the model
e,f
(2) is intended for sampling.
When minimising, Eq. (2) is simplified to The complexity of the in-domain bilingual
probability distribution, p(e, f | |e|, |f |), requires
X
= arg max Pr(e, f ) log(p (e, f )) (3) a more sophisticated approximation
e,f P
exp( k k fk (e, f ))
p(e, f /|e|, |f |) = (7)
which is approximated by a sufficiently large Z
dataset under the commonly hold assumption that being Z a normalisation constant; and where
it is independently and identically distributed ac- fk (. . .) and k are the features of the model and
cording to Pr(e, f ) as their respective parametric weights. Specifically,
X four logarithmic features were considered for this
= arg max log(p (en , fn )) (4)
sampling technique: a direct and an inverse IBM
n
model 4 (Brown et al., 1994); and both, source
Therefore, by perturbing the sample {(en , fn )} and target, 5-gram language models. All fea-
with an active strategy, we are, in fact, modifying ture models are estimated in the in-domain cor-
the approximation to Eq.(3) and learning a differ- pus with standard techniques (Brown et al., 1994;
ent underlying probability distribution. Stolcke, 2002). As a first approach, the parame-
In this section a statistical framework is pro- ters of the log-linear model in Eq. (7), k , were
posed to build systems with BSS while avoiding uniformly fixed to 1.
154
Once we have an appropriate model for the be different from the concatenation of the transla-
in-domain probability distribution, the proposed tions of both words separately.
method randomly samples a given number of When selecting sentences from the pool it is
bilingual pairs from the out-of-domain corpora important to choose sentences that contain n-
(the pool of sentences). The process of extend- grams that have never been seen (or have been
ing the in-domain corpus with additional bilin- seen just a few times) in the training set. Such
gual pairs from the out-of-domain corpus is sum- n-grams will be henceforth referred to as infre-
marised as follows: quent n-grams . An n-gram is considered infre-
Decide according to the in-domain length quent when it appears less times than an infre-
probability in Eq. (6), how many samples quent threshold t. If the source language sen-
should be drawn for each length, i.e. divide tences to be translated are known beforehand, the
the number of sentences to add into length set of infrequent n-grams can be reduced to those
dependent buckets. present in such sentences. Then, the technique
consists in selecting from the pool those sentences
Randomly draw the number of samples which contain infrequent n-grams present in the
specified in each bucket according to the source sentences to be translated.
in-domain bilingual probability in Eq. (7) Sentences in the pool are sorted by their infre-
among all the bilingual sentences that share quency score in order to select first the most in-
the current bucket length. formative. Let X the set of n-grams that appear
in the sentences to be translated and w one of
Although the pool of sentences is typically them; C(w) the counts of w in the source lan-
large, it is not large enough to gather a signifi- guage training set; and N (w) the counts of w
cant amount of probability mass. Consequently, in the source sentence f to be scored. The infre-
a small set of sentences accumulate most of the quency score of f is:
probability mass and tend to be selected multi-
ple times. To avoid this awkward and undesired
behaviour, the sampling is performed without re- X
i(f ) = min(1, N (w)) max(0, tC(w)) (8)
placement.
wX
155
t=1 t = 10 t = 25 Subset Language |S| |W | |V |
tr all tr all tr all English 747K 24.6K
train 47.5K
1-gr 11.6 1.3 40.5 3.5 59.9 5.1 French 793K 31.7K
2-gr 38 9.8 73.2 21.3 84.9 27.9 English 9.2K 1.9K
dev 571
3-gr 66.8 33.5 91.1 55.7 96.4 64.9 French 10.3K 2.2K
4-gr 87.1 65.8 98.2 85.5 99.4 90.7 English 12.6K 2.4K
test 641
French 12.8K 2.7K
Table 1: Percentage of infrequent n-grams in the TED Table 2: TED corpus main figures. K denotes thou-
test set when considering only the TED training set sands of elements. |S| stands for number of sentences,
(tr), and when adding the out-of-domain pool (all), |W | for number of running words, and |V | for vocab-
for different infrequency thresholds t. ulary size.
156
Corpus Language |S| |W | |V | Europarl
English 25.6M 81K Gigaword
Euro 1.25M 0.03
Relative frequency
French 28.2M 101K UN
TED
English 94.4M 302K NC
UN 5M
French 107M 283K 0.02
English 303M 1.6M
Giga 15.5M
French 361M 1.6M
0.01
Table 4: Figures of the corpora used as sentence pool.
M stands for millions of elements.
0
effective technique that is commonly used is to re- 0 10 20 30 40 50 60 70 80 90 100
produce out-of-vocabulary words from the source Combined sentence length
sentence in the target hypothesis. However, in- Figure 2: Combined length relative frequency.
variable n-grams are usually infrequent as well, 5.2 Results for Probabilistic Sampling
which implies that the infrequent n-grams tech-
In addition to the probabilistic sampling tech-
nique would select sentences containing such n-
nique proposed in Section 3, we also analysed the
grams, even though they do not provide further
effect of sampling only according to the combined
information. As a first approach, we exclude n-
source-reference length, with the purpose of es-
grams without any letter.
tablishing whether potential improvements were
Baseline experiments have been carried out for
only due to the length component, or rather to the
TED and NC corpora using the corresponding
complete sampling model. Results for the 2009
training set. For comparison purposes, we also
test set are shown in Figure 1. Several things
included results for a purely random sentence se-
should be noted:
lection without replacement. In the plots, each
point corresponding to random selection represent Performing sentence selection only according
the average of 10 repetitions. Experiments using to sentence lengths does not achieve better
all data are also reported, although a 64GB ma- performance than random selection.
chine was necessary, even with binarized phrase Selecting sentences according to probabilis-
and distortion tables. tic sampling is able to improve random se-
Experiments were conducted by selecting a lection in the case of the TED corpus, but
fixed amount of sentences according to each one is not able to do so in the case of the NC
of the techniques described above. Then, these corpus. Significance tests for the 500K case
sentences were included into the training data and reported that the differences were significant
subsequent SMT systems were built for translat- in the case of the TED corpus, but not in the
ing the test set. case of the NC corpus.
Results are shown in terms of BLEU (Papineni In the case of the TED corpus, the perfor-
et al., 2001), which is an accuracy metric that mance achieved with the system built by
measures n-gram precision, with a penalty for sampling 500K sentences is only 0.5 BLEU
sentences that are too short. Although it could points below the performance achieved by
be argued that improvements obtained might be the system built with all the data available.
due to a side effect of the brevity penalty, this The explanation to the fact that probabilistic
was not found to be true: the BSS techniques (in- sampling is able to improve over random sam-
cluding random) and considering all data yielded pling only in the case of the TED corpus, but not
very similar brevity penalties (0.005), within in the case of NC, relies in the nature of the cor-
each corpus. In addition, TER scores (Snover et pora. Although both of them belong to a very
al., 2006) were also computed, but are omitted generic domain, their characteristics are very dif-
for clarity purposes and since they were found to ferent. In fact, the NC data is very similar to the
be coherent with BLEU. TER is an error metric sentences in the pool, but, in contrast, the sen-
that computes the minimum number of edits re- tences present in the TED corpus have a much
quired to modify the system hypotheses so that more different structure. This difference is illus-
they match the references translations. trated in Figure 2, where the relative frequency of
157
TED corpus NC corpus
24
in domain length in domain length
all sampling 22 all sampling
random random
23
BLEU
BLEU
21
22 20
19
21
0 100K 200K 300K 400K 500K 0 100K 200K 300K 400K 500K
Number of sentences added Number of sentences added
Figure 1: Effect of adding sentences over the BLEU score using the probabilistic sampling, length sampling and
random selection techniques for the two corpora, TED and News Commentary. Horizontal lines represent the
scores when using just the in domain training set and all the data available.
TED corpus NC corpus
26
all in domain random 23
t=10 t=25
25
22
BLEU
BLEU
24
21
23 all
in domain
20 random
22
t=10
19 t=25
21
0 50k 100k 200k 0 50k 100k 200k
Number of sentences added Number of sentences added
Figure 3: Effect of adding sentences over the BLEU score using the infrequent n-grams (with different thresh-
olds) and random selection techniques for the two corpora, TED and News Commentary. Horizontal lines repre-
sent the scores when using just the in domain training set and all the data available.
each combined sentence length is shown. In this sented similar curves, although less sentences can
plot, it stands out clearly that the TED corpus has be selected and hence improvements obtained are
a very different length distribution than the other slightly lower. Several conclusions can be drawn:
four corpora considered, whereas the NC corpus The translation quality provided by the in-
presents a very similar distribution. This implies frequent n-grams technique is significantly
that, when considering TED, an intelligent data better than the results achieved with random
selection strategy will have better chances to im- selection, comparing similar amount of sen-
prove random selection than in the case of NC. tences. Specifically, the improvements ob-
5.3 Results for Infrequent n-grams Recovery tained are in the range of 3 BLEU points.
Results for the TED corpus are more irreg-
Figure 3 shows the effect of adding sentences us-
ular. The best performance is achieved for
ing the infrequent n-grams and the random se-
t = 25 and 50K sentences added. In NC, the
lection techniques on the 2009 test set. Once
best result is for t = 10 and 112K.
all the infrequent n-grams have been covered
t times, the infrequency score for all the sen- Selecting sentences with the infrequent n-
tences remaining in the pool is 0, and none of grams technique provides better results than
them can be selected. Hence, the number of including all the available data. While using
sentences that can be selected for each t is lim- less than 0.5% of the data, improvements be-
ited. Although for clarity we only show results tween 0.5 and 1 BLEU points are achieved.
for t = {10, 25}, experiments have also been car- When looking at Figure 3, one might suspect
ried out for t = {1, 5, 10, 25}. Such results pre- that t needs to be set specifically for a given test
158
set, and that results from one set are not to be ex- Src the budget has also been criticised by klaus .
trapolated to other test sets. For this reason, we Bsl le budget a egalement ete criticised par m. klaus .
Rdm le budget a egalement ete critiquees par m. klaus .
selected the best configuration in Figure 3 and
PS le budget a egalement ete critiquee par klaus .
used it to build a new system for translating the All le budget a egalement ete critique par klaus .
unseen NC 2010 test set. Such experiment, with Infr le budget a egalement ete critique par klaus .
t = 10 and including all sentences with score Ref klaus critique egalement le budget .
greater than 0 ( 110K), is shown in Table 5 and Src and one has come from music .
evidences that improvements are actually coher- Bsl et un a de la musique .
ent among different test sets. Rdm et on vient de musique .
PS et on a viennent de musique .
technique BLEU TER #phrases All et de la musique .
in-domain 19.0 65.2 5.1M Infr et un est venu de la musique .
Ref et un vient du monde de la musique .
all data 22.7 60.8 1236M
infreq. t = 10 23.6 59.2 16.5M Figure 4: Examples of two translations for each of the
SMT systems built: Src (source sentence), Bsl (base-
Table 5: Effect of the infrequent n-gram recovery tech- line), Rdm (random selection), PS (probabilistic sam-
nique for an unseen test set, when setting t = 10 and pling), All (all the data available), Infr (Infrequent n-
number of phrases (parameters) of the models. grams) and Ref (reference).
159
best suitable for dealing with a domain-specific larger corpora or even more complex techniques,
test set. This adaptation process is ought to be such as synchronous grammars or hierarchical
achieved by means of a (potentially small) adapta- models. For instance, the infrequent n-grams
tion set, which belongs to the same domain as the technique has beaten all the other systems using
test data. In contrast, BSS tackles with the prob- just a small fraction of the corpus, only 0.5%, and
lem of how to select samples from a large pool is yet able to outperform a system trained with all
of training data, regardless of whether such pool the data by 0.9 BLEU points and the random base-
of data is in-domain or out-of-domain. Hence, in line by 3 points. This baseline has been proved to
one case we can assume to have a fairly well es- be difficult to beat by other works.
timated translation model, which is to be adapted, Preliminary experiments were performed in or-
whereas in BSS we still have full control over the der to analyse the perplexity of the references, the
estimation of such model and need not to aim at a number of out of vocabulary words (OoVs) and
specific domain, although it might often be so. the ratio of target-source phrases. These exper-
BSS is related with instance weighting (Jiang iments revealed that the improvements obtained
and Zhai, 2007; Foster et al., 2010). Adapta- are largely correlated with a decrease in perplex-
tion and BSS can be considered to be orthogo- ity and in the number of OoVs. On the one hand,
nal (yet complementary) problems under the in- reducing the amount of OoVs was mirrored by
stance weighting paradigm. In such case, instance an important improvement in BLEU when the
weighting can be considered to span a complete amount of additional data was small, and also
paradigmatic space between both. At one end, entailed a decrease in perplexity. However, a
there is sample selection (BSS for SMT), while at reduction in perplexity by itself did not always
the other end there is adaptation. For instance, it imply significant improvements. Moreover, no
is quite common to confront the adaptation prob- real conclusion could be drawn from the analy-
lem by extracting different phrase-tables from dif- sis of target-source phrase ratio. Hence, we un-
ferent corpora, and then interpolate such tables. derstand that the improvements obtained are pro-
This technique could be also applied to promote vided mainly by a more specialised estimation of
the performance of the system built by means of the model parameters. However, further experi-
BSS. However, this is left out as future work. ments should still be conducted in order to verify
We thoroughly analysed two BSS approaches this conclusion.
that obtain competitive results, while using a
small fraction of the training data, although there Acknowledgments
is still much to be gained. For instance, oracle re- The research leading to these results has re-
sults have also been reported in this work, yield- ceived funding from the European Union Seventh
ing improvements of up to 10 BLEU points. Even Framework Programme (FP7/2007-2013) under
though the use of an oracle typically implies that grant agreement nr. 287755. This work was
the results obtained are not realistic, recall that also supported by the Spanish MEC/MICINN un-
the proposed oracle is special, in the sense that it der the MIPRCV Consolider Ingenio 2010 pro-
only uses the reference sentences for the specific gram (CSD2007-00018), and iTrans2 (TIN2009-
purpose of selecting training samples, but the ref- 14511) project. Also supported by the Span-
erences are not included into the training data as ish MITyC under the erudito.com (TSI-020110-
such. This is useful for assessing the potential be- 2009-439) project and Instituto Tecnologico de
hind BSS: ideally, if we were able to design a BSS Leon, DGEST-PROMEP y CONACYT, Mexico.
strategy that, without using the references, would
select exactly those training samples, we would be
boosting system performance by 10 BLEU points.
This re-states BSS as a compelling technique that
has not yet received the attention it deserves.
BSS is not aimed at optimising computational
requirements, but does so as a byproduct. This
may seem despicable but it would allow to run
more experiments with the same resources, use
160
References Moran, Richard Zens, Chris Dyer, Ontraj Bojar,
Alexandra Constantin, and Evan Herbst. 2007.
Sankaranarayanan Ananthakrishnan, Rohit Prasad, Moses: Open source toolkit for statistical machine
David Stallard, and Prem Natarajan. 2010. Dis- translation. In Proc. of ACL, pages 177180.
criminative sample selection for statistical machine
Zhifei Li, Chris Callison-Burch, Chris Dyer, Juri Gan-
translation. In Proc. of the EMNLP, pages 626635,
itkevitch, Ann Irvine, Sanjeev Khudanpur, Lane
Cambridge, MA, October.
Schwartz, Wren Thornton, Ziyuan Wang, Jonathan
Amittai Axelrod, Xiaodong He, and Jianfeng Gao. Weese, and Omar Zaidan. 2010. Joshua 2.0: A
2011. Domain adaptation via pseudo in-domain toolkit for parsing-based machine translation with
data selection. In Proc of the EMNLP, pages 355 syntax, semirings, discriminative training and other
362. goodies. In Proc. of the MATR(ACL), pages 139
Christopher M. Bishop. 2006. Pattern Recognition 143, Uppsala, Sweden, July.
and Machine Learning. Springer. Yajuan Lu, Jin Huang, and Qun Liu. 2007. Improv-
Peter F. Brown, Stephen Della Pietra, Vincent J. Della ing statistical machine translation performance by
Pietra, and Robert L. Mercer. 1994. The mathe- training data selection and optimization. In Proc. of
matics of statistical machine translation: Parameter the EMNLP-CoNLL, pages 343350, Prague, Czech
estimation. Computational Linguistics, 19(2):263 Republic, June.
311. Spyros Matsoukas, Antti-Veikko I. Rosti, and Bing
Chris Callison-Burch, Philipp Koehn, Christof Monz, Zhang. 2009. Discriminative corpus weight es-
and Josh Schroeder. 2009. Findings of the 2009 timation for machine translation. In Proc. of the
Workshop on Statistical Machine Translation. In EMNLP, pages 708717, Singapore, August.
Proc of the WSMT, pages 128, Athens, Greece, Robert C. Moore and William Lewis. 2010. Intelli-
March. gent selection of language model training data. In
Chris Callison-Burch, Philipp Koehn, Christof Monz, ACL (Short Papers), pages 220224.
Kay Peterson, Mark Przybocki, and Omar Zaidan. Franz J. Och and Hermann Ney. 2002. Discrimina-
2010. Findings of the 2010 joint Workshop on Sta- tive training and maximum entropy models for sta-
tistical Machine Translation and Metrics for Ma- tistical machine translation. In Proc. of ACL, pages
chine Translation. In Proc. of the MATR(ACL), 295302.
pages 1753, Uppsala, Sweden, July. Franz J. Och and Hermann Ney. 2003. A systematic
Sanjoy Dasgupta. 2009. The two faces of active learn- comparison of various statistical alignment models.
ing. In Proc. of The twentieth Conference on Algo- In Computational Linguistics, volume 29, pages
rithmic Learning Theory, page 1, Porto (Portugal), 1951.
October. Kishore Papineni, Salim Roukos, and Todd Ward.
George Foster, Cyril Goutte, and Roland Kuhn. 2010. 1998. Maximum likelihood and discriminative
Discriminative instance weighting for domain adap- training of direct translation models. In Proc. of
tation in statistical machine translation. In Proc. of ICASSP98, pages 189192.
the EMNLP, pages 451459, Cambridge, MA, Oc- Kishore Papineni, Salim Roukos, Todd Ward, and
tober. Wei-Jing Zhu. 2001. Bleu: A method for automatic
Guillem Gasco, Vicent Alabau, Jesus Andres-Ferrer, evaluation of machine translation. In Technical Re-
Jesus Gonzalez-Rubio, Martha-Alicia Rocha, port RC22176 (W0109-022).
German Sanchis-Trilles, Francisco Casacuberta, Michael Paul, Marcello Federico, and Sebastian Stker.
Jorge Gonzalez, and Joan-Andreu Sanchez. 2010. 2010. Overview of the IWSLT 2010 evaluation
ITI-UPV system description for IWSLT 2010. In campaign. In Proc. of the IWSLT 2010, Paris,
Proc. of the IWSLT 2010, Paris, France, December. France, December.
Gholamreza Haffari, Maxim Roy, and Anoop Sarkar. Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
2009. Active learning for statistical phrase-based nea Micciulla, and John Makhoul. 2006. A study
machine translation. In Proc. of HLT/NAACL09, of translation edit rate with targeted human annota-
pages 415423, Morristown, NJ, USA. tion. In Proc. of AMTA06.
Jing Jiang and ChengXiang Zhai. 2007. Instance Andreas Stolcke. 2002. SRILM an extensible lan-
weighting for domain adaptation in NLP. In Proc. guage modeling toolkit. In Proc. of ICSLP.
of ACL07, pages 264271. Elia Yuste, Manuel Herranz, Antonio Lagarda, Li-
Reinhard Kneser and Hermann Ney. 1995. Improved onel Tarazon, Isaas Sanchez-Cortina, and Fran-
backing-off for m-gram language modeling. Proc. cisco Casacuberta. 2010. Pangeamt - putting
of ICASSP, II:181184, May. open standards to work... well. In Proc. of the
Philipp Koehn, Hieu Hoang, Alexandra Birch, AMTA2010. Denver, CO, USA, November.
Chris Callison-Burch, Marcello Federico, Nicola
Bertoldi, Brooke Cowan, Wade Shen, Christie
161
Recall-Oriented Learning of Named Entities in Arabic Wikipedia
Behrang Mohit Nathan Schneider Rishav Bhowmick Kemal Oflazer Noah A. Smith
School of Computer Science, Carnegie Mellon University
P.O. Box 24866, Doha, Qatar Pittsburgh, PA 15213, USA
{behrang@,nschneid@cs.,rishavb@qatar.,ko@cs.,nasmith@cs.}cmu.edu
This paper considers named entity recognition Experiments show consistent gains on the chal-
(NER) in text that is different from most past re- lenging problem of identifying named entities in
search on NER. Specifically, we consider Arabic Arabic Wikipedia text.
Wikipedia articles with diverse topics beyond the
commonly-used news domain. These data chal- 2 Arabic Wikipedia NE Annotation
lenge past approaches in two ways:
Most of the effort in NER has been fo-
First, Arabic is a morphologically rich lan-
cused around a small set of domains and
guage (Habash, 2010). Named entities are ref-
general-purpose entity classes relevant to those
erenced using complex syntactic constructions
domainsespecially the categories PER ( SON ),
(cf. English NEs, which are primarily sequences
ORG ( ANIZATION ), and LOC ( ATION ) (POL),
of proper nouns). The Arabic script suppresses
which are highly prominent in news text. Ara-
most vowels, increasing lexical ambiguity, and
bic is no exception: the publicly available NER
lacks capitalization, a key clue for English NER.
corporaACE (Walker et al., 2006), ANER (Be-
Second, much research has focused on the use
najiba et al., 2008), and OntoNotes (Hovy et al.,
of news text for system building and evaluation.
2006)all are in the news domain.2 However,
Wikipedia articles are not news, belonging instead
2
to a wide range of domains that are not clearly OntoNotes contains news-related text. ACE includes
some text from blogs. In addition to the POL classes, both
1
The annotated dataset and a supplementary document corpora include additional NE classes such as facility, event,
with additional details of this work can be found at: product, vehicle, etc. These entities are infrequent and may
http://www.ark.cs.cmu.edu/AQMAR not be comprehensive enough to cover the larger set of pos-
162
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 162173,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
History Science Sports Technology Table 1: Translated titles
dev: Damascus Atom Raul Gonzales Linux
of Arabic Wikipedia arti-
Imam Hussein Shrine Nuclear power Real Madrid Solaris
cles in our development
test: Crusades Enrico Fermi 2004 Summer Olympics Computer
and test sets, and some
Islamic Golden Age Light Christiano Ronaldo Computer Software
NEs with standard and
Islamic History Periodic Table Football Internet
article-specific classes.
Ibn Tolun Mosque Physics Portugal football team Richard Stallman
Additionally, Prussia and
Amman were reserved
Ummaya Mosque Muhammad al-Razi FIFA World Cup X Window System
for training annotators,
Claudio Filippone (PER) J.J
K
X; Linux (SOFTWARE) JJ
; Spanish
League (CHAMPIONSHIPS) GAJ.B@ PY@; proton (PARTICLE) KQK.; nuclear and
mating
Gulf War for esti-
inter-annotator
agreement.
radiation (GENERIC - MISC) J@ AB@ ; Real Zaragoza (ORG) Q AK
P
appropriate entity classes will vary widely by do- testing examples, but not as training data. In 4
main; occurrence rates for entity classes are quite we will discuss our semisupervised approach to
different in news text vs. Wikipedia, for instance learning, which leverages ACE and ANER data
(Balasuriya et al., 2009). This is abundantly as an annotated training corpus.
clear in technical and scientific discourse, where
much of the terminology is domain-specific, but it 2.1 Annotation Strategy
holds elsewhere. Non-POL entities in the history We conducted a small annotation project on Ara-
domain, for instance, include important events bic Wikipedia articles. Two college-educated na-
(wars, famines) and cultural movements (roman- tive Arabic speakers annotated about 3,000 sen-
ticism). Ignoring such domain-critical entities tences from 31 articles. We identified four top-
likely limits the usefulness of the NE analysis. ical areas of interesthistory, technology, sci-
Recognizing this limitation, some work on ence, and sportsand browsed these topics un-
NER has sought to codify more robust invento- til we had found 31 articles that we deemed sat-
ries of general-purpose entity types (Sekine et al., isfactory on the basis of length (at least 1,000
2002; Weischedel and Brunstein, 2005; Grouin words), cross-lingual linkages (associated articles
et al., 2011) or to enumerate domain-specific in English, German, and Chinese3 ), and subjec-
types (Settles, 2004; Yao et al., 2003). Coarse, tive judgments of quality. The list of these arti-
general-purpose categories have also been used cles along with sample NEs are presented in ta-
for semantic tagging of nouns and verbs (Cia- ble 1. These articles were then preprocessed to
ramita and Johnson, 2003). Yet as the number extract main article text (eliminating tables, lists,
of classes or domains grows, rigorously docu- info-boxes, captions, etc.) for annotation.
menting and organizing the classeseven for a Our approach follows ACE guidelines (LDC,
single languagerequires intensive effort. Ide- 2005) in identifying NE boundaries and choos-
ally, an NER system would refine the traditional ing POL tags. In addition to this traditional form
classes (Hovy et al., 2011) or identify new entity of annotation, annotators were encouraged to ar-
classes when they arise in new domains, adapting ticulate one to three salient, article-specific en-
to new data. For this reason, we believe it is valu- tity categories per article. For example, names
able to consider NER systems that identify (but of particles (e.g., proton) are highly salient in the
do not necessarily label) entity mentions, and also Atom article. Annotators were asked to read the
to consider annotation schemes that allow annota- entire article first, and then to decide which non-
tors more freedom in defining entity classes. traditional classes of entities would be important
Our aim in creating an annotated dataset is to in the context of article. In some cases, annotators
provide a testbed for evaluation of new NER mod- reported using heuristics (such as being proper
els. We will use these data as development and 3
These three languages have the most articles on
Wikipedia. Associated articles here are those that have been
sible NEs (Sekine et al., 2002). Nezda et al. (2006) anno- manually hyperlinked from the Arabic page as cross-lingual
tated and evaluated an Arabic NE corpus with an extended correspondences. They are not translations, but if the associ-
set of 18 classes (including temporal and numeric entities); ations are accurate, these articles should be topically similar
this corpus has not been released publicly. to the Arabic page that links to them.
163
Token position agreement rate 92.6% Cohens : 0.86 History: Gulf War, Prussia, Damascus, Crusades
Token agreement rate 88.3% Cohens : 0.86 WAR CONFLICT
Token F1 between annotators 91.0% Science: Atom, Periodic table
Entity boundary match F1 94.0% THEORY CHEMICAL
Entity category match F1 87.4% NAME ROMAN PARTICLE
164
NEs from English articles to their Arabic coun- Training words NEs
terparts (Hassan et al., 2007), automatically clus- ACE+ANER 212,839 15,796
tering non-canonical types of entities into article- Wikipedia (unlabeled, 397 docs) 1,110,546
specific or cross-article classes (cf. Frietag, 2004), Development
or using non-canonical classes to improve the ACE 7,776 638
Wikipedia (4 domains, 8 docs) 21,203 2,073
(author-specified) article categories in Wikipedia.
Test
Hereafter, we merge all article-specific cate-
ACE 7,789 621
gories with the generic MIS category. The pro- Wikipedia (4 domains, 20 docs) 52,650 3,781
portion of entity mentions that are tagged as MIS,
while varying to a large extent by document, is
Table 4: Number of words (entity mentions) in data sets.
a major indication of the gulf between the news
data (<10%) and the Wikipedia data (53% for the tures known to work well for Arabic NER (Be-
development set, 37% for the test set). najiba et al., 2008; Abdul-Hamid and Darwish,
Below, we aim to develop entity detection mod- 2010), we incorporate some additional features
els that generalize beyond the traditional POL en- enabled by Wikipedia. We do not employ a
tities. We do not address here the challenges of gazetteer, as the construction of a broad-domain
automatically classifying entities or inferring non- gazetteer is a significant undertaking orthogo-
canonical groupings. nal to the challenges of a new text domain like
Wikipedia.10 A descriptive list of our features is
3 Data available in the supplementary document.
Table 4 summarizes the various corpora used in We use a first-order structured perceptron; none
this work.6 Our NE-annotated Wikipedia sub- of our features consider more than a pair of con-
corpus, described above, consists of several Ara- secutive BIO labels at a time. The model enforces
bic Wikipedia articles from four focus domains.7 the constraint that NE sequences must begin with
We do not use these for supervised training data; B (so the bigram hO, Ii is disallowed).
they serve only as development and test data. A Training this model on ACE and ANER data
larger set of Arabic Wikipedia articles, selected achieves performance comparable to the state of
on the basis of quality heuristics, serves as unla- the art (F1 -measure11 above 69%), but fares much
beled data for semisupervised learning. worse on our Wikipedia test set (F1 -measure
Our out-of-domain labeled NE data is drawn around 47%); details are given in 5.
from the ANER (Benajiba et al., 2007) and 4.1 Recall-Oriented Perceptron
ACE-2005 (Walker et al., 2006) newswire cor-
pora. Entity types in this data are POL cate- By augmenting the perceptrons online update
gories (PER, ORG, LOC) and MIS. Portions of the with a cost function term, we can incorporate a
ACE corpus were held out as development and task-dependent notion of error into the objective,
test data; the remainder is used in training. as with structured SVMs (Taskar et al., 2004;
Tsochantaridis et al., 2005). Let c(y, y 0 ) denote
4 Models a measure of error when y is the correct label se-
quence but y 0 is predicted. For observed sequence
Our starting point for statistical NER is a feature-
x and feature weights (model parameters) w, the
based linear model over sequences, trained using
structured hinge loss is `hinge (x, y, w) =
the structured perceptron (Collins, 2002).8
In addition to lexical and morphological9 fea-
> 0 0
max
0
w g(x, y ) + c(y, y ) w> g(x, y)
6
Additional details appear in the supplement. y
7
We downloaded a snapshot of Arabic Wikipedia (1)
(http://ar.wikipedia.org) on 8/29/2009 and pre- The maximization problem inside the parentheses
processed the articles to extract main body text and metadata is known as cost-augmented decoding. If c fac-
using the mwlib package for Python (PediaPress, 2010).
8 10
A more leisurely discussion of the structured percep- A gazetteer ought to yield further improvements in line
tron and its connection to empirical risk minimization can with previous findings in NER (Ratinov and Roth, 2009).
11
be found in the supplementary document. Though optimizing NER systems for F1 has been called
9
We obtain morphological analyses from the MADA tool into question (Manning, 2006), no alternative metric has
(Habash and Rambow, 2005; Roth et al., 2008). achieved widespread acceptance in the community.
165
tors similarly to the feature function g(x, y), then Input: labeled data hhx(n) , y (n) iiN
n=1 ; unlabeled
we can increase penalties for y that have more data hx(j) iJj=1 ; supervised learner L;
local mistakes. This raises the learners aware- number of iterations T 0
ness about how it will be evaluated. Incorporat- Output: w
ing cost-augmented decoding into the perceptron w L(hhx(n) , y (n) iiN n=1 )
leads to this decoding step: for t = 1 to T 0 do
for j = 1 to J do
y (j) arg maxy w> g(x(j) , y)
y arg max w> g(x, y 0 ) + c(y, y 0 ) , (2)
y0 w L(hhx(n) , y (n) iiN (j)
n=1 hhx , y
(j) J
iij=1 )
Algorithm 1: Self-training.
which amounts to performing stochastic subgradi-
ent ascent on an objective function with the Eq. 1 there is no available labeled training data. Yet
loss (Ratliff et al., 2006). the available unlabeled data is vast, so we turn to
In this framework, cost functions can be for- semisupervised learning.
mulated to distinguish between different types of Here we adapt self-training, a simple tech-
errors made during training. For a tag sequence nique that leverages a supervised learner (like the
y = hy1 , y2 , . . . , yM i, Gimpel and Smith (2010b) perceptron) to perform semisupervised learning
define word-local cost functions that differently (Clark et al., 2003; Mihalcea, 2004; McClosky
penalize precision errors (i.e., yi = O yi 6= O et al., 2006). In our version, a model is trained
for the ith word), recall errors (yi 6= O yi = O), on the labeled data, then used to label the un-
and entity class/position errors (other cases where labeled target data. We iterate between training
yi 6= yi ). As will be shown below, a key problem on the hypothetically-labeled target data plus the
in cross-domain NER is poor recall, so we will original labeled set, and relabeling the target data;
penalize recall errors more severely: see Algorithm 1. Before self-training, we remove
sentences hypothesized not to contain any named
M 0 if yi = yi0
X entity mentions, which we found avoids further
c(y, y 0 ) = if yi 6= O yi0 = O (3) encouragement of the model toward low recall.
1 otherwise
i=1
5 Experiments
for a penalty parameter > 1. We call our learner
We investigate two questions in the context of
the recall-oriented perceptron (ROP).
NER for Arabic Wikipedia:
We note that Minkov et al. (2006) similarly ex-
plored the recall vs. precision tradeoff in NER. Loss function: Does integrating a cost func-
Their technique was to directly tune the weight tion into our learning algorithm, as we have
of a single featurethe feature marking O (non- done in the recall-oriented perceptron (4.1),
entity tokens); a lower weight for this feature will improve recall and overall performance on
incur a greater penalty for predicting O. Below Wikipedia data?
we demonstrate that our method, which is less Semisupervised learning for domain adap-
coarse, is more successful in our setting.12 tation: Can our models benefit from large
In our experiments we will show that injecting amounts of unlabeled Wikipedia data, in addi-
arrogance into the learner via the recall-oriented tion to the (out-of-domain) labeled data? We
loss function substantially improves recall, espe- experiment with a self-training phase following
cially for non-POL entities (5.3). the fully supervised learning phase.
4.2 Self-Training and Semisupervised We report experiments for the possible combi-
Learning nations of the above ideas. These are summarized
As we will show experimentally, the differences in table 5. Note that the recall-oriented percep-
between news text and Wikipedia text call for do- tron can be used for the supervised learning phase,
main adaptation. In the case of Arabic Wikipedia, for the self-training phase, or both. This leaves us
with the following combinations:
12
The distinction between the techniques is that our cost
function adjusts the whole model in order to perform better reg/none (baseline): regular supervised learner.
at recall on the training data. ROP/none: recall-oriented supervised learner.
166
Figure 1: Tuning the recall-oriented cost parame-
ter for different learning settings. We optimized
for development set F1 , choosing penalty = 200
for recall-oriented supervised learning (in the plot,
ROP/*this is regardless of whether a stage of
self-training will follow); = 100 for recall-
oriented self-training following recall-oriented su-
pervised learning (ROP/ROP); and = 3200 for
recall-oriented self-training following regular super-
vised learning (reg/ROP).
reg/reg: standard self-training setup. baseline is on par with the state of the art for Ara-
ROP/reg: recall-oriented supervised learner, fol- bic NER on ACE news text (Abdul-Hamid and
lowed by standard self-training. Darwish, 2010).15
reg/ROP: regular supervised model as the initial la-
beler for recall-oriented self-training. Here is the performance of the baseline entity
ROP/ROP (the double ROP condition): recall- detection model on our 20-article test set:16
oriented supervised model as the initial labeler for P R F1
recall-oriented self-training. Note that the two technology 60.42 20.26 30.35
ROPs can use different cost parameters. science 64.96 25.73 36.86
history 63.09 35.58 45.50
For evaluating our models we consider the
sports 71.66 59.94 65.28
named entity detection task, i.e., recognizing
overall 66.30 35.91 46.59
which spans of words constitute entities. This
is measured by per-entity precision, recall, and Unsurprisingly, performance on Wikipedia data
F1 .13 To measure statistical significance of differ- varies widely across article domains and is much
ences between models we use Gimpel and Smiths lower than in-domain performance. Precision
(2010) implementation of the paired bootstrap re- scores fall between 60% and 72% for all domains,
sampler of (Koehn, 2004), taking 10,000 samples but recall in most cases is far worse. Miscella-
for each comparison. neous class recall, in particular, suffers badly (un-
der 10%)which partially accounts for the poor
5.1 Baseline recall in science and technology articles (they
Our baseline is the perceptron, trained on the have by far the highest proportion of MIS entities).
POL entity boundaries in the ACE+ANER cor-
pus (reg/none).14 Development data was used to 5.2 Self-Training
select the number of iterations (10). We per- Following Clark et al. (2003), we applied self-
formed 3-fold cross-validation on the ACE data training as described in Algorithm 1, with the
and found wide variance in the in-domain entity perceptron as the supervised learner. Our unla-
detection performance of this model: beled data consists of 397 Arabic Wikipedia ar-
P R F1 ticles (1 million words) selected at random from
fold 1 70.43 63.08 66.55 all articles exceeding a simple length threshold
fold 2 87.48 81.13 84.18 (1,000 words); see table 4. We used only one iter-
fold 3 65.09 51.13 57.27
ation (T 0 = 1), as experiments on development
average 74.33 65.11 69.33
data showed no benefit from additional rounds.
(Fold 1 corresponds to the ACE test set described Several rounds of self-training hurt performance,
in table 4.) We also trained the model to perform
15
POL detection and classification, achieving nearly Abdul-Hamid and Darwish report as their best result a
identical results in the 3-way cross-validation of macroaveraged F1 -score of 76. As they do not specify which
data they used for their held-out test set, we cannot perform
ACE data. From these data we conclude that our
a direct comparison. However, our feature set is nearly a
13
Only entity spans that exactly match the gold spans are superset of their best feature set, and their result lies well
counted as correct. We calculated these scores with the within the range of results seen in our cross-validation folds.
16
conlleval.pl script from the CoNLL 2003 shared task. Our Wikipedia evaluations use models trained on
14
In keeping with prior work, we ignore non-POL cate- POLM entity boundaries in ACE. Per-domain and overall
gories for the ACE evaluation. scores are microaverages across articles.
167
S ELF - TRAINING
S UPERVISED none reg ROP
reg 66.3 35.9 46.59 66.7 35.6 46.41 59.2 40.3 47.97
ROP 60.9 44.7 51.59 59.8 46.2 52.11 58.0 47.4 52.16
Table 5: Entity detection precision, recall, and F1 for each learning setting, microaveraged across the 24 articles
in our Wikipedia test set. Rows differ in the supervised learning condition on the ACE+ANER data (regular
vs. recall-oriented perceptron). Columns indicate whether this supervised learning phase was followed by self-
training on unlabeled Wikipedia data, and if so which version of the perceptron was used for self-training.
an effect attested in earlier research (Curran et al., vised phase (bottom left cell), the recall gains
2007) and sometimes known as semantic drift. are substantialnearly 9% over the baseline. In-
Results are shown in table 5. We find that stan- tegrating this bias within self-training (last col-
dard self-training (the middle column) has very umn of the table) produces a more modest im-
little impact on performance.17 Why is this the provement (less than 3%) relative to the base-
case? We venture that poor baseline recall and the line. In both cases, the improvements to recall
domain variability within Wikipedia are to blame. more than compensate for the amount of degra-
dation to precision. This trend is robust: wher-
5.3 Recall-Oriented Learning ever the recall-oriented perceptron is added, we
The recall-oriented bias can be introduced in ei- observe improvements in both recall and F1 . Per-
ther or both of the stages of our semisupervised haps surprisingly, these gains are somewhat addi-
learning framework: in the supervised learn- tive: using the ROP in both learning phases gives
ing phase, modifying the objective of our base- a small (though not always significant) gain over
line (5.1); and within the self-training algorithm alternatives (standard supervised perceptron, no
(5.2).18 As noted in 4.1, the aim of this ap- self-training, or self-training with a standard per-
proach is to discourage recall errors (false nega- ceptron). In fact, when the standard supervised
tives), which are the chief difficulty for the news learner is used, recall-oriented self-training suc-
texttrained model in the new domain. We se- ceeds despite the ineffectiveness of standard self-
lected the value of the false positive penalty for training.
cost-augmented decoding, , using the develop- Performance breakdowns by (gold) class, fig-
ment data (figure 1). ure 2, and domain, figure 3, further attest to the
The results in table 5 demonstrate improve- robustness of the overall results. The most dra-
ments due to the recall-oriented bias in both matic gains are in miscellaneous class recall
stages of learning.19 When used in the super- each form of the recall bias produces an improve-
17
In neither case does regular self-training produce a sig- ment, and using this bias in both the supervised
nificantly different F1 score than no self-training. and self-training phases is clearly most success-
18
Standard Viterbi decoding was used to label the data ful for miscellaneous entities. Correspondingly,
within the self-training algorithm; note that cost-augmented the technology and science domains (in which this
decoding only makes sense in learning, not as a prediction
technique, since it deliberately introduces errors relative to a
class dominates83% and 61% of mentions, ver-
correct output that must be provided.
19
In terms of F1 , the worst of the 3 models with the ROP provements due to self-training are marginal, however: ROP
supervised learner significantly outperforms the best model self-training produces a significant gain only following reg-
with the regular supervised learner (p < 0.005). The im- ular supervised learning (p < 0.05).
168
Figure 3: Supervised
learner precision vs.
recall as evaluated
on Wikipedia test
data in different
topical domains. The
regular perceptron
(baseline model) is
contrasted with ROP.
No self-training is
applied.
sus 6% and 12% for history and sports, respec- at regularization (Chelba and Acero, 2006) and
tively) receive the biggest boost. Still, the gaps feature design (Daume III, 2007); we alter the
between domains are not entirely removed. loss function. Not surprisingly, the double-ROP
Most improvements relate to the reduction of approach harms performance on the original do-
false negatives, which fall into three groups: main (on ACE data, we achieve 55.41% F1 , far
(a) entities occurring infrequently or partially below the standard perceptron). Yet we observe
in the labeled training data (e.g. uranium); that models can be prepared for adaptation even
(b) domain-specific entities sharing lexical or con- before a learner is exposed a new domain, sacri-
textual features with the POL entities (e.g. Linux, ficing performance in the original domain.
titanium); and (c) words with Latin characters, The recall-oriented bias is not merely encour-
common in the science and technology domains. aging the learner to identify entities already seen
(a) and (b) are mostly transliterations into Arabic. in training. As recall increases, so does the num-
An alternativeand simplerapproach to ber of new entity types recovered by the model:
controlling the precision-recall tradeoff is the of the 2,070 NE types in the test data that were
Minkov et al. (2006) strategy of tuning a single never seen in training, only 450 were ever found
feature weight subsequent to learning (see 4.1 by the baseline, versus 588 in the reg/ROP condi-
above). We performed an oracle experiment to tion, 632 in the ROP/none condition, and 717 in
determine how this compares to recall-oriented the double-ROP condition.
learning in our setting. An oracle trained with We note finally that our method is a simple
the method of Minkov et al. outperforms the three extension to the standard structured perceptron;
models in table 5 that use the regular perceptron cost-augmented inference is often no more ex-
for the supervised phase of learning, but under- pensive than traditional inference, and the algo-
performs the supervised ROP conditions.20 rithmic change is equivalent to adding one addi-
Overall, we find that incorporating the recall- tional feature. Our recall-oriented cost function
oriented bias in learning is fruitful for adapting to is parameterized by a single value, ; recall is
Wikipedia because the gains in recall outpace the highly sensitive to the choice of this value (fig-
damage to precision. ure 1 shows how we tuned it on development
data), and thus we anticipate that, in general, such
6 Discussion tuning will be essential to leveraging the benefits
of arrogance.
To our knowledge, this work is the first sugges-
tion that substantively modifying the supervised
learning criterion in a resource-rich domain can 7 Related Work
reap benefits in subsequent semisupervised appli- Our approach draws on insights from work in
cation in a new domain. Past work has looked the areas of NER, domain adaptation, NLP with
20
Tuning the O feature weight to optimize for F1 on our Wikipedia, and semisupervised learning. As all
test set, we found that oracle precision would be 66.2, recall are broad areas of research, we highlight only the
would be 39.0, and F1 would be 49.1. The F1 score of our
most relevant contributions here.
best model is nearly 3 points higher than the Minkov et al.
style oracle, and over 4 points higher than the non-oracle Research in Arabic NER has been focused on
version where the development set is used for tuning. compiling and optimizing the gazetteers and fea-
169
ture sets for standard sequential modeling algo- work, major topical differences distinguish the
rithms (Benajiba et al., 2008; Farber et al., 2008; training and test corporaand consequently, their
Shaalan and Raza, 2008; Abdul-Hamid and Dar- salient NE classes. In these respects our NER
wish, 2010). We make use of features identi- setting is closer to that of Florian et al. (2010),
fied in this prior work to construct a strong base- who recognize English entities in noisy text, (Sur-
line system. We are unaware of any Arabic NER deanu et al., 2011), which concerns information
work that has addressed diverse text domains like extraction in a topically distinct target domain,
Wikipedia. Both the English and Arabic ver- and (Dalton et al., 2011), which addresses English
sions of Wikipedia have been used, however, as NER in noisy and topically divergent text.
resources in service of traditional NER (Kazama Self-training (Clark et al., 2003; Mihalcea,
and Torisawa, 2007; Benajiba et al., 2008). Attia 2004; McClosky et al., 2006) is widely used
et al. (2010) heuristically induce a mapping be- in NLP and has inspired related techniques that
tween Arabic Wikipedia and Arabic WordNet to learn from automatically labeled data (Liang et
construct Arabic NE gazetteers. al., 2008; Petrov et al., 2010). Our self-training
Balasuriya et al. (2009) highlight the substan- procedure differs from some others in that we use
tial divergence between entities appearing in En- all of the automatically labeled examples, rather
glish Wikipedia versus traditional corpora, and than filtering them based on a confidence score.
the effects of this divergence on NER perfor- Cost functions have been used in non-
mance. There is evidence that models trained structured classification settings to penalize cer-
on Wikipedia data generalize and perform well tain types of errors more than others (Chan and
on corpora with narrower domains. Nothman Stolfo, 1998; Domingos, 1999; Kiddon and Brun,
et al. (2009) and Balasuriya et al. (2009) show 2011). The goal of optimizing our structured NER
that NER models trained on both automatically model for recall is quite similar to the scenario ex-
and manually annotated Wikipedia corpora per- plored by Minkov et al. (2006), as noted above.
form reasonably well on news corpora. The re-
verse scenario does not hold for models trained 8 Conclusion
on news text, a result we also observe in Arabic We explored the problem of learning an NER
NER. Other work has gone beyond the entity de- model suited to domains for which no labeled
tection problem: Florian et al. (2004) addition- training data are available. A loss function to en-
ally predict within-document entity coreference courage recall over precision during supervised
for Arabic, Chinese, and English ACE text, while discriminative learning substantially improves re-
Cucerzan (2007) aims to resolve every mention call and overall entity detection performance, es-
detected in English Wikipedia pages to a canoni- pecially when combined with a semisupervised
cal article devoted to the entity in question. learning regimen incorporating the same bias.
The domain and topic diversity of NEs has been We have also developed a small corpus of Ara-
studied in the framework of domain adaptation bic Wikipedia articles via a flexible entity an-
research. A group of these methods use self- notation scheme spanning four topical domains
training and select the most informative features (publicly available at http://www.ark.cs.
and training instances to adapt a source domain cmu.edu/AQMAR).
learner to the new target domain. Wu et al. (2009)
bootstrap the NER leaner with a subset of unla- Acknowledgments
beled instances that bridge the source and target We thank Mariem Fekih Zguir and Reham Al Tamime
domains. Jiang and Zhai (2006) and Daume III for assistance with annotation, Michael Heilman for
(2007) make use of some labeled target-domain his tagger implementation, and Nizar Habash and col-
data to tune or augment the features of the source leagues for the MADA toolkit. We thank members of
the ARK group at CMU, Hal Daume, and anonymous
model towards the target domain. Here, in con-
reviewers for their valuable suggestions. This publica-
trast, we use labeled target-domain data only for tion was made possible by grant NPRP-08-485-1-083
tuning and evaluation. Another important dis- from the Qatar National Research Fund (a member of
tinction is that domain variation in this prior the Qatar Foundation). The statements made herein
work is restricted to topically-related corpora (e.g. are solely the responsibility of the authors.
newswire vs. broadcast news), whereas in our
170
References Stephen Clark, James Curran, and Miles Osborne.
2003. Bootstrapping POS-taggers using unlabelled
Ahmed Abdul-Hamid and Kareem Darwish. 2010. data. In Walter Daelemans and Miles Osborne,
Simplified feature set for Arabic named entity editors, Proceedings of the Seventh Conference on
recognition. In Proceedings of the 2010 Named En- Natural Language Learning at HLT-NAACL 2003,
tities Workshop, pages 110115, Uppsala, Sweden, pages 4955.
July. Association for Computational Linguistics.
Michael Collins. 2002. Discriminative training meth-
Mohammed Attia, Antonio Toral, Lamia Tounsi, Mon- ods for hidden Markov models: theory and experi-
ica Monachini, and Josef van Genabith. 2010. ments with perceptron algorithms. In Proceedings
An automatically built named entity lexicon for of the ACL-02 Conference on Empirical Methods in
Arabic. In Nicoletta Calzolari, Khalid Choukri, Natural Language Processing (EMNLP), pages 1
Bente Maegaard, Joseph Mariani, Jan Odijk, Ste- 8, Stroudsburg, PA, USA. Association for Compu-
lios Piperidis, Mike Rosner, and Daniel Tapias, ed- tational Linguistics.
itors, Proceedings of the Seventh Conference on Silviu Cucerzan. 2007. Large-scale named entity
International Language Resources and Evaluation disambiguation based on Wikipedia data. In Pro-
(LREC10), Valletta, Malta, May. European Lan- ceedings of the 2007 Joint Conference on Empirical
guage Resources Association (ELRA). Methods in Natural Language Processing and Com-
Bogdan Babych and Anthony Hartley. 2003. Im- putational Natural Language Learning (EMNLP-
proving machine translation quality with automatic CoNLL), pages 708716, Prague, Czech Republic,
named entity recognition. In Proceedings of the 7th June.
International EAMT Workshop on MT and Other James R. Curran, Tara Murphy, and Bernhard Scholz.
Language Technology Tools, EAMT 03. 2007. Minimising semantic drift with Mutual
Dominic Balasuriya, Nicky Ringland, Joel Nothman, Exclusion Bootstrapping. In Proceedings of PA-
Tara Murphy, and James R. Curran. 2009. Named CLING, 2007.
entity recognition in Wikipedia. In Proceedings Jeffrey Dalton, James Allan, and David A. Smith.
of the 2009 Workshop on The Peoples Web Meets 2011. Passage retrieval for incorporating global
NLP: Collaboratively Constructed Semantic Re- evidence in sequence labeling. In Proceedings of
sources, pages 1018, Suntec, Singapore, August. the 20th ACM International Conference on Infor-
Association for Computational Linguistics. mation and Knowledge Management (CIKM 11),
Yassine Benajiba, Paolo Rosso, and Jose Miguel pages 355364, Glasgow, Scotland, UK, October.
BenedRuiz. 2007. ANERsys: an Arabic named ACM.
entity recognition system based on maximum en- Hal Daume III. 2007. Frustratingly easy domain
tropy. In Alexander Gelbukh, editor, Proceedings adaptation. In Proceedings of the 45th Annual
of CICLing, pages 143153, Mexico City, Mexio. Meeting of the Association of Computational Lin-
Springer. guistics, pages 256263, Prague, Czech Republic,
Yassine Benajiba, Mona Diab, and Paolo Rosso. 2008. June. Association for Computational Linguistics.
Arabic named entity recognition using optimized Pedro Domingos. 1999. MetaCost: a general method
feature sets. In Proceedings of the 2008 Confer- for making classifiers cost-sensitive. Proceedings
ence on Empirical Methods in Natural Language of the Fifth ACM SIGKDD International Confer-
Processing, pages 284293, Honolulu, Hawaii, Oc- ence on Knowledge Discovery and Data Mining,
tober. Association for Computational Linguistics. pages 155164.
Philip K. Chan and Salvatore J. Stolfo. 1998. To- Benjamin Farber, Dayne Freitag, Nizar Habash, and
ward scalable learning with non-uniform class and Owen Rambow. 2008. Improving NER in Arabic
cost distributions: a case study in credit card fraud using a morphological tagger. In Nicoletta Calzo-
detection. In Proceedings of the Fourth Interna- lari, Khalid Choukri, Bente Maegaard, Joseph Mar-
tional Conference on Knowledge Discovery and iani, Jan Odjik, Stelios Piperidis, and Daniel Tapias,
Data Mining, pages 164168, New York City, New editors, Proceedings of the Sixth International Lan-
York, USA, August. AAAI Press. guage Resources and Evaluation (LREC08), pages
Ciprian Chelba and Alex Acero. 2006. Adaptation of 25092514, Marrakech, Morocco, May. European
maximum entropy capitalizer: Little data can help Language Resources Association (ELRA).
a lot. Computer Speech and Language, 20(4):382 Radu Florian, Hany Hassan, Abraham Ittycheriah,
399. Hongyan Jing, Nanda Kambhatla, Xiaoqiang Luo,
Massimiliano Ciaramita and Mark Johnson. 2003. Su- Nicolas Nicolov, and Salim Roukos. 2004. A
persense tagging of unknown nouns in WordNet. In statistical model for multilingual entity detection
Proceedings of the 2003 Conference on Empirical and tracking. In Susan Dumais, Daniel Marcu,
Methods in Natural Language Processing, pages and Salim Roukos, editors, Proceedings of the Hu-
168175. man Language Technology Conference of the North
171
American Chapter of the Association for Compu- Dirk Hovy, Chunliang Zhang, Eduard Hovy, and
tational Linguistics: HLT-NAACL 2004, page 18, Anselmo Peas. 2011. Unsupervised discovery of
Boston, Massachusetts, USA, May. Association for domain-specific knowledge from text. In Proceed-
Computational Linguistics. ings of the 49th Annual Meeting of the Association
Radu Florian, John Pitrelli, Salim Roukos, and Imed for Computational Linguistics: Human Language
Zitouni. 2010. Improving mention detection ro- Technologies, pages 14661475, Portland, Oregon,
bustness to noisy input. In Proceedings of EMNLP USA, June. Association for Computational Linguis-
2010, pages 335345, Cambridge, MA, October. tics.
Association for Computational Linguistics. Jing Jiang and ChengXiang Zhai. 2006. Exploit-
Dayne Freitag. 2004. Trained named entity recog- ing domain structure for named entity recognition.
nition using distributional clusters. In Dekang Lin In Proceedings of the Human Language Technol-
and Dekai Wu, editors, Proceedings of EMNLP ogy Conference of the NAACL (HLT-NAACL), pages
2004, pages 262269, Barcelona, Spain, July. As- 7481, New York City, USA, June. Association for
sociation for Computational Linguistics. Computational Linguistics.
Kevin Gimpel and Noah A. Smith. 2010a. Softmax- Junichi Kazama and Kentaro Torisawa. 2007.
margin CRFs: Training log-linear models with loss Exploiting Wikipedia as external knowledge for
functions. In Proceedings of the Human Language named entity recognition. In Proceedings of
Technologies Conference of the North American the 2007 Joint Conference on Empirical Meth-
Chapter of the Association for Computational Lin- ods in Natural Language Processing and Com-
guistics, pages 733736, Los Angeles, California, putational Natural Language Learning (EMNLP-
USA, June. CoNLL), pages 698707, Prague, Czech Republic,
Kevin Gimpel and Noah A. Smith. 2010b. June. Association for Computational Linguistics.
Softmax-margin training for structured log- Chloe Kiddon and Yuriy Brun. 2011. Thats what
linear models. Technical Report CMU-LTI- she said: double entendre identification. In Pro-
10-008, Carnegie Mellon University. http: ceedings of the 49th Annual Meeting of the Associ-
//www.lti.cs.cmu.edu/research/ ation for Computational Linguistics: Human Lan-
reports/2010/cmulti10008.pdf. guage Technologies, pages 8994, Portland, Ore-
Cyril Grouin, Sophie Rosset, Pierre Zweigenbaum, gon, USA, June. Association for Computational
Karn Fort, Olivier Galibert, and Ludovic Quin- Linguistics.
tard. 2011. Proposal for an extension of tradi-
Philipp Koehn. 2004. Statistical significance tests for
tional named entities: from guidelines to evaluation,
machine translation evaluation. In Dekang Lin and
an overview. In Proceedings of the 5th Linguis-
Dekai Wu, editors, Proceedings of EMNLP 2004,
tic Annotation Workshop, pages 92100, Portland,
pages 388395, Barcelona, Spain, July. Association
Oregon, USA, June. Association for Computational
for Computational Linguistics.
Linguistics.
LDC. 2005. ACE (Automatic Content Extraction)
Nizar Habash and Owen Rambow. 2005. Arabic to-
Arabic annotation guidelines for entities, version
kenization, part-of-speech tagging and morpholog-
5.3.3. Linguistic Data Consortium, Philadelphia.
ical disambiguation in one fell swoop. In Proceed-
ings of the 43rd Annual Meeting of the Associa- Percy Liang, Hal Daume III, and Dan Klein. 2008.
tion for Computational Linguistics (ACL05), pages Structure compilation: trading structure for fea-
573580, Ann Arbor, Michigan, June. Association tures. In Proceedings of the 25th International Con-
for Computational Linguistics. ference on Machine Learning (ICML), pages 592
Nizar Habash. 2010. Introduction to Arabic Natural 599, Helsinki, Finland.
Language Processing. Morgan and Claypool Pub- Chris Manning. 2006. Doing named entity recogni-
lishers. tion? Dont optimize for F1 . http://nlpers.
Ahmed Hassan, Haytham Fahmy, and Hany Hassan. blogspot.com/2006/08/doing-named-
2007. Improving named entity translation by ex- entity-recognition-dont.html.
ploiting comparable and parallel corpora. In Pro- David McClosky, Eugene Charniak, and Mark John-
ceedings of the Conference on Recent Advances son. 2006. Effective self-training for parsing. In
in Natural Language Processing (RANLP 07), Proceedings of the Human Language Technology
Borovets, Bulgaria. Conference of the NAACL, Main Conference, pages
Eduard Hovy, Mitchell Marcus, Martha Palmer, 152159, New York City, USA, June. Association
Lance Ramshaw, and Ralph Weischedel. 2006. for Computational Linguistics.
OntoNotes: the 90% solution. In Proceedings of Rada Mihalcea. 2004. Co-training and self-training
the Human Language Technology Conference of for word sense disambiguation. In HLT-NAACL
the NAACL (HLT-NAACL), pages 5760, New York 2004 Workshop: Eighth Conference on Computa-
City, USA, June. Association for Computational tional Natural Language Learning (CoNLL-2004),
Linguistics. Boston, Massachusetts, USA.
172
Einat Minkov, Richard Wang, Anthony Tomasic, and Khaled Shaalan and Hafsa Raza. 2008. Arabic
William Cohen. 2006. NER systems that suit users named entity recognition from diverse text types. In
preferences: adjusting the recall-precision trade-off Advances in Natural Language Processing, pages
for entity extraction. In Proceedings of the Human 440451. Springer.
Language Technology Conference of the NAACL, Mihai Surdeanu, David McClosky, Mason R. Smith,
Companion Volume: Short Papers, pages 9396, Andrey Gusev, and Christopher D. Manning. 2011.
New York City, USA, June. Association for Com- Customizing an information extraction system to
putational Linguistics. a new domain. In Proceedings of the ACL 2011
Luke Nezda, Andrew Hickl, John Lehmann, and Sar- Workshop on Relational Models of Semantics, Port-
mad Fayyaz. 2006. What in the world is a Shahab? land, Oregon, USA, June. Association for Compu-
Wide coverage named entity recognition for Arabic. tational Linguistics.
In Proccedings of LREC, pages 4146. Ben Taskar, Carlos Guestrin, and Daphne Koller.
Joel Nothman, Tara Murphy, and James R. Curran. 2004. Max-margin Markov networks. In Sebastian
2009. Analysing Wikipedia and gold-standard cor- Thrun, Lawrence Saul, and Bernhard Scholkopf,
pora for NER training. In Proceedings of the 12th editors, Advances in Neural Information Processing
Conference of the European Chapter of the Associ- Systems 16. MIT Press.
ation for Computational Linguistics (EACL 2009), Antonio Toral, Elisa Noguera, Fernando Llopis, and
pages 612620, Athens, Greece, March. Associa- Rafael Munoz. 2005. Improving question an-
tion for Computational Linguistics. swering using named entity recognition. Natu-
PediaPress. 2010. mwlib. http://code. ral Language Processing and Information Systems,
pediapress.com/wiki/wiki/mwlib. 3513/2005:181191.
Slav Petrov, Pi-Chuan Chang, Michael Ringgaard, and Ioannis Tsochantaridis, Thorsten Joachims, Thomas
Hiyan Alshawi. 2010. Uptraining for accurate de- Hofmann, and Yasemin Altun. 2005. Large margin
terministic question parsing. In Proceedings of the methods for structured and interdependent output
2010 Conference on Empirical Methods in Natural variables. Journal of Machine Learning Research,
Language Processing, pages 705713, Cambridge, 6:14531484, September.
MA, October. Association for Computational Lin- Christopher Walker, Stephanie Strassel, Julie Medero,
guistics. and Kazuaki Maeda. 2006. ACE 2005 multi-
Lev Ratinov and Dan Roth. 2009. Design chal- lingual training corpus. LDC2006T06, Linguistic
lenges and misconceptions in named entity recog- Data Consortium, Philadelphia.
nition. In Proceedings of the Thirteenth Confer- Ralph Weischedel and Ada Brunstein. 2005.
ence on Computational Natural Language Learning BBN pronoun coreference and entity type cor-
(CoNLL-2009), pages 147155, Boulder, Colorado, pus. LDC2005T33, Linguistic Data Consortium,
June. Association for Computational Linguistics. Philadelphia.
Nathan D. Ratliff, J. Andrew Bagnell, and Martin A. Dan Wu, Wee Sun Lee, Nan Ye, and Hai Leong Chieu.
Zinkevich. 2006. Subgradient methods for maxi- 2009. Domain adaptive bootstrapping for named
mum margin structured learning. In ICML Work- entity recognition. In Proceedings of the 2009 Con-
shop on Learning in Structured Output Spaces, ference on Empirical Methods in Natural Language
Pittsburgh, Pennsylvania, USA. Processing, pages 15231532, Singapore, August.
Ryan Roth, Owen Rambow, Nizar Habash, Mona Association for Computational Linguistics.
Diab, and Cynthia Rudin. 2008. Arabic morpho- Tianfang Yao, Wei Ding, and Gregor Erbach. 2003.
logical tagging, diacritization, and lemmatization CHINERS: a Chinese named entity recognition sys-
using lexeme models and feature ranking. In Pro- tem for the sports domain. In Proceedings of the
ceedings of ACL-08: HLT, pages 117120, Colum- Second SIGHAN Workshop on Chinese Language
bus, Ohio, June. Association for Computational Processing, pages 5562, Sapporo, Japan, July. As-
Linguistics. sociation for Computational Linguistics.
Satoshi Sekine, Kiyoshi Sudo, and Chikashi Nobata.
2002. Extended named entity hierarchy. In Pro-
ceedings of LREC.
Burr Settles. 2004. Biomedical named entity recogni-
tion using conditional random fields and rich feature
sets. In Nigel Collier, Patrick Ruch, and Adeline
Nazarenko, editors, COLING 2004 International
Joint workshop on Natural Language Processing in
Biomedicine and its Applications (NLPBA/BioNLP)
2004, pages 107110, Geneva, Switzerland, Au-
gust. COLING.
173
Tree Representations in Probabilistic Models for Extended Named
Entities Detection
174
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 174184,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
pers.ind org.adm
S
amount
name.firstname.last kind demonym
object
Zahra Abouch Conseil de Gouvernement irakien func.coll
Figure 1: Examples of structured named entities annotated on the amount time.date.rel org.adm
data used in this work val object loc.adm.town name time-modifier val kind name
several tree representations, which result in differ- Figure 2: An example of named entity tree corresponding to en-
tities of a whole sentence. Tree leaves, corresponding to sentence
ent parsing models with different performances. words have been removed to keep readability
We provide a detailed evaluation of our mod-
els. Results can be compared with those obtained Quaero training dev
# sentences 43,251 112
in the evaluation campaign where the same data
words entities words entities
were used. Our system outperforms the best sys- # tokens 1,251,432 245,880 2,659 570
tem of the evaluation campaign by a significant # vocabulary 39,631 134 891 30
# components 133662 971
margin. # components dict. 28 18
The rest of the paper is structured as follows: in # OOV rate [%] 17.15 0
the next section we introduce the extended named
entities used in this work, in section 3 we describe Table 1: Statistics on the training and development sets of the
Quaero corpus
our two-steps algorithm for parsing entity trees,
in section 4 we detail the second step of our ap-
proach based on syntactic parsing approaches, in name.last for pers.ind or val (for value) and ob-
particular we describe the different tree represen- ject for amount.
tations used in this work to encode entity trees These named entities have been annotated on
in parsing models. In section 6 we describe and transcriptions of French broadcast news coming
comment experiments, and finally, in section 7, from several radio channels. The transcriptions
we draw some conclusions. constitute a corpus that has been split into train-
ing, development and evaluation sets.The evalu-
2 Extended Named Entities ation set, in particular, is composed of two set
The most important aspect of the NER task we of data, Broadcast News (BN in the table) and
investigated is provided by the tree structure of Broadcast Conversations (BC in the table). The
named entities. Examples of such entities are evaluation of the models presented in this work
given in figure 1 and 2, where words have been re- is performed on the merge of the two data types.
move for readability issues and are: (90 persons Some statistics of the corpus are reported in ta-
are still present at Atambua. Its there that 3 employ- ble 1 and 2. This set of named entities has been
ees of the High Conseil of United Nations for refugees defined in order to provide more fine semantic in-
have been killed yesterday morning): formation for entities found in the data, e.g. a
person is better specified by first and last name,
90 personnes toujours presentes a and is fully described in (Grouin, 2011) . In or-
Atambua c est la qu hier matin ont der to avoid confusion, entities that can be associ-
ete tues 3 employes du haut commis- ated directly to words, like name.first, name.last,
sariat des Nations unies aux refugies , val and object, are called entity constituents, com-
le HCR ponents or entity pre-terminals (as they are pre-
Words realizing entities in figure 2 are in bold, terminals nodes in the trees). The other entities,
and they correspond to the tree leaves in the like pers.ind or amount, are called entities or non-
picture. As we see in the figures, entities terminal entities, depending on the context.
can have complex structures. Beyond the use
3 Models Cascade for Extended Named
of subtypes, like individual in person (to give
Entities
pers.ind), or administrative in organization
(to give org.adm), entities with more specific con- Since the task of Named Entity Recognition pre-
tent can be constituents of more general enti- sented here cannot be modeled as sequence la-
ties to form tree structures, like name.first and belling and, as mentioned previously, an approach
175
Quaero test BN test BC 3.1 Conditional Random Fields
# sentences 1704 3933
words entities words entities CRFs are particularly suitable for sequence la-
# tokens 32945 2762 69414 2769
# vocabulary 28 28 belling tasks (Lafferty et al., 2001). Beyond the
# components 4128 4017 possibility to include a huge number of features
# components dict. 21 20
# OOV rate [%] 3.63 0 3.84 0
using the same framework as Maximum Entropy
models (Berger et al., 1996), CRF models en-
Table 2: Statistics on the test set of the Quaero corpus, divided in code global conditional probabilities normalized
Broadcast News (BN) and Broadcast Conversations (BC)
at sentence level.
Given a sequence of N words W1N =
w1 , ..., wN and its corresponding components se-
quence E1N = e1 , ..., eN , CRF trains the condi-
tional probabilities
P (E1N |W1N ) =
N M
!
1 Y X
n+2
Figure 3: Processing schema of the two-steps approach proposed exp m hm (en1 , en , wn2 ) (1)
Z n=1 m=1
in this work: CRF plus PCFG
coming from syntactic parsing to perform named where m are the training parameters.
entity annotation in one-shot is not robust on n+2
hm (en1 , en , wn2 ) are the feature functions
the data used in this work, we adopt a two-steps. capturing dependencies of entities and words. Z
The first is designed to be robust to noisy data and is the partition function:
is used to annotate entity components, while the
second is used to parse complete entity trees and N
XY
n+2
Z= H(en1 , en , wn2 ) (2)
is based on a relatively simple model. Since we eN n=1
1
are dealing with noisy data, the hardest part of the
task is indeed to annotate components on words.
On the other hand, since entity trees are relatively which ensures that probabilities sum up to one.
simple, at least much simpler than syntactic trees, en1 and en are components for previous and cur-
n+2
once entity components have been annotated in a rent words, H(en1 , en , wn2 ) is an abbreviation
PM n+2
first step, for the second step, a complex model is for m=1 m hm (en1 , en , wn2 ), i.e. the set
not required, which would also make the process- of active feature functions at current position in
ing slower. Taking all these issues into account, the sequence.
the two steps of our system for tree-structured In the last few years different CRF implemen-
named entity recognition are performed as fol- tations have been realized. The implementation
lows: we refer in this work is the one described in
(Lavergne et al., 2010), which optimize the fol-
1. A CRF model (Lafferty et al., 2001) is used lowing objective function:
to annotate components on words.
2
log(P (E1N |W1N )) + 1 kk1 + kk22 (3)
2. A PCFG model (Johnson, 1998) is used 2
to parse complete entity trees upon compo-
nents, i.e. using components annotated by
kk1 and kk22 are the l1 and l2 regulariz-
CRF as starting point.
ers (Riezler and Vasserman, 2004), and together
This processing schema is depicted in figure 3. in a linear combination implement the elastic net
Conditional Random Fields are described shortly regularizer (Zou and Hastie, 2005). As mentioned
in the next subsection. PCFG models, constituting in (Lavergne et al., 2010), this kind of regulariz-
the main part of this work together with the analy- ers are very effective for feature selection at train-
sis over tree representations, is described more in ing time, which is a very good point when dealing
details in the next sections. with noisy data and big set of features.
176
4 Models for Parsing Trees
The models used in this work for parsing en-
tity trees refer to the models described in (John-
son, 1998), in (Charniak, 1997; Caraballo and
Charniak, 1997) and (Charniak et al., 1998), and
which constitutes the basis of the maximum en-
tropy model for parsing described in (Charniak, Figure 4: Baseline tree representations used in the PCFG parsing
model
2000). A similar lexicalized model has been pro-
posed also by Collins (Collins, 1997). All these
models are based on a PCFG trained from data
and used in a chart parsing algorithm to find the
best parse for the given input. The PCFG model
of (Johnson, 1998) is made of rules of the form:
Xi Xj Xk
Figure 5: Filler-parent tree representations used in the PCFG pars-
Xi w ing model
177
Figure 8: Parent-node-filler tree representations used in the PCFG
parsing model
Figure 6: Parent-context tree representations used in the PCFG
parsing model
referred to as parent-node-filler. This representa-
tion is a good trade-off between contextual infor-
mation and rigidity, by still representing entities
as concatenation of labels, while using a common
special label for entity fillers. This allows to keep
lower the number of entities annotated on words,
Figure 7: Parent-node tree representations used in the PCFG pars-
ing model
i.e. components.
Using different tree representations affects both
the structure and the performance of the parsing
textualized so that to be distinguished from the model. The structure is described in the next sec-
other fillers. In the first representation we give to tion, the performance in the evaluation section.
the filler the same label of the parent node, while
in the second representation we use a concatena- 4.2 Structure of the Model
tion of the filler and the label of the parent node. Lexicalized models for syntactic parsing de-
These two representations are shown in figure 5 scribed in (Charniak, 2000; Charniak et al., 1998)
and 6, respectively. The first one will be referred and (Collins, 1997), integrate more information
to as filler-parent, while the second will be re- than what is used in equations 4 and 5. Consider-
ferred as parent-context. A problem that may be ing a particular node in the entity tree, not includ-
introduced by the first representation is that some ing terminals, the information used is:
entities that originally were used only for non-
terminal entities will appear also as components, s: the head word of the node, i.e. the most
i.e. entities annotated on words. This may intro- important word of the chunk covered by the
duce some ambiguity. current node
Another possible contextualization can be to
h: the head word of the parent node
annotate each node with the label of the parent
node. This representation is shown in figure 7 t: the entity tag of the current node
and will be referred to as parent-node. Intuitively,
this representation is effective since entities an- l: the entity tag of the parent node
notated directly on words provide also the en-
The head word of the parent node is defined
tity of the parent node. However this representa-
percolating head words from children nodes to
tion increases drastically the number of entities,
parent nodes, giving the priority to verbs. They
in particular the number of components, which
can be found using automatic approaches based
in our case are the set of labels to be learned by
on words and entity tag co-occurrence or mutual
the CRF model. For the same reason this repre-
information. Using this information, the model
sentation produces more rigid models, since label
described in (Charniak et al., 1998) is P (s|h, t, l).
sequences vary widely and thus is not likely to
This model being conditioned on several pieces
match sequences not seen in the training data.
of information, it can be affected by data sparsity
Finally, another interesting tree representation
problems. Thus, the model is actually approxi-
is a variation of the parent-node tree, where en-
mated as an interpolation of probabilities:
tity fillers are only distinguished from fillers not
in an entity, using the label ne-filler, but they are P (s|h, t, l) =
not contextualized with entity information. This 1 P (s|h, t, l) + 2 P (s|ch , t, l)+
representation is shown in figure 8 and it will be 3 P (s|t, l) + 4 P (s|t) (6)
178
have shown less effective for syntactic parsing
where i , i = 1, ..., 4, are parameters of the than their lexicalized couter-parts, there are evi-
model to be tuned, and ch is the cluster of head dences showing that they can be effective in our
words for a given entity tag t. With such model, task. With reference to figure 4, considering the
when not all pieces of information are available to entity pers.ind instantiated by Nicolas Sarkozy,
estimate reliably the probability with more con- our algorithm detects first name.first for Nicolas
ditioning, the model can still provide a proba- and name.last for Sarkozy using the CRF model.
bility with terms conditioned with less informa- As mentioned earlier, once the CRF model has de-
tion. The use of head words and their percola- tected components, since entity trees have not a
tion over the tree is called lexicalization. The complex structure with respect to syntactic trees,
goal of tree lexicalization is to add lexical infor- even a simple model like the one in equation 7
mation all over the tree. This way the probabil- or 8 is effective for entity tree parsing. For ex-
ity of all rules can be conditioned also on lexi- ample, once name.first and name.last have been
cal information, allowing to define the probabili- detected by CRF, pers.ind is the only entity hav-
ties P (s|h, t, l) and P (s|ch , t, l). Tree lexicaliza- ing name.first and name.last as children. Am-
tion reflects the characteristics of syntactic pars- biguities, like for example for kind or qualifier,
ing, for which the models described in (Charniak, which can appear in many entities, can affect the
2000; Charniak et al., 1998) and (Collins, 1997) model 7, but they are overcome by the model 8,
were defined. Head words are very informative taking the entity tag of the parent node into ac-
since they constitute keywords instantiating la- count. Moreover, the use of CRF allows to in-
bels, regardless if they are syntactic constituents clude in the model much more features than the
or named entities. However, for named entity lexicalized model in equation 6. Using features
recognition it doesnt make sense to give prior- like word prefixes (P), suffixes (S), capitalization
ity to verbs when percolating head words over the (C), morpho-syntactic features (MS) and other
tree, even more because head words of named en- features indicated as F2 , the CRF model encodes
tities are most of the time nouns. Moreover, it the conditional probability:
doesnt make sense to give priority to the head
word of a particular entity with respect to the oth- P (t|w, P, S, C, M S, F ) (9)
179
5 Related Work target-side features (Tang et al., 2006). An inte-
gration of the same kind of features has been tried
While the models used for named entity detection also in the model used in this work, without giv-
and the set of named entities defined along the ing significant improvements, but making model
years have been discussed in the introduction and training much harder. Thus, this direction has not
in section 2, since CRFs and models for parsing been further investigated.
constitute the main issue in our work, we discuss
some important models here. 6 Evaluation
Beyond the models for parsing discussed in In this section we describe experiments performed
section 4, together with motivations for using or to evaluate our models. We first describe the set-
not in our work, another important model for syn- tings used for the two models involved in the en-
tactic parsing has been proposed in (Ratnaparkhi, tity tree parsing, and then describe and comment
1999). Such model is made of four Maximum the results obtained on the test corpus.
Entropy models used in cascade for parsing at
different stages. Also this model makes use of 6.1 Settings
head words, like those described in section 4, thus The CRF implementation used in this work is de-
the same considerations hold, moreover it seems scribed in (Lavergne et al., 2010), named wapiti.3
quite complex for real applications, as it involves We didnt optimize parameters 1 and 2 of the
the use of four different models together. The elastic net (see section 3.1), although this im-
models described in (Johnson, 1998), (Charniak, proves significantly the performances and leads
1997; Caraballo and Charniak, 1997), (Charniak to more compact models, default values lead in
et al., 1998), (Charniak, 2000), (Collins, 1997) most cases to very accurate models. We used a
and (Ratnaparkhi, 1999), constitute the main in- wide set of features in CRF models, in a window
dividual models proposed for constituent-based of [2, +2] around the target word:
syntactic parsing. Later other approaches based
on models combination have been proposed, like A set of standard features like word prefixes
e.g. the reranking approach described in (Collins and suffixes of length from 1 to 6, plus some
and Koo, 2005), among many, and also evolutions Yes/No features like Does the word start with
or improvements of these models. capital letter?, etc.
More recently, approaches based on log-linear
models have been proposed (Clark and Curran, Morpho-syntactic features extracted from
2007; Finkel et al., 2008) for parsing, called also the output of the tool tagger (Allauzen and
Tree CRF, using also different training criteria Bonneau-Maynard, 2008)
(Auli and Lopez, 2011). Using such models in our Features extracted from the output of the se-
work has basically two problems: one related to mantic analyzer (Rosset et al., (2009)) pro-
scaling issues, since our data present a large num- vided by the tool WMatch (Galibert, 2009).
ber of labels, which makes CRF training problem-
atic, even more when using Tree CRF; another This analysis morpho-syntactic information as
problem is related to the difference between syn- well as semantic information at the same level
tactic parsing and named entity detection tasks, of named entities. Using two different sets of
as mentioned in sub-section 4.2. Adapting Tree morpho-syntactic features results in more effec-
CRF to our task is thus a quite complex work, it tive models, as they create a kind of agreement
constitutes an entire work by itself, we leave it as for a given word in case of match. Concerning
feature work. the PCFG model, grammars, tree binarization and
Concerning linear-chain CRF models, the the different tree representations are created with
one we use is a state-of-the-art implementation our own scripts, while entity tree parsing is per-
(Lavergne et al., 2010), as it implements the formed with the chart parsing algorithm described
most effective optimization algorithms as well as in (Johnson, 1998).4
state-of-the-art regularizers (see sub-section 3.1). 3
available at http://wapiti.limsi.fr
Some improvement of linear-chain CRF have 4
available at http://web.science.mq.edu.au/
been proposed, trying to integrate higher order mjohnson/Software.htm
180
CRF PCFG DEV TEST
Model # features # labels # rules Model SER F1 SER F1
baseline 3,041,797 55 29,611 baseline 20.0% 73.4% 14.2% 79.4%
filler-parent 3,637,990 112 29,611 filler-parent 16.2% 77.8% 12.5% 81.2%
parent-context 3,605,019 120 29,611 parent-context 15.2% 78.6% 11.9% 81.4%
parent-node 3,718,089 441 31,110 parent-node 6.6% 96.7% 5.9% 96.7%
parent-node-filler 3,723,964 378 31,110 parent-node-filler 6.8% 95.9% 5.7% 96.8%
Table 3: Statistics showing the characteristics of the different Table 4: Results computed from oracle predictions obtained with
models used in this work the different models presented in this work
DEV TEST
6.2 Evaluation Metrics Model SER F1 SER F1
baseline 33.5% 72.5% 33.4% 72.8%
All results are expressed in terms of Slot Error filler-parent 31.3% 74.4% 33.4% 72.7%
parent-context 30.9% 74.6% 33.3% 72.8%
Rate (SER) (Makhoul et al., 1999) which has a parent-node 31.2% 77.8% 31.4% 79.5%
similar definition of word error rate for ASR sys- parent-node-filler 28.7% 78.9% 30.2% 80.3%
tems, with the difference that substitution errors
Table 5: Results obtained with our combined algorithm based on
are split in three types: i) correct entity type with CRF and PCFG
wrong segmentation; ii) wrong entity type with
correct segmentation; iii) wrong entity type with
will have more rules. For example, the rule
wrong segmentation; here, i) and ii) are given half
pers.ind name.first name.last can
points, while iii), as well as insertion and deletion
appear as it is or contextualized with func.ind,
errors, are given full points. Moreover, results are
like in figure 8. In contrast the other tree repre-
given using the well known F 1 measure, defined
sentations modify only fillers, thus the number of
as a function of precision and recall.
rules is not affected.
6.3 Results Concerning CRF models, as shown in table 3,
the use of the different tree representations results
In this section we provide evaluations of the mod- in an increasing number of labels to be learned by
els described in this work, based on combination CRF. This aspect is quite critical in CRF learn-
of CRF and PCFG and using different tree repre- ing, as training time is exponential in the number
sentations of named entity trees. of labels. Indeed, the most complex models, ob-
6.3.1 Model Statistics tained with parent-node and parent-node-filler
tree representations, took roughly 8 days for train-
As a first evaluation, we describe some statis- ing. Additionally, increasing the number of labels
tics computed from the CRF and PCFG models can create data sparseness problems, however this
using the tree representations. Such statistics pro- problem doesnt seem to arise in our case since,
vide interesting clues of how difficult is learning apart the baseline model which has quite less fea-
the task and which performance we can expect tures, all the others have approximately the same
from the model. Statistics for this evaluation are number of features, meaning that there are actu-
presented in table 3. Rows corresponds to the dif- ally enough data to learn the models, regardless
ferent tree representations described in this work, the number of labels.
while in the columns we show the number of fea-
tures and labels for the CRF models (# features 6.3.2 Evaluations of Tree Representations
and # labels), and the number of rules for PCFG In this section we evaluate the models in terms
models (# rules). of the evaluation metrics described in previous
As we can see from the table, the number section, Slot Error Rate (SER) and F1 measure.
of rules is the same for the tree representations In order to evaluate PCFG models alone, we
baseline, filler-parent and parent-context, and performed entity tree parsing using as input ref-
for the representations parent-node and parent- erence transcriptions, i.e. manual transcriptions
node-filler. This is the consequence of the con- and reference component annotations taken from
textualization applied by the latter representa- development and test sets. This can be consid-
tions, i.e. parent-node and parent-node-filler ered a kind of oracle evaluations and provides us
create several different labels depending from an upper bound of the performance of the PCFG
the context, thus the corresponding grammar models. Results for this evaluation are reported in
181
Participant SER the 2011 evaluation campaign of extended named
P1 48.9
P2 41.0 entity recognition (Galibert et al., 2011; 2) Re-
parent-context 33.3 sults are reported in table 6, where the other two
parent-node 31.4
parent-node-filler 30.2 participants to the campaign are indicated as P 1
and P 2. These two participants P1 and P2, used
Table 6: Results obtained with our combined algorithm based on a system based on CRF, and rules for deep syn-
CRF and PCFG
tactic analysis, respectively. In particular, P 2 ob-
tained superior performances in previous evalua-
table 4. As it can be intuitively expected, adding tion campaign on named entity recognition. The
more contextualization in the trees results in more system we proposed at the evaluation campaign
accurate models, the simplest model, baseline, used a parent-context tree representation. The
has the worst oracle performance, filler-parent results obtained at the evaluation campaign are
and parent-context models, adding similar con- in the first three lines of Table 6. We compare
textualization information, have very similar ora- such results with those obtained with the parent-
cle performances. Same line of reasoning applies node and parent-node-filler tree representations,
to models parent-node and parent-node-filler, reported in the last two rows of the same table. As
which also add similar contextualization and have we can see, the new tree representations described
very similar oracle predictions. These last two in this work allow to achieve the best absolute per-
models have also the best absolute oracle perfor- formances.
mances. However, adding more contextualization
in the trees results also in more rigid models, the 7 Conclusions
fact that models are robust on reference transcrip-
tions and based on reference component annota- In this paper we have presented a Named Entity
tions, doesnt imply a proportional robustness on Recognition system dealing with extended named
component sequences generated by CRF models. entities with a tree structure. Given such represen-
tation of named entities, the task cannot be mod-
This intuition is confirmed from results re-
eled as a sequence labelling approach. We thus
ported in table 5, where a real evaluation of our
proposed a two-steps system based on CRF and
models is reported, using this time CRF out-
PCFG. CRF annotate entity components directly
put components as input to PCFG models, to
on words, while PCFG apply parsing techniques
parse entity trees. The results reported in ta-
to predict the whole entity tree. We motivated
ble 5 show in particular that models using base-
our choice by showing that it is not effective to
line, filler-parent and parent-context tree repre-
apply techniques used widely for syntactic pars-
sentations have similar performances, especially
ing, like for example tree lexicalization. We pre-
on test set. Models characterized by parent-node
sented an analysis of different tree representations
and parent-node-filler tree representations have
for PCFG, which affect significantly parsing per-
indeed the best performances, although the gain
formances.
with respect to the other models is not as much
as it could be expected given the difference in We provided and discussed a detailed evalua-
the oracle performances discussed above. In par- tion of all the models obtained by combining CRF
ticular the best absolute performance is obtained and PCFG with the different tree representation
with the model parent-node-filler. As we men- proposed. Our combined models result in better
tioned in subsection 4.1, this model represents the performances with respect to other models pro-
best trade-off between rigidity and accuracy using posed at the official evaluation campaign, as well
the same label for all entity fillers, but still distin- as our previous model used also at the evaluation
guishing between fillers found in entity structures campaign.
and other fillers found in words not instantiating
any entity.
Acknowledgments
This work has been funded by the project Quaero,
6.3.3 Comparison with Official Results
under the program Oseo, French State agency for
As a final evaluation of our models, we pro- innovation.
vide a comparison of official results obtained at
182
References Proceedings of the fourteenth national conference
on artificial intelligence and ninth conference on
Ralph Grishman and Beth Sundheim. 1996. Mes-
Innovative applications of artificial intelligence,
sage Understanding Conference-6: a brief history.
AAAI97/IAAI97, pages 598603. AAAI Press.
In Proceedings of the 16th conference on Com-
putational linguistics - Volume 1, pages 466471, Eugene Charniak. 2000. A maximum-entropy-
Stroudsburg, PA, USA. Association for Computa- inspired parser. In Proceedings of the 1st North
tional Linguistics. American chapter of the Association for Computa-
Satoshi Sekine and Chikashi Nobata. 2004. Defini- tional Linguistics conference, pages 132139, San
tion, Dictionaries and Tagger for Extended Named Francisco, CA, USA. Morgan Kaufmann Publish-
Entity Hierarchy. In Proceedings of LREC. ers Inc.
G. Doddington, A. Mitchell, M. Przybocki, Sharon A. Caraballo and Eugene Charniak. 1997.
L. Ramshaw, S. Strassel, and R. Weischedel. New figures of merit for best-first probabilistic chart
2004. The Automatic Content Extraction (ACE) parsing. Computational Linguistics, 24:275298.
ProgramTasks, Data, and Evaluation. Proceedings Michael Collins. 1997. Three generative, lexicalised
of LREC 2004, pages 837840. models for statistical parsing. In Proceedings of the
Cyril Grouin, Sophie Rosset, Pierre Zweigenbaum, 35th Annual Meeting of the Association for Com-
Karn Fort, Olivier Galibert, Ludovic Quintard. putational Linguistics and Eighth Conference of the
2011. Proposal for an extension or traditional European Chapter of the Association for Computa-
named entities: From guidelines to evaluation, an tional Linguistics, ACL 98, pages 1623, Strouds-
overview. In Proceedings of the Linguistic Annota- burg, PA, USA. Association for Computational Lin-
tion Workshop (LAW). guistics.
J. Lafferty, A. McCallum, and F. Pereira. 2001. Con- Eugene Charniak, Sharon Goldwater, and Mark John-
ditional random fields: Probabilistic models for son. 1998. Edge-based best-first chart parsing. In
segmenting and labeling sequence data. In Pro- In Proceedings of the Sixth Workshop on Very Large
ceedings of the Eighteenth International Confer- Corpora, pages 127133. Morgan Kaufmann.
ence on Machine Learning (ICML), pages 282289, Alexandre Allauzen and Helene Bonneau-Maynard.
Williamstown, MA, USA, June. 2008. Training and evaluation of pos taggers on the
Mark Johnson. 1998. Pcfg models of linguistic french multitag corpus. In Proceedings of the Sixth
tree representations. Computational Linguistics, International Language Resources and Evaluation
24:613632. (LREC08), Marrakech, Morocco, may.
Stefan Hahn, Marco Dinarelli, Christian Raymond, Olivier Galibert. 2009. Approches et methodologies
Fabrice Lefevre, Patrick Lehen, Renato De Mori, pour la reponse automatique a des questions
Alessandro Moschitti, Hermann Ney, and Giuseppe adaptees a un cadre interactif en domaine ouvert.
Riccardi. 2010. Comparing stochastic approaches Ph.D. thesis, Universite Paris Sud, Orsay.
to spoken language understanding in multiple lan-
guages. IEEE Transactions on Audio, Speech and Rosset Sophie, Galibert Olivier, Bernard Guillaume,
Language Processing (TASLP), 99. Bilinski Eric, and Adda Gilles. The LIMSI mul-
tilingual, multitask QAst system. In Proceed-
Adam L. Berger, Stephen A. Della Pietra, and Vin-
ings of the 9th Cross-language evaluation forum
cent J. Della Pietra. 1996. A maximum entropy
conference on Evaluating systems for multilin-
approach to natural language processing. COMPU-
gual and multimodal information access, CLEF08,
TATIONAL LINGUISTICS, 22:3971.
pages 480487, Berlin, Heidelberg, 2009. Springer-
Thomas Lavergne, Olivier Cappe, and Francois Yvon.
Verlag.
2010. Practical very large scale CRFs. In Proceed-
ings the 48th Annual Meeting of the Association for Azeddine Zidouni, Sophie Rosset, and Herve Glotin.
Computational Linguistics (ACL), pages 504513. 2010. Efficient combined approach for named en-
Association for Computational Linguistics, July. tity recognition in spoken language. In Proceedings
Stefan Riezler and Alexander Vasserman. 2004. In- of the International Conference of the Speech Com-
cremental feature selection and l1 regularization munication Assosiation (Interspeech), Makuhari,
for relaxed maximum-entropy modeling. In Pro- Japan
ceedings of the International Conference on Em- John Makhoul, Francis Kubala, Richard Schwartz,
pirical Methods for Natural Language Processing and Ralph Weischedel. 1999. Performance mea-
(EMNLP). sures for information extraction. In Proceedings of
Hui Zou and Trevor Hastie. 2005. Regularization and DARPA Broadcast News Workshop, pages 249252.
variable selection via the Elastic Net. Journal of the Adwait Ratnaparkhi. 1999. Learning to Parse Natural
Royal Statistical Society B, 67:301320. Language with Maximum Entropy Models. Journal
Eugene Charniak. 1997. Statistical parsing with of Machine Learning, vol. 34, issue 1-3, pages 151
a context-free grammar and word statistics. In 175.
183
Michael Collins and Terry Koo. 2005. Discriminative
Re-ranking for Natural Language Parsing. Journal
of Machine Learning, vol. 31, issue 1, pages 2570.
Clark, Stephen and Curran, James R. 2007. Wide-
Coverage Efficient Statistical Parsing with CCG and
Log-Linear Models. Journal of Computational Lin-
guistics, vol. 33, issue 4, pages 493552.
Finkel, Jenny R. and Kleeman, Alex and Manning,
Christopher D. 2008. Efficient, Feature-based,
Conditional Random Field Parsing. Proceedings
of the Association for Computational Linguistics,
pages 959967, Columbus, Ohio.
Michael Auli and Adam Lopez 2011. Training a Log-
Linear Parser with Loss Functions via Softmax-
Margin. Proceedings of Empirical Methods for
Natural Language Processing, pages 333343, Ed-
inburgh, U.K.
Tang, Jie and Hong, MingCai and Li, Juan-Zi and
Liang, Bangyong. 2006. Tree-Structured Con-
ditional Random Fields for Semantic Annotation.
Proceedgins of the International Semantic Web
Conference, pages 640653, Edited by Springer.
Olivier Galibert; Sophie Rosset; Cyril Grouin; Pierre
Zweigenbaum; Ludovic Quintard. 2011. Struc-
tured and Extended Named Entity Evaluation in Au-
tomatic Speech Transcriptions. IJCNLP 2011.
Marco Dinarelli, Sophie Rosset. Models Cascade for
Tree-Structured Named Entity Detection IJCNLP
2011.
184
When Did that Happen? Linking Events and Relations to Timestamps
Dirk Hovy*, James Fan, Alfio Gliozzo, Siddharth Patwardhan and Chris Welty
IBM T. J. Watson Research Center
19 Skyline Drive
Hawthorne, NY 10532
dirkh@isi.edu, {fanj,gliozzo,siddharth,welty}@us.ibm.com
185
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 185193,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
the versatility of our approach by achieving com- the TempEval tasks overlap with ours in many
petitive results on yet another similar task with a ways. Our task is similar to task A and C of
different data set. TempEval-1 (Verhagen et al., 2007) in the sense
Our approach requires us to capture contextual that we attempt to identify temporal relation be-
properties of text surrounding events, fluents and tween events and time expressions or document
time expressions that enable an automatic system dates. However, we do not use a restricted set of
to detect temporal linking within our framework. events, but focus primarily on a single temporal
A common strategy for this is to follow standard relation tlink instead of named relations like BE-
feature engineering methodology and manually FORE, AFTER or OVERLAP (although we show
develop features for a machine learning model that we can incorporate these as well). Part of our
from the lexical, syntactic and semantic analysis task is similar to task C of TempEval-2 (Verha-
of the text. A key contribution of our work in this gen et al., 2010), determining the temporal rela-
paper is to demonstrate a shallow tree-like repre- tion between an event and a time expression in
sentation of the text that enables us to employ tree the same sentence. In this paper, we do apply our
kernel models, and more accurately detect tempo- system to TempEval-2 data and compare our per-
ral linking. The feature space represented by such formance to the participating systems.
tree kernels is far larger than a manually engi- Our work is similar to that of Boguraev and
neered feature space, and is capable of capturing Ando (2005), whose research only deals with
the contextual information required for temporal temporal links between events and time expres-
linking. sions (and does not consider relations at all). They
The remainder of this paper goes into the de- employ a sequence tagging model with manual
tails of our approach for temporal linking, and feature engineering for the task and achieved
presents empirical evidence for the effectiveness state-of-the-art results on Timebank (Pustejovsky
of our approach. The contributions of this paper et al., 2003) data. Our task is slightly different be-
can be summarized as follows: cause we include relations in the temporal linking,
and our use of tree kernels enables us to explore a
1. We define a common methodology to link wider feature space very quickly.
events and fluents to timestamps. Filatova and Hovy (2001) also explore tempo-
2. We use tree kernels in combination with clas- ral linking with events, but do not assume that
sical feature-based approaches to obtain sig- events and time stamps have been provided by an
nificant gains by exploiting context. external process. They used a heuristics-based ap-
proach to assign temporal expressions to events
3. Empirical evidence illustrates that our (also relying on the proximity as a base case).
framework for temporal linking is very ef- They report accuracy of the assignment for the
fective for the task, achieving an F1-score of correctly classified events, the best being 82.29%.
0.76 on events and 0.72 on fluents/relations, Our best event system achieves an accuracy of
as well as 0.65 for TempEval2, approaching 84.83%. These numbers are difficult to compare,
state-of-the-art. however, since accuracy does not efficiently cap-
ture the performance of a system on a task with so
2 Related Work many negative examples.
Most of the previous work on relation extraction Mirroshandel et al. (2011) describe the use of
focuses on entity-entity relations, such as in the syntactic tree kernels for event-time links. Their
ACE (Doddington et al., 2004) tasks. Temporal results on TempEval are comparable to ours. In
relations are part of this, but to a lesser extent. contrast to them, we found, though, that syntactic
The primary research effort in event temporality tree kernels alone do not perform as well as using
has gone into ordering events with respect to one several flat tree representations.
another (e.g., Chambers and Jurafsky (2008)), and
3 Problem Definition
detecting their typical durations (e.g., Pan et al.
(2006)). The task of linking events and relations to time
Recently, TempEval workshops have focused stamps can be defined as the following: given a set
on the temporal related issues in NLP. Some of of expressions denoting events or relation men-
186
tions in a document, and a set of time expressions 4 Temporal Linking Framework
in the same document, find all instances of the
tlink relation between elements of the two input As previously mentioned, we approach the tem-
sets. The existence of a tlink (e, t) means that e, poral linking problem as a classification task. In
which is an event or a relation mention, occurs the framework of classification, we refer to each
within the temporal context specified by the time pair of (event/relation, temporal expression) oc-
expression t. curring within a sentence as an instance. The goal
Thus, our task can be cast as a binary rela- is to devise a classifier that separates positive (i.e.,
tion classification task: for each possible pair linked) instances from negative ones, i.e., pairs
of (event/relation, time) in a document, decide where there is no link between the event/relation
whether there exists a link between the two, and and the temporal expression in question. The lat-
if so, express it in the data. ter case is far more frequent, so we have an inher-
In addition, we make these assumptions about ent bias toward negative examples in our data.1
the data: Note that the basis of the positive and nega-
tive links is the context around the target terms.
1. There does not exist a timestamp for ev- It is impossible even for humans to determine the
ery event/relation in a document. Although existence of a link based only on the two terms
events and relations typically have temporal without their context. For instance, given just two
context, it may not be explicitly stated in a words (e.g., said and yesterday) there is no
document. way to tell if it is a positive or a negative example.
We need the context to decide.
2. Every event/relation has at most one time ex- Therefore, we base our classification models on
pression associated with it. This is a simpli- contextual features drawn from lexical and syn-
fying assumption, which in the case of rela- tactic analyses of the text surrounding the target
tions we explore as future work. terms. For this, we first define a feature-based
approach, then we improve it by using tree ker-
3. Each temporal expression can be linked to nels. These two subsections, plus the treatment
one or more events or relations. Since mul- of fluent relations, are the main contributions of
tiple events or relations may happen for a this paper. In all of this work, we employ SVM
given time, it is safe to assume that each tem- classifiers (Vapnik, 1995) for machine learning.
poral expression can be linked to more than
one event/relation. 4.1 Feature Engineering
A manual analysis of development data provided
In general, the events/relations and their associ-
several intuitions about the kinds of features that
ated timestamps may occur within the same sen-
would be useful in this task. Based on this anal-
tence or may occur across different sentences. In
ysis and with inspiration from previous work (cf.
this paper, we focus on our effort and our evalua-
Boguraev and Ando (2005)) we established three
tion on the same sentence linking task.
categories of features whose description follows.
In order to solve the problem of temporal link-
ing completely, however, it will be important to
Features describing events or relations. We
also address the links that hold between entities
check whether the event or relation is phrasal, a
across sentences. We estimate, based on our data
verb, or noun, whether it is present tense, past
set, that across sentence links account for 41% of
tense, or progressive, the type assigned to the
all correct event-time pairs in a document. For flu-
event/relation by the UIMA type system used for
ents, the ratio is much higher, more than 80% of
processing, and whether it includes certain trig-
the correct fluent-time links are across sentences.
ger words, such as reporting verbs (said, re-
One of the main obstacles for our approach in the
ported, etc.).
cross-sentence case is the very low ratio of posi-
tive to negative instances (3 : 100) in the set of all 1
Initially, we employed an instance filtering method to
pairs in a document. Most pairs are not linked to address this, which proved to be ineffective and was subse-
one another. quently left out.
187
Features describing temporal expressions. (SVMlight with tree kernels, Moschitti (2004)), it
We check for the presence of certain trigger words is faster and easier than traditional feature engi-
(last, next, old, numbers, etc.) and the type of neering. The tree structure also allows us to use
the expression (DURATION, TIME, or DATE) as different levels of representations (POS, lemma,
specified by the UIMA type system. etc.) and combine their contributions, while at the
same time taking into account the ordering of la-
Features describing context. We also in- bels. We use POS, lemma, semantic type, and a
clude syntactic/structural features, such as testing representation that replaces each word with a con-
whether the relation/event dominates the temporal catenation of its features (capitalization, count-
expression, which one comes first in the sentence able, abstract/concrete noun, etc.).
order, and whether either of them is dominated
by a separate verb, preposition, that (which of- We developed a shallow tree representation that
ten indicates a subordinate sentence) or counter- captures the context of the target terms, without
factual nouns or verbs (which would negate the encoding too much structure (which may prevent
temporal link). generalization). In essence, our tree structure in-
It is not surprising that some of the most in- duces behavior somewhat similar to a string ker-
formative features (event comes before tempo- nel. In addition, we can model the tasks by pro-
ral expression, time is syntactic child of event) viding specific markup on the generated tree. For
are strongly correlated with the baselines. Less example, in our experiment we used the labels
salient features include the test for certain words EVENT (or equivalently RELATION) and TIME-
indicating the event is a noun, a verb, and if so STAMP to mark our target terms. In order to re-
which tense it has and whether it is a reporting duce the complexity of this comparison, we focus
verb. on the substring between event/relation and time
stamp and the rest of the tree structure is trun-
4.2 Tree Kernel Engineering cated.
We expect that there exist certain patterns be-
Figure 1 illustrates an example of the structure
tween the entities of a temporal link, which mani-
described so far for both lemmas and POS tags
fest on several levels: some on the lexical level,
(note that the lowest level of the tree contains tok-
others expressed by certain sequences of POS
enized items, so their number can differ form the
tags, NE labels, or other representations. Kernels
actual words, as in attorney general). Similar
provide a principled way of expanding the number
trees are produced for each level of representa-
of dimensions in which we search for a decision
tions used, and for each instance (i.e., pair of time
boundary, and allow us to easily model local se-
expressions and event/relation). If a sentence con-
quences and patterns in a natural way (Giuliano et
tains more than one event/relation, we create sep-
al., 2009). While it is possible to define a space
arate trees for each of them, which differ in the po-
in which we find a decision boundary that sepa-
sition of the EVENT/RELATION marks (at level
rates positive and negative instances with manu-
1 of the tree).
ally engineered features, these features can hardly
capture the notion of context as well as those ex- The tree kernel implicitly expands this struc-
plored by a tree kernel. ture into a number of substructures allowing us
Tree Kernels are a family of kernel functions to capture sequential patterns in the data. As we
developed to compute the similarity between tree will see, this step provides significant boosts to
structures by counting the number of subtrees the task performance.
they have in common. This generates a high-
dimensional feature space that can be handled ef- Curiously, using a full-parse syntactic tree as
ficiently using dynamic programming techniques input representation did not help performance.
(Shawe-Taylor and Christianini, 2004). For our This is in line with our finding that syntactic re-
purposes we used an implementation of the Sub- lations are less important than sequential patterns
tree and Subset Tree (SST) (Moschitti, 2006). (see also Section 5.2). Therefore we adopted the
The advantages of using tree kernels are string kernel like representation illustrated in
two-fold: thanks to an existing implementation Figure 1.
188
Scores of supporters of detained Egyptian opposition leader Nur demonstrated outside the attorney generals
office in Cairo last Saturday, demanding he be freed immediately.
BOW
BOP
Figure 1: Input Sentence and Tree Kernel Representations for Bag of Words (BOW) and POS tags (BOP)
3. Linking Events to Temporal Expressions The size of the relation data set after filtering is
(TempEval-2, task C) 5511 (1847 positive, 3395 negative).
In order to increase the originally lower number
The first two data sets contained annotations of event instances, we made use of the annotated
in the intelligence community (IC) domain, i.e., event-coreference as a sort of closure to add more
mainly news reports about terrorism. It com- instances: if events A and B corefer, and there
prised 169 documents. This dataset has been de- is a link between A and time expression t, then
veloped in the context of the machine reading pro- there is also a link between B and t. This was not
gram (MRP) (Strassel et al., 2010). In both cases explicitly expressed in the data.
our goal is to develop a binary classifier to judge For the task at hand, we used gold standard
whether the event (or relation) overlaps with the annotations for timestamps, events and relations.
time interval denoted by the timestamp. Success The task was thus not the identification of these
of this classification can be measured by precision objects (a necessary precursor and a difficult task
and recall on annotated data. in itself), but the decision as to which events and
We originally considered using accuracy as a time expressions could and should be linked.
measure of performance, but this does not cor- We also evaluated our system on TempEval-
rectly reflect the true performance of the system: 2 (Verhagen et al., 2010) for better comparison
189
baseline comparison
%
40
(OVERLAP, BEFORE, AFTER, BEFORE-OR-
20
OVERLAP, OVERLAP-OR-AFTER). This is a
0
bit different from our settings as it required the Precision Recall F1
metric
implementation of a multi-class classifier. There-
BL-parent BL-closest features +tree kernel
fore we trained three different binary classifiers
(using the same feature set) for the first three of
those types (for which there was sufficient train- Figure 2: Performance on events
ing data) and we used a one-versus-all strategy to
System Accuracy
distinguish positive from negative examples. The
TRIOS 65%
output of the system is the category with the high-
this work 64.5%
est SVM decision score. Since we only use three
JU-CSE, NCSU-indi all 63%
labels, we incur an error every time the gold la-
TRIPS, USFD2
bel is something else. Note that this is stricter
than the evaluation in the actual task, which left Table 1: Comparison to Best Systems in TempEval-2
contestants with the option of skipping examples
their systems could not classify.
5.3 Events
5.2 Baselines Figure 2 shows the improvements of the feature-
Intuitively, one would expect temporal expres- based approach over the two baseline, and the ad-
sions to be close to the event they denote, or even ditional gain obtained by using the tree kernel.
syntactically related. In order to test this, we ap- Both the features and tree kernels mainly improve
plied two baselines. In the first, each temporal ex- precision, while the tree kernel adds a small boost
pression was linked to the closest event (as mea- in recall. It is remarkable, though, that the closest-
sured in token distance). In the second, we at- event baseline has a very high recall value. This
Page 1
tached each temporal expression to its syntactic suggests that most of the links actually do occur
head, if the head was an event. Results are re- between items that are close to one another. For a
ported in Figure 2. possible explanation for the low precision value,
While these results are encouraging for our see the error analysis (Section 5.5).
task, it seems at first counter-intuitive that the Using a two-tailed t-test, we compute the sig-
syntactic baseline does worse than the proximity- nificance in the difference between the F1-scores.
based one. It does, however, reveal two facts: Both the feature-based and the tree kernel ap-
events are not always synonymous with syntactic proach improvements are statistically significant
units, and they are not always bound to tempo- at p < 0.001 over the baseline scores.
ral expressions through direct syntactic links. The Table 1 compares the performances of our sys-
latter makes even more sense given that the links tem to the state-of-the-art systems on TempEval-2
can even occur across sentence boundaries. Pars- Data, task C, showing that our approach is very
ing quality could play a role, yet seems far fetched competitive. The best systems there used sequen-
to account for the difference. tial models. We attribute the competitive nature
More important than syntactic relations seem of our results to the use of tree kernels, which en-
to be sequential patterns on different levels, a fact ables us to make use of contextual information.
we exploit with the different tree representations
used (POS tags, NE types, etc.). 5.4 Relations
For relations, we only applied the closest- In general, performance for relations is not as high
relation baseline. Since relations consist of two or as for events (see Figure 3). The reason here is
more arguments that occur in different, often sep- two-fold: relations consist of two (or more) ele-
arated syntactic constituents, a syntactic approach ments, which can be in various positions with re-
seems futile, especially given our experience with spect to one another and the temporal expression,
events. Results are reported in Figure 3. and each relation can be expressed in a number of
190
baseline comparison
40
29.0
30 24.0 where last Tuesday modifies arrested. It limits
20
10
0
the amount of context that is available to the tree
Precision Recall F1 kernels, since we truncate the tree representations
metric
191
from the dependency trees, or our formulation of the 22nd National Conference for Artificial Intelli-
the task was not amenable to it. We did not inves- gence, Vancouver, Canada, July.
tigate this further, but leave it to future work. Branimir Boguraev and Rie Kubota Ando. 2005.
Timeml-compliant text analysis for temporal rea-
6 Conclusion and Future Work soning. In Proceedings of IJCAI, volume 5, pages
9971003. IJCAI.
We cast the problem of linking events and rela- Nathanael Chambers and Dan Jurafsky. 2008. Unsu-
tions to temporal expressions as a classification pervised learning of narrative event chains. pages
task using a combination of features and tree ker- 789797. Association for Computational Linguis-
nels, with probabilistic type filtering. Our main tics.
contributions are: Peter Clark and Phil Harrison. 2010. Machine read-
ing as a process of partial question-answering. In
We showed that within-sentence temporal Proceedings of the NAACL HLT Workshop on For-
links for both events and relations can be ap- malisms and Methodology for Learning by Reading,
Los Angeles, CA, June.
proached with a common strategy.
George Doddington, Alexis Mitchell, Mark Przybocki,
We developed flat tree representations and Lance Ramshaw, Stephanie Strassel, and Ralph
Weischedel. 2004. The automatic content extrac-
showed that these produce considerable
tion program tasks, data and evaluation. In Pro-
gains, with significant improvements over ceedings of the LREC Conference, Canary Islands,
different baselines. Spain, July.
Oren Etzioni, Michele Banko, and Michael Cafarella.
We applied our technique without great ad-
2007. Machine reading. In Proceedings of the
justments to an existing data set and achieved AAAI Spring Symposium Series, Stanford, CA,
competitive results. March.
Elena Filatova and Eduard Hovy. 2001. Assigning
Our best systems achieve F1 score of 0.76 time-stamps to event-clauses. In Proceedings of
on events and 0.72 on relations, and are ef- the workshop on Temporal and spatial information
fective at the task of temporal linking. processing, volume 13, pages 18. Association for
Computational Linguistics.
We developed the models as part of a machine Claudio Giuliano, Alfio Massimiliano Gliozzo, and
reading system and are currently evaluating it in Carlo Strapparava. 2009. Kernel methods for min-
an end-to-end task. imally supervised wsd. Computational Linguistics,
Following tasks proposed in TempEval-2, we 35(4).
plan to use our approach for across-sentence clas- Seyed A. Mirroshandel, Mahdy Khayyamian, and
sification, as well as a similar model for linking Gholamreza Ghassem-Sani. 2011. Syntactic tree
kernels for event-time temporal relation learning.
entities to the document creation date.
Human Language Technology. Challenges for Com-
puter Science and Linguistics, pages 213223.
Acknowledgements
Alessandro Moschitti. 2004. A study on convolution
We would like to thank Alessandro Moschitti for kernels for shallow semantic parsing. In Proceed-
his help with the tree kernel setup, and the review- ings of the 42nd Annual Meeting on Association for
ers who supplied us with very constructive feed- Computational Linguistics, pages 335es. Associa-
tion for Computational Linguistics.
back. Research supported in part by Air Force
Alessandro Moschitti. 2006. Making tree kernels
Contract FA8750-09-C-0172 under the DARPA practical for natural language learning. In Proceed-
Machine Reading Program. ings of EACL, volume 6.
Feng Pan, Rutu Mulkar, and Jerry R. Hobbs. 2006.
Learning event durations from event descriptions.
References In Proceedings of the 21st International Conference
Ken Barker, Bhalchandra Agashe, Shaw-Yi Chaw, on Computational Linguistics and the 44th annual
James Fan, Noah Friedland, Michael Glass, Jerry meeting of the Association for Computational Lin-
Hobbs, Eduard Hovy, David Israel, Doo Soon Kim, guistics, pages 393400. Association for Computa-
Rutu Mulkar-Mehta, Sourabh Patwardhan, Bruce tional Linguistics.
Porter, Dan Tecuci, and Peter Yeh. 2007. Learn- James Pustejovsky, Patrick Hanks, Roser Saur, An-
ing by reading: A prototype system, performance drew See, Robert Gaizauskas, Andrea Setzer,
baseline and lessons learned. In Proceedings of Dragomir Radev, Beth Sundheim, David Day, Lisa
192
Ferro, and Marcia Lazo. 2003. The TIMEBANK
Corpus. In Proceedings of Corpus Linguistics
2003, pages 647656.
John Shawe-Taylor and Nello Christianini. 2004. Ker-
nel Methods for Pattern Analysis. Cambridge Uni-
versity Press.
Stephanie Strassel, Dan Adams, Henry Goldberg,
Jonathan Herr, Ron Keesing, Daniel Oblinger,
Heather Simpson, Robert Schrag, and Jonathan
Wright. 2010. The DARPA Machine Read-
ing Program-Encouraging Linguistic and Reason-
ing Research with a Series of Reading Tasks. In
Proceedings of LREC 2010.
Vladimir Vapnik. 1995. The Nature of Statistical
Learning Theory. Springer, New York, NY.
Marc Verhagen, Robert Gaizauskas, Frank Schilder,
Mark Hepple, Graham Katz, and James Puste-
jovsky. 2007. Semeval-2007 task 15: Tempeval
temporal relation identification. In Proceedings of
the 4th International Workshop on Semantic Evalu-
ations, pages 7580. Association for Computational
Linguistics.
Marc Verhagen, Roser Sauri, Tommaso Caselli, and
James Pustejovsky. 2010. Semeval-2010 task 13:
Tempeval-2. In Proceedings of the 5th Interna-
tional Workshop on Semantic Evaluation, pages 57
62. Association for Computational Linguistics.
193
Compensating for Annotation Errors in Training a Relation Extractor
Bonan Min Ralph Grishman
New York University New York University
715 Broadway, 7th floor 715 Broadway, 7th floor
New York, NY 10003 USA New York, NY 10003 USA
min@cs.nyu.edu grishman@cs.nyu.edu
194
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 194203,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
2. Background positive instances and using all the not-a-relation
cases (same as described above) as negative
2.1 Supervised Relation Extraction examples. RC is trained on the annotated
examples with their tagged types. During testing,
One of the most studied relation extraction tasks RD is applied first to identify whether an
is the ACE relation extraction evaluation example expresses some relation, then RC is
sponsored by the U.S. government. ACE 2005 applied to determine the most likely type only if
defined 7 major entity types, such as PER it is detected as correct by RD.
(Person), LOC (Location), ORG (Organization). State-of-the-art supervised methods for
A relation in ACE is defined as an ordered pair relation extraction also differ from each other on
of entities appearing in the same sentence which data representation. Given a relation mention,
expresses one of the predefined relations. ACE feature-based methods (Miller et al., 2000;
2005 defines 7 major relation types and more Kambhatla, 2004; Boschee et al., 2005;
than 20 subtypes. Following previous work, we Grishman et al., 2005; Zhou et al., 2005; Jiang
ignore sub-types in this paper and only evaluate and Zhai, 2007; Sun et al., 2011) extract a rich
on types when reporting relation classification list of structural, lexical, syntactic and semantic
performance. Types include General-affiliation features to represent it; in contrast, the kernel
(GEN-AFF), Part-whole (PART-WHOLE), based methods (Zelenko et al., 2003; Bunescu
Person-social (PER-SOC), etc. ACE provides a and Mooney, 2005a; Bunescu and Mooney,
large corpus which is manually annotated with 2005b; Zhao and Grishman, 2005; Zhang et al.,
entities (with coreference chains between entity 2006a; Zhang et al., 2006b; Zhou et al., 2007;
mentions annotated), relations, events and Qian et al., 2008) represent each instance with an
values. Each mention of a relation is tagged with object such as augmented token sequences or a
a pair of entity mentions appearing in the same parse tree, and used a carefully designed kernel
sentence as its arguments. More details about the function, e.g. subsequence kernel (Bunescu and
ACE evaluation are on the ACE official website. Mooney, 2005b) or convolution tree kernel
Given a sentence s and two entity mentions (Collins and Duffy, 2001), to calculate their
arg1 and arg2 contained in s, a candidate relation similarity. These objects are usually augmented
mention r with argument arg1 preceding arg2 is with features such as semantic features.
defined as r=(s, arg1, arg2). The goal of Relation In this paper, we use the hierarchical learning
Detection and Classification (RDC) is to strategy since it simplifies the problem by letting
determine whether r expresses one of the types us focus on relation detection only. The relation
defined. If so, classify it into one of the types. classification stage remains unchanged and we
Supervised learning treats RDC as a will show that it benefits from improved
classification problem and solves it with detection. For experiments on both relation
supervised Machine Learning algorithms such as detection and relation classification, we use
MaxEnt and SVM. There are two commonly SVM 4 (Vapnik 1998) as the learning algorithm
used learning strategies (Sun et al., 2011). Given since it can be extended to support transductive
an annotated corpus, one could apply a flat inference as discussed in section 4.3. However,
learning strategy, which trains a single multi- for the analysis in section 3.2 and the purification
class classifier on training examples labeled as preprocess steps in section 4.2, we use a
one of the relation types or not-a-relation, and MaxEnt 5 model since it outputs probabilities 6 for
apply it to determine its type or output not-a its predictions. For the choice of features, we use
relation for each candidate relation mention the full set of features from Zhou et al. (2005)
during testing. The examples of each type are the since it is reported to have a state-of-the-art
relation mentions that are tagged as instances of performance (Sun et al., 2011).
that type, and the not-a-relation examples are
constructed from pairs of entities that appear in 2.2 ACE 2005 annotation
the same sentence but are not tagged as any of The ACE 2005 training data contains 599 articles
the types. Alternatively, one could apply a
hierarchical learning strategy, which trains two
classifiers, a binary classifier RD for relation 4
SVM-Light is used. http://svmlight.joachims.org/
5
detection and the other a multi-class classifier RC OpenNLP MaxEnt package is used.
for relation classification. RD is trained by http://maxent.sourceforge.net/about.html
6
SVM also outputs a value associated with each prediction.
grouping tagged relation mentions of all types as However, this value cannot be interpreted as probability.
195
from newswire, broadcast news, weblogs, usenet the entity mentions. Following up, we checked
newsgroups/discussion forum, conversational the relation mentions 7 from fp1 and fp2 against
telephone speech and broadcast conversations. the adjudicated list of entity mentions from adj
The annotation process is conducted as follows: and found that 682 and 665 relation mentions
two annotators working independently annotate respectively have at least one argument which
each article and complete all annotation tasks doesnt appear in the list of adjudicated entity
(entities, values, relations and events). After two mentions.
annotators both finished annotating a file, all Given the list of relation mentions with both
discrepancies are then adjudicated by a senior arguments appearing in the list of adjudicated
annotator. This results in a high-quality entity mentions, figure 1 shows the inter-
annotation file. More details can be found in the annotator agreement of the ACE 2005 relation
documentation of ACE 2005 Multilingual annotation. In this figure, the three circles
Training Data V3.0. represent the list of relation mentions in fp1, fp2
Since the final release of the ACE training and adj, respectively.
corpus only contains the final adjudicated
annotations, in which all the traces of the two 47
196
relations in figure 2 (3 of the classes, accounting examples that are not annotated in adj, and use it
together for less than 10% of the cases, are to make predictions on the mixed pool of correct
omitted) and the other class. It seems that it is examples, missing examples and spurious ones.
generally easier for the annotators to find and To illustrate how distinguishable the missing
agree on relation mentions of the type examples (false negatives) are from the true
Preposition/PreMod/Possessives but harder to negative ones, 1) we apply the MaxEnt model on
find and agree on the ones belonging to Verbal both false negatives and true negatives, 2) put
and Other. The definition and examples of these them together and rank them by the model-
syntactic classes can be found in the annotation predicted probabilities of being positive, 3)
guidelines. calculate their relative rank in this pool. We plot
In the following sections, we will show the the Cumulative distribution of frequency (CDF)
analysis on fp1 and adj since the result is similar of the ranks (as percentages in the mixed pools)
for fp2. of false negatives in figure 3. We took similar
steps for the spurious ones (false positives) and
plot them in figure 3 as well (However, they are
ranked by model-predicted probabilities of being
negative).
Figure 2. Percentage of examples of major syntactic classes. Figure 3: cumulative distribution of frequency (CDF) of the
relative ranking of model-predicted probability of being
3.2 Why the differences? positive for false negatives in a pool mixed of false
negatives and true negatives; and the CDF of the relative
To understand what causes the missing ranking of model-predicted probability of being negative for
annotations and the spurious ones, we need false positives in a pool mixed of false positives and true
methods to find how similar/different the false positives.
positives are to true positives and also how For false negatives, it shows a highly skewed
similar/different the false negatives (missing distribution in which around 75% of the false
annotations) are to true negatives. If we adopt a negatives are ranked within the top 10%. That
good similarity metric, which captures the means the missing examples are lexically,
structural, lexical and semantic similarity structurally or semantically similar to correct
between relation mentions, this analysis will help examples, and are distinguishable from the true
us to understand the similarity/difference from an negative examples. However, the distribution of
extraction perspective. false positives (spurious examples) is close to
We use a state-of-the-art feature space (Zhou uniform (flat curve), which means they are
et al., 2005) to represent examples (including all generally indistinguishable from the correct
correct examples, erroneous ones and untagged examples.
examples) and use MaxEnt as the weight
learning model since it shows competitive 3.3 Categorize annotation errors
performance in relation extraction (Jiang and The automatic method shows that the errors
Zhai, 2007) and outputs probabilities associated (spurious annotations) are very similar to the
with each prediction. We train a MaxEnt model correct examples but provides little clue as to
for relation detection on true positives and true why that is the case. To understand their causes,
negatives, which respectively are the subset of we sampled 65 examples from fp1 (10% of the
correct examples annotated by fp1 (and 645 errors), read the sentences containing these
adjudicated as correct ones) and negative
197
Example
Category Percentage Relation Notes (examples are similar
Sampled text of spurious examples in fp1
Type ones in adj for comparison)
Duplicate
relation his budding friendship
his budding friendship with US President
mention for 49.2% ORG-AFF with US President George
George W. Bush in the face of
coreferential W. Bush in the face of
entity mentions
Hundreds of thousands of demonstrators took to
PHYS
the streets in Britain
(Symmetric relation)
Correct 20%
The dead included the quack doctor, 55-year-old The dead included the quack
PER-SOC
Nityalila Naotia, his teenaged son and doctor, 55-year-old Nityalila
Naotia, his teenaged son
Putin had even secretly invited British Prime
Argument not 15.4%
PER-SOC Minister Tony Blair, Bush's staunchest backer
in list
in the war on Iraq
"The amazing thing is they are going to turn
Violate
San Francisco into ground zero for every criminal
reasonable 6.2% PHYS
who wants to profit at their chosen profession",
reader rule
Paredes said.
PART- a likely candidate to run Vivendi Universal's Arguments are tagged
WHOLE entertainment unit in the United States reversed
Errors 6.1% PART- Khakamada argued that the United
WHOLE States would also need Russia's help "to make the Relation type error
new Iraqi government seem legitimate.
illegal
Up to 20,000 protesters
promotion
PHYS Up to 20,000 protesters thronged the plazas and thronged the plazas and
through 3%
streets of San Francisco, where streets of San Francisco,
blocked
where
categories
Table 1. Categories of spurious relation mentions in fp1 (on a sample of 10% of relation mentions), ranked by the percentage
of the examples in each category. In the sample text, red text (also marked with dotted underlines) shows head words of the
first arguments and the underlined text shows head words of the second arguments.
erroneous relation mentions and compared them mistake. The third largest category is argument
to the correct relation mentions in the same not in list, by which we mean that at least one of
sentence; we categorized these examples and the arguments is not in the list of adjudicated
show them in table 1. The most common type of entity mentions.
error is duplicate relation mention for Based on Table 1, we can see that as many as
coreferential entity mentions. The first row in 72%-88% of the examples which are adjudicated
table 1 shows an example, in which there is a as incorrect are actually correct if viewed from a
relation ORG-AFF tagged between US and relation learning perspective, since most of them
George W. Bush in adj. Because President and contain informative expressions for tagging
George W. Bush are coreferential, the example relations. The annotation guideline is designed
<US, President > from fp1 is adjudicated as to ensure high quality while not imposing too
incorrect. This shows that if a relation is much burden on human annotators. To reduce
expressed repeatedly across relation mentions annotation effort, it defined rules such as illegal
whose arguments are coreferential, the promotion through blocked categories. The
adjudicator only tags one of the relation mentions annotators practice suggests that they are
as correct, although the other is correct too. This following another rule not to annotate duplicate
shared the same principle with another type of relation mention for coreferential entity
error illegal promotion through blocked mentions. This follows the similar principle of
categories 9 as defined in the annotation reducing annotation effort but is not explicitly
guideline. The second largest category is correct, stated in the guideline: to avoid propagation of a
by which we mean the example is a correct relation through a coreference chain. However,
relation mention and the adjudicator made a these examples are useful for learning more ways
to express a relation. Moreover, even for the
9 erroneous examples (as shown in table 1 as
For example, in sentence Smith went to a hotel in Brazil,
(Smith, hotel) is a taggable PHYS Relation but (Smith,
violate reasonable reader rule and errors), most
Brazil) is not, because to get the second relationship, one of them have some level of similar structures or
would have to promote Brazil through hotel. For the semantics to the targeted relation. Therefore, it is
precise definition of annotation rules, please refer to ACE very hard to distinguish them without human
(Automatic Content Extraction) English Annotation
proofreading.
Guidelines for Relations, version 5.8.3.
198
Exp # Training Testing Detection (%) Classification (%)
data data Precision Recall F1 Precision Recall F1
1 fp1 adj 83.4 60.4 70.0 75.7 54.8 63.6
2 fp2 adj 83.5 60.5 70.2 76.0 55.1 63.9
3 adj adj 80.4 69.7 74.6 73.4 63.6 68.2
Table 2. Performance of RDC trained on fp1/fp2/adj, and tested on adj.
4. Relation extraction with low-cost
3.4 Why missing annotations and how
many examples are missing?
annotation
199
already showed that most of the spurious training process of a supervised relation
annotations are not actually errors from an extraction algorithm.
extraction perspective and table 2 shows that The algorithm is similar to Li and Liu 2003.
they do not hurt precision, we will only focus on However, we drop a few noisy examples instead
utilizing the missing examples, in other words, of choosing a small purified subset since we have
training with an incomplete annotation. relatively few false negatives compared to the
entire set of unannotated examples. Moreover,
4.2 Purify the set of negative examples after step 3, most false negatives are clustered
As discussed in section 2, traditional supervised within the small region of top ranked examples
methods find all pairs of entity mentions that which has a high model-predicated probability of
appear within a sentence, and then use the pairs being positive. The intuition is similar to what
that are not annotated as relation mentions as the we observed from figure 3 for false negatives
negative examples for the purpose of training a since we also observed very similar distribution
relation detector. It relies on the assumption that using the model trained with noisy data.
the annotators annotated all relation mentions Therefore, we can purify negatives by removing
and missed no (or very few) examples. However, examples in this noisy subset.
this is not true for training on a single-pass However, the false negatives are still mixed
annotation, in which a significant portion of with true negatives. For example, still slightly
relation mentions are left not annotated. If this more than half of the top 2000 examples are true
scheme is applied, all of the correct pairs which negatives. Thus we cannot simply flip their
the annotators missed belong to this negative labels and use them as positive examples. In the
category. Therefore, we need a way to purify the following section, we will use them in the form
negative set of examples obtained by this of unlabeled examples to help train a better
conventional approach. model.
Li and Liu (2003) focuses on classifying
4.3 Transductive inference on unlabeled
documents with only positive examples. Their
examples
algorithm initially sets all unlabeled data to be
negative and trains a Rocchio classifier, selects Transductive SVM (Vapnik, 1998; Joachims,
negative examples which are closer to the 1999) is a semi-supervised learning method
negative centroid than positive centroid as the which learns a model from a data set consisting
purified negative examples, and then retrains the of both labeled and unlabeled examples.
model. Their algorithm performs well for text Compared to its popular antecedent SVM, it also
classification. It is based on the assumption that learns a maximum margin classification
there are fewer unannotated positive examples hyperplane, but additionally forces it to separate
than negative ones in the unlabeled set, so true a set of unlabeled data with large margin. The
negative examples still dominate the set of noisy optimization function of Transductive SVM
negative examples in the purification step. (TSVM) is the following:
Based on the same assumption, our purification
process consists of the following steps:
1) Use annotated relation mentions as
positive examples; construct all possible
relation mentions that are not annotated, and
initially set them to be negative. We call this
noisy data set D.
2) Train a MaxEnt relation detection model
Mdet on D. Figure 4. TSVM optimization function for non-separable
3) Apply Mdet on all unannotated case (Joachims, 1999)
examples, and rank them by the model- TSVM can leverage an unlabeled set of
predicted probabilities of being positive, examples to improve supervised learning. As
4) Remove the top N examples from D. shown in section 3, a significant number of
These preprocessing steps result in a purified relation mentions are missing from the single-
data set . We can use for the normal pass annotation data. Although it is not possible
to find all missing annotations without human
and fp2 is different. Moreover, algorithms trained on them effort, we can improve the model by further
show similar performance.
200
utilizing the fact that some unannotated examples +tSVM: First, the same purification process of
should have been annotated. +purify is applied. Then we follow the steps
The purification process discussed in the described in section 4.3 to construct the set of
previous section removes N examples which unlabeled examples, and set all the rest of
have a high density of false negatives. We further purified negative examples to be negative.
utilize the N examples as follows: Finally, we train TSVM on both labeled and
1) Construct a training corpus from unlabeled data and replace the relation detection
by taking a random sample 11 of N*(1- in the RDC algorithm. The relation classification
p)/p (p is the ratio of annotated examples to is unchanged.
all examples; p=0.05 in fp1) negatively Table 3 shows the results. All experiments are
labeled examples in and setting them to done with 5-fold cross validation 13 using testing
be unlabeled. In addition, the N examples data from adj. The first three rows show
removed by the purification process are added experiments trained on fp1, and the last row
back as unlabeled examples. (ADJ) shows the unmodified RDC algorithm
2) Train TSVM on . trained on adj for comparison. The purification
of negative examples shows significant
The second step trained a model which
performance gain, 3.7% F1 on relation detection
replaced the detection model in the hierarchical
and 3.4% on relation classification. The precision
detection-classification learning scheme we used.
decreases but recall increases substantially since
We will show in the next section that this
the missing examples are not treated as
improves the model.
negatives. Experiment shows that the purification
5. Experiments process removes more than 60% of the false
Experiments were conducted over the same set of negatives. Transductive SVM further improved
documents on which we did analysis: the 511 performance by a relatively small margin. This
documents which have completed annotation in shows that the latent positive examples can help
all of the fp1, fp2 and adj from the ACE 2005 refine the model. Results also show that
Multilingual Training Data V3.0. To transductive inference can find around 17% of
reemphasize, we apply the hierarchical learning missing relation mentions. We notice that the
scheme and we focus on improving relation performance of relation classification is
detection while keeping relation classification improved since by improving relation detection,
unchanged (results show that its performance is some examples that do not express a relation are
improved because of the improved detection). removed. The classification performance on
We use SVM as our learning algorithm with the single-pass annotation is close to the one trained
full feature set from Zhou et al. (2005). on adj due to the help from a better relation
Baseline algorithm: The relation detector is detector trained with our algorithm.
unchanged. We follow the common practice, We also did 5-fold cross validation with a
which is to use annotated examples as positive model trained on a fraction of the 4/5 (4 folds) of
ones and all possible untagged relation mentions adj data (each experiment shown in table 4 uses
as negative ones. We sub-sampled the negative 4 folds of adj documents for training since one
data by since that shows better performance. fold is left for cross validation). The documents
+purify: This algorithm adds an additional are sampled randomly. Table 4 shows results for
purification preprocessing step (section 4.2) varying training data size. Compared to the
before the hierarchical learning RDC algorithm. results shown in the +tSVM row of table 3, we
After purification, the RDC algorithm is trained can see that our best model trained on single-pass
on the positive examples and purified negative annotation outperforms SVM trained on 90% of
examples. We set N=2000 12 in all experiments. the dual-pass, adjudicated data in both relation
detection and classification, although it costs less
11
than half the 3-pass annotation. This suggests
We included this large random sample so that the balance
of positive to negative examples in the unlabeled set would
that given the same amount of human effort for
be similar to that of the labeled data. The test data is not
included in the unlabeled set.
12
We choose 2000 because it is close to the number of
relations missed from each single-pass annotation. In should perform multiple passes of independent annotation
practice, it contains more than 70% of the false negatives, on a small dataset and measure inter-annotator agreements.
13
and it is less than 10% of the unannotated examples. To Details about the settings for 5-fold cross validation are in
estimate how many examples are missing (section 3.4), one section 4.1.
201
Detection (%) Classification (%)
Algorithm
Precision Recall F1 Precision Recall F1
Baseline 83.4 60.4 70.0 75.7 54.8 63.6
+purify 76.8 70.9 73.7 69.8 64.5 67.0
+tSVM 76.4 72.1 74.2 69.4 65.2 67.2
ADJ (on adj) 80.4 69.7 74.6 73.4 63.6 68.2
Table 3. 5-fold cross-validation results. All are trained on fp1 (except the last row showing the unchanged algorithm trained
on adj for comparison), and tested on adj. McNemar's test show that the improvement from +purify to +tSVM, and from
+tSVM to ADJ are statistically significant (with p<0.05).
Percentage of Detection (%) Classification (%)
adj used Precision Recall F1 Precision Recall F1
60% 4/5 86.9 41.2 55.8 78.6 37.2 50.5
70% 4/5 85.5 51.3 64.1 77.7 46.6 58.2
80% 4/5 83.3 58.1 68.4 75.8 52.9 62.3
90% 4/5 82.0 64.9 72.5 74.9 59.4 66.2
Table 4. Performance with SVM trained on a fraction of adj. It shows 5 fold cross validation results.
relation annotation, annotating more documents mentions. They use an evaluation scheme to
with single-pass offers advantages over avoid being penalized by the relation mentions
annotating less data with high quality assurance which are not annotated because of this behavior.
(dual passes and adjudication).
7. Conclusion
6. Related work We analyzed a snapshot of the ACE 2005
relation annotation and found that each single-
Dligach et al. (2010) studied WSD annotation
pass annotation missed around 18-28% of
from a cost-effectiveness viewpoint. They
relation mentions and contains around 10%
showed empirically that, with same amount of
spurious mentions. A detailed analysis showed
annotation dollars spent, single-annotation is
that it is possible to find some of the false
better than dual-annotation and adjudication. The
negatives, and that most spurious cases are
common practice for quality control of WSD
actually correct examples from a system
annotation is similar to Relation annotation.
builders perspective. By automatically purifying
However, the task of WSD annotation is very
negative examples and applying transductive
different from relation annotation. WSD requires
inference on suspicious examples, we can train a
that every example must be assigned some tag,
relation classifier whose performance is
whereas that is not required for relation tagging.
comparable to a classifier trained on the dual-
Moreover, relation tagging requires identifying
annotated and adjudicated data. Furthermore, we
two arguments and correctly categorizing their
show that single-pass annotation is more cost-
types.
effective than annotation with high quality
The purified approach applied in this paper is
assurance.
related to the general framework of learning from
positive and unlabeled examples. Li and Liu
(2003) initially set all unlabeled data to be
Acknowledgments
negative and train a Rocchio classifier, then Supported by the Intelligence Advanced
select negative examples which are closer to the Research Projects Activity (IARPA) via Air
negative centroid than positive centroid as the Force Research Laboratory (AFRL) contract
purified negative examples. We share a similar number FA8650-10-C-7058. The U.S.
assumption with Li and Liu (2003) but we use a Government is authorized to reproduce and
different method to select negative examples distribute reprints for Governmental purposes
since the false negative examples show a very notwithstanding any copyright annotation
skewed distribution, as described in section 5.2. thereon. The views and conclusions contained
Transductive SVM was introduced by Vapnik herein are those of the authors and should not be
(1998) and later refined in Joachims (1999). A interpreted as necessarily representing the
few related methods were studied on the subtask official policies or endorsements, either
of relation classification (the second stage of the expressed or implied, of IARPA, AFRL, or the
hierarchical learning scheme) in Zhang (2005). U.S. Government.
Chan and Roth (2011) observed the similar
phenomenon that ACE annotators rarely
duplicate a relation link for coreferential
202
Xiao-Li Li and Bing Liu. 2003. Learning to classify
References text using positive and unlabeled data. In
Proceedings of IJCAI-2003.
ACE. http://www.itl.nist.gov/iad/mig/tests/ace/
Longhua Qian, Guodong Zhou, Qiaoming Zhu and
ACE (Automatic Content Extraction) English Peide Qian. 2008. Exploiting constituent
Annotation Guidelines for Relations, version 5.8.3. dependencies for tree kernel-based semantic
2005. http://projects.ldc.upenn.edu/ace/. relation extraction . In Proc. of COLING-2008.
ACE 2005 Multilingual Training Data V3.0. 2005. Ang Sun, Ralph Grishman and Satoshi Sekine. 2011.
LDC2005E18. LDC Catalog. Semi-supervised Relation Extraction with Large-
Elizabeth Boschee, Ralph Weischedel, and Alex scale Word Clustering. In Proceedings of ACL-
Zamanian. 2005. Automatic information extraction. 2011.
In Proceedings of the International Conference on Vladimir N. Vapnik. 1998. Statistical Learning
Intelligence Analysis. Theory. John Wiley.
Razvan C. Bunescu and Raymond J. Mooney. 2005a. Dmitry Zelenko, Chinatsu Aone, and Anthony
A shortest path dependency kenrel for relation Richardella. 2003. Kernel methods for relation
extraction. In Proceedings of HLT/EMNLP-2005. extraction. Journal of Machine Learning Research.
Razvan C. Bunescu and Raymond J. Mooney. 2005b. Min Zhang, Jie Zhang and Jian Su. 2006a. Exploring
Subsequence kernels for relation extraction. In syntactic features for relation extraction using a
Proceedings of NIPS-2005. convolution tree kernel, In Proceedings of HLT-
Yee Seng Chan and Dan Roth. 2011. Exploiting NAACL-2006.
Syntactico-Semantic Structures for Relation Min Zhang, Jie Zhang, Jian Su, and GuoDong Zhou.
Extraction. In Proceedings of ACL-2011. 2006b. A composite kernel to extract relations
Michael Collins and Nigel Duffy. Convolution between entities with both flat and structured
Kernels for Natural Language. In Proceedings of features. In Proceedings of COLING-ACL-2006.
NIPS-2001. Zhu Zhang. 2005. Mining Inter-Entity Semantic
Dmitriy Dligach, Rodney D. Nielsen and Martha Relations Using Improved Transductive Learning.
Palmer. 2010. To annotate more accurately or to In Proceedings of ICJNLP-2005.
annotate more. In Proceedings of Fourth Linguistic Shubin Zhao and Ralph Grishman, 2005. Extracting
Annotation Workshop at ACL 2010 Relations with Integrated Information Using Kern
Ralph Grishman, David Westbrook and Adam el Methods. In Proceedings of ACL-2005.
Meyers. 2005. NYUs English ACE 2005 System Guodong Zhou, Jian Su, Jie Zhang and Min Zhang.
Description. In Proceedings of ACE 2005 2005. Exploring various knowledge in relation
Evaluation Workshop extraction. In Proceedings of ACL-2005.
Scott Miller, Heidi Fox, Lance Ramshaw, and Ralph Guodong Zhou, Min Zhang, DongHong Ji, and
Weischedel. 2000. A novel use of statistical QiaoMing Zhu. 2007. Tree kernel-based relation
parsing to extract information from text In extraction with context-sensitive structured parse
Proceedings of NAACL-2010. tree information. In Proceedings of
Heng Ji, Ralph Grishman, Hoa Trang Dang and Kira EMNLP/CoNLL-2007.
Griffitt. 2010. An Overview of the TAC2010
Knowledge Base Population Track. In Proceedings
of TAC-2010
Jing Jiang and ChengXiang Zhai. 2007. A systematic
exploration of the feature space for relation
extraction. In Proceedings of HLT-NAACL-2007.
Thorsten Joachims. 1999. Transductive Inference for
Text Classification using Support Vector
Machines. In Proceedings of ICML-1999.
Nanda Kambhatla. 2004. Combining lexical,
syntactic, and semantic features with maximum
entropy models for information extraction. In
Proceedings of ACL-2004
203
Incorporating Lexical Priors into Topic Models
204
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 204213,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
mln, dlrs, billion, year, pct, company, share, april, record, cts, quarter, march, earnings, stg, first, pay
mln, NUM, cts, loss, net, dlrs, shr, profit, revs, year, note, oper, avg, shrs, sales, includes
lt, company, shares, corp, dlrs, stock, offer, group, share, common, board, acquisition, shareholders
bank, market, dollar, pct, exchange, foreign, trade, rate, banks, japan, yen, government, rates, today
oil, tonnes, prices, mln, wheat, production, pct, gas, year, grain, crude, price, corn, dlrs, bpd, opec
Table 1: Topics identified by LDA on the frequent-5 categories of the Reuters corpus. The categories are Earn,
Acquisition, Forex, Grain and Crude (in the order document frequency).
1 company, billion, quarter, shrs, earnings We build a model that uses the seed words
2 acquisition, procurement, merge in two ways: to improve both topic-word and
3 exchange, currency, trading, rate, euro document-topic probability distributions. For
4 grain, wheat, corn, oilseed, oil ease of exposition, we present these ideas sep-
5 natural, gas, oil, fuel, products, petrol arately and then in combination (Section 2.3).
To improve topic-word distributions, we set up
Table 2: An example for sets of seed words (seed top-
a model in which each topic prefers to gener-
ics) for the frequent-5 categories of the Reuters-21578
categorization corpus. We use them as running exam- ate words that are related to the words in a seed
ple in the rest of the paper. set (Section 2.1). To improve document-topic
distributions, we encourage the model to select
document-level topics based on the existence of
papers that such topics should exist in the corpus. input seed words in that document (Section 2.2).
By allowing the user to provide some seed words Before moving on to the details of our mod-
related to these underrepresented topics, we en- els, we briefly recall the generative story of the
courage the model to find evidence of these top- LDA model and the reader is encouraged to refer
ics in the data. Importantly, we only encourage to (Blei et al., 2003) for further details.
the model to follow the seed sets and do not force
1. For each topic k = 1 T,
it. So if it has compelling evidence in the data
choose k Dir().
to overcome the seed information then it still has
the freedom to do so. Our seeding approach in 2. For each document d, choose d Dir().
combination with the interactive topic modeling For each token i = 1 Nd :
(Hu et al., 2011) will allow a user to both explore (a) Select a topic zi Mult(d ).
a corpus, and also guide the exploration towards (b) Select a word wi Mult(zi ).
the distinctions that he/she finds more interesting.
where T is the number of topics, , are hyper-
2 Incorporating Seeds parameters of the model and k and d are topic-
word and document-topic Multinomial probabil-
Our approach to allowing a user to guide the topic ity distributions respectively.
discovery process is to let him provide seed infor-
mation at the level of word type. Namely, the user 2.1 Word-Topic Distributions (Model 1)
provides sets of seed words that are representative In regular topic models, each topic k is defined
of the corpus. Table 2 shows an example of seed by a Multinomial distribution k over words. We
sets one might use for the Reuters corpus. This extend this notion and instead define a topic as a
kind of supervision is similar to the seeding in mixture of two Multinomial distributions: a seed
bootstrapping literature (Thelen and Riloff, 2002) topic distribution and a regular topic distribu-
or prototype-based learning (Haghighi and Klein, tion. The seed topic distribution is constrained to
2006). Our reliance on seed sets is orthogonal only generate words from a corresponding seed
to existing approaches that use external knowl- set. The regular topic distribution may generate
edge, which operate at the level of documents any word (including seed words). For example,
(Blei and McAuliffe, 2008), tokens (Andrzejew- seed topic 4 (in Table 2) can only generate the
ski and Zhu, 2009) or pair-wise constraints (An- five words in its set. The word oil can be gener-
drzejewski et al., 2009). ated by seed topics 4 and 5, as well as any regular
205
doc their distribution to only generate words in the
corresponding seed set. Then, for each token in a
z=1 z=2 z=T document, we first generate a topic. After choos-
1 1 1 1 T T
ing a topic, we flip a (biased) coin to pick either
the seed or the regular topic distribution. Once
r1 s1 rT sT this distribution is selected we generate a word
from it. It is important to note that although there
are 2T topic-word distributions in total, each
Figure 1: Tree representation of a document in Model document is still a mixture of only T topics (as
1. shown in Fig. 1). This is crucial in relating seed
and regular topics and is similar to the way top-
topic. We want to emphasize that, like any regular ics and aspects are tied in TAM model (Paul and
topic, each seed topic is a non-uniform probabil- Girju, 2010).
ity distribution over the words in its set. The user To understand how this model gathers words
only inputs the sets of seed words and the model related to seed words, consider a seed topic (say
will infer their probability distributions. the fourth row in Table 2) with seed words {grain,
For the sake of simplicity, we describe our wheat, corn, etc. }. Now by assigning all the re-
model by assuming a one-to-one correspondence lated words such as tonnes, agriculture, pro-
between seed and regular topics. This assumption duction etc. to its corresponding regular topic,
can be easily relaxed by duplicating the seed top- the model can potentially put high probability
ics when there are more regular topics. As shown mass on topic z = 4 for agriculture related doc-
in Fig. 1, each document is a mixture over T top- uments. Instead, if it places these words in an-
ics, where each of those topics is a mixture of other regular topic, say z = 3, then the document
a regular topic (r ) and its associated seed topic probability mass has to be distributed among top-
(s ) distributions. The parameter k controls the ics 3 and 4 and as a result the model will pay a
probability of drawing a word from the seed topic steeper penalty. Thus the model uses seed topic
distribution versus the regular topic distribution. to gather related words into its associated regu-
For our first model, we assume that the corpus is lar topic and as a consequence the document-topic
generated based on the following generative pro- distributions also become focussed.
cess (its graphical notation is shown in Fig. 2(a)): We have experimented with two ways of choos-
ing the binary variable xi (step 2b) of the gener-
1. For each topic k=1 T, ative story. In the first method, we fix this sam-
pling probability to a constant value which is in-
(a) Choose regular topic rk Dir(r ).
dependent of the chosen topic (i.e. i = , i =
(b) Choose seed topic sk Dir(s ). 1 T). And in the second method we learn the
(c) Choose k Beta(1, 1). probability as well (Sec. 4).
2. For each document d, choose d Dir(). 2.2 Document-Topic distributions (Model 2)
For each token i = 1 Nd : In the previous model we used seed words to im-
(a) Select a topic zi Mult(d ). prove topic-word probability distributions. Here
(b) Select an indicator xi Bern(zi ) we propose a model to explore the use of seed
(c) if xi is 0 words to improve document-topic probability dis-
Select a word wi Mult(rzi ). tributions. Unlike the previous model, we will
// choose from regular topic present this model in the general case where the
(d) if xi is 1 number of seed topics is not equal to the number
Select a word wi Mult(szi ). of regular topics. Hence, we associate each seed
// choose from seed topic set (we refer seed set as group for conciseness)
with a Multinomial distribution over the regular
The first step is to generate Multinomial distribu- topics which we call group-topic distribution.
tions for both seed topics and regular topics. The To give an overview of our model, first, we
seed topics are drawn in a way that constrains transfer the seed information from words onto
206
~b ~b
g g
x z
s z s x z
r r w r r w r r w
T Nd D
T Nd D T Nd D
Figure 2: The graphical notation of all the three models. In Model 1 we use seed topics to improve the topic-word
probability distributions. In Model 2, the seed topic information is first transfered to the document level based
on the document tokens and then it is used to improve document-topic distributions. In the final, SeededLDA,
model we combine both the models. In Model 1 and SeededLDA, we dropped the dependency of s on hyper
parameter s since it is observed. And, for clarity, we also dropped the dependency of x on .
the documents that contain them. Then, the represented using the binary vector ~b. This bi-
document-topic distribution is drawn in a two step nary vector can be populated based on the docu-
process: we sample a seed set (g for group) and ment words and hence it is treated as an observed
then use its group-topic distribution (g ) as prior variable. For example, consider the (very short!)
to draw the document-topic distribution (d ). We document oil companies have merged. Accord-
used this two step process, to allow flexible num- ing to the seed sets from Table 2, we define a bi-
ber of seed and regular topics, and to tie the topic nary vector that denotes which seed topics contain
distributions of all the documents within a group. words in this document. In this case, this vec-
We assume the following generative story and its tor ~b = h1, 1, 0, 1, 1i, indicating the presence of
graphical notation is shown in Fig. 2(b). seeds from sets 1, 2, 4 and 5.1 As discussed in
(Williamson et al., 2010), generating binary vec-
1. For each k = 1 T, tor is crucial if we want a document to talk about
(a) Choose rk Dir(r ). topics that are less prominent in the corpus.
2. For each seed set s = 1 S, The binary vector ~b, that indicates which seeds
exist in this document, defines a mean of a
(a) Choose group-topic distribution s
Dirichlet distribution from which we sample a
Dir(). // the topic distribution for sth
document-group distribution, d (step 3b). We
group (seed set) a vector of length T.
set the concentration of this Dirichlet to a hy-
3. For each document d, perparamter , which we set by hand (Sec. 4);
(a) Choose a binary vector ~b of length S. thus, d Dir(~b). From the resulting multino-
(b) Choose a document-group distribution mial, we draw a group variable g for this docu-
d Dir(~b). ment. This group variable brings clustering struc-
(c) Choose a group variable g Mult( d ) ture among the documents by grouping the docu-
ments that are likely to talk about same seed set.
(d) Choose d Dir(g ). // of length T
Once the group variable (g) is drawn, we
(e) For each token i = 1 Nd :
choose the document-topic distribution (d ) from
i. Select a topic zi Mult(d ). a Dirichlet distribution with the groups-topic dis-
ii. Select a word wi Mult(rzi ). tribution as the prior (step 3d). This step ensures
that the topic distributions of documents within
We first generate T topic-word distributions
each group are related. The remaining sampling
(k ) and S group-topic distributions (s ). Then
for each document, we generate a list of seed sets 1
As a special case, if no seed word is found in the docu-
that are allowed for this document. This list is ment, ~b is defined as the all-ones vector.
207
process proceeds like LDA. We sample a topic 2.4 Automatic Seed Selection
for each word and then generate a word from its In (Andrzejewski and Zhu, 2009; Andrzejewski
corresponding topic-word distribution. Observe et al., 2009), the seed information is provided
that, if the binary vector is all ones and if we manually. Here, we describe the use of feature se-
set d = d then this model reduces to the LDA lection techniques, prevalent in the classification
model with and r as the hyperparameters. literature, to automatically derive the seed sets. If
2.3 SeededLDA we want the topicality structure identified by the
LDA to align with the underlying class structure,
Both of our models use seed words in different then the seed words need to be representative of
ways to improve topic-word and document-topic the underlying topicality structure. To enable this,
distributions respectively. We can combine both we first take class labeled data (doesnt need to
the above models easily. We refer to the combined be multi-class labeled data unlike (Ramage et al.,
model as SeededLDA and its generative story is 2009)) and identify the discriminating features for
as follows (its graphical notation is shown in Fig. each class. Then we choose these discriminating
2(c)). The variables have same semantics as in the features as the initial sets of seed words. In prin-
previous models. ciple, this is similar to the prototype driven unsu-
1. For each k=1 T, pervised learning (Haghighi and Klein, 2006).
We use Information Gain (Mitchell, 1997) to
(a) Choose regular topic rk Dir(r ).
identify the required discriminating features. The
(b) Choose seed topic sk Dir(s ). Information Gain (IG) of a word (w) in a class (c)
(c) Choose k Beta(1, 1). is given by
2. For each seed set s = 1 S,
IG(c, w) = H(c) H(c|w)
(a) Choose group-topic distribution s
Dir(). where H(c) is the entropy of the class and H(c|w)
3. For each document d, is the conditional entropy of the class given the
word. In computing Information Gain, we bina-
(a) Choose a binary vector ~b of length S.
rize the document vectors and consider whether a
(b) Choose a document-group distribution word occurs in any document of a given class or
d Dir(~b). not. Thus obtained ranked list of words for each
(c) Choose a group variable g Mult( d ). class are filtered for ambiguous words and then
(d) Choose d Dir(g ). // of length T used as initial sets of seed words to be input to the
(e) For each token i = 1 Nd : model.
i. Select a topic zi Mult(d ).
3 Related Work
ii. Select an indicator xi Bern(zi ).
iii. if xi is 0 Seed-based supervision is closely related to the
Select a word wi Mult(rzi ). idea of seeding in the bootstrapping literature for
iv. if xi is 1 learning semantic lexicons (Thelen and Riloff,
Select a word wi Mult(szi ). 2002). The goals are similar as well: growing
a small set of seed examples into a much larger
In the SeededLDA model, the process for gen- set. A key difference is the type of semantic in-
erating group variable of a document is same as formation that the two approaches aim to capture:
the one described in the Model 2. And like in the semantic lexicons are based on much more spe-
Model 2, we sample a document-topic probability cific notions of semantics (e.g. all the country
distribution as a Dirichlet draw with the group- names) than the generic topic semantics of topic
topic distribution of the chosen group as prior. models. The idea of seeding has also been used
Subsequently, we choose a topic for each token in prototype-driven learning (Haghighi and Klein,
and then flip a biased coin. We choose either the 2006) and shown similar efficacies for these semi-
seed or the regular topic based on the result of the supervised learning approaches.
coin toss and then generate a word from its distri- LDAWN (Boyd-Graber et al., 2007) models
bution. sets of words for the word sense disambiguation
208
task. It assumes that a topic is a distribution (Sec. 2.2). However our model differs from La-
over synsets and relies on the Wordnet to obtain beledLDA in the subsequent steps. Rather than
the synsets. The most related prior work is that using the group distribution directly, we sam-
of (Andrzejewski et al., 2009), who propose the ple a group variable and use it to constrain the
use Dirichlet Forest priors to incorporate Must document-topic distributions of all the documents
Link and Cannot Link constraints into the topic within this group. Moreover, in their model the
models. This work is analogous to constrained binary vector is observed directly in the form of
K-means clustering (Wagstaff et al., 2001; Basu document labels while, in our case, it is automat-
et al., 2008). A must link between a pair word ically populated based on the document tokens.
types represents that the model should encourage Interactive topic modeling brings the user into
both the words to have either high or low prob- the loop, by allowing him/her to make suggestions
ability in any particular topic. A cannot link be- on how to improve the quality of the topics at each
tween a word pair indicates both the words should iteration (Hu et al., 2011). In their approach, the
not have high probability in a single topic. In the authors use Dirichlet Forest method to incorpo-
Dirichlet Forest approach, the constraints are first rate the users preferences. In our experiments
converted into trees with words as the leaves and (Sec. 4), we show that SeededLDA performs bet-
edges having pre-defined weights. All the trees ter than Dirichlet Forest method, so SeededLDA
are joined to a dummy node to form a forest. The when used with their framework can allow an user
sampling for a word translates into a random walk to explore a document collection in a more mean-
on the forest: starting from the root and selecting ingful manner.
one of its children based on the edge weights until
you reach a leaf node. 4 Experiments
While the Dirichlet Forest method requires su- We evaluate different aspects of the model sep-
pervision in terms of Must link and Cannot link arately. Our experimental setup proceeds as fol-
information, the Topics In Sets (Andrzejewski and lows: a) Using an existing model, we evaluate the
Zhu, 2009) model proposes a different approach. effectiveness of automatically derived constraints
Here, the supervision is provided at the token indicating the potential benefits of adding seed
level. The user chooses specific tokens and re- words into the topic models. b) We evaluate each
strict them to occur only with in a specified list of of our proposed models in different settings and
topics. While this needs minimal changes to the compare with multiple baseline systems.
inference process of LDA, it requires information Since our aim is to overcome the domi-
at the level of tokens. The word type level seed nance of majority topics by encouraging the
information can be converted into token level in- topicality structure identified by the topic mod-
formation (like we do in Sec. 4) but this prevents els to align with that of the document cor-
their model from distinguishing the tokens based pus, we choose extrinsic evaluation as the
on the word senses. primary evaluation method. We use docu-
Several models have been proposed which use ment clustering task and use frequent-5 cate-
supervision at the document level. Supervised gories of Reuters-21578 corpus (Lewis et al.,
LDA (Blei and McAuliffe, 2008) and DiscLDA 2004) and four classes from the 20 News-
(Lacoste-Julien et al., 2008) try to predict the cat- groups data set (i.e.rec.autos, sci.electronics,
egory labels (e.g. sentiment classification) for comp.hardware and alt.atheism). For both
the input documents based on a document labeled the corpora we do the standard preprocessing
data. Of these models, the most related one to of removing stopwords and infrequent words
SeededLDA is the LabeledLDA model (Ramage (Williamson et al., 2010).
et al., 2009). Their model operates on multi-class For all the models, we use a Collapsed Gibbs
labeled corpus. Each document is assumed to be sampler (Griffiths and Steyvers, 2004) for the in-
a mixture over a known subset of topics (classes) ference process. We use the standard hyperparam-
with each topic being a distribution over words. eters values = 1.0, = 0.01 and = 1.0 and
The process of generating document topic distri- run the sampler for 1000 iterations, but one can
bution in LabeledLDA is similar to the process use techniques like slice sampling to estimate the
of generating group distribution in our Model 2 hyperparameters (Johnson and Goldwater, 2009).
209
Reuters 20 Newsgroups
F-measure VI F-measure VI
LDA 0.64 (.05) 1.26 (.16) 0.77 (.06) 0.9 (.13)
Dirichlet Forest 0.67 (.02) 1.17 (.11) 0.79(.01) 0.83 (.03)
over LDA (+4.68%) (-7.1%) (+2.6%) (-7.8%)
Table 3: The effect of adding constraints by Dirichlet Forest Encoding. For Variational Information (VI) a lower
score indicates a better clustering. indicates statistical significance at p = 0.01 as measured by the t-test. All
the four improvements are significant at p = 0.05.
We run all the models with the same number of every pair of words belonging to different sets.
topics as the number of clusters. Then, for each The accuracies are averaged over 25 different ran-
document, we find the topic that has maximum dom initializations and are shown in Table 3. We
probability in the posterior document-topic distri- have also indicated the relative performance gains
bution and assign it to that cluster. The accuracy compared to LDA. The significant improvement
of the document clustering is measured in terms over the plain LDA demonstrates the effectiveness
of F-measure and Variation of Information. F- of the automatic extraction of seed words in topic
measure is calculated based on the pairs of doc- models.
uments, i.e. if two documents belong to a cluster
in both ground truth and the clustering proposed 4.2 Document Clustering
by the system then it is counted as correct, other- In the next experiment, we compare our models
wise it is counted as wrong. Variational Informa- with LDA and other baselines. The first baseline
tion (VI) of two clusterings X and Y is given as (maxCluster) simply counts the number of tokens
(Meila, 2007): in each document from each of the seed topics and
assigns the document to the seed topic that has
VI(X, Y ) = H(X) + H(Y ) 2I(X, Y ) most tokens. This results in a clustering of doc-
uments based on the seed topic they are assigned
where H(X) denotes the entropy of the clustering to. This baseline evaluates the effectiveness of the
X and I(X, Y ) denotes the mutual information seed words with respect to the underlying cluster-
between the two clusterings. For VI, a lower value ing. Apart from the maxCluster baseline, we use
indicates a better clustering. All the accuracies are LDA and z-labels (Andrzejewski and Zhu, 2009)
averaged over 25 different random initializations as our baselines. For z-labels, we treat all the to-
and all the significance results are measured using kens of a seed word in the same way. Table 4
the t-test at p = 0.01. shows the comparison of our models with respect
to the baseline systems.2 Comparing the perfor-
4.1 Seed Extraction mance of maxCluster to that of LDA, we observe
The seeds were extracted automatically (Sec. 2.4) that the seed words themselves do a poor job in
based on a small sample of labeled data other than clustering the documents.
the test data. We first extract 25 seeds words per We experimented with two variants of Model 1.
each class and then remove the seed words that In the first run (Model 1) we sample the k value,
appear in more than one class. After this filtering, i.e. the probability of choosing a seed topic for
on an average, we are left with 9 and 15 words per each topic. While in the Model 1 ( = 0.7) run,
each seed topic for Reuters and 20 Newsgroups we fix this probability to a constant value of 0.7 ir-
corpora respectively. respective of the topic.3 Though both the models
We use the existing Dirichlet Forest method to 2
The code used for LDA baseline in Tables 3 and 4
evaluate the effectiveness of the automatically ex- are different. For Table 3, we use the code available from
tracted seed words. The Must and Cannot links http://pages.cs.wisc.edu/andrzeje/research/df lda.html.
required for the supervision (Andrzejewski et al., We use our own version for Table 4. We tried to produce
a comparable baseline by running the former for more
2009) are automatically obtained by adding a iterations and with different hyperparameters. In Table 3,
must-link between every pair of words belonging we report their best results.
3
to the same seed set and a split constraint between We chose this value based on intuition; it is not tuned.
210
Reuters 20 Newsgroups
F-measure VI F-measure VI
maxCluster 0.53 1.75 0.58 1.44
LDA 0.66 (.04) 1.2 (.12) 0.76 (.06) 0.9 (.14)
z-labels 0.73 (.01) 1.04 (.01) 0.8 (.00) 0.82 (.01)
over LDA (+10.6%) (-13.3%) (+5.26%) (-8.8%)
Model 1 0.69 (.00) 1.13 (.01) 0.8 (.01) 0.81 (.02)
Model 1 ( = 0.7) 0.73 (.00) 1.09 (.01) 0.8 (.01) 0.81 (.02)
Model 2 0.66 (.04) 1.22 (.1) 0.77 (.07) 0.85 (.12)
SeededLDA 0.76 (.01) 0.99 (.03) 0.81 (.01) 0.75 (.02)
over LDA (+15.5%) (-17.5%) (+6.58%) (-16.7%)
Table 4: Accuracies on document clustering task with different models. indicates significant improvement
compared to the z-labels approach, as measured by the t-test with p = 0.01. The relative performance gains are
with respect to the LDA model and are provided for comparison with Dirichlet Forest method (in Table 3.)
performed better than LDA, fixing the probabil- these intervals reveals the superior performance
ity gave better results. When we attempt to learn of SeededLDA compared to all the baselines. The
this value, the model chooses to explain some of standard deviation of the F-measures over dif-
the seed words by the regular topics. On the other ferent random initializations of our our model is
hand, when is fixed, it explains almost all the about 1% for both the corpora while it is 4% and
seed words based on the seed topics. The next 6% for the LDA on Reuters and 20 Newsgroups
row (Model 2) indicates the performance of our corpora respectively. The reduction in the vari-
second model on the same data sets. The first ance, across all the approaches that use seed infor-
model seems to be performing better than the sec- mation, shows the increased robustness of the in-
ond model, which is justifiable since the latter ference process when using seed words. From the
uses seed topics indirectly. Though the variants accuracies in both the tables, it is clear that Seed-
of Model 1 and Model 2 performed better than edLDA model out-performs other models which
the LDA, they fell short of the z-labels approach. try to incorporate seed information into the topic
Table 4 also shows the performance of our com- models.
bined model (SeededLDA) on both the corpora. 4.3 Effect of Ambiguous Seeds
When the models are combined, the performance
In the following experiment we study the effect
improves over each of them and is also better than
of ambiguous seeds. We allow a seed word to oc-
the baseline systems. As explained before, our in-
cur in multiple seed sets. Table 6 shows the cor-
dividual models improve both the topic-word and
responding results. The performance drops when
document-topic distributions respectively. But it
we add ambiguous seed words, but it is still higher
turns out that the knowledge learnt by both the in-
than that of the LDA model. This suggests that the
dividual models is complementary to each other.
quality of the seed topics is determined by the dis-
As a result the combined model performed better
criminative power of the seed words rather than
than the individual models and other baseline sys-
the number of seed words in each seed topic. The
tems. Comparing the last rows of Tables 4 and 3,
topics identified by the SeededLDA on Reuters
we notice that the relative performance gains ob-
corpus are shown in the Table 5. With the help of
served in the case of SeededLDA is significantly
the seed sets, the model is able to split the Grain
higher than the performance gains obtained by
and Crude into two separate topics which were
incorporating the constraints using the Dirichlet
merged into a single topic by the plain LDA.
Forest method. Moreover, as indicated in the Ta-
ble 4, SeededLDA achieves significant gains over 4.4 Qualitative Evaluation on NIPS papers
the z-labels approach as well. We ran LDA and SeededLDA models on the NIPS
We have also provided the standard intervals papers from 2001 to 2010. For this corpus, the
for each of the approaches. A quick inspection of seed words are chosen from the call for proposal.
211
group, offer, common, cash, agreement, shareholders, acquisition, stake, merger, board, sale
oil, price, prices, production, lt, gas, crude, 1987, 1985, bpd, opec, barrels, energy, first, petroleum
0, mln, cts, net, loss, 2, dlrs, shr, 3, profit, 4, 5, 6, revs, 7, 9, 8, year, note, 1986, 10, 0, sales
tonnes, wheat, mln, grain, week, corn, department, year, export, program, agriculture, 0, soviet, prices
bank, market, pct, dollar, exchange, billion, stg, today, foreign, rate, banks, japan, yen, rates, trade
212
References - Volume 1, HLT 11, pages 248257, Stroudsburg,
PA, USA. Association for Computational Linguis-
Andrzejewski, D. and Zhu, X. (2009). Latent dirichlet tics.
allocation with topic-in-set knowledge. In Proceed-
Johnson, M. and Goldwater, S. (2009). Improving
ings of the NAACL HLT 2009 Workshop on Semi-
nonparameteric bayesian inference: experiments
Supervised Learning for Natural Language Pro-
on unsupervised word segmentation with adap-
cessing, SemiSupLearn 09, pages 4348, Morris-
tor grammars. In Proceedings of Human Lan-
town, NJ, USA. Association for Computational Lin-
guage Technologies: The 2009 Annual Conference
guistics.
of the North American Chapter of the Association
Andrzejewski, D., Zhu, X., and Craven, M. (2009). In- for Computational Linguistics, NAACL 09, pages
corporating domain knowledge into topic modeling 317325, Stroudsburg, PA, USA. Association for
via dirichlet forest priors. In ICML 09: Proceed- Computational Linguistics.
ings of the 26th Annual International Conference Lacoste-Julien, S., Sha, F., and Jordan, M. (2008).
on Machine Learning, pages 2532, New York, NY, DiscLDA: Discriminative learning for dimensional-
USA. ACM. ity reduction and classification. In Proceedings of
Basu, S., Ian, D., and Wagstaff, K. (2008). Con- NIPS 08.
strained Clustering : Advances in Algorithms, The- Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. (2004).
ory, and Applications. Chapman & Hall/CRC Pres. Rcv1: A new benchmark collection for text catego-
Blei, D. and McAuliffe, J. (2008). Supervised topic rization research. J. Mach. Learn. Res., 5:361397.
models. In Advances in Neural Information Pro- Meila, M. (2007). Comparing clusteringsan infor-
cessing Systems 20, pages 121128, Cambridge, mation based distance. J. Multivar. Anal., 98:873
MA. MIT Press. 895.
Blei., D. M. and Lafferty., J. (2009). Topic models. In Mitchell, T. M. (1997). Machine Learning. McGraw-
Text Mining: Theory and Applications. Taylor and Hill, New York.
Francis. Paul, M. and Girju, R. (2010). A two-dimensional
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). La- topic-aspect model for discovering multi-faceted
tent dirichlet allocation. Journal of Maching Learn- topics. In AAAI.
ing Research, 3:9931022. Ramage, D., Hall, D., Nallapati, R., and Manning,
Boyd-Graber, J., Blei, D. M., and Zhu, X. (2007). A C. D. (2009). Labeled LDA: a supervised topic
topic model for word sense disambiguation. In Em- model for credit attribution in multi-labeled cor-
pirical Methods in Natural Language Processing. pora. In Proceedings of the 2009 Conference on
Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., and Empirical Methods in Natural Language Process-
Blei, D. M. (2009). Reading tea leaves: How hu- ing: Volume 1 - Volume 1, EMNLP 09, pages 248
mans interpret topic models. In Neural Information 256, Morristown, NJ, USA. Association for Com-
Processing Systems. putational Linguistics.
Griffiths, T., Steyvers, M., and Tenenbaum, J. (2007). Thelen, M. and Riloff, E. (2002). A bootstrapping
Topics in semantic representation. Psychological method for learning semantic lexicons using extrac-
Review, 114(2):211244. tion pattern contexts. In In Proc. 2002 Conf. Empir-
Griffiths, T. L. and Steyvers, M. (2004). Finding sci- ical Methods in NLP (EMNLP).
entific topics. Proceedings of National Academy of Wagstaff, K., Cardie, C., Rogers, S., and Schrodl, S.
Sciences USA, 101 Suppl 1:52285235. (2001). Constrained k-means clustering with back-
Griffiths, T. L., Steyvers, M., Blei, D. M., and Tenen- ground knowledge. In Proceedings of the Eigh-
baum, J. B. (2005). Integrating topics and syntax. teenth International Conference on Machine Learn-
In Advances in Neural Information Processing Sys- ing, ICML 01, pages 577584, San Francisco, CA,
tems, volume 17, pages 537544. USA. Morgan Kaufmann Publishers Inc.
Haghighi, A. and Klein, D. (2006). Prototype-driven Wallach, H. M. (2005). Topic modeling: beyond bag-
learning for sequence models. In Proceedings of of-words. In NIPS 2005 Workshop on Bayesian
the main conference on Human Language Tech- Methods for Natural Language Processing.
nology Conference of the North American Chap- Williamson, S., Wang, C., Heller, K. A., and Blei,
ter of the Association of Computational Linguis- D. M. (2010). The IBP compound dirichlet pro-
tics, HLT-NAACL 06, pages 320327, Strouds- cess and its application to focused topic modeling.
burg, PA, USA. Association for Computational Lin- In ICML, pages 11511158.
guistics.
Hu, Y., Boyd-Graber, J., and Satinoff, B. (2011). In-
teractive topic modeling. In Proceedings of the 49th
Annual Meeting of the Association for Computa-
tional Linguistics: Human Language Technologies
213
DualSum: a Topic-Model based approach for update summarization
214
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 214223,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
Dirichlet Allocation (LDA) (Blei et al., 2003), mary S to approximate, KL is commonly used as
aims to learn to distinguish between common in- the scoring function to select the subset of sen-
formation and novel information. We have eval- tences S that minimizes the KL divergence with
uated this approach on the ROUGE scores and T:
demonstrate that it produces comparable results
to the top system in TAC-2011. Furthermore, our X pT (w)
approach improves over that system when evalu- S = argminKL(T, S) = pT (w) log
S pS (w)
ated manually in terms of linguistic quality and wV
overall responsiveness.
where w is a word from the vocabulary V. This
2 Related work strategy is called KLSum. Usually, a smoothing
factor is applied on the candidate distribution S
2.1 Bayesian approaches in Summarization in order to avoid the divergence to be undefined1 .
Most Bayesian approaches to summarization are This objective function selects the most repre-
based on topic models. These generative mod- sentative sentences of the collection, and at the
els represent documents as mixtures of latent top- same time it also diversifies the generated sum-
ics, where a topic is a probability distribution over mary by penalizing redundancy. Since the prob-
words. In T OPIC S UM (Haghighi and Vander- lem of finding the subset of sentences from a
wende, 2009), each word is generated by a sin- collection that minimizes the KL divergence is
gle topic which can be a corpus-wide background NP-complete, a greedy algorithm is often used in
distribution over common words, a distribution practice2 . Some variations of this objective func-
of document-specific words or a distribution of tion can be considered, such as penalizing sen-
the core content of a given cluster. BAYES S UM tences that contain document-specific topics (Ma-
(Daume and Marcu, 2006) and the Special Words son and Charniak, 2011) or rewarding sentences
and Background model (Chemudugunta et al., appearing closer to the beginning of the docu-
2006) are very similar to T OPIC S UM. ment.
A commonality of all these models is the use of Wang et al. (2009) propose a Bayesian ap-
collection and document-specific distributions in proach for summarization that does not use KL
order to distinguish between the general and spe- for reranking. In their model, Bayesian Sentence-
cific topics in documents. In the context of sum- based Topic Models, every sentence in a docu-
marization, this distinction helps to identify the ment is assumed to be associated to a unique la-
important pieces of information in a collection. tent topic. Once the model parameters have been
Models that use more structure in the repre- calculated, a summary is generated by choosing
sentation of documents have also been proposed the sentence with the highest probability for each
for generating more coherent and less redun- topic.
dant summaries, such as H IER S UM (Haghighi While hierarchical topic modeling approaches
and Vanderwende, 2009) and TTM (Celikyilmaz have shown remarkable effectiveness in learning
and Hakkani-Tur, 2011). For instance, H IER S UM the latent topics of document collections, they are
models the intuitions that first sentences in docu- not designed to capture the novel information in
ments should contain more general information, a collection with respect to another one, which is
and that adjacent sentences are likely to share the primary focus of update summarization.
specic content vocabulary. However, H IER S UM,
which builds upon T OPIC S UM, does not show 2.2 Update Summarization
a statistically signicant improvement in ROUGE The goal of update summarization is to generate
over T OPIC S UM. an update summary of a collection B of recent
A number of techniques have been proposed to documents assuming that the users already read
rank sentences of a collection given a word distri- earlier documents from a collection A. We refer
bution (Carbonell and Goldstein, 1998; Goldstein 1
In our experiments we set = 0.01.
et al., 1999). The Kullback-Leibler divergence 2
In our experiments, we follow the same approach as in
(KL) is a widely used measure in summarization. (Haghighi and Vanderwende, 2009) by greedily adding sen-
Given a target distribution T that we want a sum- tences to a summary so long as they decrease KL divergence.
215
to collection A as the base collection and to col- 3 DualSum
lection B as the update collection.
3.1 Model Formulation
Update summarization is related to novelty de- The input for D UAL S UM is a set of pairs of collec-
tection which can be defined as the problem of tions of documents C = {(Ai , Bi )}i=1...m , where
determining whether a document contains new in- Ai is a base document collection and Bi is an up-
formation given an existing collection (Soboroff date document collection. We use c to refer to a
and Harman, 2005). Thus, while the goal of nov- collection pair (Ac , Bc ).
elty detection is to determine whether some infor- In D UAL S UM, documents are modeled as a bag
mation is new, the goal of update summarization of words that are assumed to be sampled from a
is to extract and synthesize the novel information. mixture of latent topics. Each word is associated
with a latent variable that specifies which topic
distribution is used to generate it. Words in a doc-
Update summarization is also related to con- ument are assumed to be conditionally indepen-
trastive summarization, i.e. the problem of jointly dent given the hidden topic.
generating summaries for two entities in order to As in previous Bayesian works for summariza-
highlight their differences (Lerman and McDon- tion (Daume and Marcu, 2006; Chemudugunta
ald, 2009). The primary difference here is that et al., 2006; Haghighi and Vanderwende, 2009),
update summarization aims to extract novel or up- D UAL S UM not only learns collection-specific dis-
dated information in the update collection with re- tributions, but also a general background distri-
spect to the base collection. bution over common words, G and a document-
specific distribution cd for each document d in
The most common approach for update sum- collection pair c, which is useful to separate the
marization is to apply a normal multi-document specific aspects from the general aspects of c. The
summarizer, with some added functionality to re- main novelty is that D UAL S UM introduces spe-
move sentences that are redundant with respect cific machinery for identifying novelty.
to collection A. This can be achieved using sim- To capture the differences between the base and
ple filtering rules (Fisher and Roark, 2008), Max- the update collection for each pair c, D UAL S UM
imal Marginal Relevance (Boudin et al., 2008), or learns two topics for every collection pair. The
more complex graph-based algorithms (Shen and joint topic, Ac captures the common information
Li, 2010; Wenjie et al., 2008). The goal here is between the two collections in the pair, i.e. the
to boost sentences in B that bring out completely main event that both collections are discussing.
novel information. One problem with this ap- The update topic, Bc focuses on the specific as-
proach is that it is likely to discard as redundant pects that are specific of the documents inside the
sentences in B containing novel information if it update collection.
is mixed with known information from collection In the generative model,
A.
For a document d in a collection Ac , words
can be originated from one of three differ-
Another approach is to introduce specific fea-
ent topics: G , cd and Ac , the last one of
tures intended to capture the novelty in collection
which captures the main topic described in
B. For example, comparing collections A and B,
the collection pair.
FastSum derives features for the collection B such
as number of named entities in the sentence that For a document d in a collection Bc , words
already occurred in the old cluster or the number can be originated from one of four different
of new content words in the sentence not already topics: G , cd , Ac and Bc . The last one
mentioned in the old cluster that are subsequently will capture the most important updates to
used to train a Support Vector Machine classifier the main topic.
(Schilder et al., 2008). A limitation with this ap-
proach is there are no large training sets available To make this representation easier, we can also
and, the more features it has, the more it is af- state that both collections are generated from the
fected by the sparsity of the training data. four topics, but we constrain the topic probability
216
1. Sample G Dir(G ) there should be more words in the background
2. For each collection pair c = (Ac , Bc ): than in the other distributions, so the mass is ex-
Sample Ac Dir(A ) pected to be shared on a larger number of words.
Sample Bc Dir(B ) Unlike for the word distributions, mixing prob-
For each document d of type ucd {A, B}: abilities are drawn from a Dirichlet distribution
- Sample cd Dir(D ) with asymmetric priors. The prior knowledge
- If (ucd = A) sample cd Dir( A ) about the origin of words in the base and up-
- If (ucd = B) sample cd Dir( B ) date collections is again encoded at the level the
- For each word w in document d: hyper-parameters. For example, if we set A =
(a) Sample a topic z M ult( cd ), z (5, 3, 2, 0), this would reflect the intuition that,
{G, cd, Ac , Bc } on average, in the base collections, 50% of the
(b) Sample a word w M ult(z ) words originate from the background distribution,
30% from the document-specific distribution, and
Figure 1: Generative model in D UAL S UM. 20% from the joint topic. Similarly, if we set
B = (5, 2, 2, 1), the prior reflects the assumption
A B that, on average, in the update collections, 50% of
the words originate from the background distri-
bution, 20% from the document-specific distribu-
u
tion, 20% from the joint topic, and 10% from the
novel, update topic3 . The priors we have actually
z
G
used are reported in Section 4.
p(z, , , w, u)
A B
p(z, , |w, u) =
p(w, u)
Omitting hyper-parameters for notational sim-
Figure 2: Graphical model representation of D UAL -
plicity, the joint distribution over the observed
S UM.
variables is:
p(w, u) = p(G )
for Bc to be always zero when generating a base Y
document. p(Ac )p(Bc )
c
We denote ucd {A, B} the type of a docu- Y Z
ment d in pair c. This is an observed, Boolean p(ucd )p(cd ) p( cd |ucd )d cd
d
variable stating whether the document d belongs YX
to the base or the update collection inside the pair p(wcdn |zcdn )p(zcdn | cd )
c. n cdn
217
(v)
Variational approaches (Blei et al., 2003) and where k K, nk denotes the number of times
collapsed Gibbs sampling (Griffiths and Steyvers, (cd)
word v is assigned to topic k, and nk denotes
2004) are common techniques for approximate in- the number of words in document d of collection
ference in Bayesian models. They offer different c that are assigned to topic k.
advantages: the variational approach is arguably By the strong law of large numbers, the average
faster computationally, but the Gibbs sampling of sample parameters should converge towards
approach is in principal more accurate since it the true expected value of the model parameter.
asymptotically approaches the correct distribution Therefore, good estimates of the model parame-
(Porteous et al., 2008). In this section, we pro- ters can be obtained averaging over the sampled
vide details on a collapsed Gibbs sampling strat- values. As suggested by Gamerman and Lopes
egy to infer the model parameters of D UAL S UM (2006), we have set a lag (20 iterations) between
for a given dataset. samples in order to reduce auto-correlation be-
Collapsed Gibbs sampling is a particular case tween samples. Our sampler also discards the first
of Markov Chain Monte Carlo (MCMC) that in- 100 iterations as burn-in period in order to avoid
volves repeatedly sampling a topic assignment for averaging from samples that are still strongly in-
each word in the corpus. A single iteration of the fluenced by the initial assignment.
Gibbs sampler is completed after sampling a new
topic for each word based on the previous assign- 4 Experiments in Update
ment. In a collapsed Gibbs sampler, the model Summarization
parameters are integrated out (or collapsed), al-
The Bayesian graphical model described in the
lowing to only sample z. Let us call wcdn the n-th
previous section can be run over a set of news
word in document d in collection c, and zcdn its
collections to learn the background distribution,
topic assignment. For Gibbs sampling, we need
a joint distribution for each collection, an update
to calculate p(zcdn |w, u, zcdn ) where zcdn de-
distribution for each collection and the document-
notes the random vector of topic assignments ex-
specific distributions. Once this is done, one of
cept the assignment zcdn .
the learned collections can be used to generate the
summary that best approximates this collection,
p(zcdn = j|w, u, zcdn , A , B , ) using the greedy algorithm described by Haghighi
(w ) (cd)
cdn
ncdn,j + j ncdn,j + jucd and Vanderwende (2009). Still, there are some pa-
PV (v)
ncdn,j + V j
P (cd)
+ kucd ) rameters that can be defined and which affects the
v=1 kK (ncdn,k
results obtained:
(v)
where K = {G, cd, Ac , Bc }, ncdn,j denotes the D UAL S UMs choice of hyper-parameters af-
number of times word v is assigned to topic j fects how the topics are learned.
excluding current assignment of word wcdn and
(cd) The documents can be represented with n-
ncdn,k denotes the number of words in document
grams of different lengths.
d of collection c that are assigned to topic j ex-
cluding current assignment of word wcdn . It is possible to generate a summary that ap-
After each sampling iteration, the model pa- proximates the joint distribution, the update-
rameters can be estimated using the following for- only distribution, or a combination of both.
mulas5 .
This section describes how these parameters
have been tuned.
(w)
nk + k
kw =P (v)
4.1 Parameter tuning
V
v=1 nk + V k
We use the TAC 2008 and 2009 update task
datasets as training set for tuning the hyper-
(cd)
n + k parameters for the model, namely the pseudo-
kcd = P k(cd)
n. + V k counts for the two Dirichlet priors that affects the
5
The interested reader is invited to consult (Wang, 2011)
topic mix assignment for each document. By per-
for more details on using Gibbs sampling for LDA-like mod- forming a grid search over a large set of pos-
els sible hyper-parameters, these have been fixed to
218
A = (90, 190, 50, 0) and B = (90, 170, 45, 25)
as the values that produced the best ROUGE-2
score on those two datasets.
Regarding the base collection, this can be inter-
preted as setting as prior knowledge that roughly
27% of the words in the original dataset originate
from the background distribution, 58% from the
document-specific distributions, and 15% from
the topic of the original collection. We remind
the reader that the last value in A is set to zero
because, due to the problem definition, the origi-
nal collection must have no words generated from
the update topic, which reflects the most recent
developments that are still not present in the base Figure 3: Variation in ROUGE-2 score in the TAC-
2010 dataset as we change the mixture weight for the
collections A.
joined topic model between 0 and 1.
Regarding the update set, 27% of the words are
assumed to originate again from the background
distribution, 51% from the document-specific dis-
tributions, 14% from an topic in common with
the original collection, and 8% from the update-
specific topic. One interesting fact to note from
these settings is that most of the words belong to
topics that are specific to single documents (58%
and 51% respectively for both sets A and B) and
to the background distribution, whereas the joint
and update topics generate a much smaller, lim-
ited set of words. This helps these two distribu-
tions to be more focused.
The other settings mentioned at the beginning
of this section have been tuned using the TAC- Figure 4: Effect of the mixture weight in ROUGE-2
scores (TAC-2010 dataset). Results are reported us-
2010 dataset, which we reserved as our develop-
ing bigrams (above, blue), unigrams (middle, red) and
ment set. Once the different document-specific trigrams (below, yellow).
and collection-specific distributions have been ob-
tained, we have to choose the target distribu-
tion T to with which the possible summaries will increases until it plateaus at a maximum around
be compared using the KL metric. Usually, the roughly the interval [0.6, 0.8], and from that point
human-generated update summaries not only in- performance slowly degrades as at the right part
clude the terms that are very specific about the last of the curve the update model is given very little
developments, but they also include a little back- importance in generating the summary. Based on
ground regarding the developing event. There- these results, from this point onwards, the mixture
fore, we try, for KLSum, a simple mixture be- weight has been set to 0.7. Note that using only
tween the joint topic (A ) and the update topic the joint distribution (setting the mixture weight
(B ). to 1.0) also produces reasonable results, hinting
Figure 3 shows the ROUGE-2 results obtained that it successfully incorporates the most impor-
as we vary the mixture weight between the joint tant n-grams from across the base and the update
A distribution and the update-specific B distri- collections at the same time.
bution. As can be seen at the left of the curve, us- A second parameter is the size of the n-grams
ing only the update-specific model, which disre- for representing the documents. The original
gards the generic words about the topic described, implementations of S UM BASIC (Nenkova and
produces much lower results. The results improve Vanderwende, 2005) and T OPIC S UM (Haghighi
as the relative weight of the joined topic model and Vanderwende, 2009) were defined over sin-
219
gle words (unigrams). Still, Haghighi and Van- automatically evaluated using the TAC-2011
derwende (2009) report some improvements in dataset. Table 1 shows the ROUGE results ob-
the ROUGE-2 score when representing words as tained. Because of the non-deterministic nature
a bag of bigrams, and Darling (2010) mention of Gibbs sampling, the results reported here are
similar improvements when running S UM BASIC the average of five runs for all the baselines and
with bigrams. Figure 4 shows the effect on the for D UAL S UM. D UAL S UM outperforms two of
ROUGE-2 curve when we switch to using uni- the baselines in all three ROUGE metrics, and it
grams and trigrams. As stated in previous work, also outperforms T OPIC S UMB on two of the three
using bigrams has better results than using uni- metrics.
grams. Using trigrams was worse than either of
them. This is probably because trigrams are too The top three systems in TAC-2011 have been
specific and the document collections are small, included for comparison. The results between
so the models are more likely to suffer from data these three systems, and between them and D U -
AL S UM , are all indistinguishable at 95% confi-
sparseness.
dence. Note that the best baseline, T OPIC S UMB ,
4.2 Baselines is quite competitive, with results that are indis-
D UAL S UM is a modification of T OPIC S UM de- tinguishable to the top participants in this years
signed specifically for the case of update sum- evaluation. Note as well that, because we have
marization, by modifying T OPIC S UMs graphical five different runs for our algorithms, whereas
model in a way that captures the dependency be- we just have one output for the TAC participants,
tween the joint and the update collections. Still, it the confidence intervals in the second case were
is important to discover whether the new graphi- slightly bigger when checking for statistical sig-
cal model actually improves over simpler applica- nificance, so it is slightly harder for these systems
tions of T OPIC S UM to this task. The three base- to assert that they outperform the baselines with
lines that we have considered are: 95% confidence. These results would have made
D UAL S UM the second best system for ROUGE-
Running T OPIC S UM on the set of collections 1 and ROUGE-SU4, and the third best system in
containing only the update documents. We terms of ROUGE-2.
call this run T OPIC S UMB .
The supplementary materials contain a detailed
Running T OPIC S UM on the set of collections example of the the topic model obtained for the
containing both the base and the update doc- background in the TAC-2011 dataset, and the base
uments. Contrary to the previous run, the and update models for collection D1110. As
topic model for each collection in this run expected, the top unigrams and bigrams are all
will contain information relevant to the base closed-class words and auxiliary verbs. Because
events. We call this run T OPIC S UMAB . trigrams are longer, background trigrams actu-
Running T OPIC S UM twice, once on the set ally include some content words (e.g. university
of collections containing the update docu- or director). Regarding the models for A and
ments, and the second time on the set of B , the base distribution contains words related
collections containing the base documents. to the original event of an earthquake in Sichuan
Then, for each collection, the obtained base province (China), and the update distribution fo-
and update models are combined in a mix- cuses more on the official (updated) death toll
ture model using a mixture weight between numbers. It can be noted here that the tokenizer
zero and one. The weight has been tuned us- we used is very simple (splitting tokens separated
ing TAC-2010 as development set. We call with white-spaces or punctuation) so that num-
this run T OPIC S UMA +T OPIC S UMB . bers such as 7.9 (the magnitude of the earthquake)
and 12,000 or 14,000 are divided into two tokens.
4.3 Automatic evaluation We thought this might be a for the bigram-based
D UAL S UM and the three baselines6 have been system to produce better results, but we ran the
6
Using the settings obtained in the previous section, hav-
summarizers with a numbers-aware tokenizer and
ing been optimized on the datasets from previous TAC com- the statistical differences between versions still
petitions. hold.
220
Method R-1 R-2 R-SU4
T OPIC S UMB 0.3442 0.0868 0.1194
T OPIC S UMAB 0.3385 0.0809 0.1159
T OPIC S UMA +T OPIC S UMB 0.3328 0.0770 0.1125
D UAL S UM 0.3575 0.0924 0.1285
TAC-2011 best system (Peer 43) 0.3559 0.0958 0.1308
TAC-2011 2nd system (Peer 25) 0.3582 0.0926 0.1276
TAC-2011 3rd system (Peer 17) 0.3558 0.0886 0.1279
Table 1: Results on the TAC-2011 dataset. , and indicate that a result is significantly better than T OPIC S UMB ,
T OPIC S UMAB and T OPIC S UMA +T OPIC S UMB , respectively (p < 0.05).
221
the collections available at the same time dur- pling to on-line settings. By fixing the back-
ing Gibbs sampling, is the background distribu- ground distribution we are able to summarize a
tion, which is estimated from all the collections distribution in only three seconds, which seems
simultaneously, roughly representing 27% of the reasonable for some on-line applications.
words, that should appear distributed across all As future work, we plan to explore the use of
documents. D UAL S UM to generate more general contrastive
The good news is that this background distri- summaries, by identifying differences between
bution will contain closed-class words in the lan- collections whose differences are not of temporal
guage, which are domain-independent (see sup- nature.
plementary material for examples). Therefore,
we can generate this distribution from one of Acknowledgments
the TAC datasets only once, and then it can be
The research leading to these results has received
reused. Fixing the background distribution to a
funding from the European Unions Seventh
pre-computed value requires a very simple mod-
Framework Programme (FP7/2007-2013) under
ification of the Gibbs sampling implementation,
grant agreement number 257790. We would also
which just needs to adjust at each iteration the
like to thank Yasemin Altun and the anonymous
collection and document-specific models, and the
reviewers for their useful comments on the draft
topic assignment for the words.
of this paper.
Using this modified implementation, it is now
possible to summarize a single collection inde-
pendently. The summarization of a single col- References
lection of the size of the TAC collections is re-
David M. Blei, Andrew Y. Ng, and Michael I. Jordan.
duced on average to only three seconds on the 2003. Latent dirichlet allocation. J. Mach. Learn.
same hardware settings, allowing the use of this Res., 3:9931022, March.
summarizer in an on-line application. Florian Boudin, Marc El-Beze, and Juan-Manuel
Torres-Moreno. 2008. A scalable MMR approach
5 Conclusions to sentence scoring for multi-document update sum-
marization. In Coling 2008: Companion volume:
The main contribution of this paper is D UAL S UM,
Posters, pages 2326, Manchester, UK, August.
a new topic model that is specifically designed to Coling 2008 Organizing Committee.
identify and extract novelty from pairs of collec- J. Carbonell and J. Goldstein. 1998. The use of mmr,
tions. diversity-based reranking for reordering documents
It is inspired by T OPIC S UM (Haghighi and and producing summaries. In Proceedings of the
Vanderwende, 2009), with two main changes: 21st annual international ACM SIGIR conference
Firstly, while T OPIC S UM can only learn the main on Research and development in information re-
topic of a collection, D UAL S UM focuses on the trieval, pages 335336. ACM.
differences between two collections. Secondly, Asli Celikyilmaz and Dilek Hakkani-Tur. 2011. Dis-
covery of topically coherent sentences for extrac-
while T OPIC S UM incorporates an additional layer
tive summarization. In Proceedings of the 49th An-
to model topic distributions at the sentence level, nual Meeting of the Association for Computational
we have found that relaxing this assumption and Linguistics: Human Language Technologies, pages
modeling the topic distribution at document level 491499, Portland, Oregon, USA, June. Associa-
does not decrease the ROUGE scores and reduces tion for Computational Linguistics.
the sampling time. Chaitanya Chemudugunta, Padhraic Smyth, and Mark
The generated summaries, tested on the TAC- Steyvers. 2006. Modeling general and specific as-
2011 collection, would have resulted on the sec- pects of documents with a probabilistic topic model.
ond and third position in the last summarization In NIPS, pages 241248.
W.M. Darling. 2010. Multi-document summarization
competition according to the different ROUGE
from first principles. In Proceedings of the third
scores. This would make D UAL S UM statistically Text Analysis Conference, TAC-2010. NIST.
indistinguishable from the top system with 0.95 Hal Daume, III and Daniel Marcu. 2006. Bayesian
confidence. query-focused summarization. In Proceedings of
We also propose and evaluate the applicability the 21st International Conference on Computa-
of an alternative implementation of Gibbs sam- tional Linguistics and the 44th annual meeting
222
of the Association for Computational Linguistics, search, Redmond, Washington, Tech. Rep. MSR-TR-
ACL-2006, pages 305312, Stroudsburg, PA, USA. 2005-101.
Association for Computational Linguistics. Ian Porteous, David Newman, Alexander Ihler, Arthur
Gunes Erkan and Dragomir R. Radev. 2004. Lexrank: Asuncion, Padhraic Smyth, and Max Welling.
graph-based lexical centrality as salience in text 2008. Fast collapsed Gibbs sampling for latent
summarization. J. Artif. Int. Res., 22:457479, De- Dirichlet allocation. In KDD 08: Proceeding of
cember. the 14th ACM SIGKDD international conference on
S. Fisher and B. Roark. 2008. Query-focused super- Knowledge discovery and data mining, pages 569
vised sentence ranking for update summaries. In 577, New York, NY, USA, August. ACM.
Proceedings of the first Text Analysis Conference, Dragomir R. Radev, Hongyan Jing, Malgorzata Stys,
TAC-2008. and Daniel Tam. 2004. Centroid-based summariza-
Dani Gamerman and Hedibert F. Lopes. 2006. tion of multiple documents. Inf. Process. Manage.,
Markov Chain Monte Carlo: Stochastic Simulation 40:919938, November.
for Bayesian Inference. Chapman and Hall/CRC. Frank Schilder, Ravikumar Kondadadi, Jochen L. Lei-
Jade Goldstein, Mark Kantrowitz, Vibhu Mittal, and dner, and Jack G. Conrad. 2008. Thomson reuters
Jaime Carbonell. 1999. Summarizing text docu- at tac 2008: Aggressive filtering with fastsum for
ments: sentence selection and evaluation metrics. update and opinion summarization. In Proceedings
In Proceedings of the 22nd annual international of the first Text Analysis Conference, TAC-2008.
ACM SIGIR conference on Research and develop- Chao Shen and Tao Li. 2010. Multi-document sum-
ment in information retrieval, SIGIR 99, pages marization via the minimum dominating set. In
121128, New York, NY, USA. ACM. Proceedings of the 23rd International Conference
T. L. Griffiths and M. Steyvers. 2004. Finding scien- on Computational Linguistics, COLING 10, pages
tific topics. Proceedings of the National Academy 984992, Stroudsburg, PA, USA. Association for
of Sciences, 101(Suppl. 1):52285235, April. Computational Linguistics.
A. Haghighi and L. Vanderwende. 2009. Exploring Ian Soboroff and Donna Harman. 2005. Novelty de-
content models for multi-document summarization. tection: the trec experience. In Proceedings of the
In Proceedings of Human Language Technologies: conference on Human Language Technology and
The 2009 Annual Conference of the North Ameri- Empirical Methods in Natural Language Process-
can Chapter of the Association for Computational ing, HLT 05, pages 105112, Stroudsburg, PA,
Linguistics, pages 362370. Association for Com- USA. Association for Computational Linguistics.
putational Linguistics. Dingding Wang, Shenghuo Zhu, Tao Li, and Yihong
Feng Jin, Minlie Huang, and Xiaoyan Zhu. 2010. The Gong. 2009. Multi-document summarization us-
thu summarization systems at tac 2010. In Proceed- ing sentence-based topic models. In Proceedings
ings of the third Text Analysis Conference, TAC- of the ACL-IJCNLP 2009 Conference Short Papers,
2010. ACLShort 09, pages 297300, Stroudsburg, PA,
Kevin Lerman and Ryan McDonald. 2009. Con- USA. Association for Computational Linguistics.
trastive summarization: an experiment with con- Yi Wang. 2011. Distributed gibbs sampling of latent
sumer reviews. In Proceedings of Human Lan- dirichlet allocation: The gritty details.
guage Technologies: The 2009 Annual Conference Li Wenjie, Wei Furu, Lu Qin, and He Yanxiang. 2008.
of the North American Chapter of the Association Pnr2: ranking sentences with positive and nega-
for Computational Linguistics, Companion Volume: tive reinforcement for query-oriented update sum-
Short Papers, NAACL-Short 09, pages 113116, marization. In Proceedings of the 22nd Interna-
Stroudsburg, PA, USA. Association for Computa- tional Conference on Computational Linguistics -
tional Linguistics. Volume 1, COLING 08, pages 489496, Strouds-
Xuan Li, Liang Du, and Yi-Dong Shen. 2011. Graph- burg, PA, USA. Association for Computational Lin-
based marginal ranking for update summarization. guistics.
In Proceedings of the Eleventh SIAM International
Conference on Data Mining. SIAM / Omnipress.
Rebecca Mason and Eugene Charniak. 2011. Ex-
tractive multi-document summaries should explic-
itly not contain document-specific content. In Pro-
ceedings of the Workshop on Automatic Summariza-
tion for Different Genres, Media, and Languages,
WASDGML 11, pages 4954, Stroudsburg, PA,
USA. Association for Computational Linguistics.
A. Nenkova and L. Vanderwende. 2005. The im-
pact of frequency on summarization. Microsoft Re-
223
Large-Margin Learning of Submodular Summarization Models
224
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 224233,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
we formulate the learning problem as a struc- concept of eigenvector centrality in a graph of
tured prediction problem and derive a maximum- sentence similarities. Similarly, TextRank (Mi-
margin algorithm in the structural support vec- halcea and Tarau, 2004) is also graph based rank-
tor machine (SVM) framework. Note that, un- ing system for identification of important sen-
like other learning approaches, our method does tences in a document by using sentence similar-
not require a heuristic decomposition of the learn- ity and PageRank (Brin and Page, 1998). Sen-
ing task into binary classification problems (Ku- tence extraction can also be implemented using
piec et al., 1995), but directly optimizes a struc- other graph based scoring approaches (Mihalcea,
tured prediction. This enables our algorithm to di- 2004) such as HITS (Kleinberg, 1999) and po-
rectly optimize the desired performance measure sitional power functions. Graph based methods
(e.g. ROUGE) during training. Furthermore, our can also be paired with clustering such as in Col-
method is not limited to linear-chain dependen- labSum (Wan et al., 2007). This approach first
cies like (Conroy and Oleary, 2001; Shen et al., uses clustering to obtain document clusters and
2007), but can learn any monotone submodular then uses graph based algorithm for sentence se-
scoring function. lection which includes inter and intra-document
This ability to easily train summarization mod- sentence similarities. Another clustering-based
els makes it possible to efficiently tune models algorithm (Nomoto and Matsumoto, 2001) is a
to various types of document collections. In par- diversity-based extension of MMR that finds di-
ticular, we find that our learning method can re- versity by clustering and then proceeds to reduce
liably tune models with hundreds of parameters redundancy by selecting a representative for each
based on a training set of about 30 examples. cluster.
This increases the fidelity of models compared The manually tuned sentence pairwise model
to their hand-tuned counterparts, showing sig- (Lin and Bilmes, 2010; Lin and Bilmes, 2011) we
nificantly improved empirical performance. We took inspiration from is based on budgeted sub-
provide a detailed investigation into the sources modular optimization. A summary is produced
of these improvements, identifying further direc- by maximizing an objective function that includes
tions for research. coverage and redundancy terms. Coverage is de-
fined as the sum of sentence similarities between
2 Related work the selected summary and the rest of the sen-
Work on extractive summarization spans a large tences, while redundancy is the sum of pairwise
range of approaches. Starting with unsupervised intra-summary sentence similarities. Another ap-
methods, one of the widely known approaches proach based on submodularity (Qazvinian et al.,
is Maximal Marginal Relevance (MMR) (Car- 2010) relies on extracting important keyphrases
bonell and Goldstein, 1998). It uses a greedy ap- from citation sentences for a given paper and us-
proach for selection and considers the trade-off ing them to build the summary.
between relevance and redundancy. Later it was In the supervised setting, several early methods
extended (Goldstein et al., 2000) to support multi- (Kupiec et al., 1995) made independent binary de-
document settings by incorporating additional in- cisions whether to include a particular sentence
formation available in this case. Good results can in the summary or not. This ignores dependen-
be achieved by reformulating this as a knapsack cies between sentences and can result in high re-
packing problem and solving it using dynamic dundancy. The same problem arises when using
programing (McDonald, 2007). Alternatively, we learning-to-rank approaches such as ranking sup-
can use annotated phrases as textual units and se- port vector machines, support vector regression
lect a subset that covers most concepts present and gradient boosted decision trees to select the
in the input (Filatova and Hatzivassiloglou, 2004) most relevant sentences for the summary (Metzler
(which can also be achieved by our coverage scor- and Kanungo, 2008).
ing function if it is extended with appropriate fea- Introducing some dependencies can improve
tures). the performance. One limited way of introduc-
A popular stochastic graph-based summariza- ing dependencies between sentences is by using a
tion method is LexRank (Erkan and Radev, 2004). linear-chain HMM. The HMM is assumed to pro-
It computes sentence importance based on the duce the summary by having a chain transitioning
225
between summarization and non-summarization however it uses vine-growth model and employs
states (Conroy and Oleary, 2001) while travers- search to to find the best policy which is then used
ing the sentences in a document. A more expres- to generate a summary.
sive approach is using a CRF for sequence label- A specific subclass of submodular (but not
ing (Shen et al., 2007) which can utilize larger and monotone) functions are defined by Determinan-
not necessarily independent feature spaces. The tal Point Processes (DPPs) (Kulesza and Taskar,
disadvantage of using linear chain models, how- 2011). While they provide an elegant probabilis-
ever, is that they represent the summary as a se- tic interpretation of the resulting summarization
quence of sentences. Dependencies between sen- models, the lack of monotonicity means that no
tences that are far away from each other cannot efficient approximation algorithms are known for
be modeled efficiently. In contrast to such lin- computing the highest-scoring summary.
ear chain models, our approach on submodular
scoring functions can model long-range depen- 3 Submodular document summarization
dencies. In this way our method can use proper- In this section, we illustrate how document sum-
ties of the whole summary when deciding which marization can be addressed using submodular set
sentences to include in it. functions. The set of documents to be summa-
More closely related to our work is that of Li rized is split into a set of individual sentences
et al. (2009). They use the diversified retrieval x = {s1 , ..., sn }. The summarization method
method proposed in Yue and Joachims (2008) for then selects a subset y x of sentences that max-
document summarization. Moreover, they assume imizes a given scoring function Fx : 2x R
that subtopic labels are available so that additional subject to a budget constraint (e.g. less than B
constraints for diversity, coverage and balance can characters).
be added to the structural SVM learning prob-
lem. In contrast, our approach does not require the y = arg max Fx (y) s.t. |y| B (1)
knowledge of subtopics (thus allowing us to ap- yx
226
Figure 1: Illustration of the pairwise model. Not all Figure 2: Illustration of the coverage model. Word
edges are shown for clarity purposes. Edge thickness border thickness represents importance.
denotes the similarity score.
An example of how a summary is scored is il-
In the above equation, (i, j) 0 denotes a mea- lustrated in the Figure 2. Analogous to the defini-
sure of similarity between pairs of sentences i and tion of similarity (i, j) in the pairwise model, the
j. The first term in Eq. 2 is a measure of how simi- choice of the word importance function (v) is
lar the sentences included in summary y are to the crucial in the coverage model. A simple heuristic
other sentences in x. The second term penalizes is to weigh words highly that occur in many sen-
y by how similar its sentences are to each other. tences of x, but in few other documents (Swami-
> 0 is a scalar parameter that trades off be- nathan et al., 2009). However, we will show in the
tween the two terms. Maximizing Fx (y) amounts following how to learn (v) from training data.
to increasing the similarity of the summary to ex-
cluded sentences while minimizing repetitions in Algorithm 1 Greedy algorithm for finding the
the summary. An example is illustrated in Figure best summary y given a scoring function Fx (y).
1. In the simplest case, (i, j) may be the TFIDF Parameter: r > 0.
(Salton and Buckley, 1988) cosine similarity, but y
we will show later how to learn sophisticated sim- Ax
ilarity functions. while A 6= do
Fx (y {l}) Fx (y)
k arg max
3.2 Coverage scoring function PlA (cl )r
A second scoring function we consider was if ck+ iy ci B and Fx (y{k})Fx (y)
first proposed for diversified document retrieval 0 then
(Swaminathan et al., 2009; Yue and Joachims, y y {k}
2008), but it naturally applies to document sum- end if
marization as well (Li et al., 2009). It is based on A A\{k}
a notion of word coverage, where each word v has end while
some importance weight (v) 0. A summary
y covers a word if at least one of its sentences
3.3 Computing a Summary
contains the word. The score of a summary is
then simply the sum of the word weights its cov- Computing the summary that maximizes either of
ers (though we could also include a concave dis- the two scoring functions from above (i.e. Eqns.
count function that rewards covering a word mul- (2) and (3)) is NP-hard (McDonald, 2007). How-
tiple times (Raman et al., 2011)): ever, it is known that the greedy algorithm 1 can
achieve a 1 1/e approximation to the optimum
X
Fx (y) = (v). (3) solution for any linear budget constraint (Lin and
vV (y) Bilmes, 2010; Khuller et al., 1999). Even further,
this algorithm provides a 1 1/e approximation
In the above equation, V (y) denotes the union of for any monotone submodular scoring function.
all words in y. This function is analogous to a The algorithm starts with an empty summariza-
maximum coverage problem, which is known to tion. In each step, a sentence is added to the sum-
be submodular (Khuller et al., 1999). mary that results in the maximum relative increase
227
of the objective. The increase is relative to the called the joint feature-map between input x and
amount of budget that is used by the added sen- output y. Note that both submodular scoring func-
tence. The algorithm terminates when the budget tion in Eqns. (2) and (3) can be brought into the
B is reached. form wT (x, y) for the linear parametrization in
Note that the algorithm has a parameter r in Eq. (6) and (7):
the denominator of the selection rule, which Lin X X
and Bilmes (2010) report to have some impact p (x, y) = px (i, j) px (i, j), (6)
on performance. In the algorithm, ci represents ix\y,jy i,jy:i6=j
X
the cost of the sentence (i.e., length). Thus, the c (x, y) = cx (v). (7)
algorithm actually selects sentences with large vV (y)
marginal unity relative to their length (trade-off
controlled by the parameter r). Selecting r to be After this transformation, it is easy to see that
less than 1 gives more importance to information computing the maximizing summary in Eq. (1)
density (i.e. sentences that have a higher ratio and the structural SVM prediction rule in Eq. (5)
of score increase per length). The 1 1e greedy are equivalent.
approximation guarantee holds despite this addi- To learn the weight vector w, structural SVMs
tional parameter (Lin and Bilmes, 2010). More require training examples (x1 , y 1 ), ..., (xn , y n ) of
details on our choice of r and its effects are pro- input/output pairs. In document summarization,
vided in the experiments section. however, the correct extractive summary is typ-
ically not known. Instead, training documents
4 Learning algorithm xi are typically annotated with multiple manual
(non-extractive) summaries (denoted by Y i ). To
In this section, we propose a supervised learning determine a single extractive target summary y i
method for training a submodular scoring func- for training, we find the extractive summary that
tion to produce desirable summaries. In particu- (approximately) optimizes ROUGE score or
lar, for the pairwise and the coverage model, we some other loss function (Y i , y) with respect
show how to learn the similarity function (i, j) to Y i .
and the word importance weights (v) respec- y i = argmin (Y i , y) (8)
tively. In particular, we parameterize (i, j) and yY
(v) using a linear model, allowing that each de-
We call the y i determined in this way the target
pends on the full set of input sentences x:
summary for xi . Note that y i is a greedily con-
x (i, j) = wTpx (i, j) x (v) = wTcx (v). (4) structed approximate target summary based on its
proximity to Y i via . Because of this, we will
In the above equations, w is a weight vector that learn a model that can predict approximately good
is learned, and px (i, j) and cx (v) are feature vec- summaries y i from xi . However, we believe that
tors. In the pairwise model, px (i, j) may include most of the score difference between manual sum-
feature like the TFIDF cosine between i and j or maries and y i (as explored in the experiments sec-
the number of words from the document titles that tion) is due to it being an extractive summary and
i and j share. In the coverage model, cx (v) may not due to greedy construction.
include features like a binary indicator of whether Following the structural SVM approach, we
v occurs in more than 10% of the sentences in x can now formulate the problem of learning w as
or whether v occurs in the document title. the following quadratic program (QP):
We propose to learn the weights following a n
1 CX
large-margin framework using structural SVMs min kwk2 + i (9)
w,0 2 n
(Tsochantaridis et al., 2005). Structural SVMs i=1
learn a discriminant function s.t. w> (xi , y i ) w> (xi , y i )
228
Algorithm 2 Cutting-plane algorithm for solving low:
the learning optimization problem.
Parameter: desired tolerance > 0. (Y i , y) = max(0, R (Y i , y) R (Y i , y i )),
i : Wi
repeat The loss was used in our experiments. Thus
for i do training a structural SVM with this loss aims to
y arg max wT (xi , y) + (Y i , y) maximize the ROUGE-1 F score with the man-
y ual summaries provided in the training examples,
if wT (xi , y i ) + wT (xi , y) + while trading it off with margin. Note that we
(Y i , y) i then could also use a different loss function (as the
Wi Wi {y} method is not tied to this particular choice), if we
w solve QP (9) using constraints Wi had a different target evaluation metric. Finally,
end if once a w is obtained from structural SVM train-
end for ing, a predicted summary for a test document x
until no Wi has changed during iteration can be obtained from (5).
5 Experiments
for any other summary y i (i.e., w> (xi , y i )). In this section, we empirically evaluate the ap-
The objective function learns a large-margin proach proposed in this paper. Following Lin and
weight vector w while trading it off with an upper Bilmes (2010), experiments were conducted on
bound on the empirical loss. The two quantities two different datasets (DUC 03 and 04). These
are traded off with a parameter C > 0. datasets contain document sets with four manual
Even though the QP has exponentially many summaries for each set. For each document set,
constraints in the number of sentences in the in- we concatenated all the articles and split them
put documents, it can be solved approximately into sentences using the tool provided with the
in polynomial time via a cutting plane algorithm 03 dataset. For the supervised setting we used
(Tsochantaridis et al., 2005). The steps of the 10 resamplings with a random 20/5/5 (03) and
cutting-plane algorithm are shown in Algorithm 40/5/5 (04) train/test/validation split. We deter-
2. In each iteration of the algorithm, for each mined the best C value in (9) using the perfor-
training document xi , a summary y i which most mance on each validation set and then report aver-
violates the constraint in (9) is found. This is done age performence over the corresponding test sets.
by finding Baseline performance (the approach of Lin and
Bilmes (2010)) was computed using all 10 test
y arg max wT (xi , y) + (Y i , y), sets as a single test set. For all experiments and
yY
datasets, we used r = 0.3 in the greedy algorithm
for which we use a variant of the greedy algorithm as recommended in Lin and Bilmes (2010) for the
in Figure 1. After a violating constraint for each 03 dataset. We find that changing r has only a
training example is added, the resulting quadratic small influence on performance.2
program is solved. These steps are repeated until The construction of features for learning is or-
all the constraints are satisfied to a required preci- ganized by word groups. The most trivial group
sion . is simply all words (basic). Considering the prop-
Finally, special care has to be taken to appro- erties of the words themselves, we constructed
priately define the loss function given the dis- several features from properties such as capital-
parity of Y i and y i . Therefore, we first define an ized words, non-stop words and words of cer-
intermediate loss function tain length (cap+stop+len). We obtained another
set of features from the most frequently occur-
R (Y, y) = max(0, 1 ROU GE1F (Y, y)), ing words in all the articles (minmax). We also
considered the position of a sentence (containing
based on the ROUGE-1 F score. To ensure that 2
Setting r to 1 and thus eliminating the non-linearity does
the loss function is zero for the target label as de- lower the score (e.g. to 0.38466 for the pairwise model on
fined in (8), we normalized the above loss as be- DUC 03 compared with the results on Figure 3).
229
the word) in the article as another feature (loca- formance numbers than those reported in Lin and
tion). All those word groups can then be further Bilmes (2010) better on DUC 03 and somewhat
refined by selecting different thresholds, weight- lower on DUC 04, if evaluated on the same selec-
ing schemes (e.g. TFIDF) and forming binned tion of test examples as theirs. We conjecture that
variants of these features. this is due to small differences in implementation
For the pairwise model we use cosine similar- and/or preprocessing of the dataset. Furthermore,
ity between sentences using only words in a given as authors of Lin and Bilmes (2010) note in their
word group during computation. For the word paper, the 03 and 04 datasets behave quite dif-
coverage model we create separate features for ferently.
covering words in different groups. This gives us
fairly comparable feature strength in both mod- model dataset ROUGE-1 F (stderr)
els. The only further addition is the use of differ- pairwise DUC 03 0.3929 (0.0074)
ent word coverage levels in the coverage model. coverage 0.3784 (0.0059)
First we consider how well does a sentence cover hand-tuned 0.3571 (0.0063)
a word (e.g. a sentence with five instances of the pairwise DUC 04 0.4066 (0.0061)
same word might cover it better than another with coverage 0.3992 (0.0054)
only a single instance). And secondly we look at hand-tuned 0.3935 (0.0052)
how important it is to cover a word (e.g. if a word
appears in a large fraction of sentences we might Figure 3: Results obtained on DUC 03 and 04
want to be sure to cover it). Combining those two datasets using the supervised models. Increase in per-
formance over the hand-tuned is statistically signifi-
criteria using different thresholds we get a set of
cant (p 0.05) for the pairwise model on the both
features for each word. Our coverage features are datasets, but only on DUC 03 for the coverage model.
motivated from the approach of Yue and Joachims
(2008). In contrast, the hand-tuned pairwise base- Figure 3 also reports the performance for
line uses only TFIDF weighted cosine similarity the coverage model as trained by our algorithm.
between sentences using all words, following the These results can be compared against those for
approach in Lin and Bilmes (2010). the pairwise model. Since we are using features
The resulting summaries are evaluated using of comparable strength in both approaches, as
ROUGE version 1.5.5 (Lin and Hovy, 2003). We well as the same greedy algorithm and structural
selected the ROUGE-1 F measure because it was SVM learning method, this comparison largely
used by Lin and Bilmes (2010) and because it is reflects the quality of models themselves. On the
one of the commonly used performance scores in 04 dataset both models achieve the same perfor-
recent work. However, our learning method ap- mance while on 03 the pairwise model performs
plies to other performance measures as well. Note significantly (p 0.05) better than the coverage
that we use the ROUGE-1 F measure both for the model.
loss function during learning, as well as for the Overall, the pairwise model appears to perform
evaluation of the predicted summaries. slightly better than the coverage model with the
datasets and features we used. Therefore, we fo-
5.1 How does learning compare to manual cus on the pairwise model in the following.
tuning?
In our first experiment, we compare our super- 5.2 How fast does the algorithm learn?
vised learning approach to the hand-tuned ap- Hand-tuned approaches have limited flexibility.
proach. The results from this experiment are sum- Whenever we move to a significantly different
marized in Figure 3. First, supervised training collection of documents we have to reinvest time
of the pairwise model (Lin and Bilmes, 2010) to retune it. Learning can make this adaptation
resulted in a statistically significant (p 0.05 to a new collection more automatic and faster
using paired t-test) increase in performance on especially since training data has to be collected
both datasets compared to our reimplementation even for manual tuning.
of the manually tuned pairwise model. Note that Figure 4 evaluates how effectively the learn-
our reimplementation of the approach of Lin and ing algorithm can make use of a given amount of
Bilmes (2010) resulted in slightly different per- training data. In particular, the figure shows the
230
extractive summary is about 10 points of ROUGE.
Third, we expect some drop in performance,
since our model may not be able to fit the optimal
extractive summaries due to a lack of expressive-
ness. This can be estimated by looking at train-
ing set performance, as reported in row model fit
of Figure 5. On both datasets, we see a drop of
about 5 points of ROUGE performance. Adding
more and better features might help the model fit
the data better.
Finally, a last drop in performance may come
Figure 4: Learning curve for the pairwise model on from overfitting. The test set ROUGE scores are
DUC 04 dataset showing ROUGE-1 F scores for given in the row prediction of Figure 5. Note that
different numbers of learning examples (logarithmic the drop between training and test performance
scale). The dashed line represents the preformance of is rather small, so overfitting is not an issue and
the hand-tuned model. is well controlled in our algorithm. We therefore
conclude that increasing model fidelity seems like
learning curve for our approach. Even with very a promising direction for further improvements.
few training examples, the learning approach al-
ready outperforms the baseline. Furthermore, at bound dataset ROUGE-1 F
the maximum number of training examples avail- human DUC 03 0.56235
able to us the curve still increases. We therefore extractive 0.45497
conjecture that more data would further improve model fit 0.40873
performance. prediction 0.39294
human DUC 04 0.55221
5.3 Where is room for improvement? extractive 0.45199
To get a rough estimate of what is actually achiev- model fit 0.40963
able in terms of the final ROUGE-1 F score, we prediction 0.40662
looked at different upper bounds under vari-
Figure 5: Upper bounds on ROUGE-1 F scores: agree-
ous scenarios (Figure 5). First, ROUGE score
ment between manual summaries, greedily computed
is computed by using four manual summaries best extractive summaries, best model fit on the train
from different assessors, so that we can estimate set (using the best C value) and the test scores of the
inter-subject disagreement. If one computes the pairwise model.
ROUGE score of a held-out summary against the
remaining three summaries, the resulting perfor-
mance is given in the row labeled human of Fig- 5.4 Which features are most useful?
ure 5. It provides a reasonable estimate of human To understand which features affected the final
performance. performance of our approach, we assessed the
Second, in extractive summarization we re- strength of each set of our features. In particu-
strict summaries to sentences from the documents lar, we looked at how the final test score changes
themselves, which is likely to lead to a reduc- when we removed certain features groups (de-
tion in ROUGE. To estimate this drop, we use the scribed in the beginning of Section 5) as shown
greedy algorithm to select the extractive summary in Figure 6.
that maximizes ROUGE on the test documents. The most important group of features are the
The resulting performance is given in the row ex- basic features (pure cosine similarity between
tractive of Figure 5. On both dataset, the drop sentences) since removing them results in the
in performance for this (approximately3 ) optimal largest drop in performance. However, other fea-
3
tures play a significant role too (i.e. only the ba-
We compared the greedy algorithm with exhaustive sic ones are not enough to achieve good perfor-
search for up to three selected sentences (more than that
would take too long). In about half the cases we got the same below optimal confirming that greedy selection works quite
solution, in other cases the soultion was on average about 1% well.
231
mance). This confirms that performance can be was 0.4010, which is slightly but not significantly
improved by adding richer fatures instead of us- lower than the 0.4066 obtained with four sum-
ing only a single similarity score as in Lin and maries (as shown on Figure 3). Similarly, on DUC
Bilmes (2010). Using learning for these complex 03 the performance drop from 0.3929 to 0.3838
model is essential, since hand-tuning is likely to was not significant as well.
be intractable. Based on those results, we conjecture that hav-
The second most important group of features ing more documents sets with only a single man-
considering the drop in performance (i.e. loca- ual summary is more useful for training than
tion) looks at positions of sentences in the arti- fewer training examples with better labels (i.e.
cles. This makes intuitive sense because the first multiple summaries). In both cases, we spend
sentences in news articles are usually packed with approximately the same amount of effort (as the
information. The other three groups do not have a summaries are the most expensive component of
significant impact on their own. the training data), however having more training
examples helps (according to the learning curve
removed ROUGE-1 F presented before) while spending effort on multi-
group ple summaries appears to have only minor benefit
none 0.40662 for training.
basic 0.38681
all except basic 0.39723 6 Conclusions
location 0.39782
This paper presented a supervised learning ap-
sent+doc 0.39901
proach to extractive document summarization
cap+stop+len 0.40273
based on structual SVMs. The learning method
minmax 0.40721
applies to all submodular scoring functions, rang-
ing from pairwise-similarity models to coverage-
Figure 6: Effects of removing different feature groups
on the DUC 04 dataset. Bold font marks significant based approaches. The learning problem is for-
difference (p 0.05) when compared to the full pari- mulated into a convex quadratic program and was
wise model. The most important are basic similar- then solved approximately using a cutting-plane
ity features including all words (similar to (Lin and method. In an empirical evaluation, the structural
Bilmes, 2010)). The last feature group actually low- SVM approach significantly outperforms conven-
ered the score but is included in the model because we tional hand-tuned models on the DUC 03 and
only found this out later on DUC 04 dataset.
04 datasets. A key advantage of the learn-
ing approach is its ability to handle large num-
5.5 How important is it to train with bers of features, providing substantial flexibility
multiple summaries? for building high-fidelity summarization models.
Furthermore, it shows good control of overfitting,
While having four manual summaries may be im-
making it possible to train models even with only
portant for computing a reliable ROUGE score
a few training examples.
for evaluation, it is not clear whether such an ap-
proach is the most efficient use of annotator re- Acknowledgments
sources for training. In our final experiment, we
trained our method using only a single manual We thank Claire Cardie and the members of the
summary for each set of documents. When us- Cornell NLP Seminar for their valuable feedback.
ing only a single manual summary, we arbitrarily This research was funded in part through NSF
took the first one out of the provided four refer- Awards IIS-0812091 and IIS-0905467.
ence summaries and used only it to compute the
target label for training (instead of using average
loss towards all four of them). Otherwise, the ex- References
perimental setup was the same as in the previous T. Berg-Kirkpatrick, D. Gillick and D. Klein. Jointly
subsections, using the pairwise model. Learning to Extract and Compress. In Proceedings
For DUC 04, the ROUGE-1 F score obtained of ACL, 2011.
using only a single summary per document set S. Brin and L. Page. The Anatomy of a Large-Scale
232
Hypertextual Web Search Engine. In Proceedings of Linear Programming for Natural Language Process-
WWW, 1998. ing, 2009.
J. Carbonell and J. Goldstein. The use of MMR, R. McDonald. 2007. A Study of Global Inference Al-
diversity-based reranking for reordering documents gorithms in Multi-document Summarization. In Ad-
and producing summaries. In Proceedings of SI- vances in Information Retrieval, Lecture Notes in
GIR, 1998. Computer Science, 2007, pp. 557564.
J. M. Conroy and D. P. Oleary. Text summarization via D. Metzler and T. Kanungo. Machine learned sen-
hidden markov models. In Proceedings of SIGIR, tence selection strategies for query-biased summa-
2001. rization. In Proceedings of SIGIR, 2008.
H. Daume III. Practical Structured Learning Tech- R. Mihalcea. 2004. Graph-based ranking algorithms
niques for Natural Language Processing. Ph.D. for sentence extraction, applied to text summa-
Thesis, 2006. rization. In Proceedings of the ACL on Interactive
G. Erkan and D. R. Radev. LexRank: Graph-based poster and demonstration sessions, 2004.
Lexical Centrality as Salience in Text Summariza- R. Mihalcea and P. Tarau. Textrank: Bringing order
tion. In Journal of Artificial Intelligence Research, into texts. In Proceedings of EMNLP, 2004.
Vol. 22, 2004, pp. 457479. T. Nomoto and Y. Matsumoto. A new approach to un-
E. Filatova and V. Hatzivassiloglou. Event-Based Ex- supervised text summarization. In Proceedings of
tractive Summarization. In Proceedings of ACL SIGIR, 2001.
Workshop on Summarization, 2004. V. Qazvinian, D. R. Radev, and A. Ozgur. 2010. Cita-
T. Finley and T. Joachims. Training structural SVMs tion Summarization Through Keyphrase Extraction.
when exact inference is intractable. In Proceedings In Proceedings of COLING, 2010.
of ICML, 2008. K. Raman, T. Joachims and P. Shivaswamy. Structured
D. Gillick and Y. Liu. A scalable global model for sum- Learning of Two-Level Dynamic Rankings. In Pro-
marization. In Proceedings of ACL Workshop on ceedings of CIKM, 2011.
Integer Linear Programming for Natural Language G. Salton and C. Buckley. Term-weighting approaches
Processing, 2009. in automatic text retrieval. In Information process-
J. Goldstein, V. Mittal, J. Carbonell, and M. ing and management, 1988, pp. 513523.
Kantrowitz. Multi-document summarization by sen- D. Shen, J. T. Sun, H. Li, Q. Yang, and Z. Chen.
tence extraction. In Proceedings of NAACL-ANLP, Document summarization using conditional ran-
2000. dom fields. In Proceedings of IJCAI, 2007.
S. Khuller, A. Moss and J. Naor. The budgeted maxi- A. Swaminathan, C. V. Mathew and D. Kirovski.
mum coverage problem. In Information Processing Essential Pages. In Proceedings of WI-IAT, IEEE
Letters, Vol. 70, Issue 1, 1999, pp. 3945. Computer Society, 2009.
I. Tsochantaridis, T. Hofmann, T. Joachims and Y. Al-
J. M. Kleinberg. Authoritative sources in a hyperlinked
tun. Large margin methods for structured and inter-
environment. In Journal of the ACM, Vol. 46, Issue
dependent output variables. In Journal of Machine
5, 1999, pp. 604-632.
Learning Research, Vol. 6, 2005, pp. 1453-1484.
A. Kulesza and B. Taskar. Learning Determinantal
X. Wan, J. Yang, and J. Xiao. Collabsum: Exploit-
Point Processes. In Proceedings of UAI, 2011.
ing multiple document clustering for collaborative
J. Kupiec, J. Pedersen, and F. Chen. A trainable docu- single document summarizations. In Proceedings of
ment summarizer. In Proceedings of SIGIR, 1995. SIGIR, 2007.
L. Li, Ke Zhou, G. Xue, H. Zha, and Y. Yu. Enhanc- Y. Yue and T. Joachims. Predicting diverse subsets us-
ing Diversity, Coverage and Balance for Summa- ing structural svms. In Proceedings of ICML, 2008.
rization through Structure Learning. In Proceedings
of WWW, 2009.
H. Lin and J. Bilmes. 2010. Multi-document summa-
rization via budgeted maximization of submodular
functions. In Proceedings of NAACL-HLT, 2010.
H. Lin and J. Bilmes. 2011. A Class of Submodu-
lar Functions for Document Summarization. In Pro-
ceedings of ACL-HLT, 2011.
C. Y. Lin and E. Hovy. Automatic evaluation of sum-
maries using N-gram co-occurrence statistics. In
Proceedings of NAACL, 2003.
F. T. Martins and N. A. Smith. Summarization with
a joint model for sentence extraction and compres-
sion. In Proceedings of ACL Workshop on Integer
233
A Probabilistic Model of Syntactic and Semantic Acquisition from
Child-Directed Utterances and their Meanings
Tom Kwiatkowski* Sharon Goldwater Luke Zettlemoyer Mark Steedman
tomk@cs.washington.edu sgwater@inf.ed.ac.uk lsz@cs.washington.edu steedman@inf.ed.ac.uk
ILCC, School of Informatics Computer Science & Engineering
University of Edinburgh University of Washington
Edinburgh, EH8 9AB, UK Seattle, WA, 98195, USA
234
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 234244,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
learner captures the step-like learning curves for pers are not designed to be cognitively plausible,
word order regularities that Thornton and Tesan using batch training algorithms, multiple passes
(2007) claim children show. This result coun- over the data, and language specific initialisations
ters Thornton and Tesans criticism of statistical (lists of noun phrases and additional corpus statis-
grammar learnersthat they tend to exhibit grad- tics), all of which we dispense with here. In
ual learning curves rather than the abrupt changes particular, our approach is closely related that of
in linguistic competence observed in children. Kwiatkowski et al. (2010) but, whereas that work
required careful initialisation and multiple passes
1.1 Related Work over the training data to learn a discriminative
Models of syntactic acquisition, whether they parsing model, here we learn a generative parsing
have addressed the task of learning both syn- model without either.
tax and semantics (Siskind, 1992; Villavicencio,
1.2 Overview of the approach
2002; Buttery, 2006) or syntax alone (Gibson
and Wexler, 1994; Sakas and Fodor, 2001; Yang, Our approach takes, as input, a corpus of (ut-
2002) have aimed to learn a single, correct, deter- terance, meaning-candidates) pairs {(si , {m}i ) :
ministic grammar. With the exception of Buttery i = 1, . . . , N }, and learns a CCG lexicon and
(2006) they also adopt the Principles and Param- the probability of each production a b that
eters grammatical framework, which assumes de- could be used in a parse. Together, these define
tailed knowledge of linguistic regularities2 . Our a probabilistic parser that can be used to find the
approach contrasts with all previous models in as- most probable meaning for any new sentence.
suming a very general kind of linguistic knowl- We learn both the lexicon and production prob-
edge and a probabilistic grammar. Specifically, abilities from allowable parses of the training
we use the probabilistic Combinatory Categorial pairs. The set of allowable parses {t} for a sin-
Grammar (CCG) framework, and assume only gle (utterance, meaning-candidates) pair consists
that the learner has access to a small set of general of those parses that map the utterance onto one of
combinatory schemata and a functional mapping the meanings. This set is generated with the func-
from semantic type to syntactic category. Further- tional mapping T :
more, this paper is the first to evaluate a model {t} = T (s, m), (2)
of child syntactic-semantic acquisition by parsing
unseen data. which is defined, following Kwiatkowski et al.
(2010), using only the CCG combinators and a
Models of child word learning have focused mapping from semantic type to syntactic category
on semantics only, learning word meanings from (presented in in Section 4).
utterances paired with either sets of concept sym- The CCG lexicon is learnt by reading off
bols (Yu and Ballard, 2007; Frank et al., 2008; Fa- the lexical items used in all parses of all training
zly et al., 2010) or a compositional meaning rep- pairs. Production probabilities are learnt in con-
resentation of the type used here (Siskind, 1996). junction with through the use of an incremen-
The models of Alishahi and Stevenson (2008) tal parameter estimation algorithm, online Varia-
and Maurits et al. (2009) learn, as well as word- tional Bayesian EM, as described in Section 5.
meanings, orderings for verb-argument structures Before presenting the probabilistic model, the
but not the full parsing model that we learn here. mapping T , and the parameter training algorithm,
we first provide some background on the meaning
Semantic parser induction as addressed by
representations we use and on CCG.
Zettlemoyer and Collins (2005, 2007, 2009), Kate
and Mooney (2007), Wong and Mooney (2006, 2 Background
2007), Lu et al. (2008), Chen et al. (2010),
Kwiatkowski et al. (2010, 2011) and Borschinger 2.1 Meaning Representations
et al. (2011) has the same task definition as the We represent the meanings of utterances in first-
one addressed by this paper. However, the learn- order predicate logic using the lambda-calculus.
ing approaches presented in those previous pa- An example logical expression (henceforth also
2
referred to as a lambda expression) is:
This linguistic use of the term parameter is distinct
from the statistical use found elsewhere in this paper. like(eve, mummy) (3)
235
which expresses a logical relationship like be-
tween the entity eve and the entity mummy. In 3 Modelling Derivations
Section 6.1 we will see how logical expressions
like this are created for a set of child-directed ut- The objective of our learning algorithm is to
terances (to use in training our model). learn the correct parameterisation of a probabilis-
The lambda-calculus uses operators to define tic model P (s, m, t) over (utterance, meaning,
functions. These may be used to represent func- derivation) triples. This model assigns a proba-
tional meanings of utterances but they may also be bility to each of the grammar productions a b
used as a glue language, to compose elements of used to build the derivation tree t. The probabil-
first order logical expressions. For example, the ity of any given CCG derivation t with sentence
function xy.like(y, x) can be combined with s and semantics m is calculated as the product of
the object mummy to give the phrasal mean- all of its production probabilities.
Y
ing y.like(y, mummy) through the lambda- P (s, m, t) = P (b|a) (4)
calculus operation of function application. abt
236
START
X = {(si , {m}i ) : i = 1, . . . , N }, the latent vari-
Sdcl
ables S (containing the productions used in each
parse t) and the parsing parameters .
NP Sdcl \NP
237
Syntactic Category Semantic Type Example Phrase
Sdcl hev, ti I took it ` Sdcl :e.took(i, it, e)
St t I0 m angry ` St :angry(i)
Swh he, hev, tii Who took it? ` Swh :xe.took(x, it, e)
Sq hev, ti Did you take it? ` Sq :e.Q(take(you, it, e))
N he, ti cookie ` N:x.cookie(x)
NP e John ` NP:john
PP hev, ti on John ` PP:e.on(john, e)
h with function application: T cycles over all cell entries in increasingly small
spans and populates the chart with their splits. For
{(X/Y : f Y : g), (10) any cell entry X : h spanning more than one word
(Y : g : X\Y : f )|h = f (g)} T generates a set of pairs representing the splits of
X:h. For each split (Cl :ml , Cr :mr ) and every bi-
or by a reversal of the CCG composition combi- nary partition (wi:k , wk:j ) of the word-span T cre-
nators if f and g can be recombined to give h with ates two new cell entries in the chart: (Cl : ml )i:k
function composition: and (Cr :mr )k:j .
238
Bayesian extension of the EM algorithm that
Input : Corpus D = {(si , {m}i )|i = 1, . . . , N },
accumulates observation pseudocounts nab for
Function T , Semantics to syntactic cate-
each of the productions a b in the grammar.
gory mapping cat, function lex to read
These pseudocounts define the posterior over pro-
lexical items off derivations.
duction probabilities as follows:
Output: Lexicon , Pseudocounts {nab }.
(ab1 , . . . , ab{k,... } )) | X, S (15) = {}, {t} = {}
for i = 1, . . . , N do
X {t}i = {}
Dir(H(b1 ) + nab1 , . . . , H(bj ) + nabj )
for m0 {m}i do
j=k
Cm0 = cat(m0 )
These pseudocounts are computed in two steps: {t}0 = T (si , Cm0 :m0 )
{t}i = {t}i {t}0 , {t} = {t} {t}0
oVBE-step For the training pair (si , {m}i ) = lex ({t}0 )
which supports the set of parses {t}, the expec- for a b {t} do
i1
tation E{t} [a b] of each production a b is niab = nab + i (N E{t}i [a b]
i1
calculated by creating a packed chart representa- nab )
tion of {t} and running the inside-outside algo- Algorithm 2: Learning and {nab }
rithm. This is similar to the E-step in standard
EM apart from the fact that each production is
the parameter update step cycles over all produc-
scored with the current expectation of its parame-
i1 tions in {t} it is not neccessary to store {t}, just
ter weight ab , where:
the set of productions that it uses.
i1
i1 e(a Ha (ab)+nab ) 6 Experimental Setup
ab = P
K i1
(16)
0
{b0 } a Ha (ab )+nab0
e 6.1 Data
and is the digamma function (Beal, 2003). The Eve corpus, collected by Brown (1973), con-
oVBM-step The expectations from the oVBE tains 14, 124 English utterances spoken to a sin-
step are used to update the pseudocounts in Equa- gle child between the ages of 18 and 27 months.
tion 15 as follows, These have been hand annotated by Sagae et al.
(2004) with labelled syntactic dependency graphs.
niab = ni1 i1 An example annotation is shown in Figure 3.
ab + i (N E{t} [a b] nab )
(17) While these annotations are designed to rep-
where i is the learning rate and N is the size of resent syntactic information, the parent-child re-
the dataset. lationships in the parse can also be viewed as a
proxy for the predicate-argument structure of the
5.2 The Training Algorithm semantics. We developed a template based de-
Now the training algorithm used to learn the lex- terministic procedure for mapping this predicate-
icon and pseudocounts {nab } can be defined. argument structure onto logical expressions of the
The algorithm, shown in Algorithm 2, passes over type discussed in Section 2.1. For example, the
the training data only once and one training in- dependency graph in Figure 3 is automatically
stance at a time. For each (si , {m}i ) it uses the transformed into the logical expression
function T |{m}i | times to generate a set of con-
sistent parses {t}0 . The lexicon is populated by e.have(you,another(y, cookie(y)), e) (18)
using the lex function to read all of the lexical on(the(z, table(z)), e),
items off from the derivations in each {t}0 . In
the parameter update step, the training algorithm where e is a Davidsonian event variable used to
updates the pseudocounts associated with each of deal with adverbial and prepositional attachments.
the productions a b that have ever been seen The deterministic mapping to logical expressions
during training according to Equation (17). uses 19 templates, three of which are used in this
Only non-zero pseudocounts are stored in our example: one for the verb and its arguments, one
model. The count vector is expanded with a new for the prepositional attachment and one (used
entry every time a new production is used. While twice) for the quantifier-noun constructions.
239
SUBJ ROOT DET OBJ JCT DET POBJ
pro|you v|have qn|another n|cookie prep|on det|the n|table
You have another cookie on the table
Accuracy
straightforward interpretation in our typed logi-
0.4
cal language (e.g. what; okay; alright; no; yeah;
hmm; yes; uhhuh; mhm; thankyou), missing ver- 0.3
bal arguments that cannot be properly guessed 0.2
from the context (largely in imperative sentences Our Approach UBL1
0.1
such as drink the water), and complex noun con- Our Approach + Guess UBL10
structions that are hard to match with a small set 0.0
0.0 0.2 0.4 0.6 0.8 1.0
of templates (e.g. as top to a jar). We also re- Proportion of Data Seen
move the small number of utterances containing Figure 4: Meaning Prediction: Train on files 1, . . . , n
more than 10 words for reasons of computational test on file n + 1.
efficiency (see discussion in Section 8).
Following Alishahi and Stevenson (2010), we 7 Experiments
generate a context set {m}i for each utterance si
by pairing that utterance with its correct logical 7.1 Parsing Unseen Sentences
expression along with the logical expressions of We test the parsing model that is learnt by training
the preceding and following (|{m}i | 1)/2 utter- on the first i files of the longitudinally ordered Eve
ances. corpus and testing on file i + 1, for i = 1 . . . 19.
For each utterance s0 in the test file we use the
6.2 Base Distributions and Learning Rate parsing model to predict a meaning m and com-
Each of the production heads a in the grammar pare this to the target meaning m0 . We report the
requires a base distribution Ha and concentration proportion of utterances for which the prediction
parameter a . For word-productions the base dis- m is returned correctly both with and without
tribution is a geometric distribution over character word-meaning guessing. When a word has never
strings and spaces. For syntactic-productions the been seen at training time our parser has the abil-
base distribution is defined in terms of the new ity to guess a typed logical meaning with place-
category to be named by cat and the probability holders for constant and predicate names.
of splitting the rule by reversing either the appli- For comparison we use the UBL semantic
cation or composition combinators. parser of Kwiatkowski et al. (2010) trained in
Semantic-productions base distributions are a similar settingi.e., with no language specific
defined by a probabilistic branching process con- initialisation4 . Figure 4 shows accuracy for our
ditioned on the type of the syntactic category. approach with and without guessing, for UBL
This distribution prefers less complex logical ex- 4
Kwiatkowski et al. (2010) initialise lexical weights in
pressions. All concentration parameters are set to their learning algorithm using corpus-wide alignment statis-
1.0. The learning rate for parameter updates is tics across words and meaning elements. Instead we run
i = (0.8 + i)0.5 . UBL with small positive weight for all lexical items. When
run with Giza++ parameter initialisations, U BL10 achieves
3
Data available at www.tomkwiat.com/resources.html 48.1% across folds compared to 49.2% for our approach.
240
when run over the training data once (UBL1 ) and Figure 5 shows the posterior probability of the
for UBL when run over the training data 10 times correct meanings for the quantifiers a, another
(UBL10 ) as in Kwiatkowski et al. (2010). Each and any over the course of training with 1, 3,
of the points represents accuracy on one of the 5 and 7 candidate meanings for each utterance5 .
19 test files. All of these results are from parsers These three words are all of the same class but
trained on utterances paired with a single candi- have very different frequencies in the training
date meaning. The lines of best fit show the up- subset shown (168, 10 and 2 respectively). In all
ward trend in parser performance over time. training settings, the word a is learnt gradually
Despite only seeing each training instance from many observations but the rarer words an-
once, our approach, due to its broader lexi- other and any are learnt (when they are learnt)
cal search strategy, outperforms both versions of through large updates to the posterior on the ba-
UBL which performs a greedy search in the space sis of few observations. These large updates re-
of lexicons and requires initialisation with co- sult from a syntactic bootstrapping effect (Gleit-
occurence statistics between words and logical man, 1990). When the model has great confidence
constants to guide this search. These statistics are about the derivation in which an unseen lexical
not justified in a model of language acquisition item occurs, the pseudocounts for that lexical item
and so they are not used here. The low perfor- get a large update under Equation 17. This large
mance of all systems is due largely to the sparsity update has a greater effect on rare words which
of the data with 32.9% of all sentences containing are associated with small amounts of probability
a previously unseen word. mass than it does on common ones that have al-
ready accumulated large pseudocounts. The fast
7.2 Word Learning learning of rare words later in learning correlates
with observations of word learning in children.
Due to the sparsity of the data, the training algo-
rithm needs to be able to learn word-meanings on
the basis of very few exposures. This is also a de- 7.3 Word Order Learning
sirable feature from the perspective of modelling Figure 6 shows the posterior probability of the
language acquisition as Carey and Bartlett (1978) correct SVO word order learnt from increasing
have shown that children have the ability to learn amounts of training data. This is calculated by
word meanings on the basis of one, or very few, summing over all lexical items containing transi-
exposures through the process of fast mapping. tive verb semantics and sampling in the space of
parse trees that could have generated them. With
1 Meaning 3 Meanings no propositional uncertainty in the training data
1.0
the correct word order is learnt very quickly and
0.8
stabilises. As the amount of propositional uncer-
P(m|w)
0.6
tainty increases, the rate at which this rule is learnt
0.4
decreases. However, even in the face of ambigu-
0.2 ous training data, the model can learn the cor-
0.0 rect word-order rule. The distribution over word
0 500 1000 1500 2000 0 500 1000 1500 2000
5 Meanings 7 Meanings orders also exhibits initial uncertainty, followed
1.0 by a sharp convergence to the correct analysis.
0.8 This ability to learn syntactic regularities abruptly
P(m|w)
241
1 Meaning 3 Meanings ble. In particular, it generates all parses consistent
1.0
P(word order) with each training instance, which can be both
0.8
0.6
memory- and processor-intensive. It is unlikely
0.4
that children do this once they have learnt at least
0.2 some of the target language. In future, we plan
0.0
0 500 1000 1500 2000 0 500 1000 1500 2000
to investigate more efficient parameter estimation
5 Meanings 7 Meanings methods. One possibility would be an approxi-
1.0
mate oVBEM algorithm in which the expectations
P(word order)
0.8
0.6
in Equation 17 are calculated according to a high
0.4 probability subset of the parses {t}. Another op-
0.2 tion would be particle filtering, which has been
0.0
0 500 1000 1500 2000 0 500 1000 1500 2000
investigated as a cognitively plausible method for
Number of Utterances Number of Utterances approximate Bayesian inference (Shi et al., 2010;
vso ovs vos Levy et al., 2009; Sanborn et al., 2010).
svo sov osv As a crude approximation to the context in
which an utterance is heard, the logical represen-
Figure 6: Learning SVO word order.
tations of meaning that we present to the learner
are also open to criticism. However, Steedman
8 Discussion (2002) argues that children do have access to
structured meaning representations from a much
We have presented an incremental model of lan-
older apparatus used for planning actions and we
guage acquisition that learns a probabilistic CCG
wish to eventually ground these in sensory input.
grammar from utterances paired with one or
Despite the limitations listed above, our ap-
more potential meanings. The model assumes
proach makes several important contributions to
no language-specific knowledge, but does assume
the computational study of language acquisition.
that the learner has access to language-universal
It is the first model to learn syntax and seman-
correspondences between syntactic and semantic
tics concurrently; previous systems (Villavicen-
types, as well as a Bayesian prior encouraging
cio, 2002; Buttery, 2006) learnt categorial gram-
grammars with heavy reuse of existing rules and
mars from sentences where all word meanings
lexical items. We have shown that this model
were known. Our model is also the first to be
not only outperforms a state-of-the-art semantic
evaluated by parsing sentences onto their mean-
parser, but also exhibits learning curves similar
ings, in contrast to the work mentioned above and
to childrens: lexical items can be acquired on a
that of Gibson and Wexler (1994), Siskind (1992)
single exposure and word order is learnt suddenly
Sakas and Fodor (2001), and Yang (2002). These
rather than gradually.
all evaluate their learners on the basis of a small
Although we use a Bayesian model, our ap-
number of predefined syntactic parameters.
proach is different from many of the Bayesian
Finally, our work addresses a misunderstand-
models proposed in cognitive science and lan-
ing about statistical learnersthat their learn-
guage acquisition (Xu and Tenenbaum, 2007;
ing curves must be gradual (Thornton and Tesan,
Goldwater et al., 2009; Frank et al., 2009; Grif-
2007). By demonstrating sudden learning of word
fiths and Tenenbaum, 2006; Griffiths, 2005; Per-
order and fast mapping, our model shows that sta-
fors et al., 2011). These models are intended
tistical learners can account for sudden changes in
as ideal observer analyses, demonstrating what
childrens grammars. In future, we hope to extend
would be learned by a probabilistically optimal
these results by examining other learning behav-
learner. Our learner uses a more cognitively plau-
iors and testing the model on other languages.
sible but approximate online learning algorithm.
In this way, it is similar to other cognitively plau- 9 Acknowledgements
sible approximate Bayesian learners (Pearl et al.,
2010; Sanborn et al., 2010; Shi et al., 2010). We thank Mark Johnson for suggesting an analy-
Of course, despite the incremental nature of our sis of learning rates. This work was funded by the
learning algorithm, there are still many aspects ERC Advanced Fellowship 24952 GramPlus and
that could be criticized as cognitively implausi- EU IP grant EC-FP7-270273 Xperience.
242
References Goldwater, S., Griffiths, T. L., and Johnson, M.
(2009). A Bayesian framework for word seg-
Alishahi and Stevenson, S. (2008). A computa-
mentation: Exploring the effects of context.
tional model for early argument structure ac-
Cognition, 112(1):2154.
quisition. Cognitive Science, 32:5:789834.
Griffiths, T. L., . T. J. B. (2005). Structure and
Alishahi, A. and Stevenson, S. (2010). Learning
strength in causal induction. Cognitive Psy-
general properties of semantic roles from usage
chology, 51:354384.
data: a computational model. Language and
Cognitive Processes, 25:1. Griffiths, T. L. and Tenenbaum, J. B. (2006). Op-
timal predictions in everyday cognition. Psy-
Beal, M. J. (2003). Variational algorithms for ap-
chological Science.
proximate Bayesian inference. Technical re-
port, Gatsby Institute, UCL. Hoffman, M., Blei, D. M., and Bach, F. (2010).
Borschinger, B., Jones, B. K., and Johnson, M. Online learning for latent dirichlet allocation.
(2011). Reducing grounded learning tasks In NIPS.
to grammatical inference. In Proceedings of Kate, R. J. and Mooney, R. J. (2007). Learning
the 2011 Conference on Empirical Methods language semantics from ambiguous supervi-
in Natural Language Processing, pages 1416 sion. In Proceedings of the 22nd Conference
1425, Edinburgh, Scotland, UK. Association on Artificial Intelligence (AAAI-07).
for Computational Linguistics. Kwiatkowski, T., Zettlemoyer, L., Goldwater, S.,
Brown, R. (1973). A First Language: the Early and Steedman, M. (2010). Inducing proba-
Stages. Harvard University Press, Cambridge bilistic CCG grammars from logical form with
MA. higher-order unification. In Proceedings of the
Buttery, P. J. (2006). Computational models for Conference on Emperical Methods in Natural
first language acquisition. Technical Report Language Processing.
UCAM-CL-TR-675, University of Cambridge, Kwiatkowski, T., Zettlemoyer, L., Goldwater, S.,
Computer Laboratory. and Steedman, M. (2011). Lexical general-
Carey, S. and Bartlett, E. (1978). Acquring a sin- ization in ccg grammar induction for semantic
gle new word. Papers and Reports on Child parsing. In Proceedings of the Conference on
Language Development, 15. Emperical Methods in Natural Language Pro-
Chen, D. L., Kim, J., and Mooney, R. J. (2010). cessing.
Training a multilingual sportscaster: Using per- Levy, R., Reali, F., and Griffiths, T. (2009). Mod-
ceptual context to learn language. J. Artif. In- eling the effects of memory on human online
tell. Res. (JAIR), 37:397435. sentence processing with particle filters. In Ad-
Fazly, A., Alishahi, A., and Stevenson, S. (2010). vances in Neural Information Processing Sys-
A probabilistic computational model of cross- tems 21.
situational word learning. Cognitive Science, Lu, W., Ng, H. T., Lee, W. S., and Zettlemoyer,
34(6):10171063. L. S. (2008). A generative model for parsing
Frank, M., Goodman, S., and Tenenbaum, J. natural language to meaning representations. In
(2009). Using speakers referential intentions Proceedings of The Conference on Empirical
to model early cross-situational word learning. Methods in Natural Language Processing.
Psychological Science, 20(5):578585. MacWhinney, B. (2000). The CHILDES project:
Frank, M. C., Goodman, N. D., and Tenenbaum, tools for analyzing talk. Lawrence Erlbaum,
J. B. (2008). A bayesian framework for cross- Mahwah, NJ u.a. EN.
situational word-learning. Advances in Neural Maurits, L., Perfors, A., and Navarro, D. (2009).
Information Processing Systems 20. Joint acquisition of word order and word refer-
Gibson, E. and Wexler, K. (1994). Triggers. Lin- ence. In Proceedings of the 31th Annual Con-
guistic Inquiry, 25:355407. ference of the Cognitive Science Society.
Gleitman, L. (1990). The structural sources of Pearl, L., Goldwater, S., and Steyvers, M. (2010).
verb meanings. Language Acquisition, 1:155. How ideal are we? Incorporating human limi-
243
tations into Bayesian models of word segmen- University of Cambridge, Computer Labora-
tation. pages 315326, Somerville, MA. Cas- tory.
cadilla Press. Wong, Y. W. and Mooney, R. (2006). Learning for
Perfors, A., Tenenbaum, J. B., and Regier, T. semantic parsing with statistical machine trans-
(2011). The learnability of abstract syntactic lation. In Proceedings of the Human Language
principles. Cognition, 118(3):306 338. Technology Conference of the NAACL.
Sagae, K., MacWhinney, B., and Lavie, A. Wong, Y. W. and Mooney, R. (2007). Learn-
(2004). Adding syntactic annotations to tran- ing synchronous grammars for semantic pars-
scripts of parent-child dialogs. In Proceed- ing with lambda calculus. In Proceedings of
ings of the 4th International Conference on the Association for Computational Linguistics.
Language Resources and Evaluation. Lisbon, Xu, F. and Tenenbaum, J. B. (2007). Word learn-
LREC. ing as Bayesian inference. Psychological Re-
Sakas, W. and Fodor, J. D. (2001). The struc- view, 114:245272.
tural triggers learner. In Bertolo, S., editor, Yang, C. (2002). Knowledge and Learning in Nat-
Language Acquisition and Learnability, pages ural Language. Oxford University Press, Ox-
172233. Cambridge University Press, Cam- ford.
bridge.
Yu, C. and Ballard, D. H. (2007). A unified model
Sanborn, A. N., Griffiths, T. L., and Navarro, of early word learning: Integrating statisti-
D. J. (2010). Rational approximations to ratio- cal and social cues. Neurocomputing, 70(13-
nal models: Alternative algorithms for category 15):2149 2165.
learning. Psychological Review.
Zettlemoyer, L. S. and Collins, M. (2005). Learn-
Sato, M. (2001). Online model selection based ing to map sentences to logical form: Struc-
on the variational bayes. Neural Computation, tured classification with probabilistic categorial
13(7):16491681. grammars. In Proceedings of the Conference on
Shi, L., Griffiths, T. L., Feldman, N. H., and San- Uncertainty in Artificial Intelligence.
born, A. N. (2010). Exemplar models as a Zettlemoyer, L. S. and Collins, M. (2007). Online
mechanism for performing bayesian inference. learning of relaxed CCG grammars for pars-
Psychonomic Bulletin & Review, 17(4):443 ing to logical form. In Proc. of the Joint Con-
464. ference on Empirical Methods in Natural Lan-
Siskind, J. M. (1992). Naive Physics, Event Per- guage Processing and Computational Natural
ception, Lexical Semantics, and Language Ac- Language Learning.
quisition. PhD thesis, Massachusetts Institute Zettlemoyer, L. S. and Collins, M. (2009). Learn-
of Technology. ing context-dependent mappings from sen-
Siskind, J. M. (1996). A computational study of tences to logical form. In Proceedings of The
cross-situational techniques for learning word- Joint Conference of the Association for Com-
to-meaning mappings. Cognition, 61(1-2):1 putational Linguistics and International Joint
38. Conference on Natural Language Processing.
Steedman, M. (2000). The Syntactic Process.
MIT Press, Cambridge, MA.
Steedman, M. (2002). Plans, affordances, and
combinatory grammar. Linguistics and Philos-
ophy, 25.
Thornton, R. and Tesan, G. (2007). Categori-
cal acquisition: Parameter setting in universal
grammar. Biolinguistics, 1.
Villavicencio, A. (2002). The acquisition of a
unification-based generalised categorial gram-
mar. Technical Report UCAM-CL-TR-533,
244
Active learning for interactive machine translation
245
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 245254,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
2. Concepts such as optimum translation and halves the required human effort to obtain a cer-
translation probability distribution are con- tain translation quality.
tinually evolving whereas existing AL algo- In this work, the AL framework presented
rithms only deal with constant concepts. in (Gonzalez-Rubio et al., 2011) is extended in
an effort to address all the above described chal-
3. Data volume is unbounded which makes lenges. In short, we propose an AL framework for
impractical to batch-learn one single sys- IMT that splits the data stream into blocks. This
tem from all previously translated sentences. approach allows us to have more context to model
Therefore, model training must be done in an the changing probability distribution of the stream
incremental fashion. (challenge 2) and results in a more accurate sam-
pling of the changing pool of sentences (chal-
In this work, we present a proposal of AL for lenge 1). In contrast to the proposal described
IMT specifically designed to work with stream in (Gonzalez-Rubio et al., 2011), we define sen-
data. In short, our proposal divides the data tence sampling strategies whose underlying mod-
stream into blocks where AL techniques for static els can be updated with the newly available data.
datasets are applied. Additionally, we implement This way, the sentences to be supervised by the
an incremental learning technique to efficiently user are chosen taking into account previously su-
train the base SMT models as new data is avail- pervised sentences. To efficiently retrain the un-
able. derlying SMT models of the IMT system (chal-
lenge 3), we follow the online learning technique
2 Related work
described in (Ortiz-Martnez et al., 2010). Finally,
A body of work has recently been proposed to ap- we integrate all these elements to define an AL
ply AL techniques to SMT (Haffari et al., 2009; framework for IMT with an objective of obtaining
Ambati et al., 2010; Bloodgood and Callison- an optimum balance between translation quality
Burch, 2010). The aim of these works is to and human user effort.
build one single optimal SMT model from manu-
ally translated data extracted from static datasets. 3 Interactive machine translation
None of them fit in the setting of data streams. IMT can be seen as an evolution of the SMT
Some of the above described challenges of AL framework. Given a sentence f from a source
from unbounded streams have been previously ad- language to be translated into a sentence e of
dressed in the MT literature. In order to deal with a target language, the fundamental equation of
the evolutionary nature of the problem, Nepveu et SMT (Brown et al., 1993) is defined as follows:
al. (2004) propose an IMT system with dynamic
adaptation via cache-based model extensions for e = arg max P r(e | f ) (1)
e
language and translation models. Pursuing the
same goal for SMT, Levenberg et al., (2010) where P r(e | f ) is usually approximated by a log
study how to bound the space when processing linear translation model (Koehn et al., 2003). In
(potentially) unbounded streams of parallel data this case, the decision rule is given by the expres-
and propose a method to incrementally retrain sion:
SMT models. Another method to efficiently re- ( M )
train a SMT model with new data was presented X
e = arg max m hm (e, f ) (2)
in (Ortiz-Martnez et al., 2010). In this work, e
m=1
the authors describe an application of the online
learning paradigm to the IMT framework. where each hm (e, f ) is a feature function repre-
To the best of our knowledge, the only previ- senting a statistical model and m its weight.
ous work on AL for IMT is (Gonzalez-Rubio et In the IMT framework, a human translator is in-
al., 2011). There, the authors present a nave ap- troduced in the translation process to collaborate
plication of the AL paradigm for IMT that do not with an SMT model. For a given source sentence,
take into account the dynamic change in proba- the SMT model fully automatically generates an
bility distribution of the stream. Nevertheless, re- initial translation. The human user checks this
sults show that even that simple AL framework translation, from left to right, correcting the first
246
source (f ): Para ver la lista de recursos number of sentences whose translations are worth
desired translation (e): To view a listing of resources to be supervised by the human expert.
e
inter.-0 p This approach implies a modification of the
es To view the resources list
user-machine interaction protocol. For a given
ep To view
inter.-1 k a source sentence, the SMT model generates an ini-
es list of resources tial translation. Then, if this initial translation is
ep To view a list classified as incorrect or worth of supervision,
inter.-2 k list i we perform a conventional IMT procedure as in
es list i ng resources Figure 1. If not, we directly return the initial au-
ep To view a listing tomatic translation and no effort is required from
inter.-3 k o the user. At the end of the process, we use the new
es of resources sentence pair (f , e) available to refine the SMT
accept ep To view a listing of resources models used by the IMT system.
In this scenario, the user only checks a small
Figure 1: IMT session to translate a Spanish sentence number of sentences, thus, final translations are
into English. The desired translation is the translation not error-free as in conventional IMT. However,
the human user have in mind. At interaction-0, the sys-
results in previous works (Gonzalez-Rubio et al.,
tem suggests a translation (es ). At interaction-1, the
user moves the mouse to accept the first eight charac- 2011) show that this approach yields important
ters To view and presses the a key (k), then the reduction in human effort. Moreover, depending
system suggests completing the sentence with list of on the definition of the sampling strategy, we can
resources (a new es ). Interactions 2 and 3 are simi- modify the ratio of sentences that are interactively
lar. In the final interaction, the user accepts the current translated to adapt our system to the requirements
translation. of a specific translation task. For example, if the
main priority is to minimize human effort, our
error. Then, the SMT model proposes a new ex- system can be configured to translate all the sen-
tension taking the correct prefix, ep , into account. tences without user intervention.
These steps are repeated until the user accepts the Algorithm 1 describes the basic algorithm to
translation. Figure 1 illustrates a typical IMT ses- implement AL for IMT. The algorithm receives as
sion. In the resulting decision rule, we have to input an initial SMT model, M , a sampling strat-
find an extension es for a given prefix ep . To do egy, S, a stream of source sentences, F, and the
this we reformulate equation (1) as follows, where block size, B. First, a block of B sentences, X,
the term P r(ep | f ) has been dropped since it does is extracted from the data stream (line 3). From
not depend on es : this block, we sample those sentences, Y , that
es = arg max P r(ep , es | f ) (3) are worth to be supervised by the human expert
es (line 4). For each of the sentences in X, the cur-
arg max p(es | f , ep ) (4) rent SMT model generates an initial translation,
es e, (line 6). If the sentence has been sampled as
The search is restricted to those sentences e worthy of supervision, f Y , the user is required
which contain ep as prefix. Since e ep es , we to interactively translate it (lines 813) as exem-
can use the same log-linear SMT model, equa- plified in Figure 1. The source sentence f and its
tion (2), whenever the search procedures are ad- human-supervised translation, e, are then used to
equately modified (Barrachina et al., 2009). retrain the SMT model (line 14). Otherwise, we
directly output the automatic translation e as our
4 Active learning for IMT final translation (line 17).
The aim of the IMT framework is to obtain high- Most of the functions in the algorithm denote
quality translations while minimizing the required different steps in the interaction between the hu-
human effort. Despite the fact that IMT may man user and the machine:
reduce the required effort with respect to post- translate(M, f ): returns the most proba-
editing, it still requires the user to supervise all ble automatic translation of f given by M .
the translations. To address this problem, we pro-
pose to use AL techniques to select only a small validPrefix(e): returns the prefix of e
247
input : M (initial SMT model) 5 Sentence sampling strategies
S (sampling strategy)
F (stream of source sentences) A good sentence sampling strategy must be able
B (block size) to select those sentences that along with their cor-
auxiliar : X (block of sentences) rect translations improve most the performance of
Y (sentences worth of supervision) the SMT model. To do that, the sampling strat-
1 begin egy have to correctly discriminate informative
2 repeat sentences from those that are not. We can make
3 X = getSentsFromStream (B, F);
different approximations to measure the informa-
4 Y = S(X, M );
5 foreach f X do tiveness of a given sentence. In the following
6 e = translate(M, f ); sections, we describe the three different sampling
7 if f Y then strategies tested in our experimentation.
8 e = e;
9 repeat 5.1 Random sampling
10 ep = validPrefix(e);
11 es = genSuffix(M, f , ep );
Arguably, the simplest sampling approach is ran-
12 e = ep es ; dom sampling, where the sentences are randomly
13 until validTranslation(e) ; selected to be interactively translated. Although
14 M = retrain(M, (f , e)); simple, it turns out that random sampling per-
15 output(e); form surprisingly well in practice. The success
16 else of random sampling stem from the fact that in
17 output(e); data stream environments the translation proba-
18 until True ;
bility distributions may vary significantly through
19 end
time. While general AL algorithms ask the user to
translate informative sentences, they may signifi-
Algorithm 1: Pseudo-code of the proposed
cantly change probability distributions by favor-
algorithm to implement AL for IMT from
ing certain translations, consequently, the previ-
unbounded data streams.
ously human-translated sentences may no longer
reveal the genuine translation distribution in the
validated by the user as correct. This prefix current point of the data stream (Zhu et al., 2007).
includes the correction k. This problem is less severe for static data where
the candidate pool is fixed and AL algorithms are
genSuffix(M, f , ep ): returns the suffix of able to survey all instances. Random sampling
maximum probability that extends prefix ep . avoids this problem by randomly selecting sen-
tences for human supervision. As a result, it al-
validTranslation(e): returns True if ways selects those sentences with the most similar
the user considers the current translation to distribution to the current sentence distribution in
be correct and False otherwise. the data stream.
248
the training samples. Therefore, the score for a Finally, this sampling strategy works by select-
given sentence f is computed as: ing a given percentage of the highest scoring sen-
PN tences.
|Nn<A (f )|
C(f ) = Pn=1
N
(5) We dynamically update the confidence sampler
n=1 |Nn (f )| each time a new sentence pair is added to the SMT
where Nn (f ) is the set of n-grams of size n model. The incremental version of the EM algo-
in f , Nn<A (f ) is the set of n-grams of size n in rithm (Neal and Hinton, 1999) is used to incre-
f that are inaccurately represented in the training mentally train the IBM model 1.
data and N is the maximum n-gram order. In
the experimentation, we assume N = 4 as the 6 Retraining of the SMT model
maximum n-gram order and a value of 10 for the
To retrain the SMT model, we implement the
threshold A. This sampling strategy works by se-
online learning techniques proposed in (Ortiz-
lecting a given percentage of the highest scoring
Martnez et al., 2010). In that work, a state-
sentences.
of-the-art log-linear model (Och and Ney, 2002)
We update the counts of the n-grams seen by
and a set of techniques to incrementally train this
the SMT model with each new sentence pair.
model were defined. The log-linear model is com-
Hence, the sampling strategy is always up-to-date
posed of a set of feature functions governing dif-
with the last training data.
ferent aspects of the translation process, includ-
5.3 Dynamic confidence sampling ing a language model, a source sentencelength
Another technique is to consider that the most in- model, inverse and direct translation models, a
formative sentence is the one the current SMT target phraselength model, a source phrase
model translates worst. The intuition behind this length model and a distortion model.
approach is that an SMT model can not generate The incremental learning algorithm allows us
good translations unless it has enough informa- to process each new training sample in constant
tion to translate the sentence. time (i.e. the computational complexity of train-
The usual approach to compute the quality of a ing a new sample does not depend on the num-
translation hypothesis is to compare it to a refer- ber of previously seen training samples). To do
ence translation, but, in this case, it is not a valid that, a set of sufficient statistics is maintained for
option since reference translations are not avail- each feature function. If the estimation of the
able. Hence, we use confidence estimation (Gan- feature function does not require the use of the
drabur and Foster, 2003; Blatz et al., 2004; Ueff- well-known expectationmaximization (EM) al-
ing and Ney, 2007) to estimate the probability of gorithm (Dempster et al., 1977) (e.g. n-gram lan-
correctness of the translations. Specifically, we guage models), then it is generally easy to incre-
estimate the quality of a translation from the con- mentally extend the model given a new training
fidence scores of their individual words. sample. By contrast, if the EM algorithm is re-
The confidence score of a word ei of the trans- quired (e.g. word alignment models), the estima-
lation e = e1 . . . ei . . . eI generated from the tion procedure has to be modified, since the con-
source sentence f = f1 . . . fj . . . fJ is computed ventional EM algorithm is designed for its use in
as described in (Ueffing and Ney, 2005): batch learning scenarios. For such models, the in-
cremental version of the EM algorithm (Neal and
Cw (ei , f ) = max p(ei |fj ) (6) Hinton, 1999) is applied. A detailed description
0j| f |
of the update algorithm for each of the models in
where p(ei |fj ) is an IBM model 1 (Brown et al.,
the log-linear combination is presented in (Ortiz-
1993) bilingual lexicon probability and f0 is the
Martnez et al., 2010).
empty source word. The confidence score for the
full translation e is computed as the ratio of its 7 Experiments
words classified as correct by the word confidence
measure. Therefore, we define the confidence- We carried out experiments to assess the perfor-
based informativeness score as: mance of the proposed AL implementation for
|{ei | Cw (ei , f ) > w }| IMT. In each experiments, we started with an
C(e, f ) = 1 (7) initial SMT model that is incrementally updated
|e|
249
words 7.2 Assessment criteria
corpus use sentences
(Spa/Eng)
train 731K 15M/15M We want to measure both the quality of the gener-
Europarl ated translations and the human effort required to
devel. 2K 60K/58K
News obtain them.
test 51K 1.5M/1.2M
Commentary We measure translation quality with the well-
known BLEU (Papineni et al., 2002) score.
Table 1: Size of the SpanishEnglish corpora used in
To estimate human user effort, we simulate the
the experiments. K and M stand for thousands and
millions of elements respectively. actions taken by a human user in its interaction
with the IMT system. The first translation hypoth-
esis for each given source sentence is compared
with a single reference translation and the longest
with the sentences selected by the current sam- common character prefix (LCP) is obtained. The
pling strategy. Due to the unavailability of public first non-matching character is replaced by the
benchmark data streams, we selected a relatively corresponding reference character and then a new
large corpus and treated it as a data stream for AL. translation hypothesis is produced (see Figure 1).
To simulate the interaction with the user, we used This process is iterated until a full match with the
the reference translations in the data stream cor- reference is obtained. Each computation of the
pus as the translation the human user would like LCP would correspond to the user looking for the
to obtain. Since each experiment is carried out next error and moving the pointer to the corre-
under the same conditions, if one sampling strat- sponding position of the translation hypothesis.
egy outperforms its peers, then we can safely con- Each character replacement, on the other hand,
clude that this is because the sentences selected to would correspond to a keystroke of the user.
be translated are more informative. Bearing this in mind, we measure the user ef-
fort by means of the keystroke and mouse-action
ratio (KSMR) (Barrachina et al., 2009). This mea-
7.1 Training corpus and data stream sure has been extensively used to report results in
the IMT literature. KSMR is calculated as the
The training data comes from the Europarl corpus number of keystrokes plus the number of mouse
as distributed for the shared task in the NAACL movements divided by the total number of refer-
2006 workshop on statistical machine transla- ence characters. From a user point of view the
tion (Koehn and Monz, 2006). We used this data two types of actions are different and require dif-
to estimate the initial log-linear model used by our ferent types of effort (Macklovitch, 2006). In any
IMT system (see Section 6). The weights of the case, as an approximation, KSMR assumes that
different feature functions were tuned by means both actions require a similar effort.
of minimum errorrate training (Och, 2003) exe-
cuted on the Europarl development corpus. Once 7.3 Experimental results
the SMT model was trained, we use the News In this section, we report results for three different
Commentary corpus (Callison-Burch et al., 2007) experiments. First, we studied the performance
to simulate the data stream. The size of these cor- of the sampling strategies when dealing with the
pora is shown in Table 1. The reasons to choose sampling bias problem. In the second experiment,
the News Commentary corpus to carry out our we carried out a typical AL experiment measur-
experiments are threefold: first, its size is large ing the performance of the sampling strategies as
enough to simulate a data stream and test our a function of the percentage of the corpus used
AL techniques in the long term; second, it is to retrain the SMT model. Finally, we tested our
out-of-domain data which allows us to simulate AL implementation for IMT in order to study the
a real-world situation that may occur in a trans- tradeoff between required human effort and final
lation company, and, finally, it consists in edito- translation quality.
rials from eclectic domain: general politics, eco-
nomics and science, which effectively represents 7.3.1 Dealing with the sampling bias
the variations in the sentence distributions of the In this experiment, we want to study the perfor-
simulated data stream. mance of the different sampling strategies when
250
DCS NS RS DCS NS SCS RS
22 23
22
21
21
20
20
BLEU
BLEU
19 19 20
18 19
18
17 18
17
16 17
2 4 6 8
16 15
0 10 20 30 40 50 0 5 10 15 20
Block number Percentage (%) of the corpus in words
Figure 2: Performance of the AL methods across dif- Figure 3: BLEU of the initial automatic translations
ferent data blocks. Block size 500. Human supervision as a function of the percentage of the corpus used to
10% of the corpus. retrain the model.
251
DCS SCS w/o AL different AL sampling strategies, DCS obtains the
NS RS
better results but differences with other methods
100 are slight.
90 Varying the sentence classifier, we can achieve
80 a balance between final translation quality and re-
70
quired human effort. This feature allows us to
BLEU
60
50 75 adapt the system to suit the requirements of the
70
40 65 particular translation task or to the available eco-
60
30 55 nomic or human resources. For example, if a
50
20 translation quality of 60 BLEU points is satisfac-
16 18 20 22 24
10
0 10 20 30 40 50 60 70 tory, then the human translators would need to
KSMR modify only a 20% of the characters of the au-
tomatically generated translations.
Figure 4: Quality of the data stream translation Finally, it should be noted that our IMT sys-
(BLEU) as a function of the required human effort tems with AL are able to generate new suffixes
(KSMR). w/o AL denotes a system with no retraining. and retrain with new sentence pairs in tenths of a
second. Thus, it can be applied in real time sce-
as it can be seen, SCS obtained slightly worst re- narios.
sults than DCS showing the importance of dy- 8 Conclusions and future work
namically adapting the underlying model used by
the sampling strategy. In this work, we have presented an AL frame-
work for IMT specially designed to process data
7.3.3 Balancing human effort and streams with massive volumes of data. Our pro-
translation quality posal splits the data stream in blocks of sentences
Finally, we studied the balance between re- of a certain size and applies AL techniques indi-
quired human effort and final translation error. vidually for each block. For this purpose, we im-
This can be useful in a real-world scenario where plemented different sampling strategies that mea-
a translation company is hired to translate a sure the informativeness of a sentence according
stream of sentences. Under these circumstances, to different criteria.
it would be important to be able to predict the ef- To evaluate the performance of our proposed
fort required from the human translators to obtain sampling strategies, we carried out experiments
a certain translation quality. comparing them with random sampling and the
The experiment simulate this situation using only previously proposed AL technique for IMT
our proposed IMT system with AL to translate described in (Gonzalez-Rubio et al., 2011). Ac-
the stream of sentences. To have a broad view cording to the results, one of the proposed sam-
of the behavior of our system, we repeated this pling strategies, specifically the dynamic con-
translation process multiple times requiring an in- fidence sampling strategy, consistently outper-
creasing human effort each time. Experiments formed all the other strategies.
range from a fully-automatic translation system The results in the experimentation show that the
with no need of human intervention to a system use of AL techniques allows us to make a tradeoff
where the human is required to supervise all the between required human effort and final transla-
sentences. Figure 4 presents results for SCS (see tion quality. In other words, we can adapt our sys-
section 7.3.2) and the sentence selection strate- tem to meet the translation quality requirements
gies presented in section 5. In addition, we also of the translation task or the available human re-
present results for a static system without AL (w/o sources.
AL). This system is equal to SCS but it do not per- As future work, we plan to investigate on
form any SMT retraining. more sophisticated sampling strategies such as
Results in Figure 4 show a consistent reduction those based in information density or query-by-
in required user effort when using AL. For a given committee. Additionally, we will conduct exper-
human effort the use of AL methods allowed to iments with real users to confirm the results ob-
obtain twice the translation quality. Regarding the tained by our user simulation.
252
Acknowledgements Conference on Computational Natural Language
Learning, pages 315321.
The research leading to these results has re- Jesus Gonzalez-Rubio, Daniel Ortiz-Martnez, and
ceived funding from the European Union Seventh Francisco casacuberta. 2011. An active learn-
Framework Programme (FP7/2007-2013) under ing scenario for interactive machine translation. In
grant agreement no 287576. Work also supported Proc. of the 13thInternational Conference on Mul-
by the EC (FEDER/FSE) and the Spanish MEC timodal Interaction. ACM.
under the MIPRCV Consolider Ingenio 2010 pro- Gholamreza Haffari, Maxim Roy, and Anoop Sarkar.
gram (CSD2007-00018) and iTrans2 (TIN2009- 2009. Active learning for statistical phrase-based
machine translation. In Proc. of the North Ameri-
14511) project and by the Generalitat Valenciana
can Chapter of the Association for Computational
under grant ALMPR (Prometeo/2009/01). Linguistics, pages 415423.
Pierre Isabelle and Kenneth Ward Church. 1997. Spe-
cial issue on new tools for human translators. Ma-
References
chine Translation, 12(1-2):12.
Vamshi Ambati, Stephan Vogel, and Jaime Carbonell. Philipp Koehn and Christof Monz. 2006. Man-
2010. Active learning and crowd-sourcing for ma- ual and automatic evaluation of machine transla-
chine translation. In Proc. of the conference on tion between european languages. In Proc. of the
International Language Resources and Evaluation, Workshop on Statistical Machine Translation, pages
pages 21692174. 102121.
Sergio Barrachina, Oliver Bender, Francisco Casacu- Philipp Koehn, Franz Josef Och, and Daniel Marcu.
berta, Jorge Civera, Elsa Cubel, Shahram Khadivi, 2003. Statistical phrase-based translation. In Pro-
Antonio Lagarda, Hermann Ney, Jesus Tomas, En- ceedings of the 2003 Conference of the North Amer-
rique Vidal, and Juan-Miguel Vilar. 2009. Sta- ican Chapter of the Association for Computational
tistical approaches to computer-assisted translation. Linguistics on Human Language Technology - Vol-
Computational Linguistics, 35:328. ume 1, pages 4854.
John Blatz, Erin Fitzgerald, George Foster, Simona Philippe Langlais and Guy Lapalme. 2002. Trans
Gandrabur, Cyril Goutte, Alex Kulesza, Alberto Type: development-evaluation cycles to boost trans-
Sanchis, and Nicola Ueffing. 2004. Confidence es- lators productivity. Machine Translation, 17:77
timation for machine translation. In Proc. of the in- 98.
ternational conference on Computational Linguis- Abby Levenberg, Chris Callison-Burch, and Miles Os-
tics, pages 315321. borne. 2010. Stream-based translation models for
Michael Bloodgood and Chris Callison-Burch. 2010. statistical machine translation. In Proc. of the North
Bucking the trend: large-scale cost-focused active American Chapter of the Association for Compu-
learning for statistical machine translation. In Proc. tational Linguistics, pages 394402, Los Angeles,
of the Association for Computational Linguistics, California, June.
pages 854864. Elliott Macklovitch. 2006. TransType2: the last word.
Peter F. Brown, Vincent J. Della Pietra, Stephen In Proc. of the conference on International Lan-
A. Della Pietra, and Robert L. Mercer. 1993. guage Resources and Evaluation, pages 16717.
The mathematics of statistical machine translation: Radford Neal and Geoffrey Hinton. 1999. A view of
parameter estimation. Computational Linguistics, the EM algorithm that justifies incremental, sparse,
19:263311. and other variants. Learning in graphical models,
Chris Callison-Burch, Cameron Fordyce, Philipp pages 355368.
Koehn, Christof Monz, and Josh Schroeder. 2007.
Laurent Nepveu, Guy Lapalme, Philippe Langlais, and
(Meta-) evaluation of machine translation. In Proc.
George Foster. 2004. Adaptive language and trans-
of the Workshop on Statistical Machine Translation,
lation models for interactive machine translation. In
pages 136158.
Proc, of EMNLP, pages 190197, Barcelona, Spain,
Arthur Dempster, Nan Laird, and Donald Rubin.
July.
1977. Maximum likelihood from incomplete data
Franz Och and Hermann Ney. 2002. Discriminative
via the EM algorithm. Journal of the Royal Statis-
training and maximum entropy models for statisti-
tical Society., 39(1):138.
cal machine translation. In Proc. of the Association
George Foster, Pierre Isabelle, and Pierre Plamon-
for Computational Linguistics, pages 295302.
don. 1998. Target-text mediated interactive ma-
chine translation. Machine Translation, 12:175 Franz Och. 2003. Minimum error rate training in sta-
194. tistical machine translation. In Proc. of the Associa-
tion for Computational Linguistics, pages 160167.
Simona Gandrabur and George Foster. 2003. Confi-
dence estimation for text prediction. In Proc. of the
253
Daniel Ortiz-Martnez, Ismael Garca-Varea, and ference, pages 262270.
Francisco Casacuberta. 2010. Online learning for Nicola Ueffing and Hermann Ney. 2007. Word-
interactive statistical machine translation. In Proc. level confidence estimation for machine translation.
of the North American Chapter of the Association Computational Linguistics, 33:940.
for Computational Linguistics, pages 546554. Xingquan Zhu, Peng Zhang, Xiaodong Lin, and Yong
Kishore Papineni, Salim Roukos, Todd Ward, and Shi. 2007. Active learning from data streams. In
Wei-Jing Zhu. 2002. BLEU: a method for auto- Proc. of the 7th IEEE International Conference on
matic evaluation of machine translation. In Proc. Data Mining, pages 757762. IEEE Computer So-
of the Association for Computational Linguistics, ciety.
pages 311318. Xingquan Zhu, Peng Zhang, Xiaodong Lin, and Yong
Nicola Ueffing and Hermann Ney. 2005. Applica- Shi. 2010. Active learning from stream data using
tion of word-level confidence measures in interac- optimal weight classifier ensemble. Transactions
tive statistical machine translation. In Proc. of the on Systems, Man and Cybernetics Part B, 40:1607
European Association for Machine Translation con- 1621, December.
254
Adapting Translation Models to Translationese Improves SMT
255
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 255265,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
We use these results as our departure point, from this table. The benefit of this method is that
but improve them in two major ways. First, not only does it yield the best results, but it also
we demonstrate that the other subset of the cor- eliminates the need to directly predict the direc-
pus, reflecting translation in the wrong direc- tion of translation of the parallel corpus. The main
tion, is also important for the translation task, and contribution of this work, therefore, is a method-
must not be ignored; second, we show that ex- ology that improves the quality of SMT by build-
plicit information on the direction of translation of ing translation models that are adapted to the na-
the parallel corpus, whether manually-annotated ture of translationese.
or machine-learned, is not mandatory. This is
achieved by casting the problem in the framework 2 Related Work
of domain adaptation: we use domain-adaptation Kurokawa et al. (2009) are the first to address
techniques to direct the SMT system toward pro- the direction of translation in the context of SMT.
ducing output that better reflects the properties Their main finding is that using the S T por-
of translationese. We show that SMT systems tion of the parallel corpus results in mucqqh better
adapted to translationese produce better transla- translation quality than when the T S portion
tions than vanilla systems trained on exactly the is used for training the translation model. We in-
same resources. We confirm these findings using deed replicate these results here (Section 3), and
an automatic evaluation metric, BLEU (Papineni view them as a baseline. Additionally, we show
et al., 2002), as well as through a qualitative anal- that the T S portion is also important for ma-
ysis of the results. chine translation and thus should not be discarded.
Our departure point is the results of Kurokawa Using information-theory measures, and in par-
et al. (2009), which we successfully replicate in ticular cross-entropy, we gain statistically signif-
Section 3. First (Section 4), we explain why trans- icant improvements in translation quality beyond
lation quality improves when the parallel corpus the results of Kurokawa et al. (2009). Further-
is translated in the right direction. We do so more, we eliminate the need to (manually or au-
by showing that the subset of the corpus that was tomatically) detect the direction of translation of
translated in the direction of the translation task the parallel corpus.
(the right direction, henceforth source-to-target, Lembersky et al. (2011) also investigate the re-
or S T ) yields phrase tables that are better lations between translationese and machine trans-
suited for translation of the original language than lation. Focusing on the language model (LM),
the subset translated in the reverse direction (the they show that LMs trained on translated texts
wrong direction, henceforth target-to-source, or yield better translation quality than LMs compiled
T S). We use several statistical measures that from original texts. They also show that perplex-
indicate the better quality of the phrase tables in ity is a good discriminator between original and
the former case. translated texts.
Then (Section 5), we explore ways to build a Our current work is closely related to research
translation model that is adapted to the unique in domain-adaptation. In a typical domain adap-
properties of translationese. We first show that tation scenario, a system is trained on a large cor-
using the entire parallel corpus, including texts pus of general (out-of-domain) training mate-
that are translated both in the right and in the rial, with a small portion of in-domain training
wrong direction, improves the quality of the re- texts. In our case, the translation model is trained
sults. Furthermore, we show that the direction of on a large parallel corpus, of which some (gener-
translation used for producing the parallel corpus ally unknown) subset is in-domain (S T ),
can be approximated by defining several entropy- and some other subset is out-of-domain (T
based measures that correlate well with transla- S). Most existing adaptation methods focus on
tionese, and, consequently, with the quality of the selecting in-domain data from a general domain
translation. corpus. In particular, perplexity is used to score
Specifically, we use the entire corpus, create a the sentences in the general-domain corpus ac-
single, unified phrase table and then use the statis- cording to an in-domain language model. Gao
tical measures mentioned above, and in particular et al. (2002) and Moore and Lewis (2010) apply
cross-entropy, as a clue for selecting phrase pairs this method to language modeling, while Foster
256
et al. (2010) and Axelrod et al. (2011) use it on French-to-English and twelve English-to-French
the translation model. Moore and Lewis (2010) phrase-based (PB-) SMT systems using the
suggest a slightly different approach, using cross- MOSES toolkit (Koehn et al., 2007), each trained
entropy difference as a ranking function. on a different subset of the corpus. We use
Domain adaptation methods are usually applied GIZA++ (Och and Ney, 2000) with grow-diag-
at the corpus level, while we focus on an adap- final alignment, and extract phrases of length up
tation of the phrase table used for SMT. In this to 10 words. We prune the resulting phrase tables
sense, our work follows Foster et al. (2010), who as in Johnson et al. (2007), using at most 30 trans-
weigh out-of-domain phrase pairs according to lations per source phrase and discarding singleton
their relevance to the target domain. They use phrase pairs.
multiple features that help distinguish between We construct English and French 5-gram lan-
phrase pairs in the general domain and those in guage models from the English and French
the specific domain. We rely on features that are subsections of the Europarl-V6 corpus (Koehn,
motivated by the findings of Translation Studies, 2005), using interpolated modified Kneser-Ney
having established their relevance through a com- discounting (Chen, 1998) and no cut-off on all
parative analysis of the phrase tables. In particu- n-grams. Europarl consists of a large number
lar, we use measures such as translation model en- of subsets translated from various languages, and
tropy, inspired by Koehn et al. (2009). Addition- is therefore unlikely to be biased towards a spe-
ally, we apply the method suggested by Moore cific source language. The reordering model used
and Lewis (2010) using perplexity ratio instead in all MT systems is trained on the union of
of cross-entropy difference. the 1.5M French-original and the 1.5M English-
original subsets, using msd-bidirectional-fe re-
3 Experimental Setup ordering. We use the MERT algorithm (Och,
The tasks we focus on are translation between 2003) for tuning and BLEU (Papineni et al., 2002)
French and English, in both directions. We as our evaluation metric. We test the statistical
use the Hansard corpus, containing transcripts of significance of the differences between the results
the Canadian parliament from 19962007, as the using the bootstrap resampling method (Koehn,
source of all parallel data. The Hansard is a 2004).
bilingual FrenchEnglish corpus comprising ap- A word on notation: We use English-original
proximately 80% English-original texts and 20% (EO) and French-original (FO) to refer to the
French-original texts. Crucially, each sentence subsets of the corpus that are translated from En-
pair in the corpus is annotated with the direction glish to French and from French to English, re-
of translation. Both English and French are lower- spectively. The translation tasks are English-to-
cased and tokenized using MOSES (Koehn et al., French (E2F) and French-to-English (F2E). We
2007). Sentences longer than 80 words are dis- thus use S T when the FO corpus is used for
carded. the F2E task or when the EO corpus is used for
To address the effect of the corpus size, we the E2F task; and T S when the FO corpus
compile six subsets of different sizes (250K, is used for the E2F task or when the EO corpus is
500K, 750K, 1M, 1.25M and 1.5M parallel used for the F2E task.
sentences) from each portion (English-original Table 1 depicts the BLEU scores of the baseline
and French-original) of the corpus. Addition- systems. The data are consistent with the findings
ally, we use the devtest section of the Hansard of Kurokawa et al. (2009): systems trained on
corpus to randomly select French-original and S T parallel texts outperform systems trained
English-original sentences that are used for tun- on T S texts, even when the latter are much
ing (1,000 sentences each) and evaluation (5,000 larger. The difference in BLEU score can be as
sentences each). French-to-English MT sys- high as 3 points.
tems are tuned and tested on French-original sen-
4 Analysis of the Phrase Tables
tences and English-to-French systems on English-
original ones. The baseline results suggest that S T and
To replicate the results of Kurokawa et al. T S phrase tables differ substantially, presum-
(2009) and set up a baseline, we train twelve ably due to the different characteristics of original
257
Task: French-to-English the average entropy over all translation options
Corpus subset S T T S for each source phrase (henceforth, phrase table
250K 34.35 31.33 entropy or PtEnt), whereas Koehn et al. (2009)
500K 35.21 32.38 search through all possible segmentations of the
750K 36.12 32.90 source sentence to find the optimal covering set of
1M 35.73 33.07 test sentences that minimizes the average entropy
1.25M 36.24 33.23 of the source phrases in the covering set (hence-
1.5M 36.43 33.73 forth, covering set entropy or CovEnt).
Task: English-to-French We also propose a metric that assesses the qual-
Corpus subset S T T S ity of the source side of a phrase table. The met-
ric finds the minimal covering set of a given text
250K 27.74 26.58
in the source language using source phrases from
500K 29.15 27.19
a particular phrase table, and outputs the average
750K 29.43 27.63
length of a phrase in the covering set (henceforth,
1M 29.94 27.88
covering set average length or CovLen).
1.25M 30.63 27.84
1.5M 29.89 27.83 Lembersky et al. (2011) show that perplexity
distinguishes well between translated and origi-
Table 1: BLEU scores of baseline systems nal texts. Moreover, perplexity reflects the de-
gree of relatedness of a given phrase to original
language or to translationese. Motivated by this
and translated texts. In this section we explain
observation, we design two cross-entropy-based
the better translation quality in terms of the bet-
measures to assess how well each phrase table fits
ter quality of the respective phrase tables, as de-
the genre of translationese. Since MT systems are
fined by a number of statistical measures. We first
evaluated against human translations, we believe
relate these measures to the unique properties of
that this factor may have a significant impact on
translationese.
translation performance. The cross-entropy of a
Translated texts tend to be simpler than original
text T = w1 , w2 , wN according to a language
ones along a number of criteria. Generally, trans-
model L is:
lated texts are not as rich and variable as origi-
nal ones, and in particular, their type/token ratio N
is lower. Consequently, we expect S T phrase 1 X
H(T, L) = log2 L(wi ) (2)
tables (which are based on a parallel corpus whose N i=1
source is original texts, and whose target is trans-
We build language models of translated texts
lationese) to have more unique source phrases and
as follows. For English translationese, we
a lower number of translations per source phrase.
extract 170,000 French-original sentences from
A large number of unique source phrases suggests
the English portion of Europarl, and 3,000
better coverage of the source text, while a small
English-translated-from-French sentences from
number of translations per source phrase means a
the Hansard corpus (disjoint from the training,
lower phrase table entropy. Entropy-based mea-
development and test sets, of course). We use
sures are well-established tools to assess the qual-
each corpus to train a trigram language model
ity of a phrase table. Phrase table entropy captures
with interpolated modified Kneser-Ney discount-
the amount of uncertainty involved in choosing
ing and no cut-off. All out-of-vocabulary words
candidate translation phrases (Koehn et al., 2009).
are mapped to a special token, hunki. Then,
Given a source phrase s and a phrase table T
we interpolate the Hansard and Europarl language
with translations t of s whose probabilities are
models to minimize the perplexity of the target
p(t|s), the entropy H of s is:
side of the development set ( = 0.58). For
X French translationese, we use 270,000 sentences
H(s) = p(t|s) log2 p(t|s) (1)
tT
from Europarl and 3,000 sentences from Hansard,
= 0.81. Finally, we compute the cross-entropy
There are two major flavors of the phrase table of each target phrase in the phrase tables accord-
entropy metric: Lambert et al. (2011) calculate ing to these language models.
258
As with the entropy-based measures, we define Measure R2 (FREN) R2 (EN-FR)
two cross-entropy metrics: phrase table cross- AvgTran 0.06 0.22
entropy or PtCrEnt calculates the average cross- PtEnt 0.03 0.19
entropy over weighted cross-entropies of all trans- CovEnt 0.94 0.46
lation options for each source phrase, and cover- PtCrEnt 0.33 0.44
ing set cross-entropy or CovCrEnt finds the opti- CovCrEnt 0.56 0.54
mal covering set of test sentences that minimizes CovLen 0.75 0.56
the weighted cross-entropy of the source phrase
Table 3: Correlation of BLEU scores with phrase table
in the covering set. Given a phrase table T and a
statistical measures
language model L, the weighted cross-entropy W
for a source phrase s is:
these measures are computed directly on the
X phrase table, and do not require reference trans-
W (s, L) = H(t, L) p(t|s) (3) lations or meta-information pertaining to the di-
tT rection of translation of the parallel phrase.
where H(t, L) is the cross-entropy of t according
to a language model L. 5 Translation Model Adaptation
Table 2 depicts various statistical measures We have thus established the fact that S T
computed on the phrase tables corresponding to phrase tables have an advantage over T S ones
our 24 SMT systems.1 The data meet our pre- that stems directly from the different characteris-
liminary expectations: S T phrase tables have tics of original and translated texts. We have also
more unique source phrases, but fewer translation identified three statistical measures that explain
options per source phrase. They have lower en- most of the variability in translation quality. We
tropy and cross-entropy, but higher covering set now explore ways for taking advantage of the en-
length. tire parallel corpus, including translations in both
In order to asses the correspondence of each directions, in light of the above findings. Our goal
measure to translation quality, we compute the is to establish the best method to address the is-
correlation of BLEU scores from Table 1 with sue of different translation direction components
each of the measures specified in Table 2; we in the parallel corpus.
compute the correlation coefficient R2 (the square First, we simply take the union of the two sub-
of Pearsons product-moment correlation coeffi- sets of the parallel corpus. We create three dif-
cient) by fitting a simple linear regression model. ferent mixtures of FO and EO: 500K sentences
Table 3 lists the results. Only the covering set each of FO and EO (MIX1), 500K sentences
cross-entropy measure shows stability over the of FO and 1M sentences of EO (MIX2), and
French-to-English and English-to-French transla- 1M sentences of FO and 500K sentences of EO
tion tasks, with R2 equals to 0.56 and 0.54, re- (MIX3). We use these corpora to train French-
spectively. Other measures are sensitive to the to-English and English-to-French MT systems,
translation task: covering set entropy has the evaluating their quality on the evaluation sets de-
highest correlation with BLEU (R2 = 0.94) when scribed in Section 3. We use the same Moses con-
translating French-to-English, but it drops to 0.46 figuration as well as the same language and re-
for the reverse task. The covering set average ordering models as in Section 3.
length measure shows similar behavior: R2 drops Table 4 reports the results, comparing them
from 0.75 in French-to-English to 0.56 in English- to the results obtained for the baseline MT sys-
to-French. Still, the correlation of these measures tems trained on individual French-original and
with BLEU is high. English-original bi-texts (see Section 3).2 Note
Consequently, we use the three best measures, that the mixed corpus includes many more sen-
namely covering set entropy, cross-entropy and tences than each of the baseline models; this is a
average length, as indicators of better transla-
2
tions, more similar to translationese. Crucially, Recall that when translating from French to English,
S T means that the bi-text is French-original; when trans-
1
The phrase tables were pruned, retaining only phrases lating from English to French, S T means it is English-
that are included in the evaluation set. original.
259
Task: French-to-English
Set Total Source AvgTran PtEnt CovEnt PtCrEnt CovCrEnt CovLen
ST
250K 231K 69K 3.35 0.86 0.36 3.94 1.64 2.44
500K 360K 86K 4.21 0.98 0.35 3.52 1.30 2.64
750K 461K 96K 4.81 1.05 0.35 3.24 1.10 2.77
1M 544K 103K 5.27 1.10 0.34 3.09 0.99 2.85
1.25M 619K 109K 5.66 1.14 0.34 2.98 0.91 2.92
1.5M 684K 114K 6.01 1.18 0.33 2.90 0.85 2.97
T S
250K 199K 55K 3.65 0.92 0.45 4.00 1.87 2.25
500K 317K 69K 4.56 1.05 0.43 3.57 1.52 2.42
750K 405K 78K 5.19 1.12 0.43 3.39 1.35 2.53
1M 479K 85K 5.66 1.16 0.42 3.21 1.21 2.61
1.25M 545K 90K 6.07 1.20 0.41 3.11 1.12 2.67
1.5M 602K 94K 6.43 1.24 0.41 3.04 1.07 2.71
Task: English-to-French
Set Total Source AvgTran PtEnt CovEnt PtCrEnt CovCrEnt CovLen
ST
250K 224K 49K 4.52 1.07 0.63 3.48 1.88 2.08
500K 346K 61K 5.64 1.21 0.59 3.08 1.49 2.25
750K 437K 68K 6.39 1.29 0.57 2.91 1.33 2.33
1M 513K 74K 6.95 1.34 0.55 2.75 1.18 2.41
1.25M 579K 78K 7.42 1.38 0.54 2.63 1.09 2.46
1.5M 635K 81K 7.83 1.41 0.53 2.58 1.03 2.50
T S
250K 220K 46K 4.75 1.12 0.63 3.62 2.09 2.02
500K 334K 57K 5.82 1.24 0.60 3.24 1.70 2.16
750K 421K 64K 6.54 1.31 0.58 2.97 1.48 2.25
1M 489K 69K 7.10 1.36 0.57 2.84 1.35 2.32
1.25M 550K 73K 7.56 1.40 0.55 2.74 1.25 2.37
1.5M 603K 76K 7.92 1.43 0.55 2.66 1.17 2.41
Table 2: Statistic measures computed on the phrase tables: total size, in tokens (Total); the number of unique
source phrases (Source); the average number of translations per source phrase (AvgTran); phrase table entropy
(PtEnt) and covering set entropy (CovEnt); phrase table cross-entropy (PtCrEnt) and covering set cross-
entropy (CovCrEnt); and the covering set average length (CovLen)
realistic scenario, in which one can opt either to previous section on phrase tables trained on the
use the entire parallel corpus, or only its S T MIX corpora, and compare them with the same
subset. Even with a corpus several times as large, measures computed for phrase tables trained on
however, the mixed MT systems perform only the relevant S T corpus for both translation
slightly better than the S T ones. On one tasks. Table 5 displays the figures for the MIX1
hand, this means that one can train MT systems corpus: Phrase tables trained on mixed corpora
on S T data only, at the expense of only a mi- have higher covering set average length, similar
nor loss in quality. On the other hand, it is obvi- covering set entropy, but significantly worse cov-
ous that the T S component also contributes to ering set cross-entropy. Consequently, improving
translation quality. We now look at ways to better covering set cross-entropy has the greatest poten-
utilize this portion. tial for improving translation quality. We there-
fore use this feature to encourage the decoder to
We compute the measures established in the
260
Task: French-to-English tion of Europarl, and 2,700 English-original sen-
System MIX1 MIX2 MIX3 tences from the Hansard corpus. We train a tri-
Union 35.27 35.36 35.94 gram language model with interpolated modified
ST 35.21 35.21 35.73 Kneser-Ney discounting on each corpus and we
T S 32.38 33.07 32.38 interpolate both models to minimize the perplex-
Task: English-to-French ity of the source side of the development set for
System MIX1 MIX2 MIX3 the English-to-French translation task ( = 0.49).
Union 29.27 30.01 29.44 For original French, we use 110,000 sentences
ST 29.15 29.94 29.15 from Europarl and 2,900 sentences from Hansard,
T S 27.19 27.19 27.88 = 0.61. Finally, for each target phrase t in the
phrase table we compute the ratio of the perplex-
Table 4: Evaluation of the MIX systems ity of t according to the original language model
Lo and the perplexity of t with respect to the trans-
select translation options that are more related to lated model Lt (see Section 4). In other words, the
the genre of translated texts. factor F is computed as follows:
H(t, Lo )
French-to-English F (t) = (4)
Measure MIX1 S T H(t, Lt )
CovLen 2.78 2.64 We apply these techniques to the French-to-
CovEnt 0.37 0.35 English and English-to-French phrase tables built
CovCrEnt 1.58 1.10 from the mixed corpora and use each phrase ta-
English-to-French ble to train an SMT system. Table 6 summa-
Measure MIX1 S T rizes the performance of these systems. All sys-
CovLen 2.40 2.25 tems outperform the corresponding Union sys-
CovEnt 0.55 0.58 tems. CrEnt systems show significant improve-
CovCrEnt 2.09 1.48 ments (p < 0.05) on balanced scenarios (MIX1)
and on scenarios biased towards the S T com-
Table 5: Statistical measures computed for mixed vs. ponent (MIX2 in the French-to-English task,
source-to-target phrase tables
MIX3 in English-to-French). PplRatio sys-
tems exhibit more consistent behavior, showing
We do so by adding to each phrase pair in the small, but statistically significant improvement
phrase tables an additional factor, as a measure of (p < 0.05) in all scenarios.
its fitness to the genre of translationese. We ex-
periment with two such factors. First, we use the Task: French-to-English
language models described in Section 4 to com- System MIX1 MIX2 MIX3
pute the cross-entropy of each translation option Union 35.27 35.36 35.94
according to this model. We add cross-entropy CrEnt 35.54 35.45 36.75
as an additional score of a translation pair that PplRatio 35.59 35.78 36.22
can be tuned by MERT (we refer to this system Task: English-to-French
as CrEnt). Since cross-entropy is the lower the System MIX1 MIX2 MIX3
better metric, we adjust the range of values used Union 29.27 30.01 29.44
by MERT for this score to be negative. Sec- CrEnt 29.47 30.44 29.45
ond, following Moore and Lewis (2010), we de- PplRatio 29.65 30.34 29.62
fine an adapting feature that not only measures
how close phrases are to translated language, but Table 6: Evaluation of MT Systems
also how far they are from original language, and
use it as a factor in a phrase table (this system Note again that all systems in the same column
is referred to as PplRatio). We build two addi- are trained on exactly the same corpus and have
tional language models of original texts as fol- exactly the same phrase tables. The only differ-
lows. For original English, we extract 135,000 ence is an additional factor in the phrase table that
English-original sentences from the English por- encourages the decoder to select translation op-
261
tions that are closer to translated texts than to orig- Source Cependant, je pense quil est premature
inal ones. de le faire actuellement, etant donne que le
ministre a lance cette tournee.
6 Analysis Baseline However, I think it is premature to the
In order to study the effect of the adaptation qual- right now, since the minister launched this
itatively, rather than quantitatively, we focus on tour.
several concrete examples. We compare transla- Adapted However, I think it is premature to do
tions produced by the Union (henceforth base- so now, given that the minister has launched
line) and by the PplRatio (henceforth adapted) this tour.
French-English SMT systems. We manually in-
spect 200 sentences of length between 15 and 25 Finally, there are often cultural differences be-
from the French-English evaluation set. tween languages, specifically the use of a 24-hour
In many cases, the adapted system produces clock (common in French) vs. a 12-hour clock
more fluent and accurate translations. In the fol- (common in English). The adapted system is
lowing examples, the baseline system generates more consistent in translating the former to the
common translations of French words that are ad- latter:
equate for a wider context, whereas the adapted
system chooses less common, but more suitable Source On avait decide de poursuivre la seance
translations: jusqu a 18 heures, mais on naura pas le
Source Jai eu cette perception et jetais assez temps de faire un autre tour de table.
certain que ca allait se faire. Baseline We had decided to continue the meeting
Baseline I had that perception and I was enough until 18 hours, but we will not have the time
certain it was going do. to do another round.
Adapted I had that perception and I was quite Adapted We had decided to continue the meeting
certain it was going do. until 6 p.m., but we wont have the time to do
Source Jattends donc que vous en demandiez la another round.
permission, monsieur le President. Source Vu quil est 17h 20, je suis daccord
Baseline I look so that you seek permission, mr. pour quon ne discute pas de ma motion
chairman. immediatement.
Adapted I await, then, that you seek permission, Baseline Seen that it is 17h 20, I agree that we are
mr. chairman. not talking about my motion immediately.
In quite a few cases, the baseline system leaves Adapted Given that it is 5:20, I agree that we are
out important words from the source sentence, not talking about my motion immediately.
producing ungrammatical, even illegible transla-
tions, whereas the adapted system generates good In (human) translation circles, translating out of
translations. Careful traceback reveals that the ones mother tongue is considered unprofessional,
baseline system splits the source sentence into even unethical (Beeby, 2009). Many professional
phrases differently (and less optimally) than the associations in Europe urge translators to work
adapted system. Apparently, when the decoder is exclusively into their mother tongue (Pavlovic,
coerced to select translation options that are more 2007). The two kinds of automatic systems built
adapted to translationese, it tends to select source in this paper reflect only partly the human sit-
phrases that are more related to original texts, re- uation, but they do so in a crucial way. The
sulting in more successful coverage of the source S T systems learn examples from many hu-
sentence: man translators who follow the decree according
Source Pourtant, lorsqu on les avait presentes, to which translation should be made into ones na-
cetait pour corriger les problemes lies au tive tongue. The T S systems are flipped di-
PCSRA. rections of humans input and output. The S T
Baseline Yet when they had presented, it was to direction proved to be more fluent, accurate and
correct the problems the CAIS program. even more culturally sensitive. This has to do with
Adapted Yet when they had presented, it was to fact that the translators cover the source texts
correct the problems associated with CAIS. more fully, having a better translation model.
262
7 Conclusion References
Amittai Axelrod, Xiaodong He, and Jianfeng Gao.
Phrase tables trained on parallel corpora that were
Domain adaptation via pseudo in-domain data se-
translated in the same direction as the translation lection. In Proceedings of the 2011 Conference
task perform better than ones trained on corpora on Empirical Methods in Natural Language Pro-
translated in the opposite direction. Nonethe- cessing, pages 355362. Association for Computa-
less, even wrong phrase tables contribute to the tional Linguistics, July 2011. URL http://www.
translation quality. We analyze both correct and aclweb.org/anthology/D11-1033.
wrong phrase tables, uncovering a great deal of Michiel Bacchiani, Michael Riley, Brian Roark, and
difference between them. We use insights from Richard Sproat. MAP adaptation of stochastic
Translation Studies to explain these differences; grammars. Computer Speech and Language, 20:41
we then adapt the translation model to the nature 68, January 2006. ISSN 0885-2308. doi: 10.1016/
of translationese. j.csl.2004.12.001. URL http://dl.acm.org/
citation.cfm?id=1648820.1648854.
We incorporate information-theoretic measures
that correlate well with translationese into phrase Mona Baker. Corpus linguistics and translation stud-
tables as an additional score that can be tuned ies: Implications and applications. In Gill Fran-
cis Mona Baker and Elena Tognini-Bonelli, editors,
by MERT, and show a statistically significant im-
Text and technology: in honour of John Sinclair,
provement in the translation quality over all base- pages 233252. John Benjamins, Amsterdam, 1993.
line systems. We also analyze the results qual-
Mona Baker. Corpora in translation studies: An
itatively, showing that SMT systems adapted to
overview and some suggestions for future research.
translationese tend to produce more coherent and Target, 7(2):223243, September 1995.
fluent outputs than the baseline systems. An addi-
tional advantage of our approach is that it does not Mona Baker. Corpus-based translation studies:
The challenges that lie ahead. In Gill Francis
require an annotation of the translation direction
Mona Baker and Elena Tognini-Bonelli, editors,
of the parallel corpus. It is completely generic Terminology, LSP and Translation. Studies in lan-
and can be applied to any language pair, domain guage engineering in honour of Juan C. Sager,
or corpus. pages 175186. John Benjamins, Amsterdam, 1996.
This work can be extended in various direc- Marco Baroni and Silvia Bernardini. A new
tions. We plan to further explore the use of two approach to the study of Translationese: Machine-
phrase tables, one for each direction-determined learning the difference between original and
subset of the parallel corpus. Specifically, we will translated text. Literary and Linguistic Com-
interpolate the translation models as in Foster and puting, 21(3):259274, September 2006. URL
Kuhn (2007), including a maximum a posteriori http://llc.oxfordjournals.org/cgi/
content/short/21/3/259?rss=1.
combination (Bacchiani et al., 2006). We also
plan to upweight the S T subset of the parallel Alison Beeby. Direction of translation (directional-
corpus and train a single phrase table on the con- ity). In Mona Baker and Gabriela Saldanha, edi-
tors, Routledge Encyclopedia of Translation Stud-
catenated corpus. Finally, we intend to extend this
ies, pages 8488. Routledge (Taylor and Francis),
work by combining the translation-model adap- New York, 2nd edition, 2009.
tation we present here with the language-model
adaptation suggested by Lembersky et al. (2011) Stanley F. Chen. An empirical study of smoothing
techniques for language modeling. Technical report
in a unified system that is more tuned to generat-
10-98, Computer Science Group, Harvard Univer-
ing translationese. sity, November 1998.
George Foster and Roland Kuhn. Mixture-model adap-
Acknowledgments tation for SMT. In Proceedings of the Second
Workshop on Statistical Machine Translation, pages
We are grateful to Cyril Goutte, George Foster 128135. Association for Computational Linguis-
and Pierre Isabelle for providing us with an anno- tics, June 2007. URL http://www.aclweb.
tated version of the Hansard corpus. This research org/anthology/W/W07/W07-0717.
was supported by the Israel Science Foundation George Foster, Cyril Goutte, and Roland Kuhn. Dis-
(grant No. 137/06) and by a grant from the Israeli criminative instance weighting for domain adap-
Ministry of Science and Technology. tation in statistical machine translation. In
263
Proceedings of the 2010 Conference on Em- Companion Volume Proceedings of the Demo and
pirical Methods in Natural Language Process- Poster Sessions, pages 177180, Prague, Czech Re-
ing, pages 451459, Stroudsburg, PA, USA, public, June 2007. Association for Computational
2010. Association for Computational Linguis- Linguistics. URL http://www.aclweb.org/
tics. URL http://dl.acm.org/citation. anthology/P07-2045.
cfm?id=1870658.1870702.
Philipp Koehn, Alexandra Birch, and Ralf Steinberger.
Jianfeng Gao, Joshua Goodman, Mingjing Li, and Kai- 462 machine translation systems for Europe. In Ma-
Fu Lee. Toward a unified approach to statistical lan- chine Translation Summit XII, 2009.
guage modeling for Chinese. ACM Transactions
Moshe Koppel and Noam Ordan. Translationese
on Asian Language Information Processing, 1:3
and its dialects. In Proceedings of the 49th An-
33, March 2002. ISSN 1530-0226. doi: http://doi.
nual Meeting of the Association for Computa-
acm.org/10.1145/595576.595578. URL http://
tional Linguistics: Human Language Technolo-
doi.acm.org/10.1145/595576.595578.
gies, pages 13181326, Portland, Oregon, USA,
Martin Gellerstam. Translationese in Swedish novels June 2011. Association for Computational Lin-
translated from English. In Lars Wollin and Hans guistics. URL http://www.aclweb.org/
Lindquist, editors, Translation Studies in Scandi- anthology/P11-1132.
navia, pages 8895. CWK Gleerup, Lund, 1986.
David Kurokawa, Cyril Goutte, and Pierre Isabelle.
Iustina Ilisei, Diana Inkpen, Gloria Corpas Pastor, Automatic detection of translated text and its im-
and Ruslan Mitkov. Identification of translationese: pact on machine translation. In Proceedings of MT-
A machine learning approach. In Alexander F. Summit XII, 2009.
Gelbukh, editor, Proceedings of CICLing-2010:
11th International Conference on Computational Patrik Lambert, Holger Schwenk, Christophe Ser-
Linguistics and Intelligent Text Processing, vol- van, and Sadaf Abdul-Rauf. Investigations on
ume 6008 of Lecture Notes in Computer Science, translation model adaptation using monolingual
pages 503511. Springer, 2010. ISBN 978-3- data. In Proceedings of the Sixth Workshop
642-12115-9. URL http://dx.doi.org/10. on Statistical Machine Translation, pages 284
1007/978-3-642-12116-6. 293. Association for Computational Linguistics,
July 2011. URL http://www.aclweb.org/
Howard Johnson, Joel Martin, George Foster, and anthology/W11-2132.
Roland Kuhn. Improving translation quality by dis-
carding most of the phrasetable. In Proceedings of Gennadi Lembersky, Noam Ordan, and Shuly Wint-
the Joint Conference on Empirical Methods in Nat- ner. Language models for machine translation:
ural Language Processing and Computational Nat- Original vs. translated texts. In Proceedings of the
ural Language Learning (EMNLP-CoNLL), pages 2011 Conference on Empirical Methods in Natural
967975. Association for Computational Linguis- Language Processing, pages 363374, Edinburgh,
tics, June 2007. URL http://www.aclweb. Scotland, UK, July 2011. Association for Computa-
org/anthology/D/D07/D07-1103. tional Linguistics. URL http://www.aclweb.
org/anthology/D11-1034.
Philipp Koehn. Statistical significance tests for ma-
chine translation evaluation. In Proceedings of Robert C. Moore and William Lewis. Intelligent
EMNLP 2004, pages 388395, Barcelona, Spain, selection of language model training data. In
July 2004. Association for Computational Linguis- Proceedings of the ACL 2010 Conference, Short
tics. Papers, pages 220224, Stroudsburg, PA, USA,
2010. Association for Computational Linguis-
Philipp Koehn. Europarl: A Parallel Corpus
tics. URL http://dl.acm.org/citation.
for Statistical Machine Translation. In Confer-
cfm?id=1858842.1858883.
ence Proceedings: the tenth Machine Translation
Summit, pages 7986, Phuket, Thailand, 2005. Franz Josef Och. Minimum error rate training in sta-
AAMT. URL http://mt-archive.info/ tistical machine translation. In ACL 03: Proceed-
MTS-2005-Koehn.pdf. ings of the 41st Annual Meeting on Association for
Computational Linguistics, pages 160167, Morris-
Philipp Koehn, Hieu Hoang, Alexandra Birch,
town, NJ, USA, 2003. Association for Computa-
Chris Callison-Burch, Marcello Federico, Nicola
tional Linguistics. doi: http://dx.doi.org/10.3115/
Bertoldi, Brooke Cowan, Wade Shen, Christine
1075096.1075117.
Moran, Richard Zens, Chris Dyer, Ondrej Bojar,
Alexandra Constantin, and Evan Herbst. Moses: Franz Josef Och and Hermann Ney. Improved statisti-
Open source toolkit for statistical machine transla- cal alignment models. In ACL 00: Proceedings of
tion. In Proceedings of the 45th Annual Meeting the 38th Annual Meeting on Association for Com-
of the Association for Computational Linguistics putational Linguistics, pages 440447, Morristown,
264
NJ, USA, 2000. Association for Computational Lin-
guistics. doi: http://dx.doi.org/10.3115/1075218.
1075274.
Kishore Papineni, Salim Roukos, Todd Ward, and
Wei-Jing Zhu. BLEU: a method for automatic eval-
uation of machine translation. In ACL 02: Proceed-
ings of the 40th Annual Meeting on Association for
Computational Linguistics, pages 311318, Morris-
town, NJ, USA, 2002. Association for Computa-
tional Linguistics. doi: http://dx.doi.org/10.3115/
1073083.1073135.
Natasa Pavlovic. Directionality in translation and in-
terpreting practice. Report on a questionnaire sur-
vey in Croatia. Forum, 5(2):7999, 2007.
Gideon Toury. In Search of a Theory of Translation.
The Porter Institute for Poetics and Semiotics, Tel
Aviv University, Tel Aviv, 1980.
Gideon Toury. Descriptive Translation Studies and be-
yond. John Benjamins, Amsterdam / Philadelphia,
1995.
Hans van Halteren. Source language markers in EU-
ROPARL translations. In COLING 08: Proceed-
ings of the 22nd International Conference on Com-
putational Linguistics, pages 937944, Morristown,
NJ, USA, 2008. Association for Computational Lin-
guistics. ISBN 978-1-905593-44-6.
265
Aspectual Type and Temporal Relation Classification
266
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 266275,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
<s>In Washington <TIMEX3 tid="t53" type="DATE" al., 2007) also combined rule-based and machine
value="1998-01-14">today</TIMEX3>, the Federal learning approaches. It employed sophisticated
Aviation Administration <EVENT eid="e1"
class="OCCURRENCE" stem="release"
NLP to compute some of the features used; more
aspect="NONE" tense="PAST" polarity="POS" specifically it used syntactic features.
pos="VERB">released</EVENT> air traffic control tapes from Our goal with this work is to evaluate the im-
<TIMEX3 tid="t54" type="TIME"
value="1998-XX-XXTNI">the night</TIMEX3> the TWA pact of information about aspectual type on these
Flight eight hundred <EVENT eid="e2" tasks. The TimeML annotations include an at-
class="OCCURRENCE" stem="go" aspect="NONE"
tense="PAST" polarity="POS" tribute class for EVENTs that encodes some as-
pos="VERB">went</EVENT> down.</s> pectual information, distinguishing between sta-
<TLINK lid="l1" relType="BEFORE" eventID="e2"
relatedToTime="t53"/>
tive (annotated with the value STATE) and non-
<TLINK lid="l2" relType="OVERLAP" stative events (value OCCURRENCE). This at-
eventID="e2" relatedToTime="t54"/> tribute is relevant to the classification problem at
hand, i.e. it is a useful feature for machine learned
Figure 1: Sample of the data annotated for TempEval, classifiers for the TempEval tasks (although this
corresponding to the fragment: In Washington today, class attribute encodes other kinds of informa-
the Federal Aviation Administration released air traf- tion as well). However, aspectual distinctions can
fic control tapes from the night the TWA Flight eight be more fine-grained than a mere binary distinc-
hundred went down. tion, and so far no system has explored this sort of
information to help improve the solutions to tem-
Task poral relation classification.
A B C In this paper we work with Portuguese, but in
principle there is no reason to believe that our
Best system 0.62 0.80 0.55 findings would not apply to other languages that
Average of all participants 0.56 0.74 0.51 display similar aspectual phenomena, such as En-
Majority class baseline 0.57 0.56 0.47 glish. Some of the details, such as the material
in Section 4.2, are however language specific and
Table 1: Results for English in TempEval (F-measure), would need adaptation.
from Verhagen et al. (2009)
2 Aspectual Type
B focused on the temporal relation between events Distinctions of aspectual type (also referred to as
and the documents creation time, which is also situation type, lexical aspect or Aktionsart) of the
annotated in TimeML (not shown in that Figure); sort of Vendler (1967) and Dowty (1979) are ex-
and task C was about classifying the temporal re- pected to improve the existing solutions to the
lation between the main events of two consecu- problem of temporal relation classification. The
tive sentences. The possible values for the type major aspectual distinctions are between (i) states
of temporal relation are BEFORE, AFTER and (e.g. to hate beer, to know the answer, to own a
OVERLAP.1 car, to stink), (ii) processes, also called activities
Table 1 shows the results of the first TempEval (to work, to eat ice cream, to grow, to play the
evaluation. The results of TempEval-2 are fairly piano), (iii) culminated processes, also called ac-
similar (Verhagen et al., 2010), but the data used complishments (to paint a picture, to burn down,
are similar but not identical. to deliver a sermon) and (iv) culminations, also
The best system in TempEval for tasks A and B called achievements (to explode, to win the game,
(Puscasu, 2007) combined statistical and knowl- to find the key). States and processes are atelic
edge based methods to propagate temporal con- situations in that they do not make salient a spe-
straints along parse trees coming from a syntac- cific instant in time. Culminated processes and
tic parser. The best system for task C (Min et culminations are telic situations: they have an in-
trinsic, instantaneous endpoint, called the culmi-
1
There are the additional disjunctive values nation (e.g. in the case of to paint a picture, it is
BEFORE-OR-OVERLAP, OVERLAP-OR-AFTER and
VAGUE, employed when the annotators could not make a
the moment when the picture is ready; in the case
more specific decision, but these affect a small number of of to explode, it is the moment of the explosion).
instances. There are several reasons to think aspectual
267
type is relevant to temporal information pro- in he will read the book in three days but not with
cessing. First, these distinctions are related to other aspectual types, as in he will be living there
how long events last: culminations are punctual, in three days.
whereas states can be very prolonged in time. A factor related to aspectual class, that is not
States are thus more likely to temporally overlap trivial to account for, is the phenomenon of as-
other temporal entities than culminations, for in- pectual shift, or aspectual coercion (Moens and
stance. Steedman, 1988; de Swart, 1998; de Swart, 2000).
Second, there are grammatical consequences Many linguistic contexts pose constraints on as-
on how events are anchored in time. Consider pectual type. This does not mean, however, that
the following examples, from Ritchie (1979) and clashes of aspectual type cause ungrammatical-
Moens and Steedman (1988): ity. What often happens is that phrases associated
with an incompatible aspectual type get their type
(1) When they built the 59th Street bridge, changed in order to be of the required type, caus-
they used the best materials. ing a change in meaning.
(2) When they built that bridge, I was still a For instance, the progressive construction com-
young lad. bines with processes. When it combines with e.g.
a culminated process, the culmination is stripped
The situation of building the bridge is a cul-
off from this culminated process, which is thus
minated processed, composed by the process of
converted into a process. The result is that a sen-
actively building a bridge followed by the culmi-
tence like (5) does not say that the bridge was fin-
nation of the bridge being finished. In sentence
ished (the event has no culmination), whereas one
(1), the event described in the main clause (that of
such as (6) does say this (the event has a culmina-
using the best materials) is a process, but in sen-
tion).
tence (2) it is a state (the state of being a young
lad). Even though the two clauses in each sen- (5) They were building that bridge.
tence are connected by when, the temporal rela-
(6) They built that bridge.
tions holding between the events of each clause
are different. On the one hand, in sentence (1) Aspectual type is not a property of just words,
the event of using the best materials (a process) but phrases as well. For example, while the
overlaps with the process of actively building the progressive construction just mentioned combines
bridge and precedes the culmination of finishing with processes, the resulting phrase behaves as a
the bridge. On the other hand, in sentence (2) state (cf. the sentence When they built the 59th
the event of being a young lad (which is a state) Street bridge, they were using the best materi-
overlaps with both the process of actively build- als and what was mentioned above about when
ing the bridge and the culmination of the bridge clauses).
being built. This difference is arguably caused by
the different aspectual types of the main events of 3 Strategy
each sentence.
Aspectual type is hard to annotate. This is partly
As another example, states overlap with tem-
because of what was just mentioned: it is not a
poral location adverbials, as in (3), while culmi-
property of just words, but rather phrases, and
nations are included in them, as in (4).
different phrases with the same head word can
(3) He was happy last Monday. have different aspectual types; however anno-
(4) He reached the top of Mount Everest last tation schemes like TimeML annotate the head
Monday. word as denoting events, not full phrases or
clauses.
In other cases, differences in aspectual type can For this reason, our strategy is to obtain aspec-
disambiguate ambiguous linguistic material. For tual type information from unannotated data. Be-
instance, the preposition in is ambiguous as it can cause these data are gradientan event-denoting
be used to locate events in the future but also to word can be associated with different aspectual
measure the duration of culminated processes; it types, depending on word sensewe do not aim
is thus ambiguous with culminated processes, as to extract categorical information, but rather nu-
268
meric values for each event term that reflect as- data. Relevant to our work is that of Siegel and
sociations to aspectual types. These may be seen McKeown (2000). The authors guess the aspec-
as values that are indicative of the frequencies in tual type of verbs by searching for specific pat-
which an event term denotes a state, or a process, terns in a one million word corpus that has been
etc. syntactically parsed. They extract several linguis-
In order to extract these indicators, we resort to tic indicators and combine them with machine
a methodology sometimes referred to as Google learning algorithms. The indicators that they ex-
Hits: large amounts of queries are sent to a web tract are naturally different from ours, since they
search engine (not necessarily Google), and the have access to syntactic structure and we do not,
number of search results (the number of web but our data are based on a much larger corpus.
pages that match the query) is recorded and taken
as a measure of the frequency of the queried ex- 3.2 Textual Patterns as Indicators of
pression. Aspectual Type
This methodology is not perfect, since multiple Because of aspectual shift phenomena (see Sec-
occurrences of the queried expression in the same tion 2), full syntactic parsing is necessary in order
web page are not reflected in the hit count, and to determine the aspectual type of a natural lan-
in many cases the hit counts reported by search guage expression. However, this can be approxi-
engines are just estimates and might not be very mated by frequencies: it is natural to expect that
accurate. Additionally, uncarefully formulated e.g. stative verbs occur more frequently in stative
queries can match expressions that are syntacti- contexts than non-stative verbs, even if there may
cally and semantically very different from what be errors in determining these contexts if syntactic
was intended. In any case, it has the advantages parsing is not a possibility.
of being based on a very large amount of data and If one uses Google Hits, syntactic information
not requiring any manual annotation, which can is not accessible. In return for its impreciseness,
introduce errors. Google Hits have the advantage of being based on
very large amounts of data.
3.1 The Web as a Very Large Corpus
Hearst (1992) is one of the earliest studies where 4 Scope and Approach
specific textual patterns are used to extract lexico-
semantic information from very large corpora. In this study we focus exclusively on verbs, but
The authors goal was to extract hyponymy rela- events can be denoted by words belonging to
tions. With the same goal, Kozareva et al. (2008) other parts-of-speech. This limitation is linked to
apply similar textual patterns to the web. the fact that the textual patterns that are used to
The web has been used as a corpus by many search for specific aspectual contexts are sensitive
other authors with the purpose of extracting syn- to part-of-speech (i.e. what may work for a verb
tactic or semantic properties of words or re- may not work equally well for a noun).
lations between them, e.g. Ravichandran and In order to assess whether aspectual type in-
Hovy (2002), Etzioni et al. (2004), etc. Some formation is relevant to the problem of temporal
of this work is specially relevant to the problem relation classification, our approach is to check
of temporal information processing. VerbOcean whether incorporating that kind of information
(Chklovski and Pantel, 2004) is a database of into existing solutions for this problem can im-
web mined relations between verbs. Among other prove their performance. TimeML annotated
kinds of relations, it includes typical precedence data, such as those used for TempEval, can be
relations, e.g. sleeping happens before waking up. used to train machine learned classifiers. These
This type of information has in fact been used by can then be augmented with attributes encoding
some of the participating systems of TempEval-2 aspectual type information and their performance
(Ha et al., 2010), with good results. compared to the original classifiers.
More generally, there is a large body of work Additionally, we work with Portuguese data.
focusing on lexical acquisition from corpora. Just This is because our work is part of an effort to
as an example, Mayol et al. (2005) learn subcate- implement a temporal processing system for Por-
gorization frames of verbs from large amounts of tuguese. We briefly describe the data next.
269
<s>Em Washington, <TIMEX3 tid="t53" type="DATE" tool (Branco et al., 2009) to generate the specific
value="1998-01-14">hoje</TIMEX3>, a Federal Aviation verb forms that are used in the queries. They are
Administration <EVENT eid="e1" class="OCCURRENCE"
stem="publicar" aspect="NONE" tense="PPI"
mostly third person singular forms of several dif-
polarity="POS" pos="VERB">publicou</EVENT> ferent tenses.
gravacoes do controlo de trafego aereo da <TIMEX3
tid="t54" type="TIME"
The indicators that we used are ratios of Google
value="1998-XX-XXTNI">noite</TIMEX3> em que o voo Hits. They compare two queries.
TWA800 <EVENT eid="e2" class="OCCURRENCE" Several indicators were tested. We provide ex-
stem="cair" aspect="NONE" tense="PPI"
polarity="POS" pos="VERB">caiu</EVENT>.</s> amples with the verb fazer do for the queries
<TLINK lid="l1" relType="BEFORE" eventID="e2" being compared by each indicator. The name of
relatedToTime="t53"/>
each indicator reflects the aspectual type being
<TLINK lid="l2" relType="OVERLAP"
tested, i.e. states should present high values for
eventID="e2" relatedToTime="t54"/>
State Indicators 1 and 2, processes should show
high values for Process Indicators 14, etc.
Figure 2: Sample of the Portuguese data adapted from
the TempEval data, corresponding to the fragment: Em State Indicator 1 (Indicator S1) is about im-
Washington, hoje, a Federal Aviation Administration perfective and perfective past forms of verbs.
publicou gravacoes do controlo de trafego aereo da It compares the number of hits a for an im-
noite em que o voo TWA800 caiu.
perfective form fazia did to the number of
a
hits b for a perfective form fez did: a+b .
4.1 Data Assuming the imperfective past constrains
the entire clause to be a state, and the perfec-
Our experiments used TimeBankPT (Costa and tive past constrains it to be telic, the higher
Branco, 2010; Costa and Branco, 2012; Costa, to this value the more frequently the verb ap-
appear). This corpus is an adaptation of the orig- pears in stative clauses in a past tense.2
inal TempEval data to Portuguese, obtained by
translating it and then adapting the annotations. State Indicator 2 (Indicator S2) is about the
Figure 2 shows the Portuguese equivalent to the co-occurrence with acaba de has just fin-
sample presented above in Figure 1. The two cor- ished. It compares the number of hits a
pora are quite similar, but there is of course the for acaba de fazer has just finished doing
language difference. TimeBankPT contains a few to the number of hits b for fazer to do:
corrections to the data (mostly the temporal rela- b
a+b . In Portuguese, this construction does
tions), but these corrections only changed around not seem to be felicitous with states.
1.2% of the total number of annotated temporal
relations (Costa and Branco, 2012). Although we Process Indicator 1 (Indicator P1) is about
did not test our results on English data, we specu- past progressive forms and simple past forms
late that our results carry over to other languages. (both imperfective). It compares the num-
Just like the original English corpus for ber of hits a for fazia did to the number of
TempEval, it is divided in a training part and a hits b for estava a fazer was doing: a+b b
.
testing part. The numbers (sentences, words, an- Assuming the progressive construction is a
notated events, time expressions and temporal re- function from processes to states (see Sec-
lations) are fairly similar for the two corpora (the tion 2), the higher this value, the more likely
English one and the Portuguese one). the verb can occur with the interpretation of
a process.
4.2 Extracting the Aspectual Indicators
2
We expect this frequency to be indicative of states be-
We extracted the 4,000 most common verbs from cause states can appear in the imperfective past tense with
a 180 million word corpus of Portuguese news- their interpretation unchanged, whereas non-stative events
paper text, CETEMPublico. Because this corpus have their interpretation shifted to a stative one in that con-
is not annotated, we used a part-of-speech tag- text (e.g. they get a habitual reading). In order to refer to an
event occurring in the past with an on-going interpretation,
ger and morphological analyzer (Barreto et al., non-stative verbs require the progressive construction to be
2006; Silva, 2007) to detect verbs and to obtain used in Portuguese, whereas states do not. Therefore, states
their dictionary form. We then used an inflection should occur more freely in the simple imperfective past.
270
Process Indicator 2 (Indicator P2) is about Culmination Indicator1 (Indicator C1) is
past progressive forms vs. simple past forms about differentiating culminations and cul-
(perfective). It compares the number of hits minated processes. It compares the number
a for fez did to the number of hits b for of hits a for fez de repente did suddenly to
b
esteve a fazer was doing: a+b . Similarly the number of hits b for fez num instante did
a
to the previous indicator, this one tests the in an instant: a+b .
frequency of a verb appearing in a context
typical of processes. For each of the 4,000 verbs, the necessary
queries required by these indicators were gener-
Process Indicator 3 (Indicator P3) is about ated and then sent to a search engine. The queries
the occurrence of for Adverbials. It com- were enclosed in quotes, so as to guarantee ex-
pares the number of hits a for fez did to act matches. The number of hits was recorded for
the number of hits b for fez durante muito each query.
b
tempo did for a long time: a+b . This We had some problems with outliers for a few
number is also intended to be an indica- rather infrequent verbs. These could show very
tion of how frequent a verb can be used extreme values for some indicators. In order
with the interpretation of a process. Note to minimize their impact, for each indicator we
that Portuguese allows modifiers to occur homogenized the 100 highest values that were
freely between a verb and its complements, found. More specifically, for each indicator, each
so this test should work for transitive verbs one of the highest 100 values was replaced by the
(or any other subcategorization frame involv- 100th highest value. The bottom 100 values were
ing complements), not just intransitive ones. similarly changed. This way the top 99 values and
Process Indicator 4 (Indicator P4) is about the bottom 99 values are replaced by the 100th
the co-occurrence of a verb with parar de to highest value and the 100th lowest value respec-
stop. It compares the number of hits a for tively.
parou de fazer stopped doing to the num- Each indicator ranges between 0 and 1 in the-
a
ber of hits b for fazer to do: a+b . Just like ory. In practice, we seldom find values close to the
the English verbs stop and finish are sensitive extremes, as this would imply that some queries
to the aspectual type of their complement, so would have close to 0 hits, which does not occur
is the Portuguese verb parar, which selects very often (after all, we intentionally used queries
for processes. for which we would expect large hit counts, as
these are more likely to be representative of true
Atelicity Indicator 1 (Indicator A1) is about language use). For this reason, each indicator is
comparing in and for adverbials. It compares scaled so that its minimum (actual) value is 0 and
the number of hits a for fez num instante did its maximum (actual) value is 1.
in an instant to the number of hits b for fez
durante muito tempo did for a long time: 5 Evaluation
b
a+b . Processes can be modified by for ad- As mentioned before, in order to assess the use-
verbials, whereas culminated processes are
fulness of these aspectual indicators for the tasks
modified by in adverbials. This indicator
of temporal relation classification, we checked
tests the occurrence of a verb in contexts that
whether they can improve machine learned clas-
require these aspectual types.
sifiers trained for this problem. We next describe
Atelicity Indicator 2 (Indicator A2) is about the classifiers that were used as the bases for com-
comparing for Adverbials with suddenly. It parison.
compares the number of hits a for fez de re-
pente did suddenly to the number of hits 5.1 Experimental Setup
b for fez durante muito tempo did for a In order to obtain bases for comparison, we
b
long time: a+b . De repente suddenly trained machine learned classifiers on the Por-
seems to modify culminations, so this indi- tuguese corpus TimeBankPT, that is adapted from
cator compares process readings with culmi- the TempEval data (see Section 4.1). We took
nation readings. inspiration in the work of Hepple et al. (2007).
271
This was one of the participating systems of Task
TempEval. It used machine learning algorithms
Attribute A B C
implemented in Weka (Witten and Frank, 1999).
For our experiments, we used Wekas implemen- event-aspect X X
tation of the C4.5 algorithm, trees.J48 (Quin- event-polarity X X X
lan, 1993), the RIPPER algorithm as implemented event-POS X
by Wekas rules.JRip (Cohen, 1995), a near- event-stem X
est neighbors classifier, lazy.KStar (Cleary event-string X
and Trigg, 1995), a Nave Bayes classifier, namely event-class X X
Wekas bayes.NaiveBayes (John and Lang- event-tense X X X
ley, 1995), and a support vector classifier, Wekas order-event-first X N/A N/A
functions.SMO (Platt, 1998) . We chose these order-event-between X N/A N/A
algorithms as they are representative of a wide order-timex3-between N/A N/A
range of machine learning approaches. order-adjacent X N/A N/A
Recall that the tasks of TempEval are to guess
the type of temporal relations. Each train or test timex3-mod X N/A
instance thus corresponds to a temporal relation, timex3-type N/A
i.e. a TLINK element in the TimeML annota- tlink-relType X X X
tions (see Figures 1 and 2). The classification
problem is to determine the value of the attribute Table 2: Feature combinations used in the classifiers
relType of TimeML TLINK elements. These used as comparison bases. Features inspired by the
temporal relations relate an event (referred by the ones used by Hepple et al. (2007) in TempEval.
eventID attribute of TLINK elements) to an-
other temporal entity, that can be a time (pointed ement that represents the temporal relation to
to by the relatedToTime attribute), in the case be classified. The order features are the at-
of tasks A and B, or, in the case of task C, an- tributes computed from the documents textual
other event (given by the relatedToEvent at- content. The feature order-event-first
tribute). encodes whether the event terms precedes in
As for the features that were employed, we also the text the time expression it is related to by
took inspiration in the approach of Hepple et al. the temporal relation to classify. The clas-
(2007). These authors used as classifier attributes sifier attribute order-event-between de-
two types of features. The first group of features scribes whether any other event is mentioned
corresponds to TimeML attributes: for instance in the text between the two expressions for
the value of the aspect attribute of EVENT el- the entities that are in the temporal relation,
ements, for the events involved in the temporal and similarly order-timex3-between is
relation to be classified. The second group of fea- about whether there is an intervening tempo-
tures corresponds to simple features that can be ral expression. Finally, order-adjacent is
computed with string manipulation and do not re- true iff both order-timex3-between and
quire any kind of natural language processing. order-event-between are false (even if
Table 2 shows the features that were tried and other linguistic material occurs between the ex-
employed. pressions denoting the two entities in the temporal
The event features correspond to attributes relation).
of EVENT elements, with the exception of In order to arrive at the final set of features
the event-string feature, which takes as (marked with a check mark in Table 2), we per-
value the character data inside the correspond- formed exhaustive search on all possible combi-
ing TimeML EVENT element. In a simi- nations of these features for each task, using the
lar spirit, the timex3 features are taken from Nave Bayes algorithm. They were compared us-
the attributes of TIMEX3 elements with the ing 10-fold cross-validation on the training data.
same name. The tlink-relType feature The feature combinations shown in Table 2 are
is the class attribute and corresponds to the the optimal combinations arrived at in this way.
relType attribute of the TimeML TLINK el- These are the classifiers that we used for the
272
comparison with the aspectual type indicators. Task
We chose this straightforward approach because it
Classifier A B C
forms a basis for comparison that is easily repro-
ducible: the algorithm implementations that were trees.J48 0.57 0.77 0.53
used are part of freely available software, and the With best indicator 0.55
features that were employed are easily computed rules.JRip 0.60 0.76 0.51
from the annotated data, with no need to run any With best indicator 0.61 0.54
natural language processing tools whatsoever.
As mentioned before in Section 4.1, the data lazy.KStar 0.54 0.70 0.52
used are organized in a training set and an evalu- With best indicator 0.73 0.53
ation set. The training part is around 60K words bayes.NaiveBayes 0.50 0.76 0.53
long, the test data containing around 9K words. With best indicator 0.53 0.54
When tested on held-out data, these classifiers
functions.SMO 0.55 0.79 0.54
present the scores shown in italics in Table 3.
With best indicator 0.56 0.55
These results are fairly similar to the scores that
the system of Hepple et al. (2007) obtained in Table 3: Evaluation on held-out test data of classi-
TempEval with English data: 0.59 for task A, 0.73 fiers trained on full train data. Values for the classi-
for task B, and 0.54 for task C. They are also not fiers used as comparison bases are in italics. Boldface
very far from the best results of TempEval. As highlights improvements resulting from incorporating
such they represent interesting bases for compar- aspectual indicators as classifier features, and missing
ison, as improving their performance is likely to values represent no improvement.
be relevant to the best systems that have been de-
veloped for temporal information processing.
the event that is the first argument of this temporal
5.2 Results and Discussion relation. After adding each of these features, we
retrained the classifiers on the training data and
After obtaining the bases for comparison de-
tested them on the held-out test data. In order to
scribed above, we proceeded to check whether the
keep the evaluation manageable, we did not test
aspectual type indicators described in Section 4.2
combinations of multiple indicators.
can improve these results.
For each aspectual indicator, we implemented Table 3 shows the overall results. For task
a classifier feature that encodes its value for the A, the best indicators were P4 (with JRip), A1
event term in the temporal relation (if it is not a (NaiveBayes) and S1 (SMO). For task B the
verb, this value is missing). In the case of task C, best one was P4 (KStar). For task C, the best
two features are added for each indicator, one for indicators were P3 (J48), A1 and P3 (JRip),
each event term. C1 (KStar), A1 (NaiveBayes) and P2 (SMO).
We extended each of these classifiers with one Each of the indicators S2, P1 and A2 either does
of these features at a time (two in the case of task not improve the results or does so but not as much
C), and checked whether it improved the results as another, better indicator for the same task and
on the test data. So for instance, in order to test algorithm.
Indicator S1, we extended each of these classifiers It seems clear from Table 3 that some tasks ben-
with a feature that encodes the value that this indi- efit from these indicators more than others. In
cator presents for the term that denotes the event particular, task C shows consistent improvements
present in the temporal relation to be classified. whereas task B is hardly affected. Since task C
In the case of task C, two classifier features are is about relations involving two events, the classi-
added, one for each event term, and both for the fiers may be picking up the sort of linguistic gen-
same Indicator S1. For instance, for the (train- eralizations mentioned in Section 2 about when
ing) instance corresponding to the TLINK in Fig- clauses.
ure 2 with the lid attribute that has the value l1, J48 and JRip produce human-readable mod-
the classifier feature for Indicator S1 has the value els. We checked how these classifiers are taking
that was computed for the verb cair go down, advantage of the aspectual indicators. For task C,
since this is the stem of the word that denotes the induced models are generally associating high
273
values of the indicators A1 and P3 with overlap An interesting question that we hope will be ad-
relations and low values of these indicators with dressed by future work is how these results extend
other types of relations. This is expected. On the to other languages. We cannot provide an answer
one end, high values for these indicators are asso- to this question, as we do not have the data. How-
ciated with atelicity (i.e. the endpoint of the cor- ever, this experiment can be replicated for any lan-
responding event is not presented). On the other guage that has (i) TimeML annotated data, (ii) a
hand, both indicators are based on queries con- reasonable size of documents on the Web and a
taining the phrase durante muito tempo for a long search engine capable of separating them from the
time, which, in addition to picking up events that documents in other languages and (iii) an aspec-
can be modified by for adverbials, more specifi- tual system similar enough that the question be-
cally pick up events that happen for a long time ing addressed in this paper makes sense (and use-
and are thus likely to overlap other events. ful patterns for queries can be constructed, even
For task A, JRip also associates high values of if not entirely identical to the ones that we used).
the indicator P4which constitute evidence that The second criterion is met by many, many lan-
the corresponding events are processes (which are guages. The third one also seems to affect many
atelic)with overlap relations. This is a specially languages, as the existing literature on aspectual
interesting result, considering that the queries on phenomena indicates that these phenomena are
which this indicator is based reflect a purely as- quite widespread. The second criterion is, at the
pectual constraint. moment, the hardest to fulfill as not many lan-
guages have data with rich annotations about time
6 Concluding Remarks (i.e. including events and temporal relations). We
In this paper, we evaluated the relevance of infor- speculate that our results can extend to English,
mation about aspectual type for temporal process- although a different set of query patterns may
ing tasks. have to be used in order to extract the aspectual
Temporal information processing has received indicators that are employed. We believe this be-
substantial attention recently with the two cause the two languages largely overlap when it
TempEval challenges in 2007 and 2010. The most comes to aspectual phenomena.
interesting problem of temporal information pro-
cessing, that of temporal relation classification, is References
still affected by high error rates.
Even though a very substantial part of the se- Florbela Barreto, Antonio Branco, Eduardo Ferreira,
Amalia Mendes, Maria Fernanda Nascimento, Fil-
mantics literature on tense and aspect focuses on
ipe Nunes, and Joao Silva. 2006. Open resources
aspectual type, solutions to the problem of auto- and tools for the shallow processing of Portuguese:
matic temporal relation classification have not in- the TagShare project. In Proceedings of LREC
corporated this sort of semantic information. In 2006.
part this is expected, as aspectual type is very in- Antonio Branco, Francisco Costa, Eduardo Ferreira,
terconnected with syntax (cf. the discussion about Pedro Martins, Filipe Nunes, Joao Silva, and Sara
aspectual coercion in Section 2), and the phe- Silveira. 2009. LX-Center: a center of online lin-
guistic services. In Proceedings of the Demo Ses-
nomenon of aspect shift can make it hard to com-
sion, ACL-IJCNLP2009, Singapore.
pute even when syntactic information is available. Timothy Chklovski and Patrick Pantel. 2004. Verb-
Our contribution with this paper is to incor- Ocean: Mining the Web for fine-grained semantic
porate this sort of information in existing ma- verb relations. In In Proceedings of EMNLP-2004,
chine learned classifiers that tackle this problem. Barcelona, Spain.
Even though these classifiers do not have access to John G. Cleary and Leonard E. Trigg. 1995. K*: An
syntactic information, aspectual type information instance-based learner using an entropic distance
seemed to be useful in improving the performance measure. In 12th International Conference on Ma-
chine Learning, pages 108114.
of these models. We hypothesize that combin-
William W. Cohen. 1995. Fast effective rule induc-
ing aspectual type information with information tion. In Proceedings of the Twelfth International
about syntactic structure can further improve the Conference on Machine Learning, pages 115123.
problems of temporal information processing, but Francisco Costa and Antonio Branco. 2010. Tempo-
we leave that research to future work. ral information processing of a new language: Fast
274
porting with minimal resources. In Proceedings of Marc Moens and Mark Steedman. 1988. Temporal
ACL 2010. ontology and temporal reference. Computational
Francisco Costa and Antonio Branco. 2012. Time- Linguistics, 14(2):1528.
BankPT: A TimeML annotated corpus of Por- John Platt. 1998. Fast training of support vec-
tuguese. In Proceedings of LREC2012. tor machines using sequential minimal optimiza-
Francisco Costa. to appear. Processing Temporal In- tion. In Bernhard Scholkopf, Chris Burges, and
formation in Unstructured Documents. Ph.D. the- Alexander J. Smola, editors, Advances in Kernel
sis, Universidade de Lisboa, Lisbon. MethodsSupport Vector Learning.
Henriette de Swart. 1998. Aspect shift and coercion. Georgiana Puscasu. 2007. WVALI: Temporal rela-
Natural Language and Linguistic Theory, 16:347 tion identification by syntactico-semantic analysis.
385. In Proceedings of SemEval-2007, pages 484487,
Prague, Czech Republic. Association for Computa-
Henriette de Swart. 2000. Tense, aspect and coer-
tional Linguistics.
cion in a cross-linguistic perspective. In Proceed-
ings of the Berkeley Formal Grammar conference, James Pustejovsky, Jose Castano, Robert Ingria, Roser
Stanford. CSLI Publications. Saur, Robert Gaizauskas, Andrea Setzer, and Gra-
ham Katz. 2003. TimeML: Robust specification of
David R. Dowty. 1979. Word Meaning and Montague
event and temporal expressions in text. In IWCS-
Grammar: the Semantics of Verbs and Times in
5, Fifth International Workshop on Computational
Generative Semantics and Montagues PTQ. Rei-
Semantics.
del, Dordrecht.
John Ross Quinlan. 1993. C4.5: Programs for Ma-
Oren Etzioni, Michael Cafarella, Doug Downey, Stan- chine Learning. Morgan Kaufmann, San Mateo,
ley Kok, Ana-Maria Popescu, Tal Shaked, , Stephen CA.
Soderland, Daniel S. Weld, and Alexander Yates.
Deepak Ravichandran and Eduard Hovy. 2002.
2004. Web-scale information extraction in Know-
Learning surface text patterns for a question an-
ItAll. In Proceedings of the 13th International Con-
swering system. In Proceedings of ACL 2002.
ference on World Wide Web.
Graeme D. Ritchie. 1979. Temporal clauses in En-
Eun Young Ha, Alok Baikadi, Carlyle Licata, and glish. Theoretical Linguistics, 6:87115.
James C. Lester. 2010. NCSU: Modeling temporal Eric V. Siegel and Kathleen McKeown. 2000.
relations with Markov logic and lexical ontology. In Learning methods to combine linguistic indica-
Proceedings of SemEval 2010. tors: Improving aspectual classification and reveal-
Marti A. Hearst. 1992. Automatic acquisition of hy- ing linguistic insights. Computational Linguistics,
ponyms from large text corpora. In Proceedings of 24(4):595627.
the 14th Conference on Computational Linguistics, Joao Ricardo Silva. 2007. Shallow processing
volume 2, pages 539545, Nantes, France. of Portuguese: From sentence chunking to nomi-
Mark Hepple, Andrea Setzer, and Rob Gaizauskas. nal lemmatization. Masters thesis, Faculdade de
2007. USFD: Preliminary exploration of fea- Ciencias da Universidade de Lisboa, Lisbon, Portu-
tures and classifiers for the TempEval-2007 tasks. gal.
In Proceedings of SemEval-2007, pages 484487, Zeno Vendler. 1967. Verbs and times. Linguistics in
Prague, Czech Republic. Association for Computa- Philosophy, pages 97121.
tional Linguistics. Marc Verhagen, Robert Gaizauskas, Frank Schilder,
George H. John and Pat Langley. 1995. Estimating Mark Hepple, and James Pustejovsky. 2007.
continuous distributions in Bayesian classifiers. In SemEval-2007 Task 15: TempEval temporal re-
Eleventh Conference on Uncertainty in Artificial In- lation identification. In Proceedings of SemEval-
telligence, pages 338345, San Mateo. 2007.
Zornitsa Kozareva, Ellen Riloff, and Eduard Hovy. Marc Verhagen, Robert Gaizauskas, Frank Schilder,
2008. Semantic class learning from the web with Mark Hepple, Jessica Moszkowicz, and James
hyponym pattern linkage graphs. In Proceedings of Pustejovsky. 2009. The TempEval challenge: iden-
ACL-08: HLT, pages 10481056, Columbus, Ohio. tifying temporal relations in text. Language Re-
Association for Computational Linguistics. sources and Evaluation.
Laia Mayol, Gemma Boleda, and Toni Badia. 2005. Marc Verhagen, Roser Saur, Tommaso Caselli, and
Automatic acquisition of syntactic verb classes with James Pustejovsky. 2010. SemEval-2010 task 13:
basic resources. Language Resources and Evalua- TempEval-2. In Proceedings of SemEval-2010.
tion, 39(4):295312. Ian H. Witten and Eibe Frank. 1999. Data Mining:
Congmin Min, Munirathnam Srikanth, and Abraham Practical Machine Learning Tools and Techniques
Fowler. 2007. LCC-TE: A hybrid approach to with Java Implementations. Morgan Kaufmann,
temporal relation identification in news text. pages San Francisco.
219222.
275
Automatic generation of short informative sentiment summaries
276
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 276285,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
conveyed succinctly. We present a simple un- aspects can result in bad summaries.
supervised method for extracting supporting sen- Our approach enables us to find strong support-
tences and show that it is superior to a baseline in ing sentences even if the reason given in that sen-
a novel crowdsourcing-based evaluation. tence does not fit well into the fixed inventory. No
In the next section, we describe related work manual work like the creation of an aspect inven-
that is relevant to our new approach. In Section 3 tory is necessary and there are no requirements on
we present the approach we use to identify sup- the format of the reviews such as author-provided
porting sentences. Section 4 describes the fea- pros and cons.
ture representation of sentences and the classifi- Aspect-oriented summarization also differs in
cation method. In Section 5 we give an overview that it does not differentiate along the dimension
of the crowdsourcing evaluation. Section 6 dis- of quality of the reason given for a sentiment. For
cusses our experimental results. In Sections 7 and example, I dont like the zoom and The zoom
8, we present our conclusions and plans for future range is too limited both give reasons for why a
work. camera gets a negative evaluation, but only the lat-
ter reason is informative. In our work, we evaluate
2 Related Work the quality of the reason given for a sentiment.
Both sentiment analysis (Pang and Lee, 2008; The use case we address in this paper requires
Liu, 2010) and summarization (Nenkova and a short, easy-to-read summary. A well-formed
McKeown, 2011) are important subfields of NLP. sentence is usually easier to understand than a
The work most relevant to this paper is work on pro/con table. It also has the advantage that the
summarization methods that addresses the spe- information conveyed is accurately representing
cific requirements of summarization in sentiment what the user wanted to say this is not the case
analysis. There are two lines of work in this vein for a presentation that involves several complex
with goals similar to ours: (i) aspect-based and processing steps and takes linguistic material out
pro/con-summarization and (ii) approaches that of the context that may be needed to understand it
extract summary sentences from reviews. correctly.
An aspect is a component or attribute of a Berend (2011) performs a form of pro/con
product such as battery, lens cap, battery summarization that does not rely on aspects.
life, and picture quality for cameras. Aspect- However, most of the problems of aspect-based
oriented summarization (Hu and Liu, 2004; pro/con summarization also apply to this paper:
Zhuang et al., 2006; Kim and Hovy, 2006) col- no differentiation between good and bad reasons,
lects sentiment assessments for a given set of as- the need for human labels to train a classifier, and
pects and returns a list of pros and cons about ev- inferior readability compared to a well-formed
ery aspect for a review or, in some cases, on a sentence.
per-sentence basis. Two previous approaches that have attempted
Aspect-oriented summarization and pro/con- to extract sentences from reviews in the context
summarization differ in a number of ways from of summarization are (Beineke et al., 2004) and
supporting sentence summarization. First, as- (Arora et al., 2009). Beineke et al. (2004) train
pects and pros&cons are taken from a fixed in- a classifier on rottentomatoes.com summary sen-
ventory. The inventory is typically small and does tences provided by review authors. These sen-
not cover the full spectrum of relevant informa- tences sometimes contain a specific reason for the
tion. Second, in its most useful form, aspect- overall sentiment of the review, but sometimes
oriented summarization requires classification of they are just catchy lines whose purpose is to
phrases and sentences according to the aspect they draw moviegoers in to read the entire review; e.g.,
belong to; e.g., The camera is very light has El Bulli barely registers a pulse stronger than a
to be recognized as being relevant to the aspect books (which does not give a specific reason for
weight. Developing a component that assigns why the movie does not register a strong pulse).
phrases and sentences to their corresponding cat- Arora et al. (2009) define two classes of sen-
egories is time-consuming and has to be redone tences: qualified claims and bald claims. A qual-
for each domain. Any such component will make ified claim gives the reader more details (e.g.,
mistakes and undetected or incorrectly classified This camera is small enough to fit easily in a
277
coat pocket) while a bald claim is open to inter- in the supporting sentence are nominal; the
pretation (e.g., This camera is small). Quali- verb will be needed in many cases to accu-
fied/bald is a dimension of classification of senti- rately convey the reason for the sentiment
ment statements that is to some extent orthogonal expressed. However, it is a fairly safe as-
to quality of reason. Qualified claims do not have sumption that part of the information is con-
to contain a reason and bald claims can contain veyed using noun phrases since it is dif-
an informative reason. For example, I didnt like ficult to convey specific information with-
the camera, but I suspect it will be a great camera out using specific noun phrases. Adjectives
for first timers is classified as a qualified claim, are often important when expressing a rea-
but the sentence does not give a good reason for son, but frequently a noun is also mentioned
the sentiment of the document. Both dimensions or one would need to resolve a pronoun to
(qualified/bald, high-quality/low-quality reason) make the sentence a self-contained support-
are important and can be valuable components of ing sentence. In a sentence like Its easy
a complete sentiment analysis system. to use it is not clear what the adjective is
Apart from the definition of the concept of sup- referring to.
porting sentence, which we believe to be more ap-
propriate for the application we have in mind than (iii) Noun phrases that express supporting facts
rottentomatoes.com summary sentences and qual- tend to be domain-specific; they can be
ified claims, there are two other important differ- automatically identified by selecting noun
ences of our approach to these two papers. First, phrases that are frequent in the domain ei-
we directly evaluate the quality of the reasons in a ther in relative terms (compared to a generic
crowdsourcing experiment. Second, our approach corpus) or in absolute terms. By making
is unsupervised and does not require manual an- this assumption we may fail to detect sup-
notation of a training set of supporting sentences. porting sentences that are worded in an orig-
As we will discuss in Section 5, we propose inal way using ordinary words. However,
a novel evaluation measure for summarization in a specific domain there is usually a lot
based on crowdsourcing in this paper. The most of redundancy and most good reasons oc-
common use of crowdsourcing in NLP is to have cur many times and are expressed by similar
workers label a training set and then train a super- words.
vised classifier on this training set. In contrast, we
use crowdsourcing to directly evaluate the relative Based on these assumptions, we select the sup-
quality of the automatic summaries generated by porting sentence in two steps. In the first step, we
the unsupervised method we propose. determine the n sentences with the strongest sen-
timent within every review by classifying the po-
3 Approach larity of the sentences (where n is a parameter).
In the second step, we select one of the n sen-
Our approach is based on the following three tences as the best supporting sentence by means
premises. of a weighting function.
(i) A good supporting sentence conveys both Step 1: Sentiment Classification
the reviews sentiment and a supporting fact.
We make this assumption because we want In this step, we apply a sentiment classifier to all
the sentence to be self-contained. If it only sentences of the review to classify sentences as
describes a fact about a product without positive or negative. We then select the n sen-
evaluation, then it does not on its own ex- tences with the highest probability of conforming
plain which sentiment is conveyed by the ar- with the overall sentiment of the document. For
ticle and why. example, if the documents polarity is negative,
we select the n sentences that are most likely to be
(ii) Supporting facts are most often expressed by negative according to the sentiment classifier. We
noun phrases. We call a noun phrase that ex- restrict the set of n sentences to sentences with the
presses a supporting fact a keyphrase. We right sentiment because even an excellent sup-
are not assuming that all important words porting sentence is not a good characterization of
278
the content of the review if it contradicts the over- quency), I1 (the set of infrequent nouns), F2 (the
all assessment given by the review. Only in cases set of compounds with high frequency), and I2
where there are fewer than n sentences with the (the set of infrequent compounds). An infrequent
correct sentiment, we also select sentences with noun (resp. compound) is simply defined as a
the wrong sentence to obtain a minimum of n noun (resp. compound) that does not meet the fre-
sentences for each review. quency criterion.
We define the score s of a sentence with n to-
Step 2: Weighting Function kens t1 . . . tn (where the last token tn is a punctu-
Based on premises (ii) and (iii) above, we score ation mark) as follows:
a sentence based on the number of noun phrases
n1
that occur with high absolute and relative fre- X
s= wf2 [[(ti , ti+1 ) F2 ]]
quency in the domain. We only consider sim-
i=1
ple nouns and compound nouns consisting of + wi2 [[(ti , ti+1 ) I2 ]] (2)
two nouns in this paper. In general, compound + wf1 [[ti F1 ]]
nouns are more informative and specific. A com- + wi1 [[ti I1 ]]
pound noun may refer to a specific reason even
if the head noun does not (e.g., life vs. battery where [[]] = 1 if is true and [[]] = 0 otherwise.
life). This means that we need to compute scores Note that a noun in a compound will contribute to
in a way that allows us to give higher weight to the overall score in two different summands.
compound nouns than to simple nouns. The weights wf2 , wi2 , wf1 , and wi1 are deter-
In addition, we also include counts of nouns mined using logistic regression. The training set
and compounds in the scoring that do not have for the regression is created in an unsupervised
high absolute/relative frequency because fre- fashion as follows. From each set of n sentences
quency heuristics identify keyphrases with only (one per review), we select the two highest scor-
moderate accuracy. However, theses nouns and ing, i.e., the two sentences that were classified
compounds are given a lower weight. with the highest confidence. The two classes in
This motivates a scoring function that is a the regression problem are then the top ranked
weighted sum of four variables: number of simple sentences vs. the sentences at rank 2. Since tak-
nouns with high frequency, number of infrequent ing all sentences turned out to be too noisy, we
simple nouns, number of compound nouns with eliminate sentence pairs where the top sentence is
high frequency, and number of infrequent com- better than the second sentence on almost all of
pound nouns. High frequency is defined as fol- the set counts (i.e., count of members of F1 , I1 ,
lows. Let fdom (p) be the domain-specific abso- F2 , and I2 ). Our hypothesis in setting up this re-
lute frequency of phrase p, i.e., the frequency in gression was that the sentence with the strongest
the review corpus, and fwiki (p) the frequency of sentiment often does not give a good reason. Our
p in the English Wikipedia. We view the distribu- experiments confirm that this hypothesis is true.
tion of terms in Wikipedia as domain-independent The weights wf2 , wi2 , wf1 , and wi1 estimated
and define the relative frequency as in Equation 1. by the regression are then used to score sentences
according to Equation 2.
fdom (p)
frel (p) = (1) We give the same weight to all keyphrase com-
fwiki (p) pounds (and the same weight to all keyphrase
We do not consider nouns and compound nouns nouns) in future work one could attempt to give
that do not occur in Wikipedia for computing higher weights to keyphrases with higher absolute
the relative frequency. A noun (resp. compound or relative frequency. In this paper, our goal is to
noun) is deemed to be of high frequency if it is establish a simple baseline for the task of extrac-
one of the k% nouns (resp. compound nouns) with tion of supporting sentences.
the highest fdom (p) and at the same time is one of After computing the overall weight for each
the k% nouns (resp. compound nouns) with the sentence in a review, the sentence with the highest
highest frel (p) where k is a parameter. weight is chosen as the supporting sentence the
Based on these definitions, we define four dif- sentence that is most informative for explaining
ferent sets: F1 (the set of nouns with high fre- the overall sentiment of the review.
279
4 Experiments reasons. The cleaned corpus consists of 11,624
documents. Finally, we split the corpus into train-
4.1 Data
ing set (85%) and test set (15%) as shown in Table
We use part of the Amazon dataset from Jindal 1. The average number of sentences of a review is
and Liu (2008). The dataset consists of more than 13.36 sentences, the median number of sentences
5.8 million consumer-written reviews of several is 10.
products, taken from the Amazon website. For
our experiment we used the digital camera do- 4.3 Sentiment Classification
main and extracted 15,340 reviews covering a to- We first build a sentence sentiment classifier by
tal of 740 products. See table 1 for key statistics training the Stanford maximum entropy classifier
of the data set. (Manning and Klein, 2003) on the sentences in the
training set. Sentences occurring in positive (resp.
Type Number
negative) reviews are labeled positive (resp. neg-
Brands 17
ative). We use a simple bag-of-words representa-
Products 740
tion (without punctuation characters and frequent
Documents (all) 15,340
stop words). Propagating labels from documents
Documents (cleaned) 11,624 to sentences creates a noisy training set because
Documents (train) 9,880 some sentences have sentiment different from the
Documents (test) 1,744 sentiment in their documents; however, there is
Short test documents 147 no alternative because we need per-sentence clas-
Long test documents 1,597 sification decisions, but do not have per-sentence
Average number of sents 13.36 human labels.
Median number of sents 10 The accuracy of the classifier is 88.4% on
propagated sentence labels.
Table 1: Key statistics of our dataset We use the sentence classifier in two ways.
First, it defines our baseline BL for extracting
In addition to the review text, authors can give supporting sentences: the baseline simply pro-
an overall rating (a number of stars) to the prod- poses the sentence with the highest sentiment
uct. Possible ratings are 5 (very positive), 4 (pos- score that is compatible with the sentiment of the
itive), 3 (neutral), 2 (negative), and 1 (very nega- document as the supporting sentence.
tive). We unify ratings of 4 and 5 to positive and Second, the sentence classifier selects a subset
ratings of 1 and 2 to negative to obtain polarity of candidate sentences that is then further pro-
labels for binary classification. Reviews with a cessed using the scoring function in Equation 2.
rating of 3 are discarded. This subset consists of the n = 5 sentences with
the highest sentiment scores of the right polarity
4.2 Preprocessing that is, if the document is positive (resp. nega-
We tokenized and part-of-speech (POS) tagged tive), then the n = 5 sentences with the highest
the corpus using TreeTagger (Schmid, 1994). We positive (resp. negative) scores are selected.
split each review into individual sentences by us-
ing the sentence boundaries given by TreeTag- 4.4 Determining Frequencies and Weights
ger. One problem with user-written reviews is The absolute frequency of nouns and compound
that they are often not written in coherent En- nouns simply is computed as their token fre-
glish, which results in wrong POS tags. To ad- quency in the training set. For computing the rel-
dress some of these problems, we cleaned the ative frequency (as described in Section 3, Equa-
corpus after the tokenization step. We separated tion 1), we use the 20110405 dump of the English
word-punctuation clusters (e.g., word...word) and Wikipedia.
removed emoticons, html tags, and all sentences In the product review corpora we studied,
with three or fewer tokens, many of which were the percentage of high-frequency keyphrase com-
a result of wrong tokenization. We excluded all pound nouns was higher than that of simple
reviews with fewer than five sentences. Short re- nouns. We therefore use two different thresh-
views are often low-quality and do not give good olds for absolute and relative frequency. We de-
280
fine F1 as the set of nouns that are in the top supporting sentences.
kn = 2.5% for both absolute and relative fre-
quencies; and F2 as the set of compounds that are 5 Comparative Evaluation with Amazon
in the top kp = 5% for both absolute and rela- Mechanical Turk
tive frequencies. These thresholds are set to ob- One standard way to evaluate summarization sys-
tain a high density of good keyphrases with few tems is to create hand-edited summaries and to
false positives. Below the threshold there are still compute some measure of similarity (e.g., word
other good keyphrases, but they cannot be sepa- or n-gram overlap) between automatic and human
rated easily from non-keyphrases. summaries. An alternative for extractive sum-
Sentences are scored according to Equation 2. maries is to classify all sentences in the document
Recall that the parameters wf2 , wi2 , wf1 , and wi1 with respect to their appropriateness as summary
are determined using logistic regression. The ob- sentences. An automatic summary can then be
tained parameter values (see table 2) indicate the scored based on its ability to correctly identify
relative importance of the four different types of good summary sentences. Both of these meth-
terms. Compounds are the most important term ods require a large annotation effort and are most
and even those with a frequency below the thresh- likely too complex to be outsourced to a crowd-
old kp still provide more detailed information than sourcing service because the creation of manual
simple nouns above the threshold kn ; the value of summaries requires skilled writers. For the sec-
wi2 is approximately twice the value wf1 for this ond type of evaluation, ranking sentences accord-
reason. Non-keyphrase nouns are least important ing to a criterion is a lot more time consuming
and are weighted with only a very small value of than making a binary decision so ranking the
wi1 = 0.01. 13 or 14 sentences that a review contains on av-
erage for the entire test set would be a signifi-
Phrase Par Value cant annotation effort. It would also be difficult
keyphrase compounds w f2 1.07 to obtain consistent and repeatable annotation in
non-keyphrase compounds wi2 0.89 crowdsourcing on this task due to its subtlety.
keyphrase nouns w f1 0.46 We therefore designed a novel evaluation
non-keyphrase nouns wi1 0.01 methodology in this paper that has a much smaller
startup cost. It is well known that relative judg-
Table 2: Weight settings ments are easier to make on difficult tasks than ab-
solute judgments. For example, much recent work
The scoring function with these parameter val- on relevance ranking in information retrieval re-
ues is applied to the n = 5 selected sentences of lies on relative relevance judgments (one docu-
the review. The highest scoring sentence is then ment is more relevant than another) rather than ab-
selected as the supporting sentence proposed by solute relevance judgments. We adopt this gen-
our system. eral idea and only request such relative judgments
For 1380 of the 1744 reviews, the sentence se- on supporting sentences from annotators. Unlike
lected by our system is different from the baseline a complete ranking of the sentences (which would
sentence; however, there are 364 cases (20.9%) require m(m 1)/2 judgments where m is the
where the two are the same. Only the 1380 cases length of the review), we choose a setup where
where the two methods differ are included in the we need to only elicit a single relative judgment
crowdsourcing evaluation to be described in the per review, one relative judgment on a sentence
next section. As we will show below, our sys- pair (consisting of the baseline sentence and the
tem selects better supporting sentences than the system sentence) for each of the 1380 reviews se-
baseline in most cases. So if baseline and our sys- lected in the previous section. This is a manage-
tem agree, then it is even more likely that the sen- able annotation task that can be run on a crowd-
tence selected by both is a good supporting sen- sourcing service in a short time and at little cost.
tence. However, there could also be cases where We use Amazon Mechanical Turk (AMT) for
the n = 5 sentences selected by the sentiment this annotation task. The main advantage of AMT
classifier are all bad supporting sentences or cases is that cost per annotation task is very low, so that
where the document does not contain any good we can obtain large annotated datasets for an af-
281
file:///Users/hs0711/example2.html
Task:
Sentence 1: This 5 meg camera meets all my requirements.
Which sentence gives the more convincing reason? Fill out exactly one field, please.
Please type the blue word of the chosen sentence into the corresponding answer field.
s1
s2
If both sentences do not give a convincing reason, type NOTCONV into this answer
field.
Submit
fordable price. The disadvantage is the level of is simply the number of times it was rated bet-
quality of the annotation which will be discussed ter than its competitor. The score can be 0, 1, 2
at the end of this section. or 3. HITs for which the worker chooses the op-
tion Neither sentence has a convincing reason
5.1 Task Design are ignored when computing sentence scores.
We created a HIT (Human Intelligence Task) The sentence with the higher score is then se-
template including detailed annotation guidelines. lected as the best supporting sentence for the cor-
Every HIT consists of a pair of sentences. One responding review.
sentence is the baseline sentence; the other sen- In cases of ties, we posted the sentence pair one
tence is the system sentence, i.e., the sentence se- more time for one worker. If one of the two sen-
lected by the1 ofscoring
1 function. The two sentences tences has a higher score after3/9/12this
12:06 reposting,
PM we
are presented in random order to avoid bias. choose it as the winner. Otherwise we label this
The workers are then asked to evaluate the rel- sentence pair no decision or N-D.
ative quality of the sentences by selecting one of
the following three options: 5.2 Quality of AMT Annotations
Since our crowdsourcing based evaluation is
1. Sentence 1 has the more convincing reason novel, it is important to investigate if human an-
notators perform the annotation consistently and
2. Sentence 2 has the more convincing reason
reproducibly.
3. Neither sentence has a convincing reason The Fleiss agreement score for the final
experiment is 0.17. AMT workers only have
If both sentences contain reasons, the worker the instructions given by the requesters. If they
has to compare the two reasons and choose the are not clear enough or too complicated, work-
sentence with the more convincing reason. ers can misunderstand the task, which decreases
Each HIT was posted to three different workers the quality of the answers. There are also AMT
to make it possible to assess annotator agreement. workers who spam and give random answers to
Every worker can process each HIT only once tasks. Moreover, ranking sentences according to
so that the three assignments are always done by the quality of the given reason is a subjective task.
three different people. Even if the sentence contains a reason, it might
Based on the worker annotations, we compute a not be convincing for the worker.
gold standard score for each sentence. This score To ensure a high level of quality for our dataset,
282
Experiment # Docs BL SY N-D B=S
1 AMT, first pass 1380 27.4 57.9 14.7 -
2 AMT, second pass 203 46.8 45.8 7.4 -
3 AMT final 1380 34.3 64.6 1.1 -
4 AMT+[B=S] 1744 27.1 51.1 0.9 20.9
Table 3: AMT evaluation results. Numbers are percentages or counts. BL = baseline, SY = system, N-D = no
decision, B=S = same sentence selected by baseline and system
we took some precautions. To force workers to baseline system, 46% the system sentence; 7.4%
actually read the sentences and not just click a of the responses were undecided (line 2). Line 3
few boxes, we randomly marked one word of each presents the consolidated results where the 14.7%
sentence blue. The worker had to type the word ties on line 1 are replaced by the ratings obtained
of their preferred sentence into the corresponding on line 2 in the second pass.
answer field or NOTCONV into the special field if The consolidated results (line 3) show that our
neither sentence was convincing. Figure 1 shows system is clearly superior to the baseline of se-
our AMT interface design. lecting the sentence with the strongest sentiment.
For each answer field we have a gold stan- Our system selected a better supporting sentence
dard (the words we marked blue and the word for 64.6% of the reviews; the baseline selected a
NOTCONV) which enables us to look for spam. better sentence for 34.3% of the reviews. These
The analysis showed that some workers mistyped results exclude the reviews where baseline and
some words, which however only indicates that system selected the same sentence. If we as-
the worker actually typed the word instead of sume that these sentences are also acceptable sen-
copying it from the task. Some workers submit- tences (since they score well on the traditional
ted inconsistent answers, for instance, they typed sentiment metrics as well as on our new con-
a random word or filled out all three answer fields. tent keyword metric), then our system finds a
In such cases we reposted this HIT again to re- good supporting sentence for 72.0% of reviews
ceive a correct answer. (51.1+20.9) whereas the baseline does so for only
After the task, we counted how often a worker 48.0% (27.1+20.9).
said that neither sentence is convincing since a
6.1 Error Analysis
high number indicates that the worker might have
only copied the word for several sentence pairs Our error analysis revealed that a significant pro-
without checking the content of the sentences. We portion of system sentences that were worse than
also analyzed the time a worker needed for every baseline sentences did contain a reason. How-
HIT. Since no task was done in less than 10 sec- ever, the baseline sentence also contained a reason
onds, the possibility of just copying the word was and was rated better by AMT annotators. Exam-
rather low. ples (1) and (2) show two such cases. The first
sentence is the baseline sentence (BL) which was
6 Results and discussion rated better. The system sentence (SY) contains
a similar or different reason. Since rating reasons
The results of the AMT experiment are shown in is a very subjective task, it is impossible to de-
table 3. As described above, each of the 1380 fine which of these two sentences contains the bet-
sentence pairs was evaluated by three workers. ter reason and depends on how the workers think
Workers rated the system sentence as better for about it.
57.9% of the reviews, and the baselines sentence
as better for 27.4% of the reviews; for 14.7% of (1) BL:The best thing is that everything is just so
reviews, the scores of the two sentences were tied easily displayed and one doesnt need a
(line 1 of Table 3). The 203 reviews in this cate- manual to start getting the work done.
gory were reposted one more time (as described in SY: The zoom is incredible, the video was so
Section 5). The responses were almost perfectly clear that I actually thought of making a
evenly split: about 47% of workers preferred the 15 min movie.
283
(2) BL:The colors are horrible, indoor shots are Finally, there are a number of cases where our
horrible, and too much noise. assumption that good supporting sentences con-
SY: Who cares about 8 mega pixels and 1600 tain keyphrases is incorrect. For example, sen-
iso when it takes such bad quality pic- tence (6) does not contain any keyphrases indica-
tures. tive of good reasons. The information that makes
it a good supporting sentence is mainly expressed
In example (3) the system sentence is an in- using verbs and particles.
complete sentence consisting of only two noun
(6) I have had an occasional problem with
phrases. These cut-off sentences are mainly
the camera not booting up and telling me
caused by incorrect usage of grammar and punc-
to turn it off and then on again.
tuation by the reviewers which results in wrongly
determined sentence boundaries in the prepro-
7 Conclusion
cessing step.
In this work, we presented a system that ex-
(3) BL:Gives peace of mind to have it fit per- tracts supporting sentences, single-sentence sum-
fectly. maries of a document that contain a convincing
SY: battery and SD card. reason for the authors opinion about a product.
We used an unsupervised approach that extracts
In some cases, the two sentences that were pre- keyphrases of the given domain and then weights
sented to the worker in the evaluation had a dif- these keyphrases to identify supporting sentences.
ferent polarity. This can have two reasons: (i) due We used a novel comparative evaluation method-
to noisy training input, the classifier misclassified ology with the crowdsourcing framework Ama-
some of the sentences, and (ii) for short reviews zon Mechanical Turk to evaluate this novel task
we also used sentences with the non-conforming since no gold standard is available. We showed
polarity. Sentences with different polarity often that our keyphrase-based system performs better
confused the workers and they tended to prefer than a baseline of extracting the sentence with the
the positive sentence even if the negative one con- highest sentiment score.
tained a more convincing reason as can be seen in
example (4). 8 Future work
(4) BL:It shares same basic commands and Our method failed for some of the about 35% of
setup, so the learning curve was minimal. reviews where it did not find a convincing reason
because of the noisiness of reviews. Reviews are
SY: I was not blown away by the image qual-
user-generated content and contain grammatically
ity, and as others have mentioned, the
incorrect sentences and are full of typographical
flash really is weak.
errors. This problem makes it hard to perform pre-
A general problem with our approach is that the processing steps like part-of-speech tagging and
weighting function favors sentences with many sentence boundary detection correctly and reli-
noun phrases. The system sentence in example ably. We plan to address these problems in fu-
(5) contains many noun phrases, including some ture work by developing a more robust processing
highly frequent nouns (e.g., lens, battery), pipeline.
but there is no convincing reason and the baseline
sentence has been selected by the workers. Acknowledgments
This work was supported by Deutsche
(5) BL:I have owned my cd300 for about 3 weeks
Forschungsgemeinschaft (Sonderforschungs-
and have already taken 700 plus pictures.
bereich 732, Project D7) and in part by the
SY: It has something to do with the lens be- IST Programme of the European Community,
cause the manual says it only happens to under the PASCAL2 Network of Excellence,
the 300 and when I called Sony tech sup- IST-2007-216886. This publication only reflects
port the guy tried to tell me the battery the authors views.
was faulty and it wasnt.
284
References International Conference on New Methods in Lan-
guage Processing, Manchester, UK.
Shilpa Arora, Mahesh Joshi, and Carolyn P. Rose.
Li Zhuang, Feng Jing, and Xiao-Yan Zhu. 2006.
2009. Identifying types of claims in online cus-
Movie review mining and summarization. In Pro-
tomer reviews. In Proceedings of Human Lan-
ceedings of the 15th ACM international conference
guage Technologies: The 2009 Annual Conference
on Information and knowledge management, CIKM
of the North American Chapter of the Association
06, pages 4350, New York, NY, USA. ACM.
for Computational Linguistics, Companion Volume:
Short Papers, NAACL-Short 09, pages 3740,
Stroudsburg, PA, USA. Association for Computa-
tional Linguistics.
Philip Beineke, Trevor Hastie, Christopher Manning,
and Shivakumar Vaithyanathan. 2004. Exploring
sentiment summarization. In Proceedings of the
AAAI Spring Symposium on Exploring Attitude and
Affect in Text: Theories and Applications. AAAI
Press. AAAI technical report SS-04-07.
Gabor Berend. 2011. Opinion expression mining by
exploiting keyphrase extraction. In Proceedings of
5th International Joint Conference on Natural Lan-
guage Processing, pages 11621170, Chiang Mai,
Thailand, November. Asian Federation of Natural
Language Processing.
Minqing Hu and Bing Liu. 2004. Mining and sum-
marizing customer reviews. In Proceedings of the
Tenth ACM SIGKDD international conference on
Knowledge discovery and data mining, KDD 04,
pages 168177, New York, NY, USA. ACM.
Nitin Jindal and Bing Liu. 2008. Opinion spam
and analysis. In WSDM 08: Proceedings of the
international conference on Web search and web
data mining, pages 219230, New York, NY, USA.
ACM.
Soo-Min Kim and Eduard Hovy. 2006. Automatic
identification of pro and con reasons in online re-
views. In Proceedings of the COLING/ACL on
Main conference poster sessions, COLING-ACL
06, pages 483490, Stroudsburg, PA, USA. Asso-
ciation for Computational Linguistics.
Bing Liu. 2010. Sentiment analysis and subjectivity.
Handbook of Natural Language Processing, 2nd ed.
Christopher Manning and Dan Klein. 2003. Opti-
mization, maxent models, and conditional estima-
tion without magic. In Proceedings of the 2003
Conference of the North American Chapter of the
Association for Computational Linguistics on Hu-
man Language Technology: Tutorials - Volume 5,
NAACL-Tutorials 03, pages 88, Stroudsburg, PA,
USA. Association for Computational Linguistics.
Ani Nenkova and Kathleen McKeown. 2011. Auto-
matic summarization. Foundations and Trends in
Information Retrieval, 5(2-3):103233.
Bo Pang and Lillian Lee. 2008. Opinion mining and
sentiment analysis. Foundations and Trends in In-
formation Retrieval, 2(1-2):1135.
Helmut Schmid. 1994. Probabilistic part-of-speech
tagging using decision trees. In Proceedings of the
285
Bootstrapped Training of Event Extraction Classifiers
286
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 286295,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
to learn extraction patterns. Our research also and Riloff, 2009)). Other systems take a more
uses role-identifying nouns to learn extraction pat- global view and consider discourse properties of
terns, but the role-identifying nouns and patterns the document as a whole to improve performance
are then used to create training data for event ex- (e.g., (Maslennikov and Chua, 2007; Ji and Gr-
traction classifiers. Each classifier is then self- ishman, 2008; Liao and Grishman, 2010; Huang
trained in a bootstrapping loop. and Riloff, 2011)). Currently, the learning-based
Our weakly supervised training procedure re- event extraction systems that perform best all use
quires a small set of seed nouns for each event supervised learning techniques that require a large
role, and a collection of relevant (in-domain) and number of texts coupled with manually-generated
irrelevant (out-of-domain) texts. No answer key annotations or answer key templates.
templates or annotated texts are needed. The seed A variety of techniques have been explored
nouns are used to automatically generate a set for weakly supervised training of event extrac-
of role-identifying patterns, and then the nouns, tion systems, primarily in the realm of pattern or
patterns, and a semantic dictionary are used to rule-based approaches (e.g., (Riloff, 1996; Riloff
label training instances. We also propagate the and Jones, 1999; Yangarber et al., 2000; Sudo et
event role labels across coreferent noun phrases al., 2003; Stevenson and Greenwood, 2005)). In
within a document to produce additional train- some of these approaches, a human must man-
ing instances. The automatically labeled texts are ually review and clean the learned patterns to
used to train three components of TIER: its two obtain good performance. Research has also been
types of sentence classifiers and its noun phrase done to learn extraction patterns in an unsuper-
classifiers. To create TIERs fourth component, vised way (e.g., (Shinyama and Sekine, 2006;
its document genre classifier, we apply heuristics Sekine, 2006)). But these efforts target open do-
to the output of the sentence classifiers. main information extraction. To extract domain-
We present experimental results on the MUC- specific event information, domain experts are
4 data set, which is a standard benchmark for needed to select the pattern subsets to use.
event extraction research. Our results show that There have also been weakly supervised ap-
the bootstrapped system, TIERlite , outperforms proaches that use more than just local context.
previous weakly supervised event extraction sys- (Patwardhan and Riloff, 2007) uses a semantic
tems and achieves performance levels comparable affinity measure to learn primary and secondary
to supervised training with 700 manually anno- patterns, and the secondary patterns are applied
tated documents. only to event sentences. The event sentence clas-
sifier is self-trained using seed patterns. Most
2 Related Work recently, (Chambers and Jurafsky, 2011) acquire
Event extraction techniques have largely focused event words from an external resource, group the
on detecting event triggers with their arguments event words to form event scenarios, and group
for extracting role fillers. Classical methods are extraction patterns for different event roles. How-
either pattern-based (Kim and Moldovan, 1993; ever, these weakly supervised systems produce
Riloff, 1993; Soderland et al., 1995; Huffman, substantially lower performance than the best su-
1996; Freitag, 1998b; Ciravegna, 2001; Califf and pervised systems.
Mooney, 2003; Riloff, 1996; Riloff and Jones,
3 Overview of TIER
1999; Yangarber et al., 2000; Sudo et al., 2003;
Stevenson and Greenwood, 2005) or classifier- The goal of our research is to develop a weakly
based (e.g., (Freitag, 1998a; Chieu and Ng, 2002; supervised training process that can successfully
Finn and Kushmerick, 2004; Li et al., 2005; Yu et train a state-of-the-art event extraction system for
al., 2005)). a new domain with minimal human input. We de-
Recently, several approaches have been pro- cided to focus our efforts on the TIER event ex-
posed to address the insufficiency of using only traction model because it recently produced bet-
local context to identify role fillers. Some ap- ter performance on the MUC-4 data set than prior
proaches look at the broader sentential context learning-based event extraction systems (Huang
around a potential role filler when making a de- and Riloff, 2011). In this section, we briefly give
cision (e.g., (Gu and Cercone, 2006; Patwardhan an overview of TIERs architecture and its com-
287
and time-consuming. Furthermore, answer key
templates for one domain are virtually never
reusable for different domains, so a new set of
answer keys must be produced from scratch for
each domain. In the next section, we present our
weakly supervised approach for training TIERs
Figure 1: TIER Overview event extraction classifiers.
288
patterns automatically generated from unanno-
tated texts to assess the similarity of nouns. First,
Basilisk assigns a score to each pattern based on
the number of seed words that co-occur with it.
Basilisk then collects the noun phrases extracted
by the highest-scoring patterns. Next, the head
noun of each noun phrase is assigned a score
Figure 2: Using Basilisk to Induce Role-Identifying based on the set of patterns that it co-occurred
Patterns with. Finally, Basilisk selects the highest-scoring
nouns, automatically labels them with the seman-
fillers occur sparsely in text and in diverse con- tic class of the seeds, adds these nouns to the lex-
texts. icon, and restarts the learning process in a boot-
In this section, we explain how we gener- strapping fashion.
ate role-identifying patterns automatically using For our work, we give Basilisk role-identifying
seed nouns, and we discuss why we add seman- seed nouns for each event role. We run the boot-
tic constraints to the patterns when producing la- strapping process for 20 iterations and then har-
beled instances for training. Then, we discuss the vest the 40 best patterns that Basilisk identifies
coreference-based label propagation that we used for each event role. We also tried using the addi-
to obtain additional training instances. Finally, we tional role-identifying nouns learned by Basilisk,
give examples to illustrate how we create training but found that these nouns were too noisy.
instances.
4.1.2 Using the Patterns to Label NPs
4.1.1 Inducing Role-Identifying Patterns The induced role-identifying patterns can be
The input to our system is a small set of matched against the unannotated texts to produce
manually-defined seed nouns for each event role. labeled instances. However, relying solely on the
Specifically, the user is required to provide pattern contexts can be misleading. For example,
10 role-identifying nouns for each event role. the pattern context <subject> caused damage
(Phillips and Riloff, 2007) defined a noun as be- will extract some noun phrases that are weapons
ing role-identifying if its lexical semantics re- (e.g., the bomb) but some noun phrases that are
veal the role of the entity/object in an event. For not (e.g., the tsunami).
example, the words assassin and sniper are Based on this observation, we add selectional
people who participate in a violent event as a PER - restrictions to each pattern that requires a noun
PETRATOR . Therefore, the entities referred to by phrase to satisfy certain semantic constraints in
role-identifying nouns are probable role fillers. order to be extracted and labeled as a positive
However, treating every context surrounding a instances for an event role. The selectional re-
role-identifying noun as a role-identifying pattern strictions are satisfied if the head noun is among
is risky. The reason is that many instances of role- the role-identifying seed nouns or if the semantic
identifying nouns appear in contexts that do not class of the head noun is compatible with the cor-
describe the event. But, if one pattern has been responding event role. In the previous example,
seen to extract many role-identifying nouns and tsunami will not be extracted as a weapon because
seldomly seen to extract other nouns, then the pat- it has an incompatible semantic class (EVENT),
tern likely represents an event context. but bomb will be extracted because it has a com-
As (Phillips and Riloff, 2007) did, we use patible semantic class (WEAPON).
Basilisk to learn patterns for each event role. We use the semantic class labels assigned by
Basilisk was originally designed for semantic the Sundance parser (Riloff and Phillips, 2004) in
class learning (e.g., to learn nouns belonging to our experiments. Sundance looks up each noun
semantic categories, such as building or human). in a semantic dictionary to assign the semantic
As shown in Figure 2, beginning with a small set class labels. As an alternative, general resources
of seed nouns for each semantic class, Basilisk (e.g., WordNet (Miller, 1990)) or a semantic tag-
learns additional nouns belonging to the same se- ger (e.g., (Huang and Riloff, 2010)) could be
mantic class. Internally, Basilisk uses extraction used.
289
propagate the perpetrator label from noun phrase
men = Human terrorists was killed by <np> #1 to noun phrase #3.
assassins <subject> attacked
building = Object snipers <subject> fired shots
... ... ...
4.2 Creating TIERlite with Bootstrapping
Semantic RoleIdentifying RoleIdentifying In this section, we explain how the labeled in-
Dictionary Noun Patterns stances are used to train TIERs classifiers with
Constraints Constraints
bootstrapping. In addition to the automatically
labeled instances, the training process depends
John Smith was killed by two armed men on a text corpus that consists of both relevant
1
in broad daylight this morning. (in-domain) and irrelevant (out-of-domain) doc-
The assassins
2
attacked the mayor as he uments. Positive instances are generated from
left his house to go to work about 8:00 am.
Police arrested the unidentified men
the relevant documents and negative instances are
3
an hour later. generated by randomly sampling from the irrele-
vant documents.
Figure 3: Automatic Training Data Creation The classifiers are all support vector machines
(SVMs), implemented using the SVMlin software
(Keerthi and DeCoste, 2005). When applying the
4.1.3 Propagating Labels with Coreference classifiers during bootstrapping, we use a sliding
To enrich the automatically labeled training in- confidence threshold to determine which labels
stances, we also propagate the event role labels are reliable based on the values produced by the
across coreferent noun phrases within a docu- SVM. Initially, we set the threshold to be 2.0 to
ment. The observation is that once a noun phrase identify highly confident predictions. But if fewer
has been identified as a role filler, its corefer- than k instances pass the threshold, then we slide
ent mentions in the same document likely fill the the threshold down in decrements of 0.1 until we
same event role since they are referring to the obtain at least k labeled instances or the thresh-
same real world entity. old drops below 0, in which case bootstrapping
To leverage these coreferential contexts, we ends. We used k=10 for both sentence classifiers
employ a simple head noun matching heuristic to and k=30 for the noun phrase classifiers.
identify coreferent noun phrases. This heuristic The following sections present the details of the
assumes that two noun phrases that have the same bootstrapped training process for each of TIERs
head noun are coreferential. We considered us- components.
ing an off-the-shelf coreference resolver, but de-
cided that the head noun matching heuristic would
likely produce higher precision results, which is
important to produce high-quality labeled data.
290
generated from the relevant documents follow- to maintain the negative:positive ratio of 10:1.
ing Section 4.1. The negative noun phrase in- The bootstrapping process and feature set are the
stances are drawn randomly from the irrelevant same as for the event sentence classifier.
documents. Considering the sparsity of role fillers The difference between the two types of sen-
in texts, we set the negative:positive ratio to be tence classifiers is that the event sentence classi-
10:1. Once the classifier is trained, it is applied to fier uses positive instances from all event roles,
the unlabeled noun phrases in the relevant docu- while each role-specific sentence classifiers only
ments. Noun phrases that are assigned role filler uses the positive instances for one particular event
labels by the classifier with high confidence (us- role. The rationale is similar as in the super-
ing the sliding threshold) are added to the set of vised setting (Huang and Riloff, 2011); the event
positive instances. New negative instances are sentence classifier is expected to generalize over
drawn randomly from the irrelevant documents to all event roles to identify event mention contexts,
maintain the 10:1 (negative:positive) ratio. while the role-specific sentence classifiers are ex-
We extract features from each noun phrase pected to learn to identify contexts specific to in-
(NP) and its surrounding context. The features dividual roles.
include the NP head noun and its premodifiers.
4.2.4 Event Narrative Document Classifier
We also use the Stanford NER tagger (Finkel et
al., 2005) to identify Named Entities within the TIER also uses an event narrative document
NP. The context features include four words to the classifier and only extracts information from role-
left of the NP, four words to the right of the NP, specific sentences within event narrative docu-
and the lexico-syntactic patterns generated by Au- ments. In the supervised setting, TIER uses
toSlog to capture expressions around the NP (see heuristic rules derived from answer key templates
(Riloff, 1993) for details). to identify the event narrative documents in the
training set, which are used to train an event nar-
4.2.2 Event Sentence Classifier rative document classifier. The heuristic rules re-
The event sentence classifier is responsible quire that an event narrative should have a high
for identifying sentences that describe a relevant density of relevant information and tend to men-
event. Similar to the noun phrase classifier train- tion the relevant information within the first sev-
ing, positive training instances are selected from eral sentences.
the relevant documents and negative instances are In our weakly supervised setting, we use the
drawn from the irrelevant documents. All sen- information density heuristic directly instead of
tences in the relevant documents that contain one training an event narrative classifier. We approxi-
or more labeled noun phrases (belonging to any mate the relevant information density heuristic by
event role) are labeled as positive training in- computing the ratio of relevant sentences (both
stances. We randomly sample sentences from the event sentences and role-specific sentences) out of
irrelevant documents to obtain a negative:positive all the sentences in a document. Thus, the event
training instance ratio of 10:1. The bootstrapping narrative labeller only relies on the output of the
process is then identical to that of the noun phrase two sentence classifiers. Specifically, we label a
classifiers. The feature set for this classifier con- document as an event narrative if 50% of the
sists of unigrams, bigrams and AutoSlogs lexico- sentences in the document are relevant (i.e., la-
syntactic patterns surrounding all noun phrases in beled positively by either sentence classifier).
the sentence.
5 Evaluation
4.2.3 Role-Specific Sentence Classifiers In this section, we evaluate our bootstrapped sys-
The role-specific sentence classifiers are tem, TIERlite , on the MUC-4 event extraction
trained to identify the contexts specific to each data set. First, we describe the IE task, the data
event role. All sentences in the relevant doc- set, and the weakly supervised baseline systems
uments that contain at least one labeled noun that we use for comparison. Then we present the
phrase for the appropriate event role are used results of our fully bootstrapped system TIERlite ,
as positive instances. Negative instances are the weakly supervised baseline systems, and two
randomly sampled from the irrelevant documents fully supervised event extraction systems, TIER
291
and GLACIER. In addition, we analyze the per- manually selects the best patterns for each event
formance of TIERlite using different configura- role. During testing, the patterns are matched
tions to assess the impact of its components. against unseen texts to extract event role fillers.
PIPER (Patwardhan and Riloff, 2007; Patward-
5.1 IE Task and Data han, 2010) learns extraction patterns using a se-
We evaluated the performance of our systems on mantic affinity measure, and it distinguishes be-
the MUC-4 terrorism IE task (MUC-4 Proceed- tween primary and secondary patterns and ap-
ings, 1992) about Latin American terrorist events. plies them selectively. (Chambers and Jurafsky,
We used 1,300 texts (DEV) as our training set and 2011) (C+J) created an event extraction system
200 texts (TST3+TST4) as the test set. All the by acquiring event words from WordNet (Miller,
documents have answer key templates. For the 1990), clustering the event words into different
training set, we used the answer keys to separate event scenarios, and grouping extraction patterns
the documents into relevant and irrelevant sub- for different event roles.
sets. Any document containing at least one rel-
evant event was considered to be relevant. 5.3 Performance of TIERlite
Table 2 shows the seed nouns that we used in our
PerpInd PerpOrg Target Victim Weapon
129 74 126 201 58 experiments, which were generated by sorting the
nouns in the corpus by frequency and manually
Table 1: # of Role Fillers in the MUC-4 Test Set identifying the first 10 role-identifying nouns for
each event role.3 Table 3 shows the number of
Following previous studies, we evaluate our training instances (noun phrases) that were auto-
system on five MUC-4 string event roles: perpe- matically labeled for each event role using our
trator individuals (PerpInd), perpetrator organi- training data creation approach (Section 4.1).
zations (PerpOrg), physical targets, victims, and
weapons. Table 1 shows the distribution of role Event Role Seed Nouns
fillers in the MUC-4 test set. The complete IE task Perpetrator terrorists assassins criminals rebels
Individual murderers death squads guerrillas
involves the creation of answer key templates, one member members individuals
template per event1 . Our work focuses on extract- Perpetrator FMLN ELN FARC MRTA M-19 Front
ing individual role fillers and not template genera- Organization Shining Path Medellin Cartel
tion, so we evaluate the accuracy of the role fillers The Extraditables
Army of National Liberation
irrespective of which template they occur in.
Target houses residence building home homes
We used the same head noun scoring scheme offices pipeline hotel car vehicles
as previous systems, where an extraction is cor- Victim victims civilians children jesuits Galan
rect if its head noun matches the head noun in the priests students women peasants Romero
answer key2 . Pronouns were discarded from both Weapon weapons bomb bombs explosives rifles
dynamite grenades device car bomb
the system responses and the answer keys since
no coreference resolution is done. Duplicate ex- Table 2: Role-Identifying Seed Nouns
tractions were conflated before being scored, so
they count as just one hit or one miss.
PerpInd PerpOrg Target Victim Weapon
5.2 Weakly Supervised Baselines 296 157 522 798 248
We compared the performance of our system with Table 3: # of Automatically Labeled NPs
three previous weakly supervised event extraction
systems. Table 4 shows how our bootstrapped system
AutoSlog-TS (Riloff, 1996) generates lexico- TIERlite compares with previous weakly super-
syntactic patterns exhaustively from unannotated vised systems and two supervised systems, its su-
texts and ranks them based on their frequency and pervised counterpart TIER (Huang and Riloff,
probability of occurring in relevant documents. 2011) and a model that jointly considers local
A human expert then examines the patterns and and sentential contexts, G LACIER (Patwardhan
1 3
Documents may contain multiple events per article. We only found 9 weapon terms among the high-
2
For example, armed men will match 5 armed men. frequency terms.
292
Weakly Supervised Baselines
PerpInd PerpOrg Target Victim Weapon Average
AUTO S LOG -TS (1996) 33/49/40 52/33/41 54/59/56 49/54/51 38/44/41 45/48/46
P IPERBest (2007) 39/48/43 55/31/40 37/60/46 44/46/45 47/47/47 44/46/45
C+J (2011) - - - - - 44/36/40
Supervised Models
G LACIER (2009) 51/58/54 34/45/38 43/72/53 55/58/56 57/53/55 48/57/52
TIER (2011) 48/57/52 46/53/50 51/73/60 56/60/58 53/64/58 51/62/56
Weakly Supervised Models
TIERlite 47/51/49 60/39/47 37/65/47 39/53/45 53/55/54 47/53/50
60 5.4 Analysis
55
Table 6 shows the effect of the coreference prop-
agation step described in Section 4.1.3 as part of
IE performance(F1)
50
training data creation. Without this step, the per-
45 formance of the bootstrapped system yields an F
score of 41. With the benefit of the additional
40
training instances produced by coreference prop-
35 agation, the system yields an F score of 53. The
new instances produced by coreference propaga-
30
0 200 400 600 800 1000
# of training documents
1200 1400 tion seem to substantially enrich the diversity of
the set of labeled instances.
Figure 5: The Learning Curve of Supervised TIER Seeding P/R/F
wo/Coref 45/38/41
w/Coref 47/53/50
293
PerpInd PerpOrg Target Victim Weapon Average
Supervised Classifier 25/67/36 26/78/39 34/83/49 32/72/45 30/75/43 30/75/42
Bootstrapped Classifier 30/54/39 37/53/44 30/71/42 28/63/39 36/57/44 32/60/42
294
Meeting of the Association for Computational Lin- W. Phillips and E. Riloff. 2007. Exploiting Role-
guistics: Human Language Technologies (ACL-11). Identifying Nouns and Expressions for Information
S. Huffman. 1996. Learning Information Extraction Extraction. In Proceedings of the 2007 Interna-
Patterns from Examples. In Stefan Wermter, Ellen tional Conference on Recent Advances in Natural
Riloff, and Gabriele Scheler, editors, Connectionist, Language Processing (RANLP-07), pages 468473.
Statistical, and Symbolic Approaches to Learning E. Riloff and R. Jones. 1999. Learning Dictionar-
for Natural Language Processing, pages 246260. ies for Information Extraction by Multi-Level Boot-
Springer-Verlag, Berlin. strapping. In Proceedings of the Sixteenth National
H. Ji and R. Grishman. 2008. Refining Event Extrac- Conference on Artificial Intelligence.
tion through Cross-Document Inference. In Pro- E. Riloff and W. Phillips. 2004. An Introduction to the
ceedings of ACL-08: HLT, pages 254262, Colum- Sundance and AutoSlog Systems. Technical Report
bus, OH, June. UUCS-04-015, School of Computing, University of
Utah.
S. Keerthi and D. DeCoste. 2005. A Modified Finite
E. Riloff. 1993. Automatically Constructing a Dictio-
Newton Method for Fast Solution of Large Scale
nary for Information Extraction Tasks. In Proceed-
Linear SVMs. Journal of Machine Learning Re-
ings of the 11th National Conference on Artificial
search.
Intelligence.
J. Kim and D. Moldovan. 1993. Acquisition of E. Riloff. 1996. Automatically Generating Extraction
Semantic Patterns for Information Extraction from Patterns from Untagged Text. In Proceedings of the
Corpora. In Proceedings of the Ninth IEEE Con- Thirteenth National Conference on Artificial Intel-
ference on Artificial Intelligence for Applications, ligence, pages 10441049.
pages 171176, Los Alamitos, CA. IEEE Computer Satoshi Sekine. 2006. On-demand information extrac-
Society Press. tion. In Proceedings of Joint Conference of the In-
Y. Li, K. Bontcheva, and H. Cunningham. 2005. Us- ternational Committee on Computational Linguis-
ing Uneven Margins SVM and Perceptron for Infor- tics and the Association for Computational Linguis-
mation Extraction. In Proceedings of Ninth Confer- tics (COLING/ACL-06.
ence on Computational Natural Language Learn- Y. Shinyama and S. Sekine. 2006. Preemptive In-
ing, pages 7279, Ann Arbor, MI, June. formation Extraction using Unrestricted Relation
Shasha Liao and Ralph Grishman. 2010. Using Docu- Discovery. In Proceedings of the Human Lan-
ment Level Cross-Event Inference to Improve Event guage Technology Conference of the North Ameri-
Extraction. In Proceedings of the 48st Annual can Chapter of the Association for Computational
Meeting on Association for Computational Linguis- Linguistics, pages 304311, New York City, NY,
tics (ACL-10). June.
M. Maslennikov and T. Chua. 2007. A Multi- S. Soderland, D. Fisher, J. Aseltine, and W. Lehnert.
Resolution Framework for Information Extraction 1995. CRYSTAL: Inducing a conceptual dictio-
from Free Text. In Proceedings of the 45th Annual nary. In Proc. of the Fourteenth International Joint
Meeting of the Association for Computational Lin- Conference on Artificial Intelligence, pages 1314
guistics. 1319.
G. Miller. 1990. Wordnet: An On-line Lexical M. Stevenson and M. Greenwood. 2005. A Seman-
Database. International Journal of Lexicography, tic Approach to IE Pattern Induction. In Proceed-
3(4). ings of the 43rd Annual Meeting of the Association
for Computational Linguistics, pages 379386, Ann
MUC-4 Proceedings. 1992. Proceedings of the
Arbor, MI, June.
Fourth Message Understanding Conference (MUC-
K. Sudo, S. Sekine, and R. Grishman. 2003. An Im-
4). Morgan Kaufmann.
proved Extraction Pattern Representation Model for
S. Patwardhan and E. Riloff. 2007. Effective Informa- Automatic IE Pattern Acquisition. In Proceedings
tion Extraction with Semantic Affinity Patterns and of the 41st Annual Meeting of the Association for
Relevant Regions. In Proceedings of 2007 the Con- Computational Linguistics (ACL-03).
ference on Empirical Methods in Natural Language R. Yangarber, R. Grishman, P. Tapanainen, and S. Hut-
Processing (EMNLP-2007). tunen. 2000. Automatic Acquisition of Domain
S. Patwardhan and E. Riloff. 2009. A Unified Model Knowledge for Information Extraction. In Proceed-
of Phrasal and Sentential Evidence for Information ings of the Eighteenth International Conference on
Extraction. In Proceedings of 2009 the Conference Computational Linguistics (COLING 2000).
on Empirical Methods in Natural Language Pro- K. Yu, G. Guan, and M. Zhou. 2005. Resume In-
cessing (EMNLP-2009). formation Extraction with Cascaded Hybrid Model.
S. Patwardhan. 2010. Widening the Field of View In Proceedings of the 43rd Annual Meeting of the
of Information Extraction through Sentential Event Association for Computational Linguistics, pages
Recognition. Ph.D. thesis, University of Utah. 499506, Ann Arbor, MI, June.
295
Bootstrapping Events and Relations from Text
Ting Liu Tomek Strzalkowski
ILS, University at Albany, ILS, University at Albany, USA
USA Polish Academy of Sciences
tliu@albany.edu tomek@albany.edu
(2) self-adapting unsupervised multi-pass boot-
Abstract strapping by which the system learns new rules
as it reads un-annotated text using the rules learnt
In this paper, we describe a new approach to in the first step and in the subsequent learning
semi-supervised adaptive learning of event passes. When a sufficient quantity and quality of
extraction from text. Given a set of exam- text material is supplied, the system will learn
ples and an un-annotated text corpus, the many ways in which a specific class of events
BEAR system (Bootstrapping Events And
can be described. This includes the capability to
Relations) will automatically learn how to
recognize and understand descriptions of detect individual event mentions using a system
complex semantic relationships in text, such of context-sensitive triggers and to isolate perti-
as events involving multiple entities and nent attributes such as agent, object, instrument,
their roles. For example, given a series of time, place, etc., as may be specific for each type
descriptions of bombing and shooting inci- of event. This method produces an accurate and
dents (e.g., in newswire) the system will highly adaptable event extraction that significant-
learn to extract, with a high degree of accu- ly outperforms current information extraction
racy, other attack-type events mentioned techniques both in terms of accuracy and robust-
elsewhere in text, irrespective of the form of ness, as well as in deployment cost.
description. A series of evaluations using
the ACE data and event set show a signifi-
2 Learning by bootstrapping
cant performance improvement over our
baseline system. As a semi-supervised machine learning method,
bootstrapping can start either with a set of prede-
fined rules or patterns, or with a collection of
1 Introduction training examples (seeds) annotated by a domain
expert on a (small) data set. These are normally
We constructed a semi-supervised machine
related to a target application domain and may be
learning process that effectively exploits statisti-
regarded as initial teacher instructions to the
cal and structural properties of natural language
learning system. The training set enables the sys-
discourse in order to rapidly acquire rules to de-
tem to derive initial extraction rules, which are
tect mentions of events and other complex rela-
applied to un-annotated text data in order to pro-
tionships in text, extract their key attributes, and
duce a much larger set of examples. The exam-
construct template-like representations. The
ples found by the initial rules will occur in a
learning process exploits descriptive and struc-
variety of linguistic contexts, and some of these
tural redundancy, which is common in language;
contexts may provide support for creating alter-
it is often critical for achieving successful com-
native extraction rules. When the new rules are
munication despite distractions, different con-
subsequently applied to the text corpus, addition-
texts, or incompatible semantic models between
al instances of the target concepts will be identi-
a speaker/writer and a hearer/reader. We also
fied, some of which will be positive and some
take advantage of the high degree of referential
not. As this process continues to iterate over, the
consistency in discourse (e.g., as observed in
system acquires more extraction rules, fanning
word sense distribution by (Gale, et al. 1992),
out from the seed set until no new rules can be
and arguably applicable to larger linguistic
learned.
units), which enables the reader to efficiently
Thus defined, bootstrapping has been used in
correlate different forms of description across
natural language processing research, notably in
coherent spans of text.
word sense disambiguation (Yarowsky, 1995).
The method we describe here consists of two
Strzalkowski and Wang (1996) were first to
steps: (1) supervised acquisition of initial extrac-
demonstrate that the technique could be applied
tion rules from an annotated training corpus, and
to adaptive learning of named entity extraction
296
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 296305,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
rules. For example, given a nave rule for iden- be found in essays and other narrative forms. The
tifying company names in text, e.g., capitalized system needs to recognize any of these forms and
NP followed by Co., their system would first to do so we need to distill each description to a
find a large number of (mostly) positive instanc- basic event pattern. This pattern will capture the
es of company names, such as Henry Kauffman heads of key phrases and their dependency struc-
Co. From the context surrounding each of these ture while suppressing modifiers and certain oth-
instances it would isolate alternative indicators, er non-essential elements. Such skeletal
such as the president of, which is noted to oc- representations cannot be obtained with keyword
cur in front of many company names, as in The analysis or linear processing of sentences at word
president of American Electric Automobile Co. level (e.g., Agichtein and Gravano, 2000), be-
. Such alternative indicators give rise to new cause such methods cannot distinguish a phrase
extraction rules, e.g., president of + CNAME. head from its modifier. A shallow dependency
The new rules find more entities, including com- parser, such as Minipar (Lin, 1998), that recog-
pany names that do not end with Co., and the nizes dependency relations between words is
process iterates until no further rules are found. quite sufficient for deriving head-modifier rela-
The technique achieved a very high performance tions and thus for construction of event tem-
(95% precision and 90% recall), which encour- plates. Event templates are obtained by stripping
aged more research in IE area by using boot- the parse tree of modifiers while preserving the
strapping techniques. Using a similar approach, basic dependency structure as shown in Figure 1,
(Thelen and Riloff, 2002) generated new syntac- which is a stripped down parse tree of, Also
tic patterns by exploiting the context of known Monday, Israeli soldiers fired on four diplomatic
seeds for learning semantic categories. vehicles in the northern Gaza town of Beit
In Snowball (Agichtein and Gravano, 2000 ) Hanoun, said diplomats
and Yangarbers IE system (2000), bootstrapping The model proposed here represents a signifi-
technique was applied for extraction of binary cant advance over the current methods for rela-
relations, such as Organization-Location, e.g., tion extraction, such as the SVO model
between Microsoft and Redmond, WA. Then, Xu (Yangarber, et al. 2000) and its extension, e.g.,
(2007) extended the method for more complex the chain model (Sudo, et al. 2001) and other
relations extraction by using sentence syntactic related variants (Riloff, 1996) all of which lack
structure and a data driven pattern generation. In the expressive power to accurately recognize and
this paper, we describe a different approach on represent complex event descriptions and to sup-
building event patterns and adapting to the dif- port successful machine learning. While Sudos
ferent structures of unseen events. subtree model (2003) overcomes some of the
limitations of the chain models and is thus con-
3 Bootstrapping applied to event learn- ceptually closer to our method, it nonetheless
ing lacks efficiency required for practical applica-
tions.
Our objective in this project was to expand the We represent complex relations as tree-like
bootstrapping technique to learn extraction of structures anchored at an event trigger (which is
events from text, irrespective of their form of usually but not necessarily the main verb) with
description, a property essential for successful branches extending to the event attributes (which
adaptability to new domains and text genres. The are usually named entities). Unlike the singular
major challenge in advancing from entities and concepts (i.e., named entities such as person or
binary relations to event learning is the complex-
ity of structures involved that not only consist of
multiple elements but their linguistic context
may now extend well beyond a few surrounding
words, even past sentence boundaries. These
considerations guided the design of the BEAR
system (Bootstrapping Events And Relations),
which is described in this paper.
3.1 Event representation
An event description can vary from very concise,
newswire-style to very rich and complex as may Figure 1. Skeletal dependency structure representation of an
event mention.
297
location) or linear relations (i.e., tuples such as 3.2 Designating the sense of event triggers
Gates CEO Microsoft), an event description
An event trigger may have multiple senses but
consists of elements that form non-linear de-
only one of them is for the event representation.
pendencies, which may not be apparent in the
If the correct sense can be determined, we would
word order and therefore require syntactic and
be able to use its synonyms and hyponym as al-
semantic analysis to extract. Furthermore, an ar-
ternative event triggers, thus enabling extraction
rangement of these elements in text can vary
of more events. This, in turn, requires sense dis-
greatly from one event mention to the next, and
ambiguation to be performed on the event trig-
there is usually other intervening material in-
gers.
volved. Consequently, we construe event repre-
In MUC evaluations, participating groups (
sentation as a collection of paths linking the
Yangarber and Grishman, 1998) used human
trigger to the attributes through the nodes of a
experts to decide the correct sense of event trig-
parse tree1.
gers and then manually added correct synonyms
To create an event pattern (which will be part
to generalize event patterns. Although accurate,
of an extraction rule), we generalize the depend-
the process is time consuming and not portable to
ency paths that connect the event trigger with
new domains.
each of the event key attributes (the roles). A
We developed a new approach for utilizing
dependency path consists of lexical and syntactic
Wordnet to decide the correct sense of an event
relations (POS and phrase dependencies), as well
trigger. The method is based on the hypothesis
as semantic relations, such as entity tags (e.g.,
that event triggers will share same sense when
Person, Company, etc.) of event roles and word
represent same type of event. For example, when
sense designations (based on Wordnet senses) of
the verbs, attack, assail, strike, gas, bomb, are
event triggers. In addition to the trigger-role
trigger words of Conflict-Attack event, they
paths (which we shall call the sub-patterns), an
share same sense. This process is described in the
event pattern also contains the following:
following steps:
Event Type and Subtype which is inher- 1) From training corpus, collect all triggers,
ited from seed examples; which specify the lemma, POS tag, the type
Trigger class an instance of the trigger of event and get all possible senses of them
must be found in text before any patterns from Wordnet.
are applied; 2) Order the triggers by the trigger frequency
Confidence score expected accuracy of TrF(t, w_pos),2 which is calculated by divid-
the pattern established during training ing number of times each word (w_pos) is
process; used as a trigger for the event of type (t) by
Context profile additional features col- the total number of times this word occurs in
lected from the context surrounding the the training corpus. Clearly, the greater trig-
event description, including references of ger frequency of a word, the more discrimi-
other types of events near this event, in native it is as a trigger for the given type of
the same sentence, same paragraph, or ad- event. When the senses of the triggers with
jacent paragraphs. high accuracy are defined, they can be the
We note that the trigger-attribute sub-patterns reference for the triggers in low accuracy.
are defined over phrase structures rather than 3) From the top of the trigger list, select the
over linear text, as shown in Figure 2. In order to first none-sense defined trigger (Tr1)
compose a complete event pattern, sub-patterns 4) Again, beginning from the top of the trigger
are collected across multiple mentions of the list, for every trigger Tr2 (other than Tr1),
same-type event. we look for a pair of compatible senses be-
tween Tr1 and Tr2. To do so, traverse Syno-
Attacker: <N(subj, PER): Attacker> <V(fire): trigger>
Place: <V(fire): trigger> <Prep> <N> <Prep(in)> <N(GPE): Place>
nym, Hypernym, and Hyponym links starting
Target: <V(fire): trigger> <Prep(on)> <N(VEH): Target> from the sense(s) of Tr2 (use either the sense
Time-Within:<N(timex2): Time-Within><SentHead><V(fire): already assigned to Tr2 if has or all its possi-
trigger>
Figure 2. Trigger-attribute sub-patterns for key roles in a Conflict- ble senses) and see whether there are paths
Attack event pattern. which can reach the senses of Tr1. If such
1
Details of how to derive the skeletal tree representation are converging paths exist, the compatible senses
described in (Liu, 2009).
2 2
t the type of the event, w_pos the lemma of a word and t the type of the event, w_pos the lemma of a word and
its POS. its POS.
3
In this figure we omit the parse tree trimming step which
was explained in the previous section.
298
relaxation, is particularly useful for rapid adapta-
tion of extraction capability to slightly altered,
partly ungrammatical, or otherwise variant data.
The basic idea is as follows: the patterns ac-
quired in prior learning iterations (starting with
those obtained from the seed examples) are
matched against incoming text to extract new
events. Along the way there will be a number of
partial matches, i.e., when no existing pattern
fully matches a span of text. This may simply
mean that no event is present; however, depend-
ing upon the degree of the partial match we may
Figure 3. A Conflict-Attack event pattern derived from a also consider that a novel structural variant was
positive example in the training corpus
are identified and assigned to Tr1 and Tr2 (if found. BEAR would automatically test this hy-
Tr2s sense wasnt assigned before). Then go pothesis by attempting to construe a new pattern,
back to step 3. However, if no such path ex- out of the elements of existing patterns, in order
ist between Tr1 senses with other triggers to achieve a full match. If a match is achieved,
senses, the first sense listed in Wordnet will the new mutated pattern will be added to
be assigned to Tr1 BEAR learned collection, subject to a validation
This algorithm tries to assign the most proper step. The validation step (discussed later in this
sense to every trigger for one type of event. For paper) is to assure that the added pattern would
example, the sense of fire as trigger of Conflict- not introduce an unacceptable drop in overall
Attack event is start firing a weapon; while it is system precision. Specific pattern mutation tech-
used in Personal-End_Position, its sense is ter- niques include the following:
minate the employment of. After the trigger Adding a role subpattern: When a pattern
sense is defined, we can expand event triggers by matches an event mention while there is a
adding their synonyms and hyponyms during the sufficient linguistic evidence (e.g., pres-
event extraction. ence of certain types of named entities)
that additional roles may be present in
3.3 Deriving initial rules from seed exam- text, then appropriate role subpatterns can
ples be "imported" from other, non-matching
patterns (Figure 4).
Extraction rules are construed as transformations
from the event patterns derived from text onto a Replacing a role subpattern: When a pat-
formal representation of an event. The initial tern matches but for one role, the system
rules are derived from a manually annotated can replace this role subpattern by another
training text corpus (seed data), supplied as part subpattern for the same role taken from a
of an application task. Each rule contains the different pattern for the same event type.
type of events it extracts, trigger, a list of role Adding or replacing a trigger: When a
sub-patterns, and the confidence score obtained pattern matches but for the trigger, a new
through a validation process (see section 3.6). trigger can be added if it either is already
Figure 3 shows an extraction pattern for the Con- present in another pattern for the same
flict-Attack event derived from the training cor- event type or the syno-
pus (but not validated yet)3. nym/hyponym/hypernym of the trigger
(found in section 3.2).
3.4 Learning through pattern mutation We should point out that some of the same ef-
fects can be obtained by making patterns more
Given an initial set of extraction rules, a variety
general, i.e., adding "optional" attributes (i.e.,
of pattern mutation techniques are applied to de-
optional sub-patterns), etc. Nonetheless, the pat-
rive new patterns and new rules. This is done by
tern mutation is more efficient because it will
selecting elements of previously learnt patterns,
automatically learn such generalization on an as-
based on the history of partial matches and com-
needed basis in an entirely data-driven fashion,
bining them into new patterns. This form of
while also maintaining high precision of the re-
learning, which also includes conditional rule
sulting pattern set. It is thus a more general
3
In this figure we omit the parse tree trimming step which method. Figure 4 illustrated the use of the ele-
was explained in the previous section. ments combination technique. In this example,
299
Figure 4. Deriving a new pattern by importing a role from another pattern
neither of the two existing patterns can fully (shown in Figure 5B) is of course subject to con-
match the new event description; however, by fidence validation, after which it will be immedi-
combining the first pattern with the Place role ately applied to extract more events.
sub-pattern from the second pattern we obtain a Another way of getting at this kind of struc-
new pattern that fully matches the text. While tural duality is to exploit co-referential con-
this adjustment is quite simple, it is nonetheless sistency within coherent spans of discourse, e.g.,
performed automatically and without any human a single news article or a similar document. Such
assistance. The new pattern is then learned by documents may contain references to multiple
BEAR, subject to a verification step explained in events, but when the same type of event is men-
a later section. tioned along with the same attributes, it is more
likely than not in reference to the same event.
3.5 Learning by exploiting structural duali- This hypothesis is a variant of an argument ad-
ty vanced in (Gale, et al. 2000) that a polysemous
As the system reads through new text extracting word used multiple times within a single docu-
more events using already learnt rules, each ex- ment, is consistently used in the same sense. So
tracted event mention is analyzed for presence of if we extract an event mention (of type T) with
alternative trigger elements that can consistently trigger t in one part of a document, and then find
predict the presence of a subset of events that that t occurs in another part of the same docu-
includes the current one. Subsequently, an alter- ment, then we may assume that this second oc-
native sub-pattern structure will be built with currence of t has the same sense as the first.
branches extending from the new trigger to the Since t is a trigger for an event of type T, we can
already identified attributes, as shown schemati- hypothesize its subsequent occurrences indicate
cally in Figure 5. additional mentions of type T events that were
In this example, a Conflict-Attack-type event not extracted by any of the existing patterns. Our
is extracted using a pattern (shown in Figure 5A) objective is to exploit these unextracted mentions
anchored at the bombing trigger. Nonetheless, and then automatically generate additional event
an alternative trigger structure is discovered, patterns.
which is anchored at an attack NP, as shown Indeed, Ji (2008) showed that trigger co-
on the right side of Figure 5. This discovery is occurrence helps finding new mentions of the
based upon seeing the new trigger repeatedly it Pattern ID: 1207
needs to explain a subset of previously seen Type: Conflict Subtype: Attack
events to be adopted. The new trigger will Trigger: bombing_N
Target: <N(bombing): trigger> <Prep(of)> <N(FAC): Target>
prompt BEAR to derive additional event pat- Attacker: <N(PER): Attacker> <V> <N(bombing): trigger>
terns, by computing alternative trigger-attribute Time-Within: <N(bombing): trigger> <Prep> <N> <Prep> <N>
paths in the dependency tree. The new pattern <E0> <V> <N(timex2): Time-within>
Figure 5A. A pattern with the bombing trigger matches the event
mention in Fig. 5.
Pattern ID: 1286
Type: Conflict Subtype: Attack
Trigger: attack_N
Target: <N(FAC): Target> <Prep(in)> <N(attack): trigger>
Attacker: <N(PER): Attacker> <V> <N> <Prep> <N> <Prep(in)>
<N(attack): trigger>
Time-Within: <N(attack): trigger> <E0> <V> <N(timex2): Time-
within>
Figure 5B. A new pattern is derived for event in Fig 5, with an attack as the
Figure 5. A new extraction pattern is derived by iden- trigger.
tifying an alternative trigger for an event.
300
entities, Howard G. Capek and UBS. The
projected accuracy of resign_V as an End-
Position trigger is 0.88. With 100% argument
overlap rate, we estimate the probability that sen-
tence R contains an event mention of the same
type as sentence L (and in fact co-referential
mention) at 97% (We set 80% as the threshold).
Thus a new event mention is found and a new
pattern for End-Position is automatically derived
Figure 6. The probability of a sentence containing a mention of the from R, as shown in Figure 7A.
same type of event within a single document
same event; however, we found that if using enti- 3.6 Pattern validation
ty co-reference as another factor, more new men- Extraction patterns are validated after each learn-
tions could be identified when the trigger has low ing cycle against the already annotated data. In
projected accuracy (Liu, 2009; Yu Hong, et al. the first supervised learning step, patterns accu-
2011). Our experiments (Figure 64), which com- racy is tested against the training corpus based on
pared the triggers and the roles across all event the similarity between the extracted events and
mentions within each document on ACE training human annotated events:
corpus, showed that when the trigger accuracy is A Full match is achieved when the event
0.5 or higher, each of its occurrences within the type is correctly identified and all its roles
document indicates an event mention of the same are correctly matched. A full credit is
type with a very high probability (mostly > 0.9). added to the pattern score.
For triggers with lower accuracy, this high prob- A Partial match is achieved when the
ability is only achieved when the two mentions event type is correctly identified but only
share at least 60% of their roles, in addition to a subset of roles is correctly extracted. A
having a common trigger. Thus our approach partial score, which is the ratio of the
uses co-occurrence of both trigger and event ar- matched roles to the whole roles, is add-
gument for detecting new event mentions. ed.
In Figure 7, an End-Position event is extracted A False Alarm occurs when a wrong type
from left sentence (L), with resign as the trig- of event is extracted (including when no
ger and Capek and UBS assigned Person and event is present in text). No credit is add-
Entity roles, respectively 5 . The right sentence ed to the pattern score.
(R), taken from the same document, contains the In the subsequent steps, the validation is ex-
same trigger word, resigned and also the same tended over parts of the unannotated corpus. In
Riloff (1996) and Sudo et al. (2001), the pattern
accuracy is mainly dependent on its occurrences
in the relevant documents6 vs. the whole corpus.
However, one document may contain multiple
types of events, thus we set a more restricted val-
idation measure on new rules:
Good Match If a new rule rediscovers
already extracted events of the same type,
Figure 7. Two event mentions have different triggers and then it will be counted as either a Full
sub-patterns structures Match or Partial Match based on previ-
Pattern ID: -1
ous rules
Type: Personnel Subtype: End-Position Possible Match If an already extracted
Trigger: resign_V event of same type of a rule contains
Person: <N(PER, subj): Person> <V(resign): trigger>
Entity: <V(resign):trigger> <E0> <N(ORG): Entity> <N> <V>
same entities and trigger as the candidate
Figure 7A. A new pattern for End-Position learned by exploiting extracted by the rule. This candidate is a
event co-reference. possible match, so it will get a partial
4
The X-axis is the percentage of entities coreferred between
the EVMs (Event mentions) and the SEs (Sentences); while
6
the Y-axis shows the probability that the SE contains a men- If a document contains same type of events extracted from
tion that is the same type as the EVM. previous steps, the document is a relevant document to the
5
Entity is the employer in the event pattern.
301
accuracy is expected to increase, in some cases
Event id: 27
from: sample above the threshold.
Projected Accuracy: 0.1765 For example, the pattern in Figure 8 has an in-
Adjusted Projected Accuracy: 0.91 itially low projected accuracy score; however, we
Type: Justice Subtype: Arrest-Jail
Trigger: capture find that positive matches of this pattern show a
Person sub-pattern: <N(obj, PER): Person> <V(capture): trigger> very high (100% in fact) degree of correlation
Co-occurrence ratio: {para_Conflict_Demonstrate=100%, } with mentions of Demonstrate events. Therefore,
Mutually exclusive ratio: {sent_Conflict_Attack=100%, pa-
ra_Conflict_Attack=96.3%, }
limiting the application of this pattern to situa-
Figure 8. An Arrest-Jail pattern with context profile information tions where a Justice-Arrest-Jail event is men-
tioned in a nearby text improves its projected
score based on the statistics result from accuracy to 91%, which is well above the re-
Figure 6. quired threshold.
False Alarm If a new rule picks up an al- In addition to the confidence rate of each new
ready extracted event in different type pattern, we also calculate projected accuracy of
Thus, event patterns are validated for overall each of the role sub-patterns, because they may
expected precision by calculating the ratio of be used in the process of detecting new patterns,
positive matches to all matches against known and it will be necessary to score partial matches,
events. This produces pattern confidence scores, as a function confidence weights for pattern
which are used to decide if a pattern is to be components. To validate a sub-pattern we apply
learned or not. Learning only the patterns with it to the training corpus and calculate its project-
sufficiently high confidence scores helps to ed accuracy score by dividing the number of cor-
guard the bootstrapping process from spinning rectly matched roles by the total number of
off track; nonetheless, the overall objective is to matches returned. The projected accuracy score
maximize the performance of the resulting set of will tell us how well a sub-pattern can distin-
extraction rules, particularly by expanding its guish a specific event role from other infor-
recall rate. mation, when used independently from other
For the patterns where the projected accuracy elements of the complete pattern.
score falls under the cutoff threshold, we may Figure 9 shows three sub-pattern examples.
still be able to make some repairs by taking The first sub-pattern extracts the Victim role in a
into account their context profile. To do so, we Life-Die event with very high projected accuracy.
applied a similar approach as (Liao, 2010), which This sub-pattern is also a good candidate for
showed that some types of events can appeared generations of additional patterns for this type of
frequently with each other. We collected all the event, a process which we describe in section D.
matches produced by such a failed pattern and The second sub-pattern was built to extract the
created a list of all other events that occur in their Attacker role in Conflict-Attack events, but it has
immediate vicinity: in the same sentence, as well very low projected accuracy. The third one
as the sentences before and after it7. These other shows another Attacker sub-pattern whose pro-
events, of different types and detected by differ- jected accuracy score is 0.417 after the first step
ent patterns, may be seen as co-occurring near Victim pattern: <N(obj, PER): Victim> <V(kill): trigger> (Life-Die)
the target event: these that co-occur near positive Projected Accuracy: 0.9390243902439024
matches of our pattern will be added to the posi- Number of negative matches: 5
Number of Positive matches: 77
tive context support of this pattern; conversely,
events co-occurring near false alarms will be Attacker pattern: <N(subj, PE/PER/ORG): Attacker> <V> <V(use):
added to the negative context support for this trigger> (Conflict-Attack)
pattern. By collecting such contextual infor- Projected Accuracy: 0.025210084033613446
Number of negative matches: 116
mation, we can find contextually-based indica- Number of positive matches: 3
tors and non-indicators for occurrence of event
mentions. When these extra constraints are in- Attacker pattern: <N(subj, GPE/PER): Attacker> <V(attack): trig-
cluded in a previously failed pattern, its projected ger> (Conflict-Attack)
Projected Accuracy: 0.4166666666666667
Number of negative matches: 7
Number of positive matches: 5
categories of posi- GPE: 4 GPE_Nation: 4 PER: 1
7
If a known event is detected in the same sentence tive matches: PER_Individual: 1
(sent_), the same paragraph (para_), or an adjacent categories of nega- GPE: 1 GPE_Nation: 1 PER: 6
paragraph (adj_para_...) as the candidate event, it be- tive matches: PER_Group: 1
comes an element of the pattern context support. PER_Individual: 5
Figure 9. sub-patterns with projected accuracy scores
302
Table 1. Sub-patterns whose projected accuracy is significantly increased after noisy samples are removed
Projected Additional con- Revised Accu-
Sub-patterns
Accuracy straints racy
Movement-Transport:
<N(obj, PER/VEH): Artifact> <V(send): trigger> 0.475 removing PER 0.667
<V(bring): trigger> <N(obj)> <Prep = to> <N(FAC/GPE): Destina-
0.375 removing GPE 1.0
tion>
Conflict Attack:
<N(PER/ORG/GPE):Attacker><N(attack):trigger> 0.682 removing PER 0.8
<N(subj,GPE/PER):Attacker><V(attack): trigger> 0.417 removing GPE 0.8
removing
<N(obj,VEH/PER/FAC):Target><V(target):trigger> 0.364 0.667
PER_Individual
in validation process. This is quite low; however, reached the best cross-validated score, 66.72%,
it can be repaired by constraining its entity type when pattern accuracy threshold is set at 0.5. The
to GPE. This is because we note that with a GPE highest score of single run is 67.62%. In the fol-
entity, the subpattern is 80% on target, while lowing of this section, we will use results of one
with PER entity it is 85% a false alarm. After single run to display the learning behavior of
this sub-pattern is restricted to GPE its projected BEAR.
accuracy becomes 0.8. In Figure 10, X-axis shows values of the
Table 1 lists example sub-patterns for which learning threshold (in descending order), while
the projected accuracy increases significantly Y-axis is the average F-score achieved by the
after adding more constrains. When the projected automatically learned patterns for all types of
accuracy of a sub-pattern is improved, all pat- events against the test corpus. The red (lower)
terns containing this sub-pattern will also im- line represents BEARs base run immediately
prove their projected accuracy. If the adjusted after the first iteration (supervised learning step);
projected accuracy rises above the predefined the blue (upper) line represents BEARs perfor-
threshold, the repaired pattern will be saved. mance after an additional 10 unsupervised learn-
In the following section, we will discuss the ing cycles9 are completed. We note that the final
experiments conducted to evaluate the perfor- performance of the bootstrapped system steadily
mance of the techniques underlying BEAR: how increases as the learning threshold is lowered,
effectively it can learn and how accurately it can peaking at about 0.5 threshold value, and then
perform its extraction task. declines as the threshold value is further de-
creased, although it remains solidly above the
4 Evaluation base run. Analyzing more closely a few selected
points on this chart we note, for example, that the
We test the system learning effectiveness by
base run at threshold of 0 has F-score of 34.5%,
comparing its performance immediately follow-
which represents 30.42% recall, 40% precision.
ing the first iteration (i.e., using rules derived
On the other end of the curve, at the threshold of
from the training data) with its performance after
0.9, the base run precision is 91.8% but recall at
N cycles of unsupervised learning. We split ACE
only 21.5%, which produces F-score of 34.8%. It
training corpus 8 randomly into 5 folders and
is interesting to observe that at neither of these
trained BEAR on the four folders and evaluated
two extremes the system learning effectiveness is
it on the left one. Then, we did 5 fold cross vali-
particularly good, and is significantly less than at
dation. Our experiments showed that BEAR
303
Table 2. BEAR performance following different selections of
the median threshold of 0.5 (based on the exper- learning steps
Precision Recall F-score
iments conducted thus far), where the system Base1 0.89 0.22 0.35
performance improves from 42% to 66.86% F- Base2 0.87 0.28 0.42
score, which represents 83.9% precision and All 0.84 0.56 0.67
55.57% recall. PMM 0.84 0.48 0.61
Figure 11 explains BEARs learning effec- CBM 0.86 0.37 0.52
tiveness at what we determined empirically to be see how they contribute to the end performance.
the optimal confidence threshold (0.5) for pattern Base1 and Base2 showed the result without and
acquisition. We note that the performance of the with adding trigger synonyms in event extrac-
system steadily increases until it reaches a plat- tion. By introducing trigger synonyms, 27%
eau after about 10 learning cycles. more good events were extracted at the first it-
Figure 12 and Figure 13 show a detailed eration and thus, BEAR had more resources to
breakdown of BEAR extraction performance use in the unsupervised learning steps.
after 10 learning cycles for different types of The ALL is the combination of PMM and
events. We note that while precision holds steady CBM, which demonstrate both methods have the
across the event types, recall levels vary signifi- contribution to the final results. Furthermore, as
cantly. The main reason for low recall in some explained before, new extraction rules are
types of events is the failure to find a sufficient learned in each iteration cycle based on what was
number of high-confidence patterns. This may learned in prior cycles and that new rules are
point to limitations of the current pattern discov- adopted only after they are tested for their pro-
ery methods and may require new ways of reach- jected accuracy (confidence score), so that the
ing outside of the current feature set. overall precision of the resulting rule set is main-
In the previous section we described several tained at a high level relative to the base run.
learning methods that BEAR uses to discover,
validate and adapt new event extraction rules. 5 Conclusion and future work
Some of them work by manipulating already
In this paper, we presented a semi-supervised
learnt patterns and adapting them to new data in
method for learning new event extraction pat-
order to create new patterns, and we shall call
terns from un-annotated text. The techniques de-
these pattern-mutation methods (PMM). Other
scribed here add significant new tools that
described methods work by exploiting a broader
increase capabilities of information extraction
linguistic context in which the events occur, or
technology in general, and more specifically, of
context-based methods (CBM). CB methods look
systems that are built by purely supervised meth-
for structural duality in text surrounding the
ods or from manually designed rules. Our eval-
events and thus discover alternative extraction
uation using ACE dataset demonstrated that that
patterns.
bootstrapping can be effectively applied to learn-
In Table 2, we report the results of running
ing event extraction rules for 33 different types
BEAR with each of these two groups of learning
of events and that the resulting system can out-
methods separately and then in combination to
perform supervised system (base run) significant-
ly.
Some follow-up research issues include:
New techniques are needed to recognize
event descriptions that still evade the cur-
rent pattern derivation techniques, espe-
cially for the events defined in Personnel,
Business, and Transactions classes.
Figure 12. Event mention extraction after learning: preci- Adapting the bootstrapping method to ex-
sion for each type of event tract events in a different language, e.g.
Chinese or Arabic.
Expanding this method to extraction of
larger scenarios, i.e., groups of correlat-
ed events that form coherent stories of-
ten described in larger sections of text,
e.g., an event and its immediate conse-
quences.
Figure 13. Event mention extraction after learning: recall for
each type of event
304
References Thelen, M., Riloff, E. 2002. A bootstrapping
method for learning semantic lexicons using
Agichtein, E. and Gravano, L. 2000. Snowball:
extraction pattern contexts. In Proceedings of
Extracting Relations from Large Plain-Text
the ACL-02 conference on Empirical methods
Collections. In Proceedings of the Fifth ACM
in natural language processing - Volume 10.
International Conference on Digital Libraries
214-222. Morristown, NJ: Association for
Gale, W. A., Church, K. W., and Yarowsky, D. Computational Linguistics
1992. One sense per discourse. In Proceedings
Xu, F., Uszkoreit, H., & Li, H. (2007). A
of the workshop on Speech and Natural Lan-
seed-driven bottom-up machine learning
guage, 233-237. Harriman, New York: Asso-
framework for extracting relations of various
ciation for Computational Linguistics.
complexity. In Proc. of the 45th Annual Meet-
Hong, Y., Zhang, J., Ma, B., Yao, J., Zhou, G., ing of the Association of Comp. Linguistics,
and Zhu, Q,. 2011. Using Cross-Entity Infer- pp. 584591, Prague, Czech Republic.
ence to Improve Event Extraction. In Proceed-
Yangarber, R., and Grishman, R. 1998. NYU:
ings of the Annual Meeting of the Association
Description of the Proteus/PET System as
of Computational Linguistics (ACL 2011).
Used for MUC-7 ST. In Proceedings of the
Portland, Oregon, USA.
7th conference on Message understanding.
Ji, H. and Grishman, R. 2008. Refining Event
Yangarber, R., Grishman, R., Tapanainen, P.,
Extraction Through Unsupervised Cross-
and Huttunen, S. 2000. Unsupervised discov-
document Inference. In Proceedings of the
ery of scenario-level patterns for information
Annual Meeting of the Association of Compu-
extraction. In Proceedings of the Sixth Confer-
tational Linguistics (ACL 2008).Ohio, USA.
ence on Applied Natural Language Pro-
Liao, S. and Grishman R. 2010. Using Document cessing, (ANLP-NAACL 2000), 282-289
Level Cross-Event Inference to Improve Event
Yarowsky, D. 1995. Unsupervised word sense
Extraction. In Proc. ACL-2010, pages 789-
disambiguation rivaling supervised methods.
797, Uppsala, Sweden, July.
In Proceedings of the 33rd annual meeting on
Lin, D. 1998. Dependency-based evaluation of Association for Computational Linguistics,
MINIPAR. In Workshop on the Evaluation of 189-196, Cambridge, Massachusetts: Associa-
Parsing System, Granada, Spain. tion for Computational Linguistics
Liu Ting. 2009. BEAR: Bootstrap Event and Re-
lations from Text. Ph.D. Thesis
Riloff, E. 1996. Automatically Generating Ex-
traction Patterns from Untagged Text. In Pro-
ceedings of the Thirteenth National
Conference on Artificial Intelligence, pages
10441049. The AAAI Press/MIT Press.
Sudo, K., Sekine, S., Grishman, R. 2001. Auto-
matic Pattern Acquisition for Japanese Infor-
mation Extraction. In Proceedings of Human
Language Technology Conference (HLT2001).
Sudo, K., Sekine, S., Grishman, R. 2003. An im-
proved extraction pattern representation model
for automatic IE pattern acquisition. Proceed-
ings of ACL 2003 , 224 231. Tokyo.
Strzalkowski, T., and Wang, J. 1996. A self-
learning universal concept spotter. In Proceed-
ings of the 16th conference on Computational
linguistics - Volume 2, 931-936, Copenhagen,
Denmark: Association for Computational Lin-
guistics
305
CLex: A Lexicon for Exploring Color, Concept and Emotion
Associations in Language
306
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 306314,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
collect more linguistically rich color-concept an- collecting color and emotion annotations for
notations associated with mood, cognitive state, 10,170 word-sense pairs from Macquarie The-
behavior and attitude. We also do not have any saurus2 . They analyzed their annotations, looking
restrictions on color naming, which helps us to for associations with the 11 basic color terms from
discover a rich lexicon of color terms and collo- Berlin and Key (1988). The set of emotion labels
cations that represent various hues, darkness, sat- used in their annotations was restricted to the set
uration and other natural language collocations. of 8 basic emotions proposed by Plutchik (1980).
We also perform a comprehensive analysis of Their annotators were restricted to the US, and
the data by investigating several questions includ- produced 4.45 annotations per word-sense pair on
ing: What affect terms are evoked by a certain average.
color, e.g., positive vs. negative? What con- There is also a commercial project from Cym-
cepts are frequently associated with a particular bolism3 to collect concept-color associations. It
color? What is the distribution of part-of-speech has 561,261 annotations for a restricted set of 256
tags over concepts and affect terms in the data col- concepts, mainly nouns, adjectives and adverbs.
lected without any preselected set of affect terms Other work on collecting emotional aspect
and concepts? What affect terms are strongly as- of concepts includes WordNet-Affect (WNA)
sociated with a certain concept or a category of (Strapparava and Valitutti, 2004), the General En-
concepts and is there any correlation with a se- quirer (GI) (Stone et al., 1966), Affective Forms
mantic orientation of a concept? of English Words (Bradley and Lang, 1999) and
Finally, we share our experience collecting the Elliotts Affective Reasoner (Elliott, 1992).
data using crowdsourcing, describe advantages The WNA lexicon is a set of affect terms from
and disadvantages as well as the strategies we WordNet (Miller, 1995). It contains emotions,
used to ensure high quality annotations. cognitive states, personality traits, behavior, at-
titude and feelings, e.g., joy, doubt, competitive,
2 Related Work cry, indifference, pain. Total of 289 affect terms
Interestingly, some color-concept associations were manually extracted, but later the lexicon was
vary by culture and are influenced by the tra- extended using WordNet semantic relationships.
ditions and beliefs of a society. As shown in WNA covers 1903 affect terms - 539 nouns, 517
(Sable and Akcay, 2010) green represents danger adjectives, 238 verbs and 15 adverbs.
in Malaysia, envy in Belgium, love and happiness The General Enquirer covers 11,788 concepts
in Japan; red is associated with luck in China and labeled with 182 category labels including cer-
Denmark, but with bad luck in Nigeria and Ger- tain affect categories (e.g., pleasure, arousal, feel-
many and reflects ambition and desire in India. ing, pain) in addition to positive/negative seman-
Some expressions involving colors share the tic orientation for concepts4 .
same meaning across many languages. For in- Affective Forms of English Words is a work
stance, white heat or red heat (the state of high which describes a manually collected set of nor-
physical and mental tension), blue-blood (an aris- mative emotional ratings for 1K English words
tocrat, royalty), white-collar or blue collar (of- that are rated in terms of emotional arousal (rang-
fice clerks). However, there are some expres- ing from calm to excited), affective valence (rang-
sions where color associations differ across lan- ing from pleasant to unpleasant) and dominance
guages, e.g., British or Italian black eye becomes (ranging from in control to dominated).
blue in Germany, purple in Spain and black-butter Elliotts Affective Reasoner is a collection of
in France; your French, Italian and English neigh- programs that is able to reason about human emo-
bors are green with envy while Germans are yel- tions. The system covers a set of 26 emotion cat-
low with envy (Bortoli and Maroto, 2001). egories from Ortony et al (1988).
There has been little academic work on con- Kaya (2004) and Strapparava and Ozbal (2010)
structing color-concept and color-emotion lexi- both have worked on inferring emotions associ-
cons. The work most closely related to ours ated with colors using semantic similarity. Their
collects concept-color (Mohammad, 2011c) and 2
http://www.macquarieonline.com.au
concept-emotion (E MO L EX) associations, both 3
http://www.cymbolism.com/
4
relying on crowdsourcing. His project involved http://www.wjh.harvard.edu/inquirer/
307
research found that Americans perceive red as ex- a set of trusted workers who had been consistently
citement, yellow as cheer, purple as dignity and working on similar tasks for us.
associate blue with comfort and security. Other
research includes that geared toward discovering 3.2 Task Design
culture-specific color-concept associations (Gage, Our task was designed to collect a linguistically
1993) and color preference, for example, in chil- rich set of color terms, emotions, and concepts
dren vs. adults (Ou et al., 2011). that were associated with a large set of colors,
specifically the 152 RGB values corresponding to
3 Data Collection facial features of cartoon human avatars. In to-
tal we had 36 colors for hair/eyebrows, 18 for
In order to collect color-concept and color- eyes, 27 for lips, 26 for eye shadows, 27 for fa-
emotion associations, we use Amazon Mechani- cial mask and 18 for skin. These data is necessary
cal Turk5 . It is a fast and relatively inexpensive to achieve our long-term goal which is to model
way to get a large amount of data from many cul- natural human-computer interactions in a virtual
tures all over the world. world domain such as the avatar editor.
We designed two MTurk tasks. For Task 1, we
3.1 MTurk and Data Quality showed a swatch for one RGB value and asked
Amazon Mechanical Turk is a crowdsourcing 50 workers to name the color, describe emotions
platform that has been extensively used for ob- this color evokes and define a set of concepts as-
taining low-cost human annotations for various sociated with that color. For Task 2, we showed a
linguistic tasks over the last few years (Callison- particular facial feature and a swatch in a particu-
Burch, 2009). The quality of the data obtained lar color, and asked 50 workers to name the color
from non-expert annotators, also referred to as and describe the concepts and emotions associ-
workers or turkers, was investigated by Snow et ated with that color. Figure 1 shows what would
al (2008). Their empirical results show that the be presented to worker for Task 2.
quality of non-expert annotations is comparable Q1. How would you name this color?
to the quality of expert annotations on a variety of Q2. What emotion does this color evoke?
natural language tasks, but the cost of the annota- Q3. What concepts do you associate with it?
tion is much lower.
There are various quality control strategies that
can be used to ensure annotation quality. For in-
stance, one can restrict a crowd by creating a
pilot task that allows only workers who passed
the task to proceed with annotations (Chen and Figure 1: Example of MTurk Task 2. Task 1 is the
Dolan, 2011). In addition, new quality control same except that only a swatch is given.
mechanisms have been recently introduced e.g.,
Masters. They are groups of workers who are The design that we suggested has a minor lim-
trusted for their consistent high quality annota- itation in that a color swatch may display differ-
tions, but to employ them costs more. ently on different monitors. However, we hope to
Our task required direct natural language in- overcome this issue by collecting 50 annotations
e c
put from workers and did not include any mul- per RGB value. The example color emotion
tiple choice questions (which tend to attract more concept associations produced by different anno-
cheating). Thus, we limited our quality control ef- tators ai are shown below:
forts to (1) checking for empty input fields and (2)
blocking copy/paste functionality on a form. We [R=222, G=207, B=186] (a1 ) light golden
e c
did not ask workers to complete any qualification yellow purity, happiness butter cookie,
e c
tasks because it is impossible to have gold stan- vanilla; (a2 ) gold cheerful, happy sun,
e c
dard answers for color-emotion and color-concept corn; (a3 ) golden sexy beach, jewelery.
associations. In addition, we limited our crowd to
[R=218, G=97, B=212] (a4 ) pinkish pur-
e c
5
http://www.mturk.com ple peace, tranquility, stressless justin
308
biebers headphones, someday perfume; (a5 ) orange), darkness (dark, light, medium), satura-
e c
pink happiness rose, bougainvillea. tion (grayish, vivid), and brightness (deep, pale)
(Mojsilovic, 2002). Interestingly, we observe
In addition, we collected data about workers
these dimensions in CL EX by looking for B&K
gender, age, native language, number of years of
color terms and their frequent collocations. We
experience with English, and color preferences.
present the top 10 color collocations for the B&K
This data is useful for investigating variance in an-
colors in Table 2. As can be seen, color terms
notations for color-emotion-concept associations
truly are distinguished by darkness, saturation and
among workers from different cultural and lin-
brightness terms e.g., light, dark, greenish, deep.
guistic backgrounds.
In addition, we find that color terms are also as-
4 Data Analysis sociated with color-specific collocations, e.g., sky
blue, chocolate brown, pea green, salmon pink,
We collected 15,200 annotations evenly divided carrot orange. These collocations were produced
between the two tasks over 12 days. In total, 915 by annotators to describe the color of particular
workers (41% male, 51% female and 8% who did RGB values. We investigate these color-concept
not specify), mainly from India and United States, associations in more details in Section 4.3.
completed our tasks as shown in Table 1. 18% In total, the CL EX has 2,315 unique color
workers produced 20 or more annotations. They
spent 78 seconds on average per annotation with P
an average salary rate $2.3 per hour ($0.05 per Color Co-occurrences
completed task). white off, antique, half, dark, black, bone, 0.62
milky, pale, pure, silver
Country Annotations black light, blackish brown, brownish, 0.43
brown, jet, dark, green, off, ash,
India 7844
blackish grey
United States 5824 red dark, light, dish brown, brick, or- 0.59
Canada 187 ange, brown, indian, dish, crimson,
United Kingdom 172 bright
Colombia 100 green dark, light, olive, yellow, lime, for- 0.54
est, sea, dark olive, pea, dirty
Table 1: Demographic information about annota- yellow light, dark, green, pale, golden, 0.63
tors: top 5 countries represented in our dataset. brown, mustard, orange, deep,
bright
In total, we collected 2,315 unique color terms, blue light, sky, dark, royal, navy, baby, 0.55
grey, purple, cornflower, violet
3,397 unique affect terms, and 1,957 unique con-
brown dark, light, chocolate, saddle, red- 0.67
cepts for the given 152 RGB values. In the dish, coffee, pale, deep, red,
sections below we discuss our findings on color medium
naming, color-emotion and color-concept associ- pink dark, light, hot, pale, salmon, baby, 0.55
ations. We also give a comparison of annotated deep, rose, coral, bright
affect terms and concepts from C LEX and other purple light, dark, deep, blue, bright, 0.69
existing lexicons. medium, pink, pinkish, bluish,
pretty
4.1 Color Terms orange light, burnt, red, dark, yellow, 0.68
brown, brownish, pale, bright, car-
Berlin and Kay (1988) state that as languages rot
evolve they acquire new color terms in a strict gray dark, light, blue, brown, charcoal, 0.62
chronological order. When a language has only leaden, greenish, grayish blue, pale,
two colors they are white (light, warm) and black grayish brown
(dark, cold). English is considered to have 11 ba-
sic colors: white, black, red, green, yellow, blue, Table 2: Top 10 color term collocations for the
brown, pink, purple, orange and gray, which is 11 B&K colors; co-occurrences are sorted by fre-
known as the B&K order. quency
P10 from left to right in a decreasing order;
In addition, colors can be distinguished along at 1 p( | color) is a total estimated probability
most three independent dimensions of hue (olive, of the top 10 co-occurrences.
309
Agreement Color Term Valitutti, 2004). Of this set, 41% appeared at
% of overall Exact match 0.492 least once in CL EX. We also looked specifically
agreement Substring match 0.461 at the set of terms labeled as emotions in the
Free-marginal Exact match 0.458 W ORD N ET-A FFECT hierarchy. Of these, 12 are
Kappa Substring match 0.424 positive emotions and 10 are negative emotions.
We found that 9 out of 12 positive emotion
Table 3: Inter-annotator agreement on assigning terms (except self-pride, levity and fearlessness)
names to RGB values: 100 annotators, 152 RGB and 9 out of 10 negative emotion terms (except in-
values and 16 color categories including 11 B&K gratitude) also appear in CL EX as shown in Table
colors, 4 additional colors and none of the above. 5. Thus, we can conclude that annotators do not
names for the set of 152 RGB values. The associate any colors with self-pride, levity, fear-
inter-annotator agreement rate on color naming is lessness and ingratitude. In addition, some emo-
shown in Table 3. We report free-marginal Kappa tions were associated more frequently with colors
(Randolph, 2005) because we did not force an- than others. For instance, positive emotions like
notators to assign certain number of RGB values calmness, joy, love are more frequent in CL EX
to a certain number of color terms. Additionally, than expectation and ingratitude; negative emo-
we report inter-annotator agreement for an exact tions like sadness, fear are more frequent than
string match e.g., purple, green and a substring shame, humility and daze.
match e.g., pale yellow = yellow = golden yellow.
Positive Freq. Negative Freq.
4.2 Color-Emotion Associations calmness 1045 sadness 356
In total, the CL EX lexicon has 3,397 unique af- joy 527 fear 250
fect terms representing feelings (calm, pleasure), love 482 anxiety 55
emotions (joy, love, anxiety), attitudes (indiffer- hope 147 despair 19
ence, caution), and mood (anger, amusement). affection 86 compassion 10
The affect terms in C LEX include the 8 basic emo- enthusiasm 33 dislike 8
tions from (Plutchik, 1980): joy, sadness, anger, liking 5 shame 5
fear, disgust, surprise, trust and anticipation6 expectation 3 humility 3
CL EX is a very rich lexicon because we did not gratitude 3 daze 1
restrict our annotators to any specific set of affect
terms. A wide range of parts-of-speech are rep- Table 5: W ORD N ET-A FFECT positive and neg-
resented, as shown in the first column in Table 4. ative emotion terms from CL EX. Emotions are
For instance, the term love is represented by other sorted by frequency in decreasing order from the
semantically related terms such as: lovely, loved, total 27,802 annotations.
loveliness, loveless, love-able and the term joy is Next, we analyze the color-emotion associ-
represented as enjoy, enjoyable, enjoyment, joy- ations in CL EX in more detail and compare
ful, joyfulness, overjoyed. them with the only other publicly-available color-
emotion lexicon, E MO L EX. Recall that E MO L EX
POS Affect Terms, % Concepts, % (Mohammad, 2011a) has 11 B&K colors associ-
Nouns 79 52 ated with 8 basic positive and negative emotions
Adjectives 12 29 from (Plutchik, 1980). Affect terms in CL EX are
Adverbs 3 5 not labeled as conveying positive or negative emo-
Verbs 6 12 tions. Instead, we use the overlapping 289 affect
terms between W ORD N ET-A FFECT and CL EX
Table 4: Main syntactic categories for affect terms
and propagate labels from W ORD N ET-A FFECT to
and concepts in CL EX.
the corresponding affect terms in CL EX. As a re-
The
manually constructed portion of sult we discover positive and negative affect term
W ORD N ET-A FFECT includes 101 positive associations with the 11 B&K colors. Table 6
and 188 negative affect terms (Strapparava and shows the percentage of positive and negative af-
6
The set of 8 Plutchiks emotions is a superset of emotions fect term associations with colors for both CL EX
from (Ekman, 1992). and E MO L EX.
310
Positive Negative a disagreement in color-emotion associations be-
CL EX EL CL EX EL tween CL EX and E MO L EX. For instance antic-
white 2.5 20.1 0.3 2.9 ipation is associated with orange in CL EX com-
black 0.6 3.9 9.3 28.3 pared to white, red or yellow in E MO L EX. We also
red 1.7 8.0 8.2 21.6 found quite a few inconsistent associations with
green 3.3 15.5 2.7 4.7 the disgust emotion. This inconsistency may be
yellow 3.0 10.8 0.7 6.9 explained by several reasons: (a) E MO L EX asso-
blue 5.9 12.0 1.6 4.1 ciates emotions with colors through concepts, but
brown 6.5 4.8 7.6 9.4 CL EX has color-emotion associations obtained
pink 5.6 7.8 1.1 1.2 directly from annotators; (b) CL EX has 3,397
purple 3.1 5.7 1.8 2.5 affect terms compared to 8 basic emotions in
orange 1.6 5.4 1.7 3.8 E MO L EX. Therefore, it may be introducing some
gray 1.0 5.7 3.6 14.1 ambiguous color-emotion associations.
Finally, we investigate cross-cultural differ-
Table 6: The percentage of affect terms associated ences in color-emotion associations between the
with B&K colors in CL EX and E MO L EX (similar two most representative groups of our annotators:
color-emotion associations are shown in bold). US-based and India-based. We consider the 8
The percentage of color-emotion associations Plutchiks emotions and allow associations with
in CL EX and E MO L EX differs because the set of all possible color terms (rather than only 11 B&K
affect terms in CL EX consists of 289 positive and colors). We show top 5 colors associated with
negative affect terms compared to 8 affect terms emotions for two groups of annotators in Figure 2.
in E MO L EX. Nevertheless, we observe the same For example, we found that US-based annotators
pattern as (Mohammad, 2011a) for negative emo- associate pink with joy, dark brown with trust vs.
tions. They are associated with black, red and India-based annotators who associate yellow with
gray colors, except yellow becomes a color of joy and blue with trust.
positive emotions in CL EX. Moreover, we found
4.3 Color-Concept Associations
the associations with the color brown to be am-
biguous as it was associated with both positive In total, workers annotated the 152 RGB values
and negative emotions. In addition, we did not ob- with 37,693 concepts which is on average 2.47
serve strong associations between white and pos- concepts compared to 1.82 affect term per anno-
itive emotions. This may be because white is the tation. CL EX contains 1,957 unique concepts in-
color of grief in India. The rest of the positive cluding 1,667 nouns, 23 verbs, 28 adjectives, and
emotions follow the E MO L EX pattern and are as- 12 adverbs. We investigate an overlap of con-
sociated with green, pink, blue and purple colors. cepts by part-of-speech tag between CL EX and
Next, we perform a detailed comparison be- other lexicons including E MO L EX (EL), Affec-
tween CL EX and E MO L EX color-emotion asso- tive Norms of English Words (AN), General In-
ciations for the 11 B&K colors and the 8 basic quirer (GI). The results are shown in Table 8.
emotions from (Plutchik, 1980) in Table 7. Recall Finally, we generate concept clusters associ-
that annotations in E MO L EX are done by workers ated with yellow, white and brown colors in Fig-
from the USA only. Thus, we report two num- ure 3. From the clusters, we observe the most
bers for CL EX - annotations from workers from frequent k concepts associated with these colors
the USA (CA ) and all annotations (C). We take have a correlation with either positive or negative
E MO L EX results from (Mohammad, 2011c). We emotion. For example, white is frequently associ-
observe a strong correlation between CL EX and ated with snow, milk, cloud and all of these con-
E MO L EX affect lexicons for some color-emotion cepts evolve positive emotions. This observation
associations. For instance, anger has a strong as- helps resolve the ambiguity in color-emotion as-
sociation with red and brown, anticipation with sociations we found in Table 7.
green, fear with black, joy with pink, sadness
5 Conclusions
with black, brown and gray, surprise with yel-
low and orange, and finally, trust is associated We have described a large-scale crowdsourcing
with blue and brown. Nonetheless, we also found effort aimed at constructing a rich color-emotion-
311
white black red green yellow blue brown pink purple orange grey
C - 3.6 43.4 0.3 0.3 0.3 3.3 0.6 0.3 1.5 2.1
anger CA - 3.8 40.6 0.8 - - 4.5 - 0.8 2.3 0.8
EA 2.1 30.7 32.4 5.0 5.0 2.4 6.6 0.5 2.3 2.5 9.9
C 0.3 24.0 0.3 0.6 0.3 4.2 11.4 0.3 2.2 0.3 10.3
sadness CA - 22.2 - 0.6 - 5.3 9.4 - 4.1 - 12.3
EA 3.0 36.0 18.6 3.4 5.4 5.8 7.1 0.5 1.4 2.1 16.1
C 0.8 43.0 8.9 2.0 1.2 0.4 6.1 0.4 0.8 0.4 2.0
fear CA - 29.5 10.5 3.2 1.1 - 3.2 - 1.1 1.1 4.2
EA 4.5 31.8 25.0 3.5 6.9 3.0 6.1 1.3 2.3 3.3 11.8
C - 2.3 1.1 11.2 1.1 1.1 24.7 1.1 3.4 1.1 -
disgust CA - - - 14.8 1.8 - 33.3 - 1.8 - -
EA 2.0 33.7 24.9 4.8 5.5 1.9 9.7 1.1 1.8 3.5 10.5
C 1.0 0.2 0.2 3.4 5.7 4.2 4.2 9.1 4.4 4.0 0.6
joy CA 0.9 - 0.3 3.3 4.5 4.8 2.7 10.6 4.2 3.9 0.6
EA 21.8 2.2 7.4 14.1 13.4 11.3 3.1 11.1 6.3 5.8 2.8
C - - 1.2 3.5 1.2 17.4 8.1 1.2 1.2 5.8 1.2
trust CA - - 3.0 6.1 3.0 3.0 9.1 - - 3.0 3.0
EA 22.0 6.3 8.4 14.2 8.3 14.4 5.9 5.5 4.9 3.8 5.8
C - - - 3.3 6.7 6.7 3.3 3.3 6.7 13.3 3.3
surprise CA - - - - 5.6 5.6 - 5.6 11.1 11.1 -
EA 11.0 13.4 21.0 8.3 13.5 5.2 3.4 5.2 4.1 5.6 8.8
C - - - 5.3 5.3 - 5.3 5.3 - 15.8 5.3
anticipation CA - - - - - - - 10.0 - 10.0 10.0
EA 16.2 7.5 11.5 16.2 10.7 9.5 5.7 5.9 3.1 4.9 8.4
Table 7: The percentage of the 8 basic emotions associated with 11 B&K colors in CL EX vs. E MO L EX,
e.g., sadness is associated with black by 36% of annotators in E MOLEX (EA ), 22.1% in CL EX (CA ) by
US-based annotators only and 24% in CL EX (C) by all annotators; we report zero associations by -.
(a) Joy - US: 331, I: 154 (b) Trust - US: 33, I: 47 (c) Surprise - US: 18, I: 12 (d) Anticipation - US: 10, I: 9
(e) Anger - US: 133, I: 160 (f) Sadness - US: 171, I: 142 (g) Fear - US: 95, I: 105 (h) Disgust - US: 54, I: 16
Figure 2: Apparent cross-cultural differences in color-emotion associations between US- and India-
based annotators. 10.6% of US workers associated joy with pink, while 7.1% India-based workers
associated joy with yellow (based on 331 joy associations from the US and from 154 India).
312
(a) Yellow (b) Brown (c) White
Figure 3: Concept clusters of color-concept associations for ambiguous colors: yellow, white, brown.
concept association lexicon, CL EX. This lexicon the way that colors are associated with concepts
links concepts, color terms and emotions to spe- and emotions in languages other than English.
cific RGB values. This lexicon may help to dis-
ambiguate objects when modeling conversational Acknowledgments
interactions in many domains. We have examined
We are grateful to everyone in the NLP group
the association between color terms and positive
at Microsoft Research for helpful discussion and
or negative emotions.
feedback especially Chris Brocket, Piali Choud-
Our work also investigated cross-cultural dif- hury, and Hassan Sajjad. We thank Natalia Rud
ferences in color-emotion associations between from Tyumen State University, Center of Linguis-
India- and US-based annotators. We identified tic Education for helpful comments and sugges-
frequent color-concept associations, which sug- tions.
gests that concepts associated with a particular
color may express the same sentiment as the color.
Our future work includes applying statistical References
inference for discovering a hidden structure of
Cecilia Ovesdotter Alm, Dan Roth, and Richard
concept-emotion associations. Moreover, auto- Sproat. 2005. Emotions from text: machine
matically identifying the strength of association learning for text-based emotion prediction. In
between a particular concept and emotions is an- Proceedings of the conference on Human Lan-
other task which is more difficult than just iden- guage Technology and Empirical Methods in Natu-
tifying the polarity of the word. We are also in- ral Language Processing, HLT 05, pages 579586,
terested in using a similar approach to investigate Stroudsburg, PA, USA. Association for Computa-
tional Linguistics.
CL EXAN CL EXEL CL EXGI Brent Berlin and Paul Kay. 1988. Basic Color Terms:
their Universality and Evolution. Berkley: Univer-
Noun 287 Noun 574 Noun 708
sity of California Press.
Verb 4 Verb 13 Verb 17
M. Bortoli and J. Maroto. 2001. Translating colors in
Adj 28 Adj 53 Adj 66 web site localisation. In In The Proceedings of Eu-
Adv 1 Adv 2 Adv 3 ropean Languages and the Implementation of Com-
320 642 794 munication and Information Technologies (Elicit).
AN\CL EX EL\CL EX GI\CL EX M. Bradley and P. Lang. 1999. Affective forms for
712 7,445 11,101 english words (anew): Instruction manual and af-
CL EX\AN CL EX\EL CL EX\GI fective ranking.
1,637 1,315 1,163 Chris Callison-Burch. 2009. Fast, cheap, and creative:
evaluating translation quality using amazons me-
Table 8: An overlap of concepts by part-of- chanical turk. In EMNLP 09: Proceedings of the
speech tag between CL EX and existing lexicons. 2009 Conference on Empirical Methods in Natural
Language Processing, pages 286295, Stroudsburg,
CL EXGI stands for the intersection of sets,
PA, USA. Association for Computational Linguis-
CL EX\GI denotes the difference of sets. tics.
313
David L. Chen and William B. Dolan. 2011. Building Aleksandra Mojsilovic. 2002. A method for color
a persistent workforce on mechanical turk for mul- naming and description of color composition in im-
tilingual data collection. In Proceedings of The 3rd ages. In Proc. IEEE Int. Conf. Image Processing,
Human Computation Workshop (HCOMP 2011), pages 789792.
August. Andrew Ortony, Gerald L. Clore, and Allan Collins.
Paul Ekman. 1992. An argument for basic emotions. 1988. The Cognitive Structure of Emotions. Cam-
Cognition & Emotion, 6(3):169200. bridge University Press, July.
Clark Davidson Elliott. 1992. The affective reasoner: Li-Chen Ou, M. Ronnier Luo, Pei-Li Sun, Neng-
a process model of emotions in a multi-agent sys- Chung Hu, and Hung-Shing Chen. 2011. Age ef-
tem. Ph.D. thesis, Evanston, IL, USA. UMI Order fects on colour emotion, preference, and harmony.
No. GAX92-29901. Color Research and Application.
R. Plutchik, 1980. A general psychoevolutionary the-
J. Gage. 1993. Color and culture: Practice and mean-
ory of emotion, pages 333. Academic press, New
ing from antiquity to abstraction, univ. of calif.
York.
C. Hardin and L. Maffi. 1997. Color Categories in Justus J. Randolph. 2005. Author note: Free-marginal
Thought and Language. multirater kappa: An alternative to fleiss fixed-
N. Jacobson and W. Bender. 1996. Color as a deter- marginal multirater kappa.
mined communication. IBM Syst. J., 35:526538, P. Sable and O. Akcay. 2010. Color: Cross cultural
September. marketing perspectives as to what governs our re-
N. Kaya. 2004. Relationship between color and emo- sponse to it. In In The Proceedings of ASSBS, vol-
tion: a study of college students. College Student ume 17.
Journal. Rion Snow, Brendan OConnor, Daniel Jurafsky, and
Efthymios Kouloumpis, Theresa Wilson, and Johanna Andrew Y. Ng. 2008. Cheap and fastbut is it
Moore. 2011. Twitter sentiment analysis: The good good?: evaluating non-expert annotations for natu-
the bad and the OMG! In Proc. ICWSM. ral language tasks. In Proceedings of the Confer-
ence on Empirical Methods in Natural Language
George A. Miller. 1995. Wordnet: A lexical database
Processing, EMNLP 08, pages 254263, Strouds-
for english. Communications of the ACM, 38:39
burg, PA, USA. Association for Computational Lin-
41.
guistics.
Saif M. Mohammad and Peter D. Turney. 2010. Emo- Philip J. Stone, Dexter C. Dunphy, Marshall S. Smith,
tions evoked by common words and phrases: using and Daniel M. Ogilvie. 1966. The General In-
mechanical turk to create an emotion lexicon. In quirer: A Computer Approach to Content Analysis.
Proceedings of the NAACL HLT 2010 Workshop on MIT Press.
Computational Approaches to Analysis and Gener- Carlo Strapparava and Gozde Ozbal. 2010. The color
ation of Emotion in Text, CAAGET 10, pages 26 of emotions in text. COLING, pages 2832.
34, Stroudsburg, PA, USA. Association for Compu- C. Strapparava and A. Valitutti. 2004. Wordnet-affect:
tational Linguistics. an affective extension of wordnet. In In: Proceed-
Saif Mohammad. 2011a. Colourful language: Mea- ings of the 4th International Conference on Lan-
suring word-colour associations. In Proceedings guage Resources and Evaluation (LREC 2004), Lis-
of the 2nd Workshop on Cognitive Modeling and bon, pages 10831086.
Computational Linguistics, pages 97106, Port-
land, Oregon, USA, June. Association for Compu-
tational Linguistics.
Saif Mohammad. 2011b. From once upon a time
to happily ever after: Tracking emotions in novels
and fairy tales. In Proceedings of the 5th ACL-
HLT Workshop on Language Technology for Cul-
tural Heritage, Social Sciences, and Humanities,
pages 105114, Portland, OR, USA, June. Associa-
tion for Computational Linguistics.
Saif M. Mohammad. 2011c. Even the abstract have
colour: consensus in word-colour associations. In
Proceedings of the 49th Annual Meeting of the As-
sociation for Computational Linguistics: Human
Language Technologies: short papers - Volume 2,
HLT 11, pages 368373, Stroudsburg, PA, USA.
Association for Computational Linguistics.
314
Extending the Entity-based Coherence Model with Multiple Ranks
315
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 315324,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
Manila Miles Island Quake Baco are simply clustered by string matching.
1 X X 2.2 Evaluation Tasks
2 S O
Two evaluation tasks for Barzilay and Lapata
3 X X X X X
(2008)s entity-based model are sentence order-
Table 1: A fragment of an entity grid for five entities ing and summary coherence rating.
across three sentences. In sentence ordering, a set of random permu-
tations is created for each source document, and
the learning procedure is conducted on this syn-
(Section 5.1.2) and different distributions of per-
thetic mixture of coherent and incoherent docu-
mutations used in training (Section 5.1.3). We
ments. Barzilay and Lapata (2008) experimented
show that these two aspects are crucial, depend-
on two datasets: news articles on the topic of
ing on the characteristics of the dataset.
earthquakes (Earthquakes) and narratives on the
2 Entity-based Coherence Model topic of aviation accidents (Accidents). A train-
ing data instance is constructed as a pair con-
2.1 Document Representation sisting of a source document and one of its ran-
The original entity-based coherence model is dom permutations, and the permuted document
based on the assumption that a document makes is always considered to be less coherent than the
repeated reference to elements of a set of entities source document. The entity transition features
that are central to its topic. For a document d, an are then used to train a support vector machine
entity grid is constructed, in which the columns ranker (Joachims, 2002) to rank the source docu-
represent the entities referred to in d, and rows ments higher than the permutations. The model is
represent the sentences. Each cell corresponds tested on a different set of source documents and
to the grammatical role of an entity in the corre- their permutations, and the performance is evalu-
sponding sentence: subject (S), object (O), nei- ated as the fraction of correct pairwise rankings in
ther (X), or nothing (). An example fragment the test set.
of an entity grid is shown in Table 1; it shows In summary coherence rating, a similar exper-
the representation of three sentences from a text imental framework is adopted. However, in this
on a Philippine earthquake. B&L define a lo- task, rather than training and evaluating on a set
cal transition as a sequence {S , O, X, }n , repre- of synthetic data, system-generated summaries
senting the occurrence and grammatical roles of and human-composed reference summaries from
an entity in n adjacent sentences. Such transi- the Document Understanding Conference (DUC
tion sequences can be extracted from the entity 2003) were used. Human annotators were asked
grid as continuous subsequences in each column. to give a coherence score on a seven-point scale
For example, the entity Manila in Table 1 has for each item. The pairwise ranking preferences
a bigram transition {S , X} from sentence 2 to 3. between summaries generated from the same in-
The entity grid is then encoded as a feature vector put document cluster (excluding the pairs consist-
(d) = (p1 (d), p2 (d), . . . , pm (d)), where pt (d) is ing of two human-written summaries) are used by
the probability of the transition t in the entity grid, a support vector machine ranker to learn a dis-
and m is the number of transitions with length no criminant function to rank each pair according to
more than a predefined optimal transition length their coherence scores.
k. pt (d) is computed as the number of occurrences
of t in the entity grid of document d, divided by 2.3 Extended Models
the total number of transitions of the same length Filippova and Strube (2007) applied Barzilay and
in the entity grid. Lapatas model on a German corpus of newspa-
For entity extraction, Barzilay and Lapata per articles with manual syntactic, morphological,
(2008) had two conditions: Coreference+ and and NP coreference annotations provided. They
Coreference. In Coreference+, entity corefer- further clustered entities by semantic relatedness
ence relations in the document were resolved by as computed by the WikiRelated! API (Strube and
an automatic coreference resolution tool (Ng and Ponzetto, 2006). Though the improvement was
Cardie, 2002), whereas in Coreference, nouns not significant, interestingly, a short subsection in
316
their paper described their approach to extending 3.1 Sentence Ordering
pairwise rankings to longer rankings, by supply-
In the standard entity-based model, a discrimina-
ing the learner with rankings of all renderings as
tive system is trained on the pairwise rankings be-
computed by Kendalls , which is one of our
tween source documents and their permutations
extensions considered in this paper. Although
(see Section 2.2). However, a model learned from
Filippova and Strube simply discarded this idea
these pairwise rankings is not sufficiently fine-
because it hurt accuracies when tested on their
grained, since the subtle differences between the
data, we found it a promising direction for further
permutations are not learned. Our major contribu-
exploration. Cheung and Penn (2010) adapted
tion is to further differentiate among the permuta-
the standard entity-based coherence model to the
tions generated from the same source documents,
same German corpus, but replaced the original
rather than simply treating them all as being of the
linguistic dimension used by Barzilay and Lap-
same degree of coherence.
ata (2008) grammatical role with topologi-
cal field information, and showed that for German Our fundamental assumption is that there exists
text, such a modification improves accuracy. a canonical ordering for the sentences of a doc-
ument; therefore we can approximate the degree
For English text, two extensions have been pro- of coherence of a document by the similarity be-
posed recently. Elsner and Charniak (2011) aug- tween its actual sentence ordering and that canon-
mented the original features used in the standard ical sentence ordering. Practically, we automati-
entity-based coherence model with a large num- cally assign an objective score for each permuta-
ber of entity-specific features, and their extension tion to estimate its dissimilarity from the source
significantly outperformed the standard model document (see Section 4). By learning from all
on two tasks: document discrimination (another the pairs across a source document and its per-
name for sentence ordering), and sentence inser- mutations, the effective size of the training data
tion. Lin et al. (2011) adapted the entity grid rep- is increased while no further manual annotation
resentation in the standard model into a discourse is required, which is favorable in real applica-
role matrix, where additional discourse informa- tions when available samples with manually an-
tion about the document was encoded. Their ex- notated coherence scores are usually limited. For
tended model significantly improved ranking ac- r source documents each with m random permuta-
curacies on the same two datasets used by Barzi- tions, the number of training instances in the stan-
lay and Lapata (2008) as well as on the Wall Street dard entity-based model is therefore r m, while
Journal corpus. in our
multiple-rank
model learning process, it is
r m+1 1
r m 2 > r m, when m > 2.
However, while enriching or modifying the 2 2
original features used in the standard model is cer-
tainly a direction for refinement of the model, it 3.2 Summary Coherence Rating
usually requires more training data or a more so- Compared to the standard entity-based coherence
phisticated feature representation. In this paper, model, our major contribution in this task is to
we instead modify the learning approach and pro- show that by automatically assigning an objective
pose a concise and highly adaptive extension that score for each machine-generated summary to es-
can be easily combined with other extended fea- timate its dissimilarity from the human-generated
tures or applied to different languages. summary from the same input document cluster,
we are able to achieve performance competitive
with, or even superior to, that of B&Ls model
3 Experimental Design without knowing the true coherence score given
by human judges.
Following Barzilay and Lapata (2008), we wish Evaluating our multiple-rank model in this task
to train a discriminative model to give the cor- is crucial, since in summary coherence rating,
rect ranking preference between two documents the coherence violations that the reader might en-
in terms of their degree of coherence. We experi- counter in real machine-generated texts can be
ment on the same two tasks as in their work: sen- more precisely approximated, while the sentence
tence ordering and summary coherence rating. ordering task is only partially capable of doing so.
317
4 Dissimilarity Metrics by Bollegala et al. (2006). This metric esti-
mates the quality of a particular sentence order-
As mentioned previously, the subtle differences ing by the number of correctly arranged contin-
among the permutations of the same source docu- uous sentences, compared to the reference order-
ment can be used to refine the model learning pro- ing. For example, if = (. . . , 3, 4, 5, 7, . . . , oN ),
cess. Considering an original document d and one then {3, 4, 5} is considered as continuous while
of its permutations, we call = (1, 2, . . . , N) the {3, 4, 5, 7} is not. Average continuity is calculated
reference ordering, which is the sentence order- as
ing in d, and = (o1 , o2 , . . . , oN ) the test order- n
1 X
AC = exp log (Pi + ) ,
ing, which is the sentence ordering in that permu-
n 1 i=2
tation, where N is the number of sentences being
rendered in both documents. where n = min(4, N) is the maximum number
In order to approximate different degrees of co- of continuous sentences to be considered, and
herence among the set of permutations which bear = 0.01. Pi is the proportion of continuous sen-
the same content, we need a suitable metric to tences of length i in that are also continuous in
quantify the dissimilarity between the test order- the reference ordering . To represent the dis-
ing and the reference ordering . Such a metric similarity between the two orderings and , we
needs to satisfy the following criteria: (1) It can be use its complement AC 0 = 1 AC, such that the
automatically computed while being highly corre- larger AC 0 is, the more dissimilar two orderings
lated with human judgments of coherence, since are2 .
additional manual annotation is certainly undesir- Edit distance (ED): Edit distance is a com-
able. (2) It depends on the particular sentence monly used metric in information theory to mea-
ordering in a permutation while remaining inde- sure the difference between two sequences. Given
pendent of the entities within the sentences; oth- a test ordering , its edit distance is defined as the
erwise our multiple-rank model might be trained minimum number of edits (i.e., insertions, dele-
to fit particular probability distributions of entity tions, and substitutions) needed to transform it
transitions rather than true coherence preferences. into the reference ordering . For permutations,
In our work we use three different metrics: the edits are essentially movements, which can
Kendalls distance, average continuity, and edit be considered as equal numbers of insertions and
distance. deletions.
Kendalls distance: This metric has been
5 Experiments
widely used in evaluation of sentence ordering
(Lapata, 2003; Lapata, 2006; Bollegala et al., 5.1 Sentence Ordering
2006; Madnani et al., 2007)1 . It measures the Our first set of experiments is on sentence order-
disagreement between two orderings and in ing. Following Barzilay and Lapata (2008), we
terms of the number of inversions of adjacent sen- use all transitions of length 3 for feature extrac-
tences necessary to convert one ordering into an- tion. In addition, we explore three specific aspects
other. Kendalls distance is defined as in our experiments: rank assignment, entity ex-
2m traction, and permutation generation.
= ,
N(N 1) 5.1.1 Rank Assignment
In our multiple-rank model, pairwise rankings
where m is the number of sentence inversions nec-
between a source document and its permutations
essary to convert to .
are extended into a longer ranking with multiple
Average continuity (AC): Following Zhang
ranks. We assign a rank to a particular permuta-
(2011), we use average continuity as the sec-
tion, based on the result of applying a chosen dis-
ond dissimilarity metric. It was first proposed
similarity metric from Section 4 (, AC, or ED) to
1
Filippova and Strube (2007) found that their perfor- the sentence ordering in that permutation.
mance dropped when using this metric for longer rankings; We experiment with two different approaches
but they were using data in a different language and with to assigning ranks to permutations, while each
manual annotations, so its effect on our datasets is worth try-
ing nonetheless. 2
We will refer to AC 0 as AC from now on.
318
source document is always assigned a zero (the 5.1.3 Permutation Generation
highest) rank. The quality of the model learned depends on
In the raw option, we rank the permutations di- the set of permutations used in training. We are
rectly by their dissimilarity scores to form a full not aware of how B&Ls permutations were gen-
ranking for the set of permutations generated from erated, but we assume they are generated in a per-
the same source document. fectly random fashion.
Since a full ranking might be too sensitive to However, in reality, the probabilities of seeing
noise in training, we also experiment with the documents with different degrees of coherence are
stratified option, in which C ranks are assigned to not equal. For example, in an essay scoring task,
the permutations generated from the same source if the target group is (near-) native speakers with
document. The permutation with the smallest dis- sufficient education, we should expect their essays
similarity score is assigned the same (zero, the to be less incoherent most of the essays will
highest) rank as the source document, and the one be coherent in most parts, with only a few minor
with the largest score is assigned the lowest (C1) problems regarding discourse coherence. In such
rank; then ranks of other permutations are uni- a setting, the performance of a model trained from
formly distributed in this range according to their permutations generated from a uniform distribu-
raw dissimilarity scores. We experiment with 3 tion may suffer some accuracy loss.
to 6 ranks (the case where C = 2 reduces to the Therefore, in addition to the set of permutations
standard entity-based model). used by Barzilay and Lapata (2008) (PS BL ), we
create another set of permutations for each source
document (PS M ) by assigning most of the proba-
5.1.2 Entity Extraction bility mass to permutations which are mostly sim-
ilar to the original source document. Besides its
Barzilay and Lapata (2008)s best results were capability of better approximating real-life situ-
achieved by employing an automatic coreference ations, training our model on permutations gen-
resolution tool (Ng and Cardie, 2002) for ex- erated in this way has another benefit: in the
tracting entities from a source document, and the standard entity-based model, all permuted doc-
permutations were generated only afterwards uments are treated as incoherent; thus there are
entity extraction from a permuted document de- many more incoherent training instances than co-
pends on knowing the correct sentence order and herent ones (typically the proportion is 20:1). In
the oracular entity information from the source contrast, in our multiple-rank model, permuted
document since resolving coreference relations documents are assigned different ranks to fur-
in permuted documents is too unreliable for an au- ther differentiate the different degrees of coher-
tomatic tool. ence within them. By doing so, our model will
We implement our multiple-rank model with be able to learn the characteristics of a coherent
full coreference resolution using Ng and Cardies document from those near-coherent documents as
coreference resolution system, and entity extrac- well, and therefore the problem of lacking coher-
tion approach as described above the Coref- ent instances can be mitigated.
erence+ condition. However, as argued by El- Our permutation generation algorithm is shown
sner and Charniak (2011), to better simulate in Algorithm 1, where = 0.05, = 5.0,
the real situations that human readers might en- MAX NUM = 50, and K and K 0 are two normal-
counter in machine-generated documents, such ization factors to make p(swap num) and p(i, j)
oracular information should not be taken into ac- proper probability distributions. For each source
count. Therefore we also employ two alterna- document, we create the same number of permu-
tive approaches for entity extraction: (1) use the tations as PS BL .
same automatic coreference resolution tool on
permuted documents we call it the Corefer- 5.2 Summary Coherence Rating
ence condition; (2) use no coreference reso- In the summary coherence rating task, we are
lution, i.e., group head noun clusters by simple dealing with a mixture of multi-document sum-
string matching B&Ls Coreference condi- maries generated by systems and written by hu-
tion. mans. Barzilay and Lapata (2008) did not assume
319
Algorithm 1 Permutation Generation. with the optimal transition length set to 2.
Input: S 1 , S 2 , . . . , S N ; = (1, 2, . . . , N)
Choose a number of sentence swaps 6 Results
swap num with probability eswap num /K 6.1 Sentence Ordering
for i = 1 swap num do
Swap a pair of sentence (S i , S j ) In this task, we use the same two sets of source
with probability p(i, j) = e|i j| /K 0 documents (Earthquakes and Accidents, see Sec-
end for tion 3.1) as Barzilay and Lapata (2008). Each
Output: = (o1 , o2 , . . . , oN ) contains 200 source documents, equally divided
between training and test sets, with up to 20 per-
mutations per document. We conduct experi-
a simple binary distinction among the summaries ments on these two domains separately. For each
generated from the same input document clus- domain, we accompany each source document
ter; rather, they had human judges give scores for with two different sets of permutations: the one
each summary based on its degree of coherence used by B&L (PS BL ), and the one generated from
(see Section 3.2). Therefore, it seems that the our model described in Section 5.1.3 (PS M ). We
subtle differences among incoherent documents train our multiple-rank model and B&Ls standard
(system-generated summaries in this case) have two-rank model on each set of permutations using
already been learned by their model. the SVM rank package (Joachims, 2006), and eval-
uate both systems on their test sets. Accuracy is
But we wish to see if we can replace hu-
measured as the fraction of correct pairwise rank-
man judgments by our computed dissimilarity
ings for the test set.
scores so that the original supervised learning is
converted into unsupervised learning and yet re- 6.1.1 Full Coreference Resolution with
tain competitive performance. However, given Oracular Information
a summary, computing its dissimilarity score is In this experiment, we implement B&Ls fully-
a bit involved, due to the fact that we do not fledged standard entity-based coherence model,
know its correct sentence order. To tackle this and extract entities from permuted documents us-
problem, we employ a simple sentence align- ing oracular information from the source docu-
ment between a system-generated summary and ments (see Section 5.1.2).
a human-written summary originating from the Results are shown in Table 2. For each test sit-
same input document cluster. Given a system- uation, we list the best accuracy (in Acc columns)
generated summary D s = (S s1 , S s2 , . . . , S sn ) and for each chosen dissimilarity metric, with the cor-
its corresponding human-written summary Dh = responding rank assignment approach. C repre-
(S h1 , S h2 , . . . , S hN ) (here it is possible that n , sents the number of ranks used in stratifying raw
N), we treat the sentence ordering (1, 2, . . . , N) scores (N if using raw configuration, see Sec-
in Dh as (the original sentence ordering), and tion 5.1.1 for details). Baselines are accuracies
compute = (o1 , o2 , . . . , on ) based on D s . To trained using the standard entity-based coherence
compute each oi in , we find the most similar model3 .
sentence S h j , j [1, N] in Dh by computing their Our model outperforms the standard entity-
cosine similarity over all tokens in S h j and S si ; based model on both permutation sets for both
if all sentences in Dh have zero cosine similarity datasets. The improvement is not significant
with S si , we assign 1 to oi . when trained on the permutation set PS BL , and
Once is known, we can compute its dissimi- is achieved only with one of the three metrics;
larity from using a chosen metric. But because 3
There are discrepancies between our reported accuracies
now is not guaranteed to be a permutation of
and those of Barzilay and Lapata (2008). The differences are
(there may be repetition or missing values, i.e., due to the fact that we use a different parser: the Stanford de-
1, in ), Kendalls cannot be used, and we use pendency parser (de Marneffe et al., 2006), and might have
only average continuity and edit distance as dis- extracted entities in a slightly different way than theirs, al-
though we keep other experimental configurations as close
similarity metrics in this experiment.
as possible to theirs. But when comparing our model with
The remaining experimental configuration is theirs, we always use the exact same set of features, so the
the same as that of Barzilay and Lapata (2008), absolute accuracies do not matter.
320
Condition: Coreference+ Condition: Coreference
Earthquakes Accidents Earthquakes Accidents
Perms Metric Perms Metric
C Acc C Acc C Acc C Acc
3 79.5 3 82.0 3 71.0 3 73.3
AC 4 85.2 3 83.3 AC 3 *76.8 3 74.5
PS BL PS BL
ED 3 86.8 6 82.2 ED 4 *77.4 6 74.4
Baseline 85.3 83.2 Baseline 71.7 73.8
3 86.8 3 85.2* 3 55.9 3 51.5
AC 3 85.6 1 85.4* AC 4 53.9 6 49.0
PS M PS M
ED N 87.9* 4 86.3* ED 4 53.9 5 52.3
Baseline 85.3 81.7 Baseline 49.2 53.2
Table 2: Accuracies (%) of extending the stan- Table 3: Accuracies (%) of extending the stan-
dard entity-based coherence model with multiple-rank dard entity-based coherence model with multiple-rank
learning in sentence ordering using Coreference+ op- learning in sentence ordering using Coreference op-
tion. Accuracies which are significantly better than the tion. Accuracies which are significantly better than the
baseline (p < .05) are indicated by *. baseline (p < .05) are indicated by *.
but when trained on PS M (the set of permutations generated from our model), running full corefer-
generated from our biased model), our models ence resolution is not a good option, since it al-
performance significantly exceeds B&Ls4 for all most makes the accuracies no better than random
three metrics, especially as their models perfor- guessing (50%).
mance drops for dataset Accidents. Moreover, considering training using PS BL ,
From these results, we see that in the ideal sit- running full coreference resolution has a different
uation where we extract entities and resolve their influence for the two datasets. For Earthquakes,
coreference relations based on the oracular infor- our model significantly outperforms B&Ls while
mation from the source document, our model is the improvement is insignificant for Accidents.
effective in terms of improving ranking accura- This is most probably due to the different way that
cies, especially when trained on our more realistic entities are realized in these two datasets. As an-
permutation sets PS M . alyzed by Barzilay and Lapata (2008), in dataset
Earthquakes, entities tend to be referred to by pro-
6.1.2 Full Coreference Resolution without
nouns in subsequent mentions, while in dataset
Oracular Information
Accidents, literal string repetition is more com-
In this experiment, we apply the same auto- mon.
matic coreference resolution tool (Ng and Cardie, Given a balanced permutation distribution as
2002) on not only the source documents but also we assumed in PS BL , switching distant sentence
their permutations. We want to see how removing pairs in Accidents may result in very similar en-
the oracular component in the original model af- tity distribution with the situation of switching
fects the performance of our multiple-rank model closer sentence pairs, as recognized by the auto-
and the standard model. Results are shown in Ta- matic tool. Therefore, compared to Earthquakes,
ble 3. our multiple-rank model may be less powerful in
First we can see when trained on PS M , run- indicating the dissimilarity between the sentence
ning full coreference resolution significantly hurts orderings in a permutation and its source docu-
performance for both models. This suggests that, ment, and therefore can improve on the baseline
in real-life applications, where the distribution of only by a small margin.
training instances with different degrees of co-
herence is skewed (as in the set of permutations 6.1.3 No Coreference Resolution
4
Following Elsner and Charniak (2011), we use the In this experiment, we do not employ any coref-
Wilcoxon Sign-rank test for significance. erence resolution tool, and simply cluster head
321
Condition: Coreference 88.0
Accuracy (%)
Earthquakes Accidents 83.0 Earthquake ED Coref+
Perms Metric Earthquake ED Coref
C Acc C Acc 78.0
Accidents ED Coref+
322
Entities Metric Same Full vs. 72.3% on full test.
When our model performs poorer than the
AC 82.5 *72.6
baseline (using Coreference configuration), the
Coreference+ ED 81.3 **73.0
difference is not significant, which suggests that
Baseline 78.8 70.9 our multiple-rank model with unsupervised score
AC 76.3 72.0 assignment via simple cosine matching can re-
Coreference ED 78.8 71.7 main competitive with the standard model, which
requires human annotations to obtain a more fine-
Baseline 80.0 72.3 grained coherence spectrum. This observation is
consistent with Banko and Vanderwende (2004)s
Table 5: Accuracies (%) of extending the stan-
discovery that human-generated summaries look
dard entity-based coherence model with multiple-rank
learning in summary rating. Baselines are results of quite extractive.
standard entity-based coherence model. Accuracies
which are significantly better than the corresponding 7 Conclusions
baseline are indicated by * (p < .05) and ** (p < .01).
In this paper, we have extended the popular co-
herence model of Barzilay and Lapata (2008) by
for entity extraction. We train both models on adopting a multiple-rank learning approach. This
the ranking preferences (144 in all) among sum- is inherently different from other extensions to
maries originating from the same input document this model, in which the focus is on enriching
cluster using the SVM rank package (Joachims, the set of features for entity-grid construction,
2006), and test on two different test sets: same- whereas we simply keep their original feature set
cluster test and full test. Same-cluster test is the intact, and manipulate only their learning method-
one used by Barzilay and Lapata (2008), in which ology. We show that this concise extension is
only the pairwise rankings (80 in all) between effective and able to outperform B&Ls standard
summaries originating from the same input doc- model in various experimental setups, especially
ument cluster are tested; we also experiment with when experimental configurations are most suit-
full test, in which pairwise rankings (1520 in all) able considering certain dataset properties (see
between all summary pairs excluding two human- discussion in Section 6.1.4).
written summaries are tested. We experimented with two tasks: sentence or-
dering and summary coherence rating, following
Results are shown in Table 5. Coreference+
B&Ls original framework. In sentence ordering,
and Coreference denote the configuration of
we also explored the influence of removing the
using full coreference resolution or no resolu-
oracular component in their original model and
tion separately. First, clearly for both models,
dealing with permutations generated from differ-
performance on full test is inferior to that on
ent distributions, showing that our model is robust
same-cluster test, but our model is still able to
for different experimental situations. In summary
achieve performance competitive with the stan-
coherence rating, we further extended their model
dard model, even if our fundamental assumption
such that their original supervised learning is con-
about the existence of canonical sentence order-
verted into unsupervised learning with competi-
ing in documents with same content may break
tive or even superior performance.
down on those test pairs not originating from the
Our multiple-rank learning model can be easily
same input document cluster. Secondly, for the
adapted into other extended entity-based coher-
baseline model, using the Coreference configu-
ence models with their enriched feature sets, and
ration yields better accuracy in this task (80.0%
further improvement in ranking accuracies should
vs. 78.8% on same-cluster test, and 72.3% vs.
be expected.
70.9% on full test), which is consistent with the
findings of Barzilay and Lapata (2008). But our Acknowledgments
multiple-rank model seems to favor the Corefer-
ence+ configuration, and our best accuracy even This work was financially supported by the Nat-
exceeds B&Ls best when tested on the same set: ural Sciences and Engineering Research Council
82.5% vs. 80.0% on same-cluster test, and 73.0% of Canada and by the University of Toronto.
323
References Mirella Lapata. 2006. Automatic evaluation of in-
formation ordering: Kendalls tau. Computational
Michele Banko and Lucy Vanderwende. 2004. Us- Linguistics, 32(4):471484.
ing n-grams to understand the nature of summaries.
Ziheng Lin, Hwee Tou Ng, and Min-Yen Kan. 2011.
In Proceedings of Human Language Technologies
Automatically evaluating text coherence using dis-
and North American Association for Computational
course relations. In Proceedings of the 49th Annual
Linguistics 2004: Short Papers, pages 14.
Meeting of the Association for Computational Lin-
Regina Barzilay and Mirella Lapata. 2005. Modeling guistics (ACL 2011), pages 9971006.
local coherence: An entity-based approach. In Pro- Nitin Madnani, Rebecca Passonneau, Necip Fazil
ceedings of the 42rd Annual Meeting of the Asso- Ayan, John M. Conroy, Bonnie J. Dorr, Ju-
ciation for Computational Linguistics (ACL 2005), dith L. Klavans, Dianne P. OLeary, and Judith D.
pages 141148. Schlesinger. 2007. Measuring variability in sen-
Regina Barzilay and Mirella Lapata. 2008. Modeling tence ordering for news summarization. In Pro-
local coherence: an entity-based approach. Compu- ceedings of the Eleventh European Workshop on
tational Linguistics, 34(1):134. Natural Language Generation (ENLG 2007), pages
Danushka Bollegala, Naoaki Okazaki, and Mitsuru 8188.
Ishizuka. 2006. A bottom-up approach to sen- Vincent Ng and Claire Cardie. 2002. Improving ma-
tence ordering for multi-document summarization. chine learning approaches to coreference resolution.
In Proceedings of the 21st International Confer- In Proceedings of the 40th Annual Meeting on Asso-
ence on Computational Linguistics and 44th Annual ciation for Computational Linguistics (ACL 2002),
Meeting of the Association for Computational Lin- pages 104111.
guistics, pages 385392. Michael Strube and Simone Paolo Ponzetto. 2006.
Jackie Chi Kit Cheung and Gerald Penn. 2010. Entity- Wikirelate! Computing semantic relatedness using
based local coherence modelling using topological Wikipedia. In Proceedings of the 21st National
fields. In Proceedings of the 48th Annual Meet- Conference on Artificial Intelligence, pages 1219
ing of the Association for Computational Linguis- 1224.
tics (ACL 2010), pages 186195. Renxian Zhang. 2011. Sentence ordering driven by
Marie-Catherine de Marneffe, Bill MacCartney, and local and global coherence for summary generation.
Christopher D. Manning. 2006. Generating typed In Proceedings of the ACL 2011 Student Session,
dependency parses from phrase structure parses. In pages 611.
Proceedings of the 5th International Conference on
Language Resources and Evaluation (LREC 2006).
Micha Elsner and Eugene Charniak. 2011. Extending
the entity grid with entity-specific features. In Pro-
ceedings of the 49th Annual Meeting of the Asso-
ciation for Computational Linguistics (ACL 2011),
pages 125129.
Katja Filippova and Michael Strube. 2007. Extend-
ing the entity-grid coherence model to semantically
related entities. In Proceedings of the Eleventh Eu-
ropean Workshop on Natural Language Generation
(ENLG 2007), pages 139142.
Thorsten Joachims. 2002. Optimizing search en-
gines using clickthrough data. In Proceedings of
the 8th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (KDD
2002), pages 133142.
Thorsten Joachims. 2006. Training linear SVMs
in linear time. In Proceedings of the 12th ACM
SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD 2006), pages
217226.
Mirella Lapata. 2003. Probabilistic text structuring:
Experiments with sentence ordering. In Proceed-
ings of the 41st Annual Meeting of the Association
for Computational Linguistics (ACL 2003), pages
545552.
324
Generalization Methods for In-Domain and Cross-Domain Opinion
Holder Extraction
325
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 325335,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
In this work, we will consider three differ- Domain # Sentences # Holders in sentence (average)
ETHICS 5700 0.79
ent generalization methods being simple unsuper- SPACE 628 0.28
vised word clustering, an induction method and FICTION 614 1.49
the usage of lexical resources. We show that gen-
Table 1: Statistics of the different domain corpora.
eralization causes significant improvements and
that the impact of improvement depends on how
much training and test data differ from each other.
In addition to these two (sub)domains, we
We also address the issue of opinion holders in
chose some text type that is not even news text
patient position and present methods including a
in order to have a very distant domain. There-
novel extraction method to detect these opinion
fore, we had to use some text not included in the
holders without any labeled training data as stan-
MPQA corpus. Existing text collections contain-
dard datasets contain too few instances of them.
ing product reviews (Kessler et al., 2010; Toprak
In the context of generalization it is also impor- et al., 2010), which are generally a popular re-
tant to consider different classification methods source for sentiment analysis, were not found
as the incorporation of generalization may have a suitable as they only contain few distinct opinion
varying impact depending on how robust the clas- holders. We finally used a few summaries of fic-
sifier is by itself, i.e. how well it generalizes even tional work (two Shakespeare plays and one novel
with a standard feature set. We compare two state- by Jane Austen4 ) since their language is notably
of-the-art learning methods, conditional random different from that of news texts and they con-
fields and convolution kernels, and a rule-based tain a large number of different opinion holders
method. (therefore opinion holder extraction is a meaning-
ful task on this text type). These texts make up
2 Data our third domain FICTION. We manually labeled
As a labeled dataset we mainly use the MPQA it with opinion holder information by applying the
2.0 corpus (Wiebe et al., 2005). We adhere to annotation scheme of the MPQA corpus.
the definition of opinion holders from previous Table 1 lists the properties of the different do-
work (Wiegand and Klakow, 2010; Wiegand and main corpora. Note that ETHICS is the largest do-
Klakow, 2011a; Wiegand and Klakow, 2011b), main. We consider it our primary (source) domain
i.e. every source of a private state or a subjective as it serves both as a training and (in-domain) test
speech event (Wiebe et al., 2005) is considered an set. Due to their size, the other domains only
opinion holder. serve as test sets (target domains).
This corpus contains almost exclusively news For some of our generalization methods, we
texts. In order to divide it into different domains, also need a large unlabeled corpus. We use the
we use the topic labels from (Stoyanov et al., North American News Text Corpus (LDC95T21).
2004). By inspecting those topics, we found that
3 The Different Types of Generalization
many of them can grouped to a cluster of news
items discussing human rights issues mostly in 3.1 Word Clustering (Clus)
the context of combating global terrorism. This The simplest generalization method that is con-
means that there is little point in considering every sidered in this paper is word clustering. By that,
single topic as a distinct (sub)domain and, there- we understand the automatic grouping of words
fore, we consider this cluster as one single domain occurring in similar contexts. Such clusters are
ETHICS.3 For our cross-domain evaluation, we usually computed on a large unlabeled corpus.
want to have another topic that is fairly different Unlike lexical features, features based on clusters
from this set of documents. By visual inspection, are less sparse and have been proven to signif-
we found that the topic discussing issues regard- icantly improve data-driven classifiers in related
ing the International Space Station would suit our tasks, such as named-entity recognition (Turian et
purpose. It is henceforth called SPACE.
4
available at: www.absoluteshakespeare.com/
3
The cluster is the union of documents with the following guides/{othello|twelfth night}/summary/
MPQA-topic labels: axisofevil, guantanamo, humanrights, {othello|twelfth night} summary.htm
mugabe and settlements. www.wikisummaries.org/Pride and Prejudice
326
I. Madrid, Dresden, Bordeaux, Istanbul, Caracas, Manila, ... majority of holders are agents (4). A certain
II. Toby, Betsy, Michele, Tim, Jean-Marie, Rory, Andrew, ...
III. detest, resent, imply, liken, indicate, suggest, owe, expect, ...
number of predicates, however, also have opinion
IV. disappointment, unease, nervousness, dismay, optimism, ... holders in patient position, e.g. (5) and (6).
V. remark, baby, book, saint, manhole, maxim, coin, batter, ... Wiegand and Klakow (2011b) found that many
Table 2: Some automatically induced clusters.
of those latter predicates are listed in one of
Levins verb classes called amuse verbs. While
ETHICS SPACE FICTION
on the evaluation on the entire MPQA corpus,
1.47 2.70 11.59 opinion holders in patient position are fairly rare
(Wiegand and Klakow, 2011b), we may wonder
Table 3: Percentage of opinion holders as patients. whether the same applies to the individual do-
mains that we consider in this work. Table 3
lists the proportion of those opinion holders (com-
al., 2010). Such a generalization is, in particular,
puted manually) based on a random sample of 100
attractive as it is cheaply produced. As a state-
opinion holder mentions from those corpora. The
of-the-art clustering method, we consider Brown
table shows indeed that on the domains from the
clustering (Brown et al., 1992) as implemented in
MPQA corpus, i.e. ETHICS and SPACE, those
the SRILM-toolkit (Stolcke, 2002). We induced
opinion holders play a minor role but there is a no-
1000 clusters which is also the configuration used
tably higher proportion on the FICTION-domain.
in (Turian et al., 2010).5
Table 2 illustrates a few of the clusters induced 3.3 Task-Specific Lexicon Induction (Induc)
from our unlabeled dataset introduced in Section
3.3.1 Distant Supervision with Prototypical
() 2. Some of these clusters represent location
Opinion Holders
or person names (e.g. I. & II.). This exempli-
Lexical resources are potentially much more
fies why clustering is effective for named-entity
expressive than word clustering. This knowledge,
recognition. We also find clusters that intuitively
however, is usually manually compiled, which
seem to be meaningful for our task (e.g. III. &
makes this solution much more expensive. Wie-
IV.) but, on the other hand, there are clusters that
gand and Klakow (2011a) present an intermedi-
contain words that with the exception of their part
ate solution for opinion holder extraction inspired
of speech do not have anything in common (e.g.
by distant supervision (Mintz et al., 2009). The
V.).
output of that method is also a lexicon of predi-
3.2 Manually Compiled Lexicons (Lex) cates but it is automatically extracted from a large
The major shortcoming of word clustering is that unlabeled corpus. This is achieved by collecting
it lacks any task-specific knowledge. The oppo- predicates that frequently co-occur with prototyp-
site type of generalization is the usage of manu- ical opinion holders, i.e. common nouns such as
ally compiled lexicons comprising predicates that opponents (7) or critics (8), if they are an agent
indicate the presence of opinion holders, such as of that predicate. The rationale behind this is
supported, worries or disappointed in (4)-(6). that those nouns act very much like actual opin-
ion holders and therefore can be seen as a proxy.
(4) I always supported this idea. holder:agent.
(5) This worries me. holder:patient (7) Opponents say these arguments miss the point.
(6) He disappointed me. holder:patient (8) Critics argued that the proposed limits were unconstitutional.
327
anguish , astonish, astound, concern, convince, daze, delight, opinion holders to persons. This means that we
disenchant , disappoint, displease, disgust, disillusion, dissat-
isfy, distress, embitter , enamor , engross, enrage, entangle , allow personal pronouns (i.e. I, you, he, she and
excite, fatigue , flatter, fluster, flummox , frazzle , hook , hu- we) to appear in this position. We believe that this
miliate, incapacitate , incense, interest, irritate, obsess, outrage,
perturb, petrify , sadden, sedate , shock, stun, tether , trouble
relaxation can be done in that particular case, as
adjectives are much more likely to convey opin-
Table 4: Examples of the automatically extracted verbs ions a priori than verbs (Wiebe et al., 2004).
taking opinion holders as patients ( : not listed as An intrinsic evaluation of the predicates that we
amuse verb). thus extracted from our unlabeled corpus is dif-
ficult. The 250 most frequent verbs exhibiting
this special property of coinciding with adjectives
is limited to agentive opinion holders. Opinion
(this will be the list that we use in our experi-
holders in patient position, such as the ones taken
ments) contains 42% entries of the amuse verbs
by amuse verbs in (5) and (6), are not covered.
(3.2). However, we also found many other po-
Wiegand and Klakow (2011a) show that consid-
tentially useful predicates on this list that are not
ering less restrictive contexts significantly drops
listed as amuse verbs (Table 4). As amuse verbs
classification performance. So the natural exten-
cannot be considered a complete golden standard
sion of looking for predicates having prototypical
for all predicates taking opinion holders as pa-
opinion holders in patient position is not effective.
tients, we will focus on a task-based evaluation
Sentences, such as (9), would mar the result.
of our automatically extracted list (6).
(9) They criticized their opponents.
4 Data-driven Methods
In (9) the prototypical opinion holder opponents
(in the patient position) is not a true opinion In the following, we present the two supervised
holder. classifiers we use in our experiments. Both clas-
Our novel method to extract those predicates sifiers incorporate the same levels of representa-
rests on the observation that the past participle of tions, including the same generalization methods.
those verbs, such as shocked in (10), is very often
4.1 Conditional Random Fields (CRF)
identical to some predicate adjective (11) having
a similar if not identical meaning. For the predi- The supervised classifier most frequently used
cate adjective, the opinion holder is, however, its for information extraction tasks, in general, are
subject/agent and not its patient. conditional random fields (CRF) (Lafferty et al.,
2001). Using CRF, the task of opinion holder ex-
(10) He had shockedverb me. holder:patient
(11) I was shockedadj . holder:agent
traction is framed as a tagging problem in which
given a sequence of observations x = x1 x2 . . . xn
Instead of extracting those verbs directly (10), (words in a sentence) a sequence of output tags
we take the detour via their corresponding pred- y = y1 y2 . . . yn indicating the boundaries of opin-
icate adjectives (11). This means that we collect ion holders is computed by modeling the condi-
all those verbs (from our large unlabeled corpus tional probability P (x|y).
(2)) for which there is a predicate adjective that The features we use (Table 5) are mostly in-
coincides with the past participle of the verb. spired by Choi et al. (2005) and by the ones
To increase the likelihood that our extracted used for plain support vector machines (SVMs)
predicates are meaningful for opinion holder ex- in (Wiegand and Klakow, 2010). They are orga-
traction, we also need to check the semantic type nized into groups. The basic group Plain does not
in the relevant argument position, i.e. make sure contain any generalization method. Each other
that the agent of the predicate adjective (which group is dedicated to one specific generalization
would be the patient of the corresponding verb) method that we want to examine (Clus, Induc
is an entity likely to be an opinion holder. Our and Lex). Apart from considering generalization
initial attempts with prototypical opinion holders features indicating the presence of generalization
were too restrictive, i.e. the number of prototyp- types, we also consider those types in conjunction
ical opinion holders co-occurring with those ad- with semantic roles. As already indicated above,
jectives was too small. Therefore, we widen the semantic roles are especially important for the de-
semantic type of this position from prototypical tection of opinion holders. Unfortunately, the cor-
328
Group Features convolution kernels, the structures to be compared
Token features: unigrams and bigrams
POS/chunk/named-entity features: unigrams, bi-
within the kernel function are not vectors com-
grams and trigrams prising manually designed features but the under-
Plain Constituency tree path to nearest predicate lying discrete structures, such as syntactic parse
Nearest predicate
Semantic role to predicate+lexical form of predicate
trees or part-of-speech sequences. Since they are
Cluster features: unigrams, bigrams and trigrams directly provided to the learning algorithm, a clas-
Clus Semantic role to predicate+cluster-id of predicate sifier can be built without taking the effort of im-
Cluster-id of nearest predicate
Is there predicate from induced lexicon within win-
plementing an explicit feature extraction.
dow of 5 tokens? We take the best configuration from (Wiegand
Induc Semantic role to predicate, if predicate is contained in and Klakow, 2010) that comprises a combination
induced lexicon
Is nearest predicate contained in induced lexicon? of three different tree kernels being two tree ker-
Is there predicate from manually compiled lexicons nels based on constituency parse trees (one with
within window of 5 tokens?
predicate and another with semantic scope) and
Semantic role to predicate, if predicate is contained in
Lex manually compiled lexicons a tree kernel encoding predicate-argument struc-
Is nearest predicate contained in manually compiled tures based on semantic role information. These
lexicons?
representations are illustrated in Figure 1. The re-
Table 5: Feature set for CRF. sulting kernels are combined by plain summation.
In order to integrate our generalization meth-
ods into the convolution kernels, the input struc-
responding feature from the Plain feature group tures, i.e. the linguistic tree structures, have to be
that also includes the lexical form of the predicate augmented. For that we just add additional nodes
is most likely a sparse feature. For the opinion whose labels correspond to the respective gener-
holder me in (10), for example, it would corre- alization types (i.e. Clus: CLUSTER-ID, Induc:
spond to A1 shock. Therefore, we introduce for INDUC-PRED and Lex: LEX-PRED). The nodes
each generalization method an additional feature are added in such a way that they (directly) domi-
replacing the sparse lexical item by a generaliza- nate the leaf node for which they provide a gener-
tion label, i.e. Clus: A1 CLUSTER-35265, Induc: alization.10 If several generalization methods are
A1 INDUC-PRED and Lex: A1 LEX-PRED.6 used and several of them apply for the same lex-
For this learning method, we use CRF++.7 We ical unit, then the (vertical) order of the general-
choose a configuration that provides good perfor- ization nodes is LEX-PRED INDUC-PRED
mance on our source domain (i.e. ETHICS).8 CLUSTER-ID.11 Figure 2 illustrates the predi-
For semantic role labeling we use SWIRL9 , for cate argument structure from Figure 1 augmented
chunk parsing CASS (Abney, 1991) and for con- with INDUC-PRED and CLUSTER-IDs.
stituency parsing Stanford Parser (Klein and Man- For this learning method, we use the
ning, 2003). Named-entity information is pro- SVMLight-TK toolkit.12 Again, we tune the
vided by Stanford Tagger (Finkel et al., 2005). parameters to our source domain (ETHICS).13
4.2 Convolution Kernels (CK) 5 Rule-based Classifiers (RB)
Convolution kernels (CK) are special kernel func-
tions. A kernel function K : X X R com- Finally, we also consider rule-based classifiers
putes the similarity of two data instances xi and (RB). The main difference towards CRF and CK
xj (xi xj X). It is mostly used in SVMs that is that it is an unsupervised approach not requiring
estimate a hyperplane to separate data instances training data. We re-use the framework by Wie-
from different classes H(~x) = w ~ ~x + b = 0 gand and Klakow (2011b). The candidate set are
n
where w R and b R (Joachims, 1999). In all noun phrases in a test set. A candidate is clas-
sified as an opinion holder if all of the following
6
Predicates in patient position are given the same gener-
10
alization label as the predicates in agent position. Specially Note that even for the configuration Plain the trees are
marking them did not result in a notable improvement. already augmented with named-entity information.
7 11
http://crfpp.sourceforge.net We chose this order as it roughly corresponds to the
8
The soft margin parameter c is set to 1.0 and all fea- specificity of those generalization types.
12
tures occurring less than 3 times are removed. disi.unitn.it/moschitti
9 13
http://www.surdeanu.name/mihai/swirl The cost parameter j (Morik et al., 1999) was set to 5.
329
Figure 1: The different structures (left: constituency trees, right: predicate argument structure) derived from
Sentence (1) for the opinion holder candidate Malaysia used as input for convolution kernels (CK).
Features
Induc Lex Induc+Lex
Domains AG AG+PT AG AG+PT AG+PT
ETHICS 50.77 50.99 52.22 52.27 53.07
SPACE 45.81 46.55 47.60 48.47 45.20
FICTION 46.59 49.97 54.84 59.35 63.11
330
Training Size (%) and recall. RB achieves a high recall, whereas the
Features Alg. 5 10 20 50 100
CRF 32.14 35.24 41.03 51.05 55.13
learning-based methods always excel RB in pre-
Plain
CK 42.15 46.34 51.14 56.39 59.52 cision.14 Applying generalization to the learning-
CRF 33.06 37.11 43.47 52.05 56.18 based methods results in an improvement of both
+Clus
CK 42.02 45.86 51.11 56.59 59.77
CRF 37.28 42.31 46.54 54.27 56.71
recall and precision if few training data are used.
+Induc The impact on precision decreases, however, the
CK 46.26 49.35 53.26 57.28 60.42
+Lex
CRF 40.69 43.91 48.43 55.37 58.46 more training data are added. There is always a
CK 46.45 50.59 53.93 58.63 61.50
significant increase in recall but learning-based
CRF 37.27 42.19 47.35 54.95 57.14
+Clus+Induc methods may not reach the level of RB even
CK 45.14 48.20 52.39 57.37 59.97
+Clus+Lex
CRF 40.52 44.29 49.32 55.44 58.80 though they use the same resources. This is a
CK 45.89 49.35 53.56 58.74 61.43
side-effect of preserving a much higher precision.
CRF 42.23 45.92 49.96 55.61 58.40
+Lex+Induc
CK 47.46 51.44 54.80 58.74 61.58 It also explains why learning-based methods with
CRF 41.56 45.75 50.39 56.24 59.08 generalization may have a lower F-score than RB.
All
CK 46.18 50.10 54.04 58.92 61.44
6.3 Out-of-Domain Evaluation of
Table 7: F-score of in-domain (ETHICS) learning-
Learning-based Methods
based classifiers.
Table 9 presents the results of out-of-domain clas-
sifiers. The complete ETHICS-dataset is used for
only be measured on the FICTION-domain since training. Some properties are similar to the pre-
this is the only domain with a significant propor- vious experiments: CK always outperforms CRF.
tion of those opinion holders (Table 3). RB provides a high recall whereas the learning-
based methods maintain a higher precision. Sim-
6.2 In-Domain Evaluation of ilar to the in-domain setting using few labeled
Learning-based Methods training data, the incorporation of generalization
Table 7 shows the performance of the learning- increases both precision and recall. Moreover, a
based methods CRF and CK on an in-domain combination of generalization methods is better
evaluation (ETHICS-domain) using different than just using one method on average, although
amounts of labeled training data. We carry out Lex is again a fairly robust individual generaliza-
a 5-fold cross-validation and use n% of the train- tion method. Generalization is more effective in
ing data in the training folds. The table shows that this setting than on the in-domain evaluation us-
CK is more robust than CRF. The fewer training ing all training data, in particular for CK, since
data are used the more important generalization the training and test data are much more different
becomes. CRF benefits much more from gener- from each other and suitable generalization meth-
alization than CK. Interestingly, the CRF config- ods partly close that gap.
uration with the best generalization is usually as There is a notable difference in precision be-
good as plain CK. This proves the effectiveness tween the SPACE- and FICTION-domain (and
of CK. In principle, Lex is the strongest general- also the source domain ETHICS (Table 8)). We
ization method while Clus is by far the weakest. strongly assume that this is due to the distribu-
For Clus, systematic improvements towards no tion of opinion holders in those datasets (Table 1).
generalization (even though they are minor) can The FICTION-domain contains much more opin-
only be observed with CRF. As far as combina- ion holders, therefore the chance that a predicted
tions are concerned, either Lex+Induc or All per- opinion holder is correct is much higher.
forms best. This in-domain evaluation proves that With regard to recall, a similar level of per-
opinion holder extraction is different from named- formance as in the ETHICS-domain can only be
entity recognition. Simple unsupervised general- achieved in the SPACE-domain, i.e. CK achieves
ization, such as word clustering, is not effective a recall of 60%. In the FICTION-domain, how-
and popular sequential classifiers are less robust ever, the recall is much lower (best recall of CK
than margin-based tree-kernels. is below 47%). This is no surprise as the SPACE-
Table 8 complements Table 7 in that it com- domain is more similar to the source domain than
pares the learning-based methods with the best 14
The reason for RB having a high recall is extensively
rule-based classifier and also displays precision discussed in (Wiegand and Klakow, 2011b).
331
the FICTION-domain since ETHICS and SPACE CRF CK
Size Feat. Prec Rec F1 Prec Rec F1
are news texts. FICTION contains more out-of- Plain 52.17 26.61 35.24 58.26 38.47 46.34
domain language. Therefore, RB (which exclu- 10
All 62.85 35.96 45.75 63.18 41.50 50.10
sively uses domain-independent knowledge) out- Plain 59.85 44.50 51.05 59.60 53.50 56.39
50
All 62.99 50.80 56.24 61.91 56.20 58.92
performs both learning-based methods including Plain 64.14 48.33 55.13 62.38 56.91 59.52
the ones incorporating generalization. Similar re- 100
All 64.75 54.32 59.08 63.81 59.24 61.44
sults have been observed for rule-based classifiers RB 47.38 60.32 53.07 47.38 60.32 53.07
from other tasks in cross-domain sentiment anal-
ysis, such as subjectivity detection and polarity Table 8: Comparison of best RB with learning-based
approaches on in-domain classification.
classification. High-level information as it is en-
coded in a rule-based classifier generalizes better
Algorithms Generalization Prec Rec F
than learning-based methods (Andreevskaia and CK (Plain) 66.90 41.48 51.21
Bergler, 2008; Lambov et al., 2009). CK Induc 67.06 45.15 53.97
CK+RBAG Induc 60.22 54.52 57.23
We set up another experiment exclusively for
CK+RBAG+P T Induc 61.09 58.14 59.58
the FICTION-domain in which we combine the CK Lex 69.45 46.65 55.81
output of our best learning-based method, i.e. CK, CK+RBAG Lex 67.36 59.02 62.91
CK+RBAG+P T Lex 68.25 63.28 65.67
with the prediction of a rule-based classifier. The
CK Induc+Lex 69.73 46.17 55.55
combined classifier will predict an opinion holder, CK+RBAG Induc+Lex 61.41 65.56 63.42
if either classifier predicts one. The motivation for CK+RBAG+P T Induc+Lex 62.26 70.56 66.15
this is the following: The FICTION-domain is the
Table 10: Combination of out-of-domain CK and rule-
only domain to have a significant proportion of
based classifiers on FICTION (i.e. distant domain).
opinion holders appearing as patients. We want
to know how much of them can be recognized
with the best out-of-domain classifier using train- further evidence that our novel approach to extract
ing data with only very few instances of this type those predicates (3.3.2) is effective.
and what benefit the addition of using various RBs The combined approach in Table 10 not only
which have a clearer notion of these constructions outperforms CK (discussed above) but also RB
brings about. Moreover, we already observed that (Table 6). We manually inspected the output of
the learning-based methods have a bias towards the classifiers to find also cases in which CK de-
preserving a high precision and this may have as tect opinion holders that RB misses. CK has the
a consequence that the generalization features in- advantage that it is not only bound to the relation-
corporated into CK will not receive sufficiently ship between candidate holder and predicate. It
large weights. Unlike the SPACE-domain where learns further heuristics, e.g. that sentence-initial
a sufficiently high recall is already achieved with mentions of persons are likely opinion holders. In
CK (presumably due to its stronger similarity to- (12), for example, this heuristics fires while RB
wards the source domain) the FICTION-domain overlooks this instance as to give someone a share
may be more severely affected by this bias and of advice is not part of the lexicon.
evidence from RB may compensate for this.
(12) She later gives Charlotte her share of advice on running a
Table 10 shows the performance of those com- household.
bined classifiers. For all generalization types
considered, there is, indeed, an improvement by
7 Related Work
adding information from RB resulting in a large
boost in recall. Already the application of our in- The research on opinion holder extraction has
duction approach Induc results in an increase of been focusing on applying different data-driven
more than 8% points compared to plain CK. The approaches. Choi et al. (2005) and Choi et al.
table also shows that there is always some im- (2006) explore conditional random fields, Wie-
provement if RB considers opinion holders as pa- gand and Klakow (2010) examine different com-
tients (AG+PT). This can be considered as some binations of convolution kernels, while Johans-
evidence that (given the available data we use) son and Moschitti (2010) present a re-ranking ap-
opinion holders in patient position can only be ef- proach modeling complex relations between mul-
fectively extracted with the help of RBs. It is also tiple opinions in a sentence. A comparison of
332
SPACE (similar target domain) FICTION (distant target domain)
CRF CK CRF CK
Features Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1
Plain 47.32 48.62 47.96 45.89 57.07 50.87 68.58 28.96 40.73 66.90 41.48 51.21
+Clus 49.00 48.62 48.81 49.23 57.64 53.10 71.85 32.21 44.48 67.54 41.21 51.19
+Induc 42.92 49.15 45.82 46.66 60.45 52.67 71.59 34.77 46.80 67.06 45.15 53.97
+Lex 49.65 49.07 49.36 49.60 59.88 54.26 71.91 35.83 47.83 69.45 46.65 55.81
+Clus+Induc 46.61 48.78 47.67 48.65 58.20 53.00 71.32 35.88 47.74 67.46 42.17 51.90
+Lex+Induc 48.75 50.87 49.78 49.92 58.76 53.98 74.02 37.37 49.67 69.73 46.17 55.55
+Clus+Lex 49.72 50.87 50.29 53.70 59.32 56.37 73.41 37.15 49.33 70.59 43.98 54.20
All 49.87 51.03 50.44 51.68 58.76 54.99 72.00 37.44 49.26 70.61 44.83 54.84
best RB 41.72 57.80 48.47 41.72 57.80 48.47 63.26 62.96 63.11 63.26 62.96 63.11
those methods has not yet been attempted. In The only cross-domain evaluation of opinion
this work, we compare the popular state-of-the-art holder extraction is reported in (Li et al., 2007) us-
learning algorithms conditional random fields and ing the MPQA corpus as a training set and the NT-
convolution kernels for the first time. All these CIR collection as a test set. A low cross-domain
data-driven methods have been evaluated on the performance is obtained and the authors conclude
MPQA corpus. Some generalization methods are that this is due to the very different annotation
incorporated but unlike this paper they are neither schemes of those corpora.
systematically compared nor combined. The role
of resources that provide the knowledge of argu- 8 Conclusion
ment positions of opinion holders is not covered We examined different generalization methods for
in any of these works. This kind of knowledge opinion holder extraction. We found that for in-
should be directly learnt from the labeled train- domain classification, the more labeled training
ing data. In this work, we found, however, that data are used, the smaller is the impact of gener-
the distribution of argument positions of opinion alization. Robust learning methods, such as con-
holders varies throughout the different domains volution kernels, benefit less from generalization
and, therefore, cannot be learnt from any arbitrary than weaker classifiers, such as conditional ran-
out-of-domain training set. dom fields. For cross-domain classification, gen-
Bethard et al. (2004) and Kim and Hovy (2006) eralization is always helpful. Distant domains
explore the usefulness of semantic roles provided are problematic for learning-based methods, how-
by FrameNet (Fillmore et al., 2003). Bethard ever, rule-based methods provide a reasonable re-
et al. (2004) use this resource to acquire labeled call and can be effectively combined with the
training data while in (Kim and Hovy, 2006) learning-based methods. The types of generaliza-
FrameNet is used within a rule-based classifier tion that help best are manually compiled lexicons
mapping frame-elements of frames to opinion followed by an induction method inspired by dis-
holders. Bethard et al. (2004) only evaluate on an tant supervision. Finally, we examined the case
artificial dataset (i.e. a subset of sentences from of opinion holders as patients and also presented
FrameNet and PropBank (Kingsbury and Palmer, a novel automatic extraction method that proved
2002)). The only realistic test set on which Kim effective. Such dedicated extraction methods are
and Hovy (2006) evaluate their approach are news important as common labeled datasets (from the
texts. Their method is compared against a sim- news domain) do not provide sufficient training
ple rule-based baseline and, unlike this work, not data for these constructions.
against a robust data-driven algorithm.
Acknowledgements
(Wiegand and Klakow, 2011b) is similar to
(Kim and Hovy, 2006) in that a rule-based ap- This work was funded by the German Federal Ministry
proach is used relying on the relationship towards of Education and Research (Software-Cluster) under
predictive predicates. Diverse resources are con- grant no. 01IC10S01. The authors thank Alessandro
sidered for obtaining such words, however, they Moschitti, Benjamin Roth and Josef Ruppenhofer for
are only evaluated on the entire MPQA corpus. their technical support and interesting discussions.
333
References Proceedings of the Annual Meeting of the Associa-
tion for Computational Linguistics (ACL), Portland,
Steven Abney. 1991. Parsing By Chunks. In Robert OR, USA.
Berwick, Steven Abney, and Carol Tenny, editors,
Jason S. Kessler, Miriam Eckert, Lyndsay Clarke,
Principle-Based Parsing. Kluwer Academic Pub-
and Nicolas Nicolov. 2010. The ICWSM JDPA
lishers, Dordrecht.
2010 Sentiment Corpus for the Automotive Do-
Alina Andreevskaia and Sabine Bergler. 2008. When main. In Proceedings of the International AAAI
Specialists and Generalists Work Together: Over- Conference on Weblogs and Social Media Data
coming Domain Dependence in Sentiment Tagging. Challange Workshop (ICWSM-DCW), Washington,
In Proceedings of the Annual Meeting of the Associ- DC, USA.
ation for Computational Linguistics: Human Lan- Soo-Min Kim and Eduard Hovy. 2006. Extracting
guage Technologies (ACL/HLT), Columbus, OH, Opinions, Opinion Holders, and Topics Expressed
USA. in Online News Media Text. In Proceedings of
Steven Bethard, Hong Yu, Ashley Thornton, Vasileios the ACL Workshop on Sentiment and Subjectivity in
Hatzivassiloglou, and Dan Jurafsky. 2004. Extract- Text, Sydney, Australia.
ing Opinion Propositions and Opinion Holders us- Paul Kingsbury and Martha Palmer. 2002. From
ing Syntactic and Lexical Cues. In Computing At- TreeBank to PropBank. In Proceedings of the
titude and Affect in Text: Theory and Applications. Conference on Language Resources and Evaluation
Springer-Verlag. (LREC), Las Palmas, Spain.
Peter F. Brown, Peter V. deSouza, Robert L. Mer- Dan Klein and Christopher D. Manning. 2003. Accu-
cer, Vincent J. Della Pietra, and Jenifer C. Lai. rate Unlexicalized Parsing. In Proceedings of the
1992. Class-based n-gram models of natural lan- Annual Meeting of the Association for Computa-
guage. Computational Linguistics, 18:467479. tional Linguistics (ACL), Sapporo, Japan.
Yejin Choi, Claire Cardie, Ellen Riloff, and Sid- John Lafferty, Andrew McCallum, and Fernando
dharth Patwardhan. 2005. Identifying Sources Pereira. 2001. Conditional Random Fields: Prob-
of Opinions with Conditional Random Fields and abilistic Models for Segmenting and Labeling Se-
Extraction Patterns. In Proceedings of the Con- quence Data. In Proceedings of the International
ference on Human Language Technology and Em- Conference on Machine Learning (ICML).
pirical Methods in Natural Language Processing Dinko Lambov, Gael Dias, and Veska Noncheva.
(HLT/EMNLP), Vancouver, BC, Canada. 2009. Sentiment Classification across Domains. In
Yejin Choi, Eric Breck, and Claire Cardie. 2006. Joint Proceedings of the Portuguese Conference on Artifi-
Extraction of Entities and Relations for Opinion cial Intelligence (EPIA), Aveiro, Portugal. Springer-
Recognition. In Proceedings of the Conference on Verlag.
Empirical Methods in Natural Language Process- Beth Levin. 1993. English Verb Classes and Alter-
ing (EMNLP), Sydney, Australia. nations: A Preliminary Investigation. University of
Charles. J. Fillmore, Christopher R. Johnson, and Chicago Press.
Miriam R. Petruck. 2003. Background to Yangyong Li, Kalina Bontcheva, and Hamish Cun-
FrameNet. International Journal of Lexicography, ningham. 2007. Experiments of Opinion Analy-
16:235 250. sis on the Corpora MPQA and NTCIR-6. In Pro-
Jenny Rose Finkel, Trond Grenager, and Christopher ceedings of the NTCIR-6 Workshop Meeting, Tokyo,
Manning. 2005. Incorporating Non-local Informa- Japan.
tion into Information Extraction Systems by Gibbs Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-
Sampling. In Proceedings of the Annual Meeting sky. 2009. Distant Supervision for Relation Extrac-
of the Association for Computational Linguistics tion without Labeled Data. In Proceedings of the
(ACL), Ann Arbor, MI, USA. Joint Conference of the Annual Meeting of the As-
Thorsten Joachims. 1999. Making Large-Scale SVM sociation for Computational Linguistics and the In-
Learning Practical. In B. Scholkopf, C. Burges, and ternational Joint Conference on Natural Language
A. Smola, editors, Advances in Kernel Methods - Processing of the Asian Federation of Natural Lan-
Support Vector Learning. MIT Press. guage Processing (ACL/IJCNLP), Singapore.
Richard Johansson and Alessandro Moschitti. 2010. Katharina Morik, Peter Brockhausen, and Thorsten
Reranking Models in Fine-grained Opinion Anal- Joachims. 1999. Combining Statistical Learn-
ysis. In Proceedings of the International Confer- ing with a Knowledge-based Approach - A Case
ence on Computational Linguistics (COLING), Be- Study in Intensive Care Monitoring. In Proceedings
jing, China. the International Conference on Machine Learning
Richard Johansson and Alessandro Moschitti. 2011. (ICML).
Extracting Opinion Expressions and Their Polari- Andreas Stolcke. 2002. SRILM - An Extensible Lan-
ties Exploration of Pipelines and Joint Models. In guage Modeling Toolkit. In Proceedings of the In-
334
ternational Conference on Spoken Language Pro-
cessing (ICSLP), Denver, CO, USA.
Veselin Stoyanov and Claire Cardie. 2011. Auto-
matically Creating General-Purpose Opinion Sum-
maries from Text. In Proceedings of Recent Ad-
vances in Natural Language Processing (RANLP),
Hissar, Bulgaria.
Veselin Stoyanov, Claire Cardie, Diane Litman, and
Janyce Wiebe. 2004. Evaluating an Opinion An-
notation Scheme Using a New Multi-Perspective
Question and Answer Corpus. In Proceedings of
the AAAI Spring Symposium on Exploring Attitude
and Affect in Text, Menlo Park, CA, USA.
Cigdem Toprak, Niklas Jakob, and Iryna Gurevych.
2010. Sentence and Expression Level Annotation
of Opinions in User-Generated Discourse. In Pro-
ceedings of the Annual Meeting of the Associa-
tion for Computational Linguistics (ACL), Uppsala,
Sweden.
Joseph Turian, Lev Ratinov, and Yoshua Bengio.
2010. Word Representations: A Simple and Gen-
eral Method for Semi-supervised Learning. In Pro-
ceedings of the Annual Meeting of the Associa-
tion for Computational Linguistics (ACL), Uppsala,
Sweden.
Janyce Wiebe, Theresa Wilson, Rebecca Bruce,
Matthew Bell, and Melanie Martin. 2004. Learn-
ing Subjective Language. Computational Linguis-
tics, 30(3).
Janyce Wiebe, Theresa Wilson, and Claire Cardie.
2005. Annotating Expressions of Opinions and
Emotions in Language. Language Resources and
Evaluation, 39(2/3):164210.
Michael Wiegand and Dietrich Klakow. 2010. Convo-
lution Kernels for Opinion Holder Extraction. In
Proceedings of the Human Language Technology
Conference of the North American Chapter of the
ACL (HLT/NAACL), Los Angeles, CA, USA.
Michael Wiegand and Dietrich Klakow. 2011a. Proto-
typical Opinion Holders: What We can Learn from
Experts and Analysts. In Proceedings of Recent Ad-
vances in Natural Language Processing (RANLP),
Hissar, Bulgaria.
Michael Wiegand and Dietrich Klakow. 2011b. The
Role of Predicates in Opinion Holder Extraction. In
Proceedings of the RANLP Workshop on Informa-
tion Extraction and Knowledge Acquisition (IEKA),
Hissar, Bulgaria.
Theresa Wilson, Janyce Wiebe, and Paul Hoffmann.
2005. Recognizing Contextual Polarity in Phrase-
level Sentiment Analysis. In Proceedings of the
Conference on Human Language Technology and
Empirical Methods in Natural Language Process-
ing (HLT/EMNLP), Vancouver, BC, Canada.
335
Skip N-grams and Ranking Functions for Predicting Script Events
Bram Jans Steven Bethard
KU Leuven University of Colorado Boulder
Leuven, Belgium Boulder, Colorado, USA
bram.jans@gmail.com steven.bethard@colorado.edu
336
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 336344,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
Section 2 gives an overview of the prior work dom sequence?) rather than on prediction tasks
related to this task. Section 3 lists and briefly de- (e.g. which event should follow these events?).
scribes different approaches that try to provide In the current article, we attempt to shed some
answers to the three questions posed in this intro- light on these previous works by comparing differ-
duction, while Section 4 presents the results of our ent ways of collecting and using event chains.
experiments and reports on our findings. Finally,
Section 5 provides a conclusive discussion along 3 Methods
with ideas for future work. Models that predict script events typically have
three stages. First, a large corpus is processed to
2 Prior Work
find event chains in each of the documents. Next,
Our work is primarily inspired by the work of statistics over these event chains are gathered and
Chambers and Jurafsky, which combined a depen- stored. Finally, the gathered statistics are used to
dency parser with coreference resolution to col- create a model that takes as input a partial script
lect event script statistics and predict script events and produces as output a ranked list of events for
(Chambers and Jurafsky, 2008; Chambers and Ju- that script. The following sections give more de-
rafsky, 2009). For each document in their training tails about each of these stages and identify the
corpus, they used coreference resolution to iden- decisions that must be made in each step, and an
tify all the entities, and a dependency parser to overview of the whole process with an example
identify all verbs that had an entity as either a sub- source text is displayed in Figure 1.
ject or object. They defined an event as a verb plus
3.1 Identifying Event Chains
a dependency type (either subject or object), and
collected for each entity, the chain of events that Event chains are typically defined as a sequence
it participated in. They then calculated pointwise of actions performed by some actor. Formally, an
mutual information (PMI) statistics over all the event chain C for some actor a, is a partially or-
pairs of events that occurred in the event chains in dered set of events (v, d) where each v is a verb
their corpus. To predict a new script event given that has the actor a as its dependency d. Following
a partial chain of events, they selected the event prior work (Chambers and Jurafsky, 2008; Cham-
with the highest sum of PMIs with all the events bers and Jurafsky, 2009; McIntyre and Lapata,
in the partial chain. 2009; McIntyre and Lapata, 2010), these event
The work of McIntyre and Lapata followed in chains are identified by running a coreference sys-
this same paradigm, (McIntyre and Lapata, 2009; tem and a dependency parser. Then for each en-
McIntyre and Lapata, 2010), collecting chains of tity identified by the coreference system, all verbs
events by looking at entities and the sequence of that have a mention of that entity as one of their
verbs for which they were a subject or object. They dependencies are collected1 . The event chain is
also calculated statistics over the collected event then the sequence of (verb, dependency-type) tu-
chains, though they considered both event bigram ples. For example, given the sentence A Crow
and event trigram counts. Rather than predicting was sitting on a branch of a tree when a Fox ob-
an event for a script however, they used these sim- served her, the event chain for the Crow would be
ple counts to predict the next event that should be (sitting, SUBJECT), (observed, OBJECT).
generated for a childrens story. Once event chains have been identified, the most
Manshadi and colleagues were concerned about appropriate event chains for training the model
the scalability of running parsers and coreference must be selected. The goal of this process is to
over a large collection of story blogs, and so used select the subset of the event chains identified by
a simplified version of event chains just the main the coreference system and the dependency parser
verb of each sentence (Manshadi et al., 2008). that look to be the most reliable. Both the coref-
Rather than rely on an ad-hoc summation of PMIs, erence system and the dependency parser make
they apply language modeling techniques (specifi- some errors, so not all event chains are necessarily
cally, a smoothed 5-gram model) over the sequence useful for training a model. The three strategies
of events in the collected chains. However, they we consider for this selection process are:
only tested these language models on sequencing 1
Also following prior work, we consider only the depen-
tasks (e.g. is the real sequence better than a ran- dencies subject and object.
337
John woke up. He opened his eyes and yawned. Then he crossed the room and walked to the door.
There he saw Mary. Mary smiled and kissed him. Then they both blushed.
1-skip bigrams
t
(saw, SUBJ) rip 2. Gathering event chain statistics
l sc
(kissed, OBJ)
rti a
. SUBJ)
(blushed, pa 2-skip
a ms
r bigra
. bigram
g regula
in s
.
ct
t ru
ns [(saw, OBJ), (smiled, SUBJ)] [(saw, OBJ), (smiled, SUBJ)]
co [(smiled, SUBJ), (kissed, SUBJ)] [(saw, OBJ), (kissed, SUBJ)]
[(saw, OBJ), (smiled, SUBJ)]
[(saw, OBJ), (kissed, SUBJ)]
[(kissed, SUBJ), (blushed, SUBJ)] [(smiled, SUBJ), (kissed, SUBJ)] [(saw, OBJ), (blushed, SUBJ)]
[(smiled, SUBJ), (blushed, SUBJ)] ...
[(kissed, SUBJ), (blushed, SUBJ)] [(kissed, SUBJ), (blushed, SUBJ)]
1. (blushed, SUBJ)
2. (kissed, OBJ)
3. (smiled, SUBJ)
Figure 1: An overview of the whole linear work flow showing the three key steps identifying event chains,
collecting statistics out of the chains and predicting a missing event in a script. The figure also displays how a
partial script for evaluation (Section 4.3) is constructed. We show the whole process for Marys event chain only,
but the same steps are followed for Johns event chain.
Select all event chains, that is, all sequences 3.2 Gathering Event Chain Statistics
of two or more events linked by common Once event chains have been collected from the
actors. This strategy will produce the largest corpus, the statistics necessary for constructing
number of event chains to train a model from, the event prediction model must be gathered. Fol-
but it may produce noisier training data as lowing prior work (Chambers and Jurafsky, 2008;
the very short chains included by this strategy Chambers and Jurafsky, 2009; Manshadi et al.,
may be less likely to represent real scripts. 2008; McIntyre and Lapata, 2009; McIntyre and
Select all long event chains consisting of 5 Lapata, 2010), we focus on gathering statistics
or more events. This strategy will produce a about the n-grams of events that occur in the
smaller number of event chains, but as they collected event chains. Specifically, we look at
are longer, they may be more likely to repre- strategies for collecting bigram statistics, the most
sent scripts. common type of statistics gathered in prior work.
Select only the longest event chain. This We consider three strategies for collecting bigram
strategy will produce the smallest number of statistics:
event chains from a corpus. However, they
may be of higher quality, since this strategy Regular bigrams. We find all pairs of
looks for the key actor in each story, and only events that are adjacent in an event chain
uses the events that are tied together by that and collect the number of times each event
key actor. Since this is the single actor that pair was observed. For example, given the
played the largest role in the story, its actions chain of events (saw, SUBJ), (kissed, OBJ),
may be the most likely to represent a real (blushed, SUBJ), we would extract the two
script. event bigrams: ((saw, SUBJ), (kissed, OBJ))
338
and ((kissed, OBJ), (blushed, SUBJ)). In addi- collected from their corpus, and score it as
tion to the event pair counts, we also collect the sum of the pointwise mutual informations
the number of times each event was observed between the event e and each of the events in
individually, to allow for various conditional the script:
probability calculations. This strategy fol- n
lows the classic approach for most language
X P (ci , e)
f (e, c) = log
models. P (ci )P (e)
i
1-skip bigrams. We collect pairs of events Chambers and Jurafskys description of this
that occur with 0 or 1 events intervening be- score suggests that it is unordered, such that
tween them. For example, given the chain P (a, b) = P (b, a). Thus the probabilities
(saw, SUBJ), (kissed, OBJ), (blushed, SUBJ), must be defined as:
we would extract three bigrams: the two regu-
C(e1 , e2 ) + C(e2 , e1 )
lar bigrams ((saw, SUBJ), (kissed, OBJ)) and P (e1 , e2 ) = PP
C(ei , ej )
((kissed, OBJ), (blushed, SUBJ)), plus the 1- ei ej
skip-bigram, ((saw, SUBJ), (blushed, SUBJ)).
This approach to collecting n-gram statistics C(e)
P (e) = P 0
is sometimes called skip-gram modeling, and e0 C(e )
it can reduce data sparsity by extracting more where C(e1 , e2 ) is the number of times that
event pairs per chain (Guthrie et al., 2006). the ordered event pair (e1 , e2 ) was counted in
It has not previously been applied in the task the training data, and C(e) is the number of
of predicting script events, but it may be times that the event e was counted.
quite appropriate to this task because in most
scripts it is possible to skip some events in Ordered PMI. A variation on the approach
the sequence. of Chambers and Jurafsky is to have a score
that takes the order of the events in the chain
2-skip bigrams. We collect pairs of events into account. In this scenario, we assume that
that occur with 0, 1 or 2 intervening events, in addition to the partial script of events, we
similar to what was done in the 1-skip bi- are given an insertion point, m, where the
grams strategy. This will extract even more new event should be added. The score is then
pairs of events from each chain, but it is pos- defined as:
sible the statistics over these pairs of events
m
will be noisier. X P (ck , e)
f (e, c) = log +
P (ck )P (e)
3.3 Predicting Script Events k=1
n
X P (e, ck )
Once statistics over event chains have been col- log
lected, it is possible to construct the model for P (e)P (ck )
k=m+1
predicting script events. The input of this model
will be a partial script c of n events, where c = where the probabilities are defined as:
c1 c2 . . . cn = (v1 , d1 ), (v2 , d2 ), . . . , (vn , dn ), and C(e1 , e2 )
the output of this model will be a ranked list of P (e1 , e2 ) = P P
C(ei , ej )
events where the highest ranked events are the ones ei ej
most likely to belong to the event sequence in the
script. Thus, the key issue for this model is to de- C(e)
P (e) = P 0
fine the function f for ranking events. We consider e0 C(e )
three such ranking functions: This approach uses pointwise mutual infor-
mation but also models the event chain in the
Chambers & Jurafsky PMI. Chambers and order it was observed.
Jurafsky (2008) define their event ranking
function based on pointwise mutual infor- Bigram probabilities. Finally, a natural
mation. Given a partial script c as defined ranking function, which has not been applied
above, they consider each event e = (v 0 , d0 ) to the script event prediction task (but has
339
been applied to related tasks (Manshadi et economics, sports, etc., strongly varying in
al., 2008)) is to use the bigram probabilities length, topics and narrative structure.
of language modeling rather than pointwise
mutual information scores. Again, given an Andrew Lang Fairy Tale Corpus 4 a
insertion point m for the event in the script, small collection of 437 children stories with
we define the score as: an average length of 125 sentences, and used
previously for story generation by McIntyre
m
X and Lapata (2009).
f (e, c) = log P (e|ck ) +
k=1
n
In general, the Reuters Corpus is much larger and
X
log P (ck |e) allows us to see how well script events can be
k=m+1
predicted when a lot of data is available, while the
Andrew Lang Fairy Tale Corpus is much smaller,
where the conditional probability is defined but has a more straightforward narrative structure
as2 : that may make identifying scripts simpler.
C(e1 , e2 )
P (e1 |e2 ) =
C(e2 ) 4.2 Corpus Processing
This approach scores an event based on the Constructing a model for predicting script events
probability that it was observed following all requires a corpus that has been parsed with a de-
the events before it in the chain and preceding pendency parser, and whose entities have been
all the events after it in the chain. This ap- identified via a coreference system. We there-
proach most directly models the event chain fore processed our corpora by (1) filtering out
in the order it was observed. non-narrative articles, (2) applying a dependency
parser, (3) applying a coreference resolution sys-
4 Experiments tem and (4) identifying event chains via entities
Our experiments aimed to answer three questions: and dependencies.
Which event chains are worth keeping? How First, articles that had no narrative content were
should event bigram counts be collected? And removed from the corpora. In the Reuters Corpus,
which ranking method is best for predicting script we removed all files solely listing stock exchange
events? To answer these questions we use two values, interest rates, etc., as well as all articles
corpora, the Reuters Corpus and the Andrew Lang that were simply summaries of headlines from dif-
Fairy Tale Corpus, to evaluate our three differ- ferent countries or cities. After removing these
ent chain selection methods, {all chains, long files, the Reuters corpus was reduced to 788, 245
chains, the longest chain}, our three different bi- files. Removing files from the Fairy Tale corpus
gram counting methods, {regular bigrams, 1-skip was not necessary all 437 stories were retained.
bigrams, 2-skip bigrams}, and our three different We then applied the Stanford Parser (Klein and
ranking methods, {Chambers & Jurafsky PMI, or- Manning, 2003) to identify the dependency struc-
dered PMI, bigram probabilities}. ture of each sentence in each article in the corpus.
This parser produces a constitutent-based syntactic
4.1 Corpora parse tree for each sentence, and then converts this
We consider two corpora for evaluation: tree to a collapsed dependency structure via a set
of tree patterns.
Reuters Corpus, Volume 1 3 (Lewis et Next we applied the OpenNLP coreference en-
al., 2004) a large collection of 806, 791 gine5 to identify the entities in each article, and the
news stories written in English concerning noun phrases that were mentions of each entity.
a number of different topics such as politics, Finally, to identify the event chains, we took
2
Note that predicted bigram probabilities are calculated each of the entities proposed by the coreference
in this way for both classic language modeling and skip-gram system, walked through each of the noun phrases
modeling. In skip-gram modeling, skips in the n-grams are associated with that entity, retrieved any subject
only used to increase the size of the training data; prediction
4
is performed exactly as in classic language modeling. http://www.mythfolklore.net/andrewlang/
3 5
http://trec.nist.gov/data/reuters/reuters.html http://incubator.apache.org/opennlp/
340
or object dependencies that linked a verb to that the rank of e in the systems guess list for c.
noun phrase, and created an event chain from the
sequence of (verb, dependency-type) tuples in the Average rank. The average rank of the miss-
order that they appeared in the text. ing event across all of the partial scripts:
1 X
4.3 Evaluation Metrics ranksys (c)
|C|
cC
We follow the approach of Chambers and Jurafsky
(2008), evaluating our models for predicting script This is the evaluation metric used by Cham-
events in a narrative cloze task. The narrative bers and Jurafsky (2008).
cloze task is inspired by the classic psychological
cloze task in which subjects are given a sentence Recall@N. The fraction of partial scripts
with a word missing and asked to fill in the blank where the missing event is ranked N or less6
(Taylor, 1953). Similarly, in the narrative cloze in the guess list.
task, the system is given a sequence of events from 1
a script where one event is missing, and asked |{c : c C ranksys (c) N }|
|C|
to predict the missing event. The difficulty of a
cloze task depends a lot on the context around In our experiments we use N = 50, but re-
the missing item in some cases it may be quite sults are roughly similar for lower and higher
predictable, but in many cases there is no single values of N .
correct answer, though some answers are more
probable than others. Thus, performing well on a Recall@N has not been used before for evaluat-
cloze task is more about ranking the missing event ing models that predict script events, however we
highly, and not about proposing a single correct suggest that it is a more reliable metric than Av-
event. erage rank. When calculating the average rank,
In this way, narrative cloze is like perplexity the length of the guess lists will have a significant
in a language model. However, where perplexity influence on results. For instance, if a small model
measures how good the model is at predicting a is trained with only a small vocabulary of events,
script event given the previous events in the script, its guess lists will usually be shorter than a larger
narrative cloze measures how good the model is model, but if both models predict the missing event
at predicting what is missing between events in at the bottom of the list, the larger model will get
the script. Thus narrative cloze is somewhat more penalized more. Recall@N does not have this is-
appropriate to our task, and at the same time sim- sue it is not influenced by length of the guess
plifies comparisons to prior work. lists.
An alternative evaluation metric would have
Rather than manually constructing a set of
been mean average precision (MAP), a metric
scripts on which to run the cloze test, we follow
commonly used to evaluate information retrieval.
Chambers and Jurafsky in reserving a section of
Mean average precision reduces to mean recipro-
our parsed corpora for testing, and then using the
cal rank (MRR) when theres only a single answer
event chains from that section as the scripts for
as in the case of narrative cloze, and would have
which the system must predict events. Given an
scored the ranked lists as:
event chain of length n, we run n cloze tests, with
a different one of the n events removed each time 1 X 1
to create a partial script from the remaining n 1 |C| ranksys (c)
cC
events (see Figure 1). Given a partial script as
input, an accurate event prediction model should Note that mean reciprocal rank has the same issues
rank the missing event highly in the guess list that with guess list length that average rank does. Thus,
it generates as output. since it does not aid us in comparing to prior work,
We consider two approaches to evaluating the and it has the same deficiencies as average rank,
guess lists produced in response to narrative cloze we do not report MRR in this article.
tests. Both are defined in terms of a test collection 6
Rank 1 is the event that the system predicts is most prob-
C, consisting of |C| partial scripts, where for each able, so we want the missing event to have the smallest rank
partial script c with missing event e, ranksys (c) is possible.
341
2-skip + bigram prob. all chains + bigram prob.
Chain selection Av. rank Recall@50 Bigram selection Av. rank Recall@50
all chains 502 0.5179 regular bigrams 789 0.4886
long chains 549 0.4951 1-skip bigrams 630 0.4951
the longest chain 546 0.4984 2-skip bigrams 502 0.5179
Table 1: Chain selection methods for the Reuters corpus Table 3: Event bigram selection methods for the
- comparison of average ranks and Recall@50. Reuters corpus - comparison of average ranks and Re-
call@50.
2-skip + bigram prob.
Chain selection Av. rank Recall@50 all chains + bigram prob.
all chains 1650 0.3376 Bigram selection Av. rank Recall@50
long chains 452 0.3461 regular bigrams 2363 0.3227
the longest chain 1534 0.3376 1-skip bigrams 1690 0.3418
2-skip bigrams 1650 0.3376
Table 2: Chain selection methods for the Fairy Tale
corpus - comparison of average ranks and Recall@50. Table 4: Event bigram selection methods for the Fairy
Tales corpus - comparison of average ranks and Re-
call@50.
4.4 Results
We considered all 27 combinations of our chain predicting script events.
selection methods, bigram counting methods, and
For the Fairy Tale collection, long chains gives
ranking methods: {all chains, long chains, the
the lowest average rank and highest Recall@50. In
longest chain}x{regular bigrams, 1-skip bigrams,
this collection, there is apparently some benefit to
2-skip bigrams}x{Chambers & Jurafsky PMI, or-
filtering the shorter event chains, probably because
dered PMI, bigram probabilities}. The best among
the collection is small enough that the noise in-
these 27 combinations for the Reuters corpus was
troduced from dependency and coreference errors
{all chains}x{2-skip bigrams}x{bigram probabil-
plays a larger role.
ities} achieving an average rank of 502 and a Re-
call@50 of 0.5179.
4.4.2 Gathering Event Chain Statistics
Since viewing all the combinations at once
would be confusing, instead the following sec- We next try to answer the question: Given an
tions investigate each decision (selection, counting, event chain, how should statistics be gathered from
ranking) one at a time. While one decision is var- it? Tables 3 and 4 show performance when we vary
ied across its three choices, the other decisions are the strategy for counting event pairs, while fixing
held to their values in the best model above. the selecting method to all chains, and fixing the
ranking method to bigram probabilities.
4.4.1 Identifying Event Chains For the Reuters corpus, 2-skip bigrams achieves
We first try to answer the question: How should the lowest average rank and the highest Recall@50.
representative chains of events be selected from For the Fairy Tale corpus, 1-skip bigrams and 2-
the source text? Tables 1 and 2 show perfor- skip bigrams perform similarly, and both have
mance when we vary the strategy for selecting lower average rank and higher Recall@50 than
event chains, while fixing the counting method to regular bigrams.
2-skip bigrams, and fixing the ranking method to Skip-grams probably outperform regular n-
bigram probabilities. grams on both of these corpora because the skip-
For the Reuters collection, we see that using all grams provide many more event pairs over which
chains gives a lower average rank and a higher to calculate statistics: in the Reuters corpus, regu-
Recall@50 than either of the strategies that select lar bigrams extracts 737,103 bigrams, while 2-skip
a subset of the event chains. The explanation is bigrams extracts 1,201,185 bigrams. Though skip-
probably simple: using all chains produces more grams have not been applied to predicting script
than 700,000 bigrams from the Reuters corpus, events before, it seems that they are a good fit,
while using only the long chains produces only and better capture statistics about narrative event
around 300,000. So more data is better data for chains than regular n-grams do.
342
all bigrams + 2-skip ing the intuition that events do not have to appear
Ranking method Av. rank Recall@50 strictly one after another to be closely semantically
C&J PMI 2052 0.1954
related, skip-grams decrease data sparsity and in-
ordered PMI 3584 0.1694
bigram prob. 502 0.5179 crease the size of the training data.
Second, our novel bigram probabilities ranking
Table 5: Ranking methods for the Reuters corpus - function outperforms the other ranking methods.
comparison of average ranks and Recall@50. In particular, it outperforms the state-of-the-art
pointwise mutual information method introduced
all bigrams + 2-skip by Chambers and Jurafsky (2008), and it does so
Ranking method Av. rank Recall@50 by a large margin, more than doubling the Re-
C&J PMI 1455 0.1975
call@50 on the Reuters corpus. The key insight
ordered PMI 2460 0.0467
bigram prob. 1650 0.3376 here is that, when modeling events in a script, a
language-model-like approach better fits the task
Table 6: Ranking methods for the Fairy Tale corpus - than a mutual information approach.
comparison of average ranks and Recall@50. Third, we have discussed why Recall@N is a
better and more consistent evaluation metric than
Average rank. However, both evaluation metrics
4.4.3 Predicting Script Events
suffer from the strictness of the narrative cloze test,
Finally, we try to answer the question: Given which accepts only one event being the correct
event n-gram statistics, which ranking function event, while it is sometimes very difficult, even
best predicts the events for a script? Tables 5 and for humans, to predict the missing events, and
6 show performance when we vary the strategy for sometimes more solutions are possible and equally
ranking event predictions, while fixing the selec- correct. In future research, our goal is to design
tion method to all chains, and fixing the counting a better evaluation framework which is more suit-
method to 2-skip bigrams. able for this task, where credit can be given for
For both Reuters and the Fairy Tale corpus, Re- proposed script events that are appropriate but not
call@50 identifies bigram probabilities as the best identical to the ones observed in a text.
ranking function by far. On the Reuters corpus Fourth, we have observed some differences in
the Chambers & Jurafsky PMI ranking method results between the Reuters and the Fairy Tale
achieves Recall@50 of only 0.1954, while bigram corpora. The results for Reuters are consistently
probabilities ranking method achieves 0.5179. The better (higher Recall@50, lower average rank), al-
gap is also quite large on the Fairy Tales corpus: though fairy tales contain a plainer narrative struc-
0.1975 vs. 0.3376. ture, which should be more appropriate to our task.
On the Reuters corpus, average rank also identi- This again leads us to the conclusion that more
fies bigram probabilities as the best ranking func- data (even with more noise as in Reuters) leads to
tion, yet for the Fairy Tales corpus, Chambers & a greater coverage of events, better overall models
Jurafsky PMI and bigram probabilities have simi- and, consequently, to more accurate predictions.
lar average ranks. This inconsistency is probably Still, the Reuters corpus seems to be far from a
due to the flaws in the average rank evaluation perfect corpus for research in the automatic acqui-
measure that were discussed in Section 4.3 the sition of scripts, since only a small portion of the
measure is overly sensitive to the length of the corpus contains true narratives. Future work must
guess list, particularly when the missing event is therefore gather a large corpus of true narratives,
ranked lower, as it is likely to be when training on like fairy tales and childrens stories, whose sim-
a smaller corpus like the Fairy Tales corpus. ple plot structures should provide better learning
material, both for models predicting script events,
5 Discussion
and for related tasks like automatic storytelling
Our experiments have led us to several important (McIntyre and Lapata, 2009).
conclusions. First, we have introduced skip-grams One of the limitations of the work presented
and proved their utility for acquiring script knowl- here is that it takes a fairly linear, n-gram-based ap-
edge our models that employ skip bigrams score proach to characterizing story structure. We think
consistently higher on event prediction. By follow- such an approach is useful because it forms a natu-
343
ral baseline for the task (as it does in many other 47th Annual Meeting of the Association for Compu-
tasks such as named entity tagging and language tational Linguistics and the 4th International Joint
modeling). However, story structure is seldom Conference on Natural Language Processing of the
strictly linear, and future work should consider AFNLP, pages 217225.
Neil McIntyre and Mirella Lapata. 2010. Plot induc-
models based on grammatical or discourse links
tion and evolutionary search for story generation.
that can capture the more complex nature of script In Proceedings of the 48th Annual Meeting of the
events and story structure. Association for Computational Linguistics, pages
15621572.
Acknowledgments Michaela Regneri, Alexander Koller, and Manfred
We would like to thank the anonymous reviewers Pinkal. 2010. Learning script knowledge with web
experiments. In Proceedings of the 48th Annual
for their constructive comments. This research
Meeting of the Association for Computational Lin-
was carried out as a master thesis in the frame- guistics, pages 979988.
work of the TERENCE European project (EU FP7- Roger C. Schank and Robert P. Abelson. 1977. Scripts,
257410). plans, goals, and understanding: an inquiry into
human knowledge structures. Lawrence Erlbaum
Associates.
References Wilson L. Taylor. 1953. Cloze procedure: a new tool
Nathanael Chambers and Dan Jurafsky. 2008. Un- for measuring readibility. Journalism Quarterly,
supervised learning of narrative event chains. In 30:415433.
Proceedings of the 46th Annual Meeting of the As-
sociation for Computational Linguistics: Human
Language Technologies, pages 789797.
Nathanael Chambers and Dan Jurafsky. 2009. Un-
supervised learning of narrative schemas and their
participants. In Proceedings of the Joint Conference
of the 47th Annual Meeting of the Association for
Computational Linguistics and the 4th International
Joint Conference on Natural Language Processing
of the AFNLP, pages 602610.
Nathanael Chambers and Dan Jurafsky. 2011.
Template-based information extraction without the
templates. In Proceedings of the 49th Annual Meet-
ing of the Association for Computational Linguistics:
Human Language Technologies, pages 976986.
David Guthrie, Ben Allison, W. Liu, Louise Guthrie,
and Yorick Wilks. 2006. A closer look at skip-gram
modelling. In Proceedings of the Fifth international
Conference on Language Resources and Evaluation
(LREC), pages 12221225.
Dan Klein and Christopher D. Manning. 2003. Ac-
curate unlexicalized parsing. In Proceedings of the
41st Annual Meeting of the Association for Compu-
tational Linguistics, pages 423430.
David D. Lewis, Yiming Yang, Tony G. Rose, and Fan
Li. 2004. RCV1: a new benchmark collection for
text categorization research. Journal of Machine
Learning Research, 5:361397.
Mehdi Manshadi, Reid Swanson, and Andrew S. Gor-
don. 2008. Learning a probabilistic model of event
sequences from internet weblog stories. In Proceed-
ings of the Twenty-First International Florida Artifi-
cial Intelligence Research Society Conference.
Neil McIntyre and Mirella Lapata. 2009. Learning to
tell tales: A data-driven approach to story genera-
tion. In Proceedings of the Joint Conference of the
344
The Problem with Kappa
David M W Powers
Centre for Knowledge & Interaction Technology, CSEM
Flinders University
David.Powers@flinders.edu.au
Abstract Introduction
Research in Computational Linguistics usually
It is becoming clear that traditional requires some form of quantitative evaluation. A
evaluation measures used in number of traditional measures borrowed from
Computational Linguistics (including Information Retrieval (Manning & Schtze,
Error Rates, Accuracy, Recall, Precision 1999) are in common use but there has been
and F-measure) are of limited value for considerable critical evaluation of these measures
unbiased evaluation of systems, and are themselves over the last decade or so (Entwisle
not meaningful for comparison of & Powers, 1998, Flach, 2003, Ben-David. 2008).
algorithms unless both the dataset and Receiver Operating Analysis (ROC) has been
algorithm parameters are strictly advocated as an alternative by many, and in
controlled for skew (Prevalence and particular has been used by Frnkranz and Flach
Bias). The use of techniques originally (2005), Ben-David (2008) and Powers (2008) to
designed for other purposes, in particular better understand both learning algorithms
Receiver Operating Characteristics Area relationship and the between the various
Under Curve, plus variants of Kappa, measures, and the inherent biases that make
have been proposed to fill the void. many of them suspect. One of the key advantages
of ROC is that it provides a clear indication of
This paper aims to clear up some of the chance level performance as well as a less well
confusion relating to evaluation, by known indication of the relative cost weighting
demonstrating that the usefulness of each of positive and negative cases for each possible
evaluation method is highly dependent on system or parameterization represented.
the assumptions made about the ROC Area Under the Curve (Fig. 1) has been
distributions of the dataset and the also used as a performance measure but averages
underlying populations. The behaviour of over the false positive rate (Fallout) and is thus a
a number of evaluation measures is function of cost that is dependent on the
compared under common assumptions. classifier rather than the application. For this
reason it has come into considerable criticism
Deploying a system in a context which and a number of variants and alternatives have
has the opposite skew from its validation been proposed (e.g. AUK, Kaymak et. Al, 2010
set can be expected to approximately and H-measure, Hand, 2009). An AUC curve
negate Fleiss Kappa and halve Cohen that is at least as good as a second curve at all
Kappa but leave Powers Kappa points, is said to dominate it and indicates that
unchanged. For most performance the first classifier is equal or better than the
evaluation purposes, the latter is thus second for all plotted values of the parameters,
most appropriate, whilst for comparison and all cost ratios. However AUC being greater
of behaviour, Matthews Correlation is for one classifier than another does not have such
recommended. a property indeed deconvexities within or
345
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 345355,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
intersections of ROC curves are both prima facie ROC AUC, which are in fact all equivalent to
evidence that fusion of the parameterized DeltaP in the dichotomous case, which we deal
classifiers will be useful (cf. Provost and Facett, with first, and to the other Kappas when the
2001; Flach and Wu, 2005). marginal prevalences (or biases) match.
AUK stands for Area under Kappa, and
represents a step in the advocacy of Kappa (Ben- 1.1 Two classes and non-negative Kappa.
David, 2008ab) as an alternative to the traditional Kappa was originally proposed (Cohen, 1960) to
measures and ROC AUC. Powers (2003,2007) compare human ratings in a binary, or
has also proposed a Kappa-like measure dichotomous, classification task. Cohen (1960)
(Informedness) and analysed it in terms of ROC, recognized that Rand Accuracy did not take
and there are many more, Warrens (2010) analyzing chance into account and therefore proposed to
the relationships between some of the others. subtract off the chance level of Accuracy and
Systems like RapidMiner (2011) and Weka then renormalize to the form of a probability:
(Witten and Frank, 2005) provide almost all of K(Acc) = [Acc E(Acc)] / [1 E(Acc)] (1)
the measures we have considered, and many This leaves the question of how to estimate the
more besides. This encourages the use of expected Accuracy, E(Acc). Cohen (1960) made
multiple measures, and indeed it is now the assumption that raters would have different
becoming routine to display tables of multiple distributions that could be estimated as
results for each system, and this is in particular the products of the corresponding marginal
true for the frameworks of some of the coefficients of the contingency table:
challenges and competitions brought to the
communities (e.g. 2nd i2b2 Challenge in NLP for +ve Class ve Class
Clinical Data, 2011; 2nd Pascal Challenge on +ve Prediction A=TP B=FP PP
HTC, 2011)). ve Prediction C=FN D=TN PN
This use of multiple statistics is no doubt in Notation RP RN N
response to the criticism levelled at the Table 1. Statistical and IR Contingency Notation
evaluation mechanisms used in earlier
generations of competitions and the above In order to discuss this further it is important
mentioned critiques, but the proliferation of to discuss our notational conventions, and it is
alternate measures in some ways merely noted that in statistics, the letters A-D (upper
compounds the problem. Researchers have the case or lower case) are conventionally used to
temptation of choosing those that favour their label the cells, and their sums may be used to
system as they face the dilemma of what to do label the marginal cells. However in the
about competing (and often disagreeing) literature on ROC analysis, which we follow
evaluation measures that they do not completely here, it is usual to talk about true and false
understand. These systems and competitions also positives (that is positive predictions that are
exhibit another issue, the tendency to macro- correct or incorrect), and conversely true and
averages over multiple classes, even of measures false negatives. Often upper case is used to
that are not denominated in class (e.g. that are indicate counts in the contingency table, which
proportions of predicted labels rather than real sum to the number of instances, N. In this case
classes, as with Precision). lower case letters are used to indicate
This paper is directed at better understanding probabilities, which means that the
some of these new and old measures as well as corresponding upper case values in the
providing recommendations as to which measures contingency table are all divided by N, and n=1.
are appropriate in which circumstances. Statistics relative to (the total numbers of
items in) the real classes are called Rates and
Whats in a Kappa? have the number (or proportion) of Real
Positives (RP) or Real Negatives (RN) in the
In this paper we focus on the Kappa family of denominator. In this notation, we have Recall =
measures, as well as some closely related TPR = TP/RP.
statistics named for other letters of the Greek Conversely statistics relative to the (number
alphabet, and some measures that we will show of) predictions are called Accuracies, so relative
behave as Kappa measures although they were to the predictions that label instances positively,
not originally defined as such. These include Predicted Positives (PP), we have Precision =
Informedness, Gini Coefficient and single point TPA = TP/PP.
346
!
the weighting is made according to the number
of predictions made for the corresponding labels.
Rand Accuracy is also the weighted average of
Recall and Inverse Recall (probability that
negative instances are correctly predicted),
where the weighting is made according to the
number of instances in the corresponding
classes.
The marginal probabilities rp and pp are also
known as Prevalence (the class prevalence of
positive instances) and Bias (the label bias to
positive predictions), and the corresponding
probabilities of negative classes and labels are
the Inverse Prevalence and Inverse Bias
respectively. In the ROC literature, the ratios of
negative to positive classes is often referred to as
the class ratio or skew. We can similarly also
refer to a label ratio, prediction ratio or
Figure 1. Illustration of ROC Analysis. The prediction skew. Note that optimal performance
solid diagonal represents chance performance can only be achieved if class skew = label skew.
for different rates of guessing positive or The Expected True Positives and Expected
negative labels. The dotted line represent the True Negatives for Cohen Kappa, as well as Chi-
convex hull enclosing the results of different squared significance, are estimated as the
systems, thresholds or parameters tested. The product of Bias and Prevalence, and the product
(0,0) and (1,1) points represent guessing always of Inverse Bias and Inverse Prevalence, resp.,
negative and always positive and are always where traditional uses of Kappa for agreement of
nominal systems in a ROC curve. The points human raters, the contingency table represents
along any straight line segment of a convex hull one rater as providing the classification to be
are achievable by probabilistic interpolation of
predicted by the other rater. Cohen assumes that
the systems at each end, the gradient represents
their distribution of ratings are independent, as
the cost ratio and all points along the segment,
including the endpoints have the same effective reflected both by the margins and the
cost benefit. AUC is the area under the curve contingencies: ETP = RP*PP; ETN = RN*NN.
joining the systems with straight edges and This gives us E(Acc) = (ETP+ETN)/N=etp+etn.
AUCH is the area under the convex hull where By contrast the two rater two class form of
points within it are ignored. The height above Fleiss (1981) Kappa, also known as Scott Pi,
the chance line of any point represents DeltaP, assumes that both raters are labeling
the Gini Coefficient and also the Dichotomous independently using the same distribution, and
Informedness of the corresponding system, and that the margins reflect this potential variation.
also corresponds to twice the area of the triangle The expected number of positives is thus
between it and the chance line, and thus 2AUC-1 effectively estimated as the average of the two
where AUC is calculated on this single point raters counts, so that EP = (RP+PP)/2, and EN =
curve (not shown) joining it to (0,0) and (1,1). (RN+PN)/2, ETP = EP2 and ETN = EN2.
The (1,0) point represents perfect performance
with 100% True Positive Rate and 0% False 1.2 Inverting Kappa
Negative Rate.
The definition of Kappa in Eqn (1) can be seen
to be applicable to arbitrary definitions of
The accuracy of all our predictions, positive or Expected Accuracy, and in order to discover how
negative, is given by Rand Accuracy = other measures relate to the family of Kappa
(TF+TN)/N = tf+tn, and this is what is meant in measures it is useful to invert Kappa to discover
general by the unadorned term Accuracy, or the the implicit definition of Expected Accuracy that
abbreviation Acc. allows a measure to be interpreted as a form of
Rand Accuracy is the weighted average of Kappa. We simply make E(Acc) the subject by
Precision and Inverse Precision (probability that multiplying out Eqn (1) to a common
negative predictions are correctly labeled), where denominator and associating factors of E(Acc):
347
K(Acc) = [Acc E(Acc)] / [1 E(Acc)] (1) prevalence and bias of each class/label). Our
E(Acc) = [Acc K(Acc)] / [1 K(Acc)] (2) focus in this paper is the behaviour of the various
Note that for a given value of Acc the function Kappa measures as we move from strongly
connecting E(Acc) and K(Acc) is its own matched to strongly mismatched biases.
inverse: Cohen (1968) also introduced a weighted
variant of Kappa. We have also discussed cost
E(Acc) = fAcc(K(Acc)) (3)
weighting in the context of ROC, and Hand
K(Acc) = fAcc(E(Acc)) (4) (2009) seeks to improve on ROC AUC by
For the future we will tend to drop the Acc introducing a beta distribution as an estimated
argument or subscript when it is clear, and we cost profile, but we will not discuss them further
will also subscript E and K with the name or here as we are more interested in the
initial of the corresponding definition of effectiveness of the classifer overall rather than
Expectation and thus Kappa (viz. Fleiss and matching a particular cost profile, and are
Cohen so far). skeptical about any generic cost distribution. In
Note that given Acc and E(Acc) are in the particular the beta distribution gives priority to
range of 0..1 as probabilities, Kappa is also central tendency rather than boundary conditions,
restricted to this range, and takes the form of a but boundary conditions are frequently
probability. encountered in optimization. Similarly Kaymak
et al.s (2010) proposal to replace AUC by AUK
1.3 Multiclass multirater Kappa corresponds to a Cohen Kappa reweighting of
Fleiss (1981) and others sought to generalize the ROC that eliminates many of its useful
Cohen (1960) definition of Kappa to handle both properties, without any expectation that the
multiple class (not just positive/negative) and measure, as an integration across a surrogate cost
multiple raters (not just two one of which we distribution, has any validity for system
have called real and the other prediction). Fleiss selection. Introducing alternative weights is also
in fact generalized Scotts (1955) Pi in both allowed in the definition of F-Measure, although
senses, not Cohen Kappa. The Fleiss Kappa is in practice this is almost invariably employed as
not formulated as we have done here for the equally weighted harmonic mean of Recall
exposition, but in terms of pairings (agreements) and Precision. Introducing additional weight or
amongst the raters, who are each assumed to distribution parameters, just multiplies the
have rated the same number of items, N, but not confusion as to which measure to believe.
necessarily all. Krippendorfs (1970, 1978) Powers (2003) derived a further multiclass
effectively generalizes further by dealing with Kappa-like measure from first principles,
arbitrary numbers of raters assessing different dubbing it Informedness, based on an analogy of
numbers of items. Bookmaker associating costs/payoffs based on
Light (1971) and Hubert (1977) successfully the odds. This is then proven to measure the
generalized Cohen Kappa. Another approach to proportion of time (or probability) a decision is
estimating E(Acc) was taken by Bennett et al informed versus random, based on the same
(1955) which basically assumed all classes were assumptions re expectation as Cohen Kappa, and
equilikely (effectively what use of Accuracy, F- we will thus call it Powers Kappa, and derive an
Measure etc. do, although they dont subtract off formulation of the corresponding expectation.
the chance component). Powers (2007) further identifies that the
The Bennett Kappa was generalized by dichotomous form of Powers Kappa is equivalent
Randolph (2005), but as our starting point is that to the Gini cooefficient as a deskewed version of
we need to take the actual margins into account, the weighted Relative Accuracy proposed by
we do not pursue these further. However, Flach (2003) based on his analysis and
Warrens (2010a) shows that, under certain deskewing of common evaluation measures in
conditions, Fleiss Kappa is a lower bound of the ROC paradigm. Powers (2007) also identifies
both the Hubert generalization of Cohen Kappa that Dichotomous Informedness is equivalent to
and the Randolph generalization of Bennet an empirically derived psychological measure
Kappa, which is itself correspondingly an upper called DeltaP (Perruchet et al. 2004). DeltaP
bound of both the Hubert and the Light (and its dual DeltaP) were derived based on
generalizations of Cohen Kappa. Unfortunately analysis of human word association data the
the conditions are that there is some agreement combination of this empirical observation with
between the class and label skews (viz. the the place of DeltaP as the dichotomous case of
348
Powers Informedness suggests that human the ability to take the geometric mean (of macro-
association is in some sense optimal. Powers averaged) Informedness and Markedness means
(2007) also introduces a dual of Informedness that a single Correlation can be provided when
that he names Markedness, and shows that the appropriate.
geometric mean of Informedness and Our aim now is therefore to characterize
Markedness is Matthews Correlation, the Informedness (and hence as its dual Markedness)
nominal analog of Pearson Correlation. as a Kappa measure in relation to the families of
Powers Informedness is in fact a variant of Kappa measures represented by Cohen and Fleiss
Kappa with some similarities to Cohen Kappa, Kappa in the dichotomous case. Note that
but also some advantages over both Cohen and Warrens (2011) shows that a linearly weighted
Fleiss Kappa due to its asymmetric relation with versions of Cohens (1968) Kappa is in fact a
Recall, in the dichotomous form of Powers (2007), weighted average of dichotomous Kappas.
Informedness = Recall + InverseRecall 1 Similarly Powers (2003) shows that his Kappa
= (Recall Bias) / (1 Prevalence). (Informedness) has this property. Thus it is
If we think of Kappa as assessing the appropriate to consider the dichotomous case,
relationship between two raters, Powers statistic and from this we can generalize as required.
is not evenhanded and the Informedness and
Markedness duals measure the two directions of 1.5 Kappa vs Determinant
prediction, normalizing Recall and Precision. In Warrens (2010c) discusses another commonly
fact, the relationship with Correlation allows used measure, the Odds Ratio ad/bc (in
these to be interpreted as regression coefficients Epidemiology rather than Computer Science or
for the prediction function and its inverse. Computational Linguistics). Closely related to
this is the Determinant of the Contingency
1.4 Kappa vs Correlation
Matrix dtp = ad-bc = etp-etn (in the Chi-Sqr,
It is often asked why we dont just use Cohen and Powers sense based on independent
Correlation to measure. In fact, Castellan (1996) marginal probabilities). Both show whether the
uses Tetrachoric Correlation, another odds favour positives over negatives more for the
generalization of Pearson Correlation that first rater (real) than the second (predicted) for
assumes that the two class variables are given by the ratio it is if it is greater than one, for the
underlying normal distributions. Uebersax difference it is if it is greater than 0. Note that
(1987), Hutchison (1993) and Bonnet and Price taking logs of all coefficients would maintain the
(2005) each compare Kappa and Correlation and same relationship and that the difference of the
conclude that there does not seem to be any logs corresponds to the log of the ratio, mapping
situation where Kappa would be preferable to into the information domain.
Correlation. However all the Kappa and Warrens (2010c) further shows (in cost-
Correlation variants considered were symmetric, weighted form) that Cohen Kappa is given by the
and it is thus interesting to consider the separate following (in the notation of this paper, but
regression coefficients underlying it that preferring the notations Prevalence and Inverse
represent the Powers Kappa duals of Prevalence to rp and rn for clarity):
Informedness and Markedness, which have the KC = dtp/[(Prev*IBias+Bias*IPrev)/2]. (5)
advantage of separating out the influences of
Prevalence and Bias (which then allows macro- Based on the previous characterization of
averaging, which is not admissable for any Fleiss Kappa, we can further characterize it by
symmetric form of Correlation or Kappa, as we KF = dtp/[(Prev+Bias)*(IBias+IPrev)/4]. (6)
will discuss shortly). Powers (2007) regards Powers (2007) also showed corresponding
Matthews Correlation as an appropriate measure formulations for Bookmaker Informedness (B, or
for symmetric situations (like rater agreement) Powers Kappa = KP), Markedness and Matthews
and generalizes the relationships between Correlation:
Correlation and Significance to the Markedness B = dtp/[(Prev*IPrev)]. (7)
and Informedness Measures. The differences
M = dtp/[(Bias*IBias)]. (8)
between Informedness and Markedness, which
relate to mismatches in Prevalence and Bias, C = dtp/[(Prev*IPrev*Bias*IBias)]. (9)
mean that the pair of numbers provides further These elegant dichotomous forms are
information about the nature of the relationship straightforward, with the independence
between the two classifications or raters, whilst assumptions on Bias and Prevalence clear in
349
Cohen Kappa, the arithmetic means of Bias and 1.7 Averaging
Prevalence clear in Fleiss Kappa, and the We now consider the issue of dealing with
geometric means of Bias and Prevalence in the multiple measures and results of multiple
Matthews Correlation. Further the independence classifiers by averaging. We first consider
of Bias is apparent for Powers Kappa in the averages of some of the individual measures we
Informedness form, and independence of have seen. The averages need not be arithmetic
Prevalence is clear in the Markedness direction. means, or may represent means over the
Note that the names Powers uses suggest that Prevalences and Biases.
we are measuring something about the We will be punctuating our theoretical
information conveyed by the prediction about the discussions and explanations with empirical
class in the case of Informedness, and the demonstrations where we use 1:1 and 4:1
information conveyed to the predictor by the prevalence versus matching and mismatching
class state in the case of Markedness. To the bias to generate the chance level contingency
extent that Prevalence and Bias can be controlled based on marginal independence. We then mix
independently, Informedness and Markedness are in a proportion of informed decisions, with the
independent and Correlation represents the joint remaining decisions made by chance.
probability of information being passed in both Table 2 compares Accuracy and F-Measure
directions! Powers (2007) further proposes using for an informed decision percentage of 0, 100, 15
log formulations of these measures to take them and -15. Note that Powers Kappa or
into the information domain, as well as relating Informedness purports to recover this
them to mutual information, G-squared and chi- proportion or probability.
squared significance. F-Measure is one of the most common
measures in Computational Linguistics and
1.6 Kappa vs Concordance Information Retrieval, being a Harmonic Mean
of Recall and Precision, which in the common
The pairwise approach used by Fleiss Kappa and
unweighted form also is interpretable with
its relatives does not assume raters use a
respect to a mean of Prevalence and Bias:
common distribution, but does assume they are
using the same set, and number of categories. F = tp / [(Prev+Bias)/2] (10)
When undertaking comparison of unconstrained Note that like Recall and Precision, F-Measure
ratings or unsupervised learning, this constraint ignores totally cell D corresponding to tn. This
is removed and we need to use a measure of is an issue when Prevalence and Bias are uneven
concordance to compare clusterings against each or mismatched. In Information Retrieval, it is
other or against a Gold Standard. Some of the often justified on the basis that the number of
concordance measures use operators in irrelevant documents is large and not precisely
probability space and relate closely to the known, but in fact this is due to lack of
techniques here, whilst others operate in knowledge of the number of relevant documents,
information space. See Pfitzner et al. (2009) for which affects Recall. In fact if tn is large with
reviews of clustering comparison/concordance. respect to both rp and pp, and thus with respect
A complete coverage of evaluation would also to components tp, fp and fn, then both tn/pn and
cover significance and the multiple testing tn/rn approach 0 as tn increases without bound.
problem, but we will confine our focus in this As discussed earlier, Rand Accuracy is a
paper to the issue of choice of Kappa or prevalence (real class) weighted average of
Correlation statistic, as well as addressing some Precision and Inverse Precision, as well as a bias
issues relating to the use of macro-averaging. In (prediction label) weighted average of Recall and
this paper we are regarding the choice of Bias as Inverse Precision. It reflects the D (tn) cell unlike
under the control of the experimenter, as we have F, and while it does not remove the effect of
a focus on learned or hand crafted computational chance it does not have the positive bias of F.
linguistics systems. In fact, when we are using Acc = tp + fp (11)
bootstrapping techniques or dealing with We also point out that the differences between
multiple real samples or different subjects or the various Kappas shown in Determinant
ecosystems, Prevalence may also vary. Thus the normalized form in Eqns (5-9) vary only in the
simple marginal assumptions of Cohen or way prevalences and biases are averaged
Powers statistics are the appropriate ones. together in the normalizing denominator.
350
Informed 1:1/1:1 4:1/4:1 4:1/1:4 We now turn to macro-averaging across
Acc 50% 68% 32% multiple classifiers or raters. The Area Under the
0% Curve measures are all of this form, whether we
F 50% 80% 32%
Acc 100% 100% 100% are talking about ROC, Kappa, Recall-Precision
100% curves or whatever. The controversy over these
F 100% 100% 100%
Acc 57.5% 72.8% 42.2% averages, and macro-averaging in general, relates
15% to one of two issues: 1. The averages are not in
F 57.5% 83% 46.97%
Acc 42.5% 57.8% 27.2% general over the appropriate units or
-15% denominators of the individual statistics; or 2.
F 42.5% 72% 27.2%
Table 2. Accuracy and F-Measure for different The averages are over a classifier determined
mixes of prevalence and bias skew (odds ratio cost function rather than an externally or
shown) as well as different proportions of correct standardly defined cost function. AUK and H-
(informed) answers versus guessing negative Measure seek to address these issues as discussed
proportions imply that the informed decisions are earlier. In fact they both boil down to averaging
deliberately made incorrectly (oracle tells me with an inappropriate distribution of weights.
what to do and I do the opposite). Commonly macro-averaging averages across
classes as average statistics derived for each class
weighted by the cardinality of the class (viz.
From Table 2 we note that the first set of prevalence). In our review above, we cited four
statistics notes the chance level varies from the examples, but we will refer only to WEKA
50% expected for Bias=Prevalence=50%. This is (Witten et al., 2005) here as a commonly used
in fact the E(Acc) used in calculating Cohen system and associated text book that employs
Kappa. Where Prevalences and Biases are equal and advocates macro-averaging. WEKA
and balanced, all common statistics agree averages over tpr, fpr, Recall (yes redundantly),
Recall = Precision = Accuracy = F, and they are Precision, F-Factor and ROC AUC. Only the
interpretable with respect to this 50% chance average over tpr=Recall is actually meaningful,
level. All the Kappas will also agree, as the because only it has the number of members of
different averages of the identical prevalences the class, or its prevalence, as its denominator.
and biases all come down to 50% as well. So Precision needs to be macro-averaged over the
subtracting 50% from 57.5% and normalizing number of predictions for each class, in which
(dividing) by the average effective prevalence of case it is equivalent to micro-averaging.
50%, we return 15% informed decisions in all Other micro-averaged statistics are also
cases (as seen in detail in Table 3). shown, including Kappa (with the expectation
However, F-measure gives an inflated estimate determined from ZeroR predicting the majority
when it focus on the more prevalent positive class, leading to a Cohen-like Kappa).
class, with corresponding bias in the chance AUC will be pointwise for classifiers that
component. dont provide any probabilistic information
Worse still is the strength of the Acc and F associated with label prediction, and thus dont
scores under conditions of matched bias and allow varying a threshold for additional points on
prevalence when the deviation from chance is - the ROC or other threshold curves. In the case
15% - that is making the wrong decision 15% of where multiple threshold points are available,
the time and guessing the rest of the time. In ROC AUC cannot be interpreted as having any
academic terms, if we bump these rates up to relevance to any particular classifier, but is an
25% F-factor gives a High Distinction for average over a range of classifiers. Even then it
guessing 75% of the time and putting the right is not so meaningful as AUCH, which should be
answer for the other 25%, a Distinction for 100% used as classifiers on the convex hull are usually
guessing, and a Credit for guessing 75% of the available. The AUCH measure will then
time and putting a wrong answer for the other dominate any individual classifiers, as if the
25%! In fact, the Powers Kappa corresponds to convex hull is not the same as the single
the methodology of multiple choice marking, classifier it must include points that are above the
where for questions with k+1 choices, a right classifier curve and thus its enclosed area totally
answer gets 1 mark, and a wrong answer gets -1/k includes the area that is enclosed by the
so that guessing achieves an expected mark of 0. individual classifier.
Cohen Kappa achieves a very similar result for Macroaveraging of the curve based on each
unbiased guessing strategies. class in turn as the Positive Class, and weighted
351
by the size of the positive class, is not two of the more complex cases that both relate to
meaningful as effectively shown by Powers Fleiss Kappa with its mismatch to the marginal
(2003) for the special case of the single point independence assumptions we prefer. These will
curve given its equivalence to Powers Kappa. provide informedness of probability B plus a
In fact Markedness does admit averaging over remaining proportion 1-B of random responses
classes, whilst Informedness requires averaging exhibiting extreme bias versus both neutral and
over predicted labels, as does Precision. The contrary prevalence. Note that we consider only
other Kappa and Correlations are more complex |B|<1 as all Kappas give Acc=1 and thus K=1 for
(note the demoninators in Eqns 5-9) and how B=1, and only Powers Kappa is designed to work
they might be meaningfully macro-averaged is for B<1, giving K= -1 for B= -1.
an open question. However, microaveraging can Recall that the general calculation of Expected
always be done quickly and easily by simply Accuracy is
summing all the contingency tables (the true E(Acc) = etp+etn (11)
contingency tables are tables of counts, not For Fleiss Kappa we must calculate the
probabilities, as shown in Table 1). expected values of the correct contingencies as
Macroaveraging should never be done except discussed previously with expected probabilities
for the special cases of Recall and Markedness ep = (rp+pp)/2 & en = (rn+pn)/2 (12)
when it is equivalent to micro-average, which is etp = ep 2
& etn = en 2
(13)
only slightly more expensive/complicated to do.
We first consider cases where prevalence is
Comparison of Kappas extreme and the chance component exhibits
inverse bias. We thus consider limits as
We now turn to explore the different definitions rp0, rn1, pp1-B, pnB. This gives us
of Kappas, using the same approach employed (assuming |B|<1)
with Accuracy and F-Factor in Table 1: We will
EF(Acc) = (1/4+B2/4+B/2)2+(1/4+B2/4-B/2)2
consider 0%, 100%, 15% and -15% informed = (1+B2)/2 (14)
decisions, with random decisions modelled on
the basis of independent Bias and Prevalence. KF(Acc) = (1-B)2/[B2-2] (15)
This clearly biases against the Fleiss family of We second consider cases where the
Kappas, which is entirely appropriate. As prevalence is balanced and chance extreme, with
pointed out by Entwisle & Powers (1998) the rp0.5, rn0.5, pp1-B, pnB, giving
practice of deliberately skewing bias to achieve EF(Acc) = 1/2 + (B-1/2)2/2
better statistics is to be deprecated they used = 5/8 + B(B-1)/2 (16)
1 1 2 1 1 2
the real-life example of a CL researcher choosing KF(Acc)=[(B- /2)-(B- /2) /2]/[ /2-(B- /2) /2] (17)
to say water was always a noun because it was a =[B-5/8+B(B-1)/2]/[1-(5/8+B(B-1)/2)
noun more often than not. With Cohen or Powers
measures, any actual power of the system to Conclusions
determine PoS, however weak, would be The asymmetric Powers Informedness gives
reflected in an improvement in the scores versus the clearest measure of the predictive value of a
any random choice, whatever the distribution. system, while the Matthews Correlation (as
Recall that choosing one answer all the time geometric mean with the Powers Markedness
corresponds to the extreme points of the chance dual) is appropriate for comparing equally valid
line in the ROC curve. classifications or ratings into an agreed number
Studies like Fitzgibbon et al (2007) and of classes. Concordance measures should be used
Leibbrandt and Powers (2012) show divergences if number of classes is not agreed or specified.
amongst the conventional and debiased measures, For mismatch cases (15) Fleiss is always
but it is tricky to prove which is better. negative for |B|<1) and thus fails to adequately
reward good performance under these marginal
Kappa in the Limit conditions. For the chance case (17), the first
It is however straightforward to derive limits for form we provide shows that the deviation from
the various Kappas and Expectations under matching Prevalence is a driver in a Kappa-like
extreme and central conditions of bias and function. Cohen on the other hand (Table 3)
prevalence, including both match and mismatch. tends to apply multiply the weight given to error
The 36 theoretical results match the mixture in even mild prevalence-bias mismatch
model results in Table 3, however, due to space conditions. None of the symmetric Kappas
constraints, formal treatment will be limited to designed for raters are suitable for classifiers.
352
1:1 1:1 4:1 4:1 4:1 1:4 1:1 1:1 4:1 4:1 4:1 1:4 1:1 1:1 4:1 4:1 4:1 1:4
Informedness 0% 0% 0% 0% 0% 0% 0% 0% 0%
Prevalence 50% 80% 80% 50% 80% 80% 50% 20% 20%
Iprevalence 50% 20% 20% 50% 20% 20% 50% 80% 80%
Bias 50% 80% 20% 50% 80% 20% 50% 20% 80%
Ibias 50% 20% 80% 50% 20% 80% 50% 80% 20%
SkewR 100% 25% 25% 100% 25% 25% 100% 400% 400%
SkewP 100% 25% 400% 100% 25% 400% 100% 400% 25%
OddsRatio 100% 100% 6% 100% 100% 6% 100% 100% 1600%
ePowers 50% 68% 32% 50% 68% 32% 50% 68% 32%
eCohen 50% 68% 32% 50% 68% 32% 50% 68% 32%
eFleiss 50% 68% 50% 50% 68% 50% 50% 68% 50%
kPowers 0% 0% 0% 0% 0% 0% 0% 0% 0%
kCohen 0% 0% 0% 0% 0% 0% 0% 0% 0%
kFleiss 0% 0% -36% 0% 0% -36% 0% 0% -36%
Informedness 100% 100% 100% 100% 100% 100% 100% 100% 100%
Prevalence 50% 80% 80% 50% 80% 80% 50% 20% 20%
Iprevalence 50% 20% 20% 50% 20% 20% 50% 80% 80%
Bias 50% 80% 80% 50% 80% 80% 50% 20% 20%
Ibias 50% 20% 20% 50% 20% 20% 50% 80% 80%
SkewR 100% 25% 25% 100% 25% 25% 100% 400% 400%
SkewP 100% 25% 25% 100% 25% 25% 100% 400% 400%
OddsRatio 100% 100% 100% 100% 100% 100% 100% 100% 100%
ePowers 50% 68% 68% 50% 68% 68% 50% 68% 68%
aCohen 50% 68% 68% 50% 68% 68% 50% 68% 68%
aFleiss 50% 68% 68% 50% 68% 68% 50% 68% 68%
kPowers 100% 100% 100% 100% 100% 100% 100% 100% 100%
kCohen 100% 100% 100% 100% 100% 100% 100% 100% 100%
kFleiss 100% 100% 100% 100% 100% 100% 100% 100% 100%
Informedness 15% 15% 15% 99% 99% 99% 99% 99% 99%
Prevalence 50% 80% 80% 50% 80% 80% 50% 20% 20%
Iprevalence 50% 20% 20% 50% 20% 20% 50% 80% 80%
Bias 50% 80% 29% 50% 80% 79% 50% 20% 79%
Ibias 50% 20% 71% 50% 20% 21% 50% 80% 21%
SkewR 100% 25% 25% 100% 25% 25% 100% 400% 400%
SkewP 100% 25% 245% 100% 25% 26% 100% 400% 26%
OddsRatio 100% 100% 6% 100% 100% 6% 100% 100% 1600%
ePowers 50% 68% 32% 50% 68% 32% 50% 68% 32%
eCohen 50% 68% 37% 50% 68% 68% 50% 68% 32%
eFleiss 50% 68% 50% 50% 68% 68% 50% 68% 50%
kPowers 15% 15% 15% 99% 99% 99% 1% 1% 1%
kCohen 15% 15% 8% 99% 99% 98% 1% 1% 0%
kFleiss 15% 15% -17% 99% 99% 98% 1% 1% -35%
Informedness -15% -15% -15% -99% -99% -99% -99% -99% -99%
Prevalence 50% 80% 20% 50% 80% 80% 50% 20% 20%
Iprevalence 50% 20% 80% 50% 20% 20% 50% 80% 80%
Bias 50% 71% 80% 50% 21% 20% 50% 21% 80%
Ibias 50% 29% 20% 50% 79% 80% 50% 79% 20%
SkewR 100% 25% 400% 100% 25% 25% 100% 400% 400%
SkewP 100% 41% 25% 100% 385% 400% 100% 385% 25%
OddsRatio 100% 65% 1038% 100% 25% 25% 100% 104% 1542%
ePowers 50% 63% 37% 50% 50% 50% 50% 68% 32%
eCohen 50% 63% 32% 50% 32% 32% 50% 68% 32%
eFleiss 50% 63% 50% 50% 50% 50% 50% 68% 50%
kPowers -15% -15% -15% -99% -99% -99% -1% -1% -1%
kCohen -15% -13% -7% -99% -47% -47% -1% -1% 0%
kFleiss -15% -14% -46% -99% -99% -99% -1% -1% -37%
Table 3. Empirical Results for Accuracy and Kappa for Fleiss/Scott, Cohen and Powers. Shaded
cells indicate misleading results, which occur for both Cohen and Fleiss Kappas.
353
References P. A. Flach (2003). The Geometry of ROC Space:
Understanding Machine Learning Metrics through
2nd i2b2 Workshop on Challenges in Natural
ROC Isometrics, Proceedings of the Twentieth
Language Processing for Clinical Data (2008).
International Conference on Machine Learning
http://gnode1.mib.man.ac.uk/awards.html
(ICML-2003), Washington DC, 2003, pp. 226-233.
(accessed 4 November 2011)
J. L. Fleiss (1981). Statistical methods for rates and
2nd Pascal Challenge on Hierarchical Text
proportions (2nd ed.). New York: Wiley.
Classification http://lshtc.iit.demokritos.gr/node/48
(accessed 4 November 2011) A. Fraser & D. Marcu (2007). Measuring Word
Alignment Quality for Statistical Machine
N. Ailon. and M. Mohri (2010) Preference-based
Translation, Computational Linguistics 33(3):293-
learning to rank. Machine Learning 80:189-211.
303.
A. Ben-David. (2008a). About the relationship
J. Frnkranz & P. A. Flach (2005). ROC n Rule
between ROC curves and Cohens kappa.
Learning Towards a Better Understanding of
Engineering Applications of AI, 21:874882, 2008.
Covering Algorithms, Machine Learning 58(1):39-
A. Ben-David (2008b). Comparison of classification 77.
accuracy using Cohens Weighted Kappa, Expert
D. J. Hand (2009). Measuring classifier performance:
Systems with Applications 34 (2008) 825832
a coherent alternative to the area under the ROC
Y. Benjamini and Y. Hochberg (1995). "Controlling curve. Machine Learning 77:103-123.
the false discovery rate: a practical and powerful
T. P. Hutchinson (1993). Focus on Psychometrics.
approach to multiple testing". Journal of the Royal
Kappa muddles together two sources of
Statistical Society. Series B (Methodological) 57
disagreement: tetrachoric correlation is preferable.
(1), 289300.
Research in Nursing & Health 16(4):313-6, 1993
D. G. Bonett & R.M. Price, (2005). Inferential Aug.
Methods for the Tetrachoric Correlation
U. Kaymak, A. Ben-David and R. Potharst (2010),
Coefficient, Journal of Educational and Behavioral
AUK: a sinple alternative to the AUC, Technical
Statistics 30:2, 213-225
Report, Erasmus Research Institute of
J. Carletta (1996). Assessing agreement on Management, Erasmus School of Economics,
classification tasks: the kappa statistic. Rotterdam NL.
Computational Linguistics 22(2):249-254
K. Krippendorff (1970). Estimating the reliability,
N. J. Castellan, (1966). On the estimation of the systematic error, and random error of interval data.
tetrachoric correlation coefficient. Psychometrika, Educational and Psychological Measurement, 30
31(1), 67-73. (1),61-70.
J. Cohen (1960). A coefficient of agreement for K. Krippendorff (1978). Reliability of binary attribute
nominal scales. Educational and Psychological data. Biometrics, 34 (1), 142-144.
Measurement, 1960:37-46.
J. Lafferty, A. McCallum. & F. Pereira. (2001).
J. Cohen (1968). Weighted kappa: Nominal scale Conditional Random Fields: Probabilistic Models
agreement with provision for scaled disagreement for Segmenting and Labeling Sequence Data.
or partial credit. Psychological Bulletin 70:213-20. Proceedings of the 18th International Conference
on Machine Learning (ICML-2001), San
B. Di Eugenio and M. Glass (2004), The Kappa
Francisco, CA: Morgan Kaufmann, pp. 282-289.
Statistic: A Second Look., Computational
Linguistics 30:1 95-101. R. Leibbrandt & D. M. W. Powers, Robust Induction
of Parts-of-Speech in Child-Directed Language by
J. Entwisle and D. M. W. Powers (1998). "The
Co-Clustering of Words and Contexts. (2012).
Present Use of Statistics in the Evaluation of NLP
EACL Joint Workshop of ROBUS (Robust
Parsers", pp215-224, NeMLaP3/CoNLL98 Joint
Unsupervised and Semi-supervised Methods in
Conference, Sydney, January 1998
NLP) and UNSUP (Unsupervised Learning in NLP).
Sean Fitzgibbon, David M. W. Powers, Kenneth
P. J. G. Lisboa, A. Vellido & H. Wong (2000). Bias
Pope, and C. Richard Clark (2007). Removal of
reduction in skewed binary classfication with
EEG noise and artefact using blind source
Bayesian neural networks. Neural Networks
separation. Journal of Clinical Neurophysiology
13:407-410.
24(3):232-243, June 2007
354
R. Lowry (1999). Concepts and Applications of L. H. Reeker, (2000), Theoretic Constructs and
Inferential Statistics. (Published on the web as Measurement of Performance and Intelligence in
http:// faculty.vassar.edu/lowry/webtext.html.) Intelligent Systems, PerMIS 2000. (See
http://www.isd.mel.nist.gov/research_areas/
C. D. Manning, and H. Schtze (1999). Foundations
research_engineering/PerMIS_Workshop/ accessed
of Statistical Natural Language Processing. MIT
22 December 2007.)
Press, Cambridge, MA.
W. A. Scott (1955). Reliability of content analysis:
J. H McDonald, (2007). The Handbook of Biological
The case of nominal scale coding. Public Opinion
Statistics. (Course handbook web published as Quarterly, 19, 321-325.
http: //udel.edu/~mcdonald/statpermissions.html)
D. R. Shanks (1995). Is human learning rational?
J.C. Nunnally and Bernstein, I.H. (1994). Quarterly Journal of Experimental Psychology,
Psychometric Theory (Third ed.). McGraw-Hill. 48A, 257-279.
K. Pearson and D. Heron (1912). On Theories of T. Sellke, Bayarri, M.J. and Berger, J. (2001), Calibration
Association. J. Royal Stat. Soc. LXXV:579-652 of P-values for testing precise null hypotheses,
P. Perruchet and R. Peereman (2004). The American Statistician 55, 62-71. (See http://
exploitation of distributional information in www.stat.duke.edu/%7Eberger/papers.html#p-value
syllable processing, J. Neurolinguistics 17:97119. accessed 22 December 2007.)
355
User Edits Classification Using Document Revision Histories
356
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 356366,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
by defining the scope of user edits, extracting a phrase identification using a rule based approach
large collection of such user edits from the En- and manually annotated examples.
glish Wikipedia, constructing a manually labeled Wikipedia vandalism detection is a user ed-
dataset, and setting up a classification baseline. its classification problem addressed by a yearly
A set of features is designed and integrated into competition (since 2010) in conjunction with the
a supervised machine learning framework. It is CLEF conference (Potthast et al., 2010; Potthast
composed of language model probabilities and and Holfeld, 2011). State-of-the-art solutions in-
string similarity measured over different represen- volve supervised machine learning using various
tations, including part-of-speech tags and named content and metadata features. Content features
entities. Despite their relative simplicity, the fea- use spelling, grammar, character- and word-level
tures achieve high classification accuracy when attributes. Many of them are relevant for our ap-
applied to contiguous edit segments. proach. Metadata features allow detection by pat-
We go beyond labeled data and exploit large terns of usage, time and place, which are gener-
amounts of unlabeled data. First, we demonstrate ally useful for the detection of online malicious
that the trained classifier generalizes to thousands activities (West et al., 2010; West and Lee, 2011).
of examples identified by user comments as spe- We deliberately refrain from using such features.
cific types of fluency edits. Furthermore, we in- A wide range of methods and approaches has
troduce a new method for extracting features from been applied to the similar tasks of textual en-
an evolving set of unlabeled user edits. This tailment and paraphrase recognition, see Androut-
method is successfully evaluated as an alternative sopoulos and Malakasiotis (2010) for a compre-
or supplement to the initial supervised approach. hensive review. These are all related because
paraphrases and bidirectional entailments repre-
2 Related Work sent types of fluency edits.
The need for user edits classification is implicit in A different line of research uses classifiers to
studies of Wikipedia edit histories. For example, predict sentence-level fluency (Zwarts and Dras,
Viegas et al. (2004) use revision size as a simpli- 2008; Chae and Nenkova, 2009). These could be
fied measure for the change of content, and Kittur useful for fluency edits detection. Alternatively,
et al. (2007) use metadata features to predict user user edits could be a potential source of human-
edit conflicts. produced training data for fluency models.
Classification becomes an explicit requirement
3 Definition of User Edits Scope
when exploiting edit histories for NLP research.
Yamangil and Nelken (2008) use edits as train- Within our approach we distinguish between edit
ing data for sentence compression. They make segments, which represent the comparison (diff)
the simplifying assumption that all selected edits between two document revisions, and user edits,
retain the core meaning. Zanzotto and Pennac- which are the input for classification.
chiotti (2010) use edits as training data for textual An edit segment is a contiguous sequence of
entailment recognition. In addition to manually deleted, inserted or equal words. The difference
labeled edits, they use Wikipedia user comments between two document revisions (vi , vj ) is repre-
and a co-training approach to leverage unlabeled sented by a sequence of edit segments E. Each
edits. Woodsend and Lapata (2011) and Yatskar edit segment (, w1m ) E is a pair, where
et al. (2010) use Wikipedia comments to identify {deleted , inserted , equal } and w1m is a m-word
relevant edits for learning sentence simplification. substring of vi , vj or both (respectively).
The work by Max and Wisniewski (2010) is A user edit is a minimal set of sentences over-
closely related to the approach proposed in this lapping with deleted or inserted segments. Given
paper. They extract a corpus of rewritings, dis- the two sets of revision sentences (Svi , Svj ), let
tinguish between weak semantic differences and
strong semantic differences, and present a typol- (, w1m ) = {s Svi Svj | w1m s 6= } (1)
ogy of multiple subclasses. Spelling corrections be the subset of sentences overlapping with a
are heuristically identified but the task of auto- given edit segment, and let
matic classification is deferred. Follow-up work
by Dutrey et al. (2011) focuses on automatic para- (s) = {(, w1m ) E | w1m s 6= } (2)
357
be the subset of edit segments overlapping with a (1) Revisions 368209202 & 378822230
given sentence. pre (By the mid 1700s, Medzhybizh was the seat of
A user edit is a pair (pre Svi , post Svj ) power in Podilia Province.)
where post (By the mid 18th century, Medzhybizh was
the seat of power in Podilia Province.)
s pre post {deleted , inserted } w1m diff (equal , By the mid) , (deleted, 1700s) ,
(inserted , 18th century) , (equal , , Medzhy-
(, w1m ) (s) (, w1m ) pre post (3) bizh was the seat of power in Podilia Province.)
s pre post {deleted , inserted } w1m
(, w1m ) (s) (4) (2) Revisions 148109085 & 149440273
pre (Original Society of Teachers of the Alexander
Table 1 illustrates different types of edit seg- Technique (est. 1958).)
ments and user edits. The term replaced segment post (Original and largest professional Society of
refers to adjacent deleted and inserted segments. Teachers of the Alexander Technique estab-
lished in 1958.)
Example (1) contains a replaced segment because
diff (equal , Original) , (inserted , and largest
the deleted segment (1700s) is adjacent to the
professional) , (equal , Society of Teachers of
inserted segment (18th century). Example (2) the Alexander Technique) , (deleted , (est.) ,
contains an inserted segment (and largest profes- (inserted , established in) , (equal , 1958) ,
sional), a replaced segment ((est. estab- (deleted , )) , (equal , .)
lished in) and a deleted segment ()). User edits
of both examples consist of a single pre sentence (3) Revisions 61406809 & 61746002
and a single post sentence because deleted and in- pre (Fredrik Modin is a Swedish ice hockey left
serted segments do not cross any sentence bound- winger. , He is known for having one of the
hardest slap shots in the NHL.)
ary. Example (3) contains a replaced segment (. post (Fredrik Modin is a Swedish ice hockey left
He who). In this case the deleted segment winger who is known for having one of the hard-
(. He) overlaps with two sentences and there- est slap shots in the NHL.)
fore the user edit consists of two pre sentences. diff (equal , Fredrik Modin is a Swedish ice hockey
left winger) , (deleted , . He) , (inserted ,
4 Features for Edits Classification who) , (equal , is known for having one of
the hardest slap shots in the NHL.)
We design a set of features for supervised classi-
fication of user edits. The design is guided by two Table 1: Examples of user edits and the corre-
main considerations: simplicity and interoperabil- sponding edit segments (revision numbers corre-
ity. Simplicity is important because there are po- spond to the English Wikipedia).
tentially hundreds of millions of user edits to be
classified. This amount continues to grow at rapid
pace and a scalable solution is required. Interop- each user edit. For instance, example (1) in Table
erability is important because millions of user ed- 1 has one deleted token, two inserted tokens and
its are available in multiple languages. Wikipedia 14 equal tokens. Many features use string similar-
is a flagship project, but there are other collabora- ity calculated over alternative representations.
tive editing projects. The solution should prefer- Character-level features include counts of
ably be language- and project-independent. Con- deleted, inserted and equal characters of different
sequently, we refrain from deeper syntactic pars- types, such as word & non-word characters or dig-
ing, Wikipedia-specific features, and language re- its & non-digits. Character types may help iden-
sources that are limited to English. tify edits types. For example, the change of dig-
Our basic intuition is that longer edits are likely its may suggest a factual edit while the change of
to be factual and shorter edits are likely to be non-word characters may suggest a fluency edit.
fluency edits. The baseline method is therefore Word-level features count deleted, inserted
character-level edit distance (Levenshtein, 1966) and equal words using three parallel represen-
between pre- and post-edited text. tations: original case, lower case, and lemmas.
Six feature categories are added to the baseline. Word-level edit distance is calculated for each
Most features take the form of threefold counts re- representation. Table 2 illustrates how edit dis-
ferring to deleted, inserted and equal elements of tance may vary across different representations.
358
Rep. User Edit Dist the deleted NE tag and the inserted PoS tag. This
Words pre Branch lines were built in Kenya 4 is an inherent weakness of these features when
post A branch line was built in Kenya compared to parsing-based alternatives.
Lowcase pre branch lines were built in kenya 3 An additional set of counts, NE values, de-
post a branch line was built in kenya
scribes the number of deleted, inserted and equal
Lemmas pre branch line be build in Kenya 1
post a branch line be build in Kenya
normalized values of numeric entities such as
PoS tags pre NN NNS VBD VBN IN NNP 2
numbers and dates. For instance, if the word
post DT NN NN VBD VBN IN NNP 100 is replaced by 200 and the respective nu-
NE tags pre LOCATION 0 meric values 100.0 and 200.0 are normalized, the
post LOCATION counts of deleted and inserted NE values will be
incremented and suggest a factual edit. If on the
Table 2: Word- and tag-level edit distance mea- other hand 100 is replaced by hundred and the
sured over different representations (example latter is normalized as having the numeric value
from Wikipedia revisions 2678278 & 2682972). 100.0, then the count of equal NE values will be
incremented, rather suggesting a fluency edit.
Acronym features count deleted, inserted and
Fluency edits may shift words, which sometimes
equal acronyms. Potential acronyms are extracted
may be slightly modified. Fluency edits may also
from word sequences that start with a capital letter
add or remove words that already appear in con-
and from words that contain multiple capital let-
text. Optimal calculation of edit distance with
ters. If, for example, UN is replaced by United
shifts is computationally expensive (Shapira and
Nations, MicroSoft by MS or Jean Pierre
Storer, 2002). Translation error rate (TER) pro-
by J.P, the count of equal acronyms will be in-
vides an approximation but it is designed for the
cremented, suggesting a fluency edit.
needs of machine translation evaluation (Snover
The last category, language model (LM) fea-
et al., 2006). To have a more sensitive estima-
tures, takes a different approach. These features
tion of the degree of edit, we compute the minimal
look at n-gram based sentence probabilities be-
character-level edit distance between every pair of
fore and after the edit, with and without normal-
words that belong to different edit segments. For
ization with respect to sentence lengths. The ratio
each pair of edit segments (, w1m ), ( 0 , w0 k1 ) over-
of the two probabilities, Pratio (pre, post) is com-
lapping with a user edit, if 6= 0 we compute:
puted as follows:
w w1m : min EditDist(w, w0 ) (5) m
Y
w0 w0 k1 P (w1m ) i1
P (wi |win+1 ) (6)
Binned counts of the number of words with a min- i=1
1
imal edit distance of 0, 1, 2, 3 or more charac- Pnorm (w1m ) = P (w1m ) m (7)
ters are accumulated per edit segment type (equal, Pnorm (post)
deleted or inserted). Pratio (pre, post) = (8)
Pnorm (pre)
Part-of-speech (PoS) features include counts
of deleted, inserted and equal PoS tags (per tag) Pnorm (post)
log Pratio (pre, post) = log (9)
and edit distance at the tag level between PoS tags Pnorm (pre)
before and after the edit. Similarly, named-entity = log Pnorm (post) log Pnorm (pre)
(NE) features include counts of deleted, inserted
1 1
and equal NE tags (per tag, excluding OTHER) = log P (post) log P (pre)
|post| |pre|
and edit distance at the tag level between NE tags
before and after the edit. Table 2 illustrates the Where P is the sentence probability estimated as
edit distance at different levels of representation. a product of n-gram conditional probabilities and
We assume that a deleted NE tag, e.g. PERSON Pnorm is the sentence probability normalized by
or LOCATION, could indicate a factual edit. It the sentence length. We hypothesize that the rel-
could however be a fluency edit where the NE is ative change of normalized sentence probabilities
replaced by a co-referent like she or it. Even is related to the edit type. As an additional feature,
if we encounter an inserted PRP PoS tag, the fea- the number of out of vocabulary (OOV) words be-
tures do not capture the explicit relation between fore and after the edit is computed. The intuition
359
Dataset Labeled Subset guage model built by SRILM (Stolcke, 2002) with
Number of User Edits: modified interpolated Kneser-Ney smoothing us-
923,820 (100%) 2,008 (100%) ing the AFP and Xinhua portions of the English
Edit Segments Distribution: Gigaword corpus (LDC2003T05).
Replaced 535,402 (57.96%) 1,259 (62.70%) We extract a total of 4.3 million user edits of
Inserted 235,968 (25.54%) 471 (23.46%) which 2.52 million (almost 60%) are insertions
Deleted 152,450 (16.5%) 278 (13.84%) and deletions of complete sentences. Although
Character-level Edit Distance Distribution: these may include fluency edits such as sentence
1 202,882 (21.96%) 466 (23.21%) reordering or rewriting from scratch, we assume
2 81,388 (8.81%) 198 (9.86%) that the large majority is factual. Of the remaining
3-10 296,841 (32.13%) 645 (32.12%)
1.78 million edits, the majority (64.5%) contains
11-100 342,709 (37.10%) 699 (34.81%)
single deleted, inserted or replaced segments. We
Word-level Edit Distance Distribution:
decide to focus on this subset because sentences
1 493,095 (53.38%) 1,008 (54.18%)
2 182,770 (19.78%) 402 (20.02%) with multiple non-contiguous edit segments are
3 77,603 (8.40%) 161 (8.02%) more likely to contain mixed cases of unrelated
4-10 170,352 (18.44%) 357 (17.78%) factual and fluency edits, as illustrated by exam-
Labels Distribution: ple (2) in Table 1. Learning to classify contigu-
Fluency - 1,008 (50.2%) ous edit segments seems to be a reasonable way
Factual - 1,000 (49.8%) of breaking down the problem into smaller parts.
We filter out user edits with edit distance longer
Table 3: Dataset of nearly 1 million user edits than 100 characters or 10 words that we assume to
with single deleted, inserted or replaced segments, be factual. The resulting dataset contains 923,820
of which 2K are labeled. The labels are almost user edits: 58% replaced segments, 25.5% in-
equally distributed. The distribution over edit seg- serted segments and 16.5% deleted segments.
ment types and edit distance intervals is detailed. Manual labeling of user edits is carried out by
a group of annotators with near native or native
level of English. All annotators receive the same
is that unknown words are more likely to be in- written guidelines. In short, fluency labels are
dicative of factual edits. assigned to edits of letter case, spelling, gram-
mar, synonyms, paraphrases, co-referents, lan-
5 Experiments guage and style. Factual labels are assigned to
5.1 Experimental Setup edits of dates, numbers and figures, named enti-
ties, semantic change or disambiguation, addition
First, we extract a large amount of user edits from
or removal of content. A random set of 2,676 in-
revision histories of the English Wikipedia.3 The
stances is labeled: 2,008 instances with a majority
extraction process scans pairs of subsequent re-
agreement of at least two annotators are selected
visions of article pages and ignores any revision
as training set, 270 instances are held out as de-
that was reverted due to vandalism. It parses the
velopment set, 164 trivial fluency corrections of a
Wikitext and filters out markup, hyperlinks, tables
single letters case and 234 instances with no clear
and templates. The process analyzes the clean text
agreement among annotators are excluded. The
of the two revisions4 and computes the difference
last group (8.7%) emphasizes that the task is, to
between them.5 The process identifies the overlap
a limited extent, subjective. It suggests that auto-
between edit segments and sentence boundaries
mated classification of certain user edits would be
and extracts user edits. Features are calculated
difficult. Nevertheless, inter-rater agreement be-
and user edits are stored and indexed. LM features
tween annotators is high to very high. Kappa val-
are calculated against a large English 4-gram lan-
ues between 0.74 to 0.84 are measured between
3
Dump of all pages with complete edit history as of Jan- six pairs of annotators, each pair annotated a com-
uary 15, 2011 (342GB bz2), http://dumps.wikimedia.org. mon subset of at least 100 instances. Table 3 de-
4
Tokenization, sentence split, PoS & NE tags by Stanford scribes the resulting dataset, which we also make
CoreNLP, http://nlp.stanford.edu/software/corenlp.shtml.
5
Myers O(N D) difference algorithm (Myers, 1986),
available to the research community.6
http://code.google.com/p/google-diff-match-patch. 6
Available for download at http://staff.
360
Character-level Edit Distance Feature set SVM RF Logit
flu. / fac. flu. / fac. flu. / fac.
.4 >4&
Baseline 0.85 / 0.67 0.74 / 0.79 0.85 / 0.67
+ Char-level 0.85 / 0.82 0.83 / 0.86 0.86 / 0.82
Fluency (725) Factual (821) + Word-level 0.88 / 0.69 0.81 / 0.82 0.86 / 0.70
Factual (179) Fluency (283) + PoS 0.85 / 0.68 0.78 / 0.76 0.84 / 0.72
+ NE 0.86 / 0.79 0.79 / 0.87 0.87 / 0.78
Figure 1: A decision tree that uses character-level + Acronyms 0.87 / 0.66 0.83 / 0.70 0.86 / 0.68
edit distance as a sole feature. The tree correctly + LM 0.85 / 0.67 0.79 / 0.76 0.84 / 0.69
All Features 0.88 / 0.86 0.86 / 0.88 0.87 / 0.84
classifies 76% of the labeled user edits.
Table 5: Fraction of correctly classified edits per
Feature set SVM RF Logit type: fluency edits (left) and factual edits (right),
Baseline 76.26% 76.26% 76.34% using the baseline, each feature set added to the
+ Char-level 83.71% 84.45% 84.01% baseline, and all features combined.
+ Word-level 78.38% 81.38% 78.13%
+ PoS 76.58% 76.97% 78.35%
+ NE 82.71% 83.12% 82.38%
+ Acronyms 76.55% 76.61% 76.96% line. Then each one of the feature groups is sep-
+ LM 76.20% 77.42% 76.52% arately added to the baseline. Finally, all features
All Features 87.14% 87.14% 85.64% are evaluated together. Table 4 reports the per-
centage of correctly classified edits (classifiers
Table 4: Classification accuracy using the base- accuracy), and Table 5 reports the fraction of cor-
line, each feature set added to the baseline, and rectly classified edits per type. All results are for
all features combined. Statistical significance at 10-fold cross validation. Statistical significance
p < 0.05 is indicated by w.r.t the baseline (us- against the baseline and between classifiers is cal-
ing the same classifier), and by w.r.t to another culated at p < 0.05 using paired t-test.
classifier marked by (using the same features). The first interesting result is the highly predic-
Highest accuracy per classifier is marked in bold. tive power of the single-feature baseline. It con-
firms the intuition that longer edits are mainly fac-
tual. Figure 1 shows that the edit distance of 72%
5.2 Feature Analysis
of the user edits labeled as fluency is between 1 to
We experiment with three classifiers: Support 4, while the edit distance of 82% of those labeled
Vector Machines (SVM), Random Forests (RF) as factual is greater than 4. The cut-off value is
and Logistic Regression (Logit).7 SVMs (Cortes found by a single-node decision tree that uses edit
and Vapnik, 1995) and Logistic Regression (or distance as a sole feature. The tree correctly clas-
Maximum Entropy classifiers) are two widely sifies 76% of the instances. This result implies
used machine learning techniques. SVMs have that the actual challenge is to correctly classify
been applied to many text classification problems short factual edits and long fluency edits.
(Joachims, 1998). Maximum Entropy classifiers Character-level features and named-entity fea-
have been applied to the similar tasks of para- tures lead to significant improvements over the
phrase recognition (Malakasiotis, 2009) and tex- baseline for all classifiers. Their strength lies in
tual entailment (Hickl et al., 2006). Random their ability to identify short factual edits such
Forests (Breiman, 2001) as well as other decision as changes of numeric values or proper names.
tree algorithms are successfully used for classi- Word-level features also significantly improve the
fying Wikipedia edits for the purpose of vandal- baseline but their contribution is smaller. PoS
ism detection (Potthast et al., 2010; Potthast and and acronym features lead to small statistically-
Holfeld, 2011). insignificant improvements over the baseline.
Experiments begin with the edit-distance base- The poor contribution of LM features is sur-
prising. It might be due to the limited context
science.uva.nl/abronner/uec/data.
7
Using Weka classifiers: SMO (SVM), RandomForest &
of n-grams, but it might be that LM probabili-
Logistic (Hall et al., 2009). Classifiers parameters are tuned ties are not a good predictor for the task. Re-
using the held-out development set. moving LM features from the set of all features
361
Fluency Edits Misclassified as Factual Correctly Classified Fluency Edits
Equivalent or redundant in context 14 Adventure education makes intentional use of intention-
Paraphrases 13 ally uses challenging experiences for learning.
Equivalent numeric patterns 7 He served as president from October 1 , 1985 and retired
Replacing first name with last name 4 through his retirement on June 30 , 2002.
Acronyms 4 In 1973, he helped organize assisted in organizing his
Non specific adjectives or adverbs 3 first ever visit to the West.
Other 5
Correctly Classified Factual Edits
Factual Edits Misclassified as Fluency
Over the course of the next two years five months, the
Short correction of content 35 unit completed a series of daring raids.
Opposites 3 Scottish born David Tennant has reportedly said he
Similar names 3 would like his Doctor to wear a kilt.
Noise (unfiltered vandalism) 3
Other 6 This family joined the strip in late 1990 around March
1991.
362
Comment Test Set Classified as Replaced by Frequency Edit class
Size Fluency Edits second 144 Factual
grammar 1,122 88.9% First 38 Fluency
spelling 2,893 97.6% last 31 Factual
typo 3,382 91.6% 1st 22 Fluency
copyedit 3,437 68.4% third 22 Factaul
Random set 5,000 49.4%
Table 9: User edits replacing the word first with
Table 8: Classifying unlabeled data selected by another single word: most frequent 5 out of 524.
user comments that suggest a fluency edit. The
SVM classifier is trained using the labeled data.
Replaced by Frequency Replaced by Frequency
User comments are not used as features.
Adams 7 Squidward 6
Joseph 7 Alexander 5
Einstein 6 Davids 5
classify because the modification of verb tense in Galland 6 Haim 5
a given context is sometimes factual and some- Lowe 6 Hickes 5
times a fluency edit.
These findings agree with the feature analy- Table 10: Fluency edits replacing the word He
sis. Fluency edit misclassifications are typically with proper noun: most frequent 10 out of 1,381.
longer phrases that carry the same meaning while
factual edit misclassifications are typically sin-
gle words or short phrases that carry different uate against. We resort to Wikipedia user com-
meaning. The main conclusion is that the clas- ments. It is a problematic option because it is un-
sifier should take into account explicit content reliable. Users may add a comment when submit-
and context. Putting aside the consideration of ting an edit, but it is not mandatory. The com-
simplicity and interoperability, features based on ment is a free text with no predefined structure.
co-reference resolution and paraphrase recogni- It could be meaningful or nonsense. The com-
tion are likely to improve fluency edits classi- ment is per revision. It may refer to one, some
fication, and features from language resources or all edits submitted for a given revision. Nev-
that describe synonymy and antonymy relations ertheless, we identify several keywords that rep-
are likely to improve factual edits classification. resent certain types of fluency edits: grammar,
While this conclusion may come at no surprise, it spelling, typo, and copyedit. The first three
is important to highlight the high classification ac- clearly indicate grammar and spelling corrections.
curacy that is achieved without such capabilities The last indicates a correction of format and style,
and resources. Table 7 presents several examples but also of accuracy of the text. Therefore it only
of correct classification produced by our classifier. represents a bias towards fluency edits.
We extract unlabeled edits whose comment is
6 Exploiting Unlabeled Data
equal to one of the keywords and construct a test
We extracted a large set of user edits but our ap- set per keyword. An additional test set consists of
proach has been limited to a restricted number of randomly selected unlabeled edits with any com-
labeled examples. This section attempts to find ment. The five test sets are classified by the SVM
whether the classifier generalizes beyond labeled classifier trained using the labeled data and the set
data and whether unlabeled data could be used to of all features. To remove any doubt, user com-
improve classification accuracy. ments are not part of any feature of the classifier.
The results in Table 8 show that most unlabeled
6.1 Generalizing Beyond Labeled Data edits whose comments are grammar, spelling
The aim of the next experiment is to test how well or typo are indeed classified as fluency ed-
the supervised classifier generalizes beyond the its. The classification of edits whose comment is
labeled test set. The problem is the availability copyedit is biased towards fluency edits, but as
of test data. There is no shared task for user ed- expected the result is less distinct. The classifica-
its classification and no common test set to eval- tion of the random set is balanced, as expected.
363
Feature set SVM RF Logit tences of other unlabeled edits. The first step is to
Baseline 76.26% 76.26% 76.34% select candidates using a bag of words approach.
All Features 87.14% 87.14% 85.64% The second step is a comparison of the user edit
Unlabeled only 78.11% 83.49% 78.78% with each one of the candidates while increment-
Base + unlabeled 80.86% 85.45% 81.83% ing counts of similarity measures. These account
All + unlabeled 87.23% 88.35% 85.92% for exact matches between different representa-
tions (original and low case, lemmas, PoS and NE
Table 11: Classification accuracy using features tags) as well as for approximate matches using
from unlabeled data. The first two rows are identi- character- and word-level edit distance between
cal to Table 4. Statistical significance at p < 0.05 those representations. An additional feature is the
is indicated by: w.r.t the baseline; w.r.t all fea- number of distinct documents in the candidate set.
tures excluding features from unlabeled data; and We compute the set of features for the labeled
w.r.t to another classifier marked by (using the
dataset based on the unlabeled data. The number
same features). The best result is marked in bold. of candidates is set to 1,000 per user edit. We
re-train the classifiers using five configurations:
6.2 Features from Unlabeled Data Baseline and All Features are identical to the first
experiment. Unlabeled only uses the new feature
The purpose of the last experiment is to exploit set without any other feature. Base + Unlabeled
unlabeled data in order to extract additional fea- adds the new feature set to the baseline. All + Un-
tures for the classifier. The underlying assumption labeled uses all available features. All results are
is that reoccurring patterns may indicate whether for 10-fold cross validation with statistical signif-
a user edit is factual or a fluency edit. icance at p < 0.05 by paired t-test, see Table 11.
We could assume that fluency edits would re- We find that features extracted from unlabeled
occur across many revisions, while factual edits data outperform the baseline and lead to statisti-
would only appear in revisions of specific docu- cally significant improvements when added to it.
ments. However, this assumption does not nec- The combination of all features allows Random
essarily hold. Table 9 gives a simple example of Forests to achieve the highest statistically signifi-
single word replacements for which the most re- cant accuracy level of 88.35%.
occurring edit is actually factual and other factual
and fluency edits reoccur in similar frequencies. 7 Conclusions
Finding user edits reoccurrence is not trivial.
This work addresses the task of user edits clas-
We could rely on exact matches of surface forms,
sification as factual or fluency edits. It adopts
but this may lead to data sparseness issues. Flu-
a supervised machine learning approach and
ency edits that exchange co-referents and proper
uses character- and word- level features, part-
nouns, as illustrated by the example in Table 10,
of-speech tags, named entities, language model
may reoccur frequently but this fact could not
probabilities, and a set of features extracted from
be revealed by exact matching of specific proper
large amounts of unlabeled data. Our experiments
nouns. On the other hand, using a bag of word
with contiguous user edits extracted from revision
approach may find too many unrelated edits.
histories of the English Wikipedia achieve high
We introduce a two-step method that measures
classification accuracy and demonstrate general-
the reoccurrence of edits in unlabeled data us-
ization to data beyond labeled edits.
ing exact and approximate matching over multi-
Our approach shows that machine learning
ple representations. The method provides a set of
techniques can successfully distinguish between
frequencies that is fed into the classifier and al-
user edit types, making them a favorable alterna-
lows for learning subtle patterns of reoccurrence.
tive to heuristic solutions. The simple and adap-
Staying consistent with our initial design consid-
tive nature of our method allows for application to
erations, the method is simple and interoperable.
large and evolving sets of user edits.
Given a user edit (pre, post), the method does
not compare pre with post in any way. It only Acknowledgments. This research was funded
compares pre with pre-edited sentences of other in part by the European Commission through the
unlabeled edits and post with post-edited sen- CoSyne project FP7-ICT-4-248531.
364
References R. Nelken and E. Yamangil. 2008. Mining
Wikipedias article revision history for training
A. Aji, Y. Wang, E. Agichtein, and E. Gabrilovich. computational linguistics algorithms. In Proceed-
2010. Using the past to score the present: Extend- ings of the AAAI Workshop on Wikipedia and Arti-
ing term weighting models through revision history ficial Intelligence: An Evolving Synergy, pages 31
analysis. In Proceedings of the 19th ACM inter- 36.
national conference on Information and knowledge S. Nunes, C. Ribeiro, and G. David. 2011. Term
management, pages 629638. weighting based on document revision history.
I. Androutsopoulos and P. Malakasiotis. 2010. A sur- Journal of the American Society for Information
vey of paraphrasing and textual entailment meth- Science and Technology, 62(12):24712478.
ods. Journal of Artificial Intelligence Research, M. Potthast and T. Holfeld. 2011. Overview of the 2nd
38(1):135187. international competition on Wikipedia vandalism
L. Breiman. 2001. Random forests. Machine learn- detection. Notebook for PAN at CLEF 2011.
ing, 45(1):532. M. Potthast, B. Stein, and T. Holfeld. 2010. Overview
J. Chae and A. Nenkova. 2009. Predicting the fluency of the 1st international competition on Wikipedia
of text with shallow structural features: case stud- vandalism detection. Notebook Papers of CLEF,
ies of machine translation and human-written text. pages 2223.
In Proceedings of the 12th Conference of the Euro- D. Shapira and J. Storer. 2002. Edit distance with
pean Chapter of the Association for Computational move operations. In Combinatorial Pattern Match-
Linguistics, pages 139147. ing, pages 8598.
C. Cortes and V. Vapnik. 1995. Support-vector net- M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and
works. Machine learning, 20(3):273297. J. Makhoul. 2006. A study of translation edit rate
with targeted human annotation. In Proceedings of
C. Dutrey, D. Bernhard, H. Bouamor, and A. Max.
Association for Machine Translation in the Ameri-
2011. Local modifications and paraphrases in
cas, pages 223231.
Wikipedias revision history. Procesamiento del A. Stolcke. 2002. SRILM-an extensible language
Lenguaje Natural, Revista no 46:5158. modeling toolkit. In Proceedings of the interna-
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reute- tional conference on spoken language processing,
mann, and I.H. Witten. 2009. The WEKA data volume 2, pages 901904.
mining software: an update. ACM SIGKDD Explo- F.B. Viegas, M. Wattenberg, and K. Dave. 2004.
rations Newsletter, 11(1):1018. Studying cooperation and conflict between authors
A. Hickl, J. Williams, J. Bensley, K. Roberts, B. Rink, with history flow visualizations. In Proceedings of
and Y. Shi. 2006. Recognizing textual entailment the SIGCHI conference on Human factors in com-
with LCCs GROUNDHOG system. In Proceedings puting systems, pages 575582.
of the Second PASCAL Challenges Workshop. A.G. West and I. Lee. 2011. Multilingual vandalism
T. Joachims. 1998. Text categorization with support detection using language-independent & ex post
vector machines: Learning with many relevant fea- facto evidence. Notebook for PAN at CLEF 2011.
tures. Machine Learning: ECML-98, pages 137 A.G. West, S. Kannan, and I. Lee. 2010. Detecting
142. Wikipedia vandalism via spatio-temporal analysis
A. Kittur, B. Suh, B.A. Pendleton, and E.H. Chi. 2007. of revision metadata. In Proceedings of the Third
He says, she says: Conflict and coordination in European Workshop on System Security, pages 22
Wikipedia. In Proceedings of the SIGCHI confer- 28.
ence on Human factors in computing systems, pages K. Woodsend and M. Lapata. 2011. Learning to
453462. simplify sentences with quasi-synchronous gram-
mar and integer programming. In Proceedings of
V.I. Levenshtein. 1966. Binary codes capable of cor-
the 2011 Conference on Empirical Methods in Nat-
recting deletions, insertions, and reversals. Soviet
ural Language Processing, pages 409420.
Physics Doklady, 10(8):707710.
E. Yamangil and R. Nelken. 2008. Mining Wikipedia
P. Malakasiotis. 2009. Paraphrase recognition using revision histories for improving sentence compres-
machine learning to combine similarity measures. sion. In Proceedings of ACL-08: HLT, Short Pa-
In Proceedings of the ACL-IJCNLP 2009 Student pers, pages 137140.
Research Workshop, pages 2735. M. Yatskar, B. Pang, C. Danescu-Niculescu-Mizil, and
A. Max and G. Wisniewski. 2010. Min- L. Lee. 2010. For the sake of simplicity: Unsu-
ing naturally-occurring corrections and paraphrases pervised extraction of lexical simplifications from
from Wikipedias revision history. In Proceedings Wikipedia. In Human Language Technologies: The
of LREC, pages 31433148. 2010 Annual Conference of the North American
E.W. Myers. 1986. An O(N D) difference algorithm Chapter of the Association for Computational Lin-
and its variations. Algorithmica, 1(1):251266. guistics, pages 365368.
365
F.M. Zanzotto and M. Pennacchiotti. 2010. Expand-
ing textual entailment corpora from Wikipedia us-
ing co-training. In Proceedings of the 2nd Work-
shop on Collaboratively Constructed Semantic Re-
sources, COLING 2010.
S. Zwarts and M. Dras. 2008. Choosing the right
translation: A syntactically informed classification
approach. In Proceedings of the 22nd International
Conference on Computational Linguistics-Volume
1, pages 11531160.
366
User Participation Prediction in Online Forums
367
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 367376,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
and latent topics, and linearly interpolate their according to its informativeness. Then, base on
results. this personal profile a ranking machine is applied
to give a ranked recommendation list. In Fabs sys-
User modeling: We model users participa- tem, Rocchio algorithm (Rocchio, 1971) is used
tion inside threads as latent user groups. Each to learn the average TF-IDF vector of highly rated
latent group is a multinomial distribution on documents. Skyskill & Weberts system uses Naive
users. Then LDA is used to infer the group Bayes classifiers to give the probability of docu-
mixture inside each thread, based on which ments being liked. Winnows algorithm (Little-
the probability of a users participation can be stone, 1988), which is similar to perception algo-
derived. rithm, has been shown to perform well when there
Hybrid system: Since content and user- are many features. An adaptive framework is intro-
based methods rely on different information duced in (Li et al., 2010) using forum comments
sources, we combine the results from them for for news recommendation. In (Wu et al., 2010),
further improvement. a topic-specific topic flow model is introduced to
rank the likelihood of user participating in a thread
We have evaluated our proposed method using in online forums.
three data sets collected from three representative Collaborative-filtering based systems, unlike
forums. Our experimental results show that in all content-based systems, predict the recommending
forums, by using latent topics information, system items using co-occurrence information between
can achieve better accuracy in predicting threads users. For example, in a news recommendation
for recommendation. In addition, by modeling la- system, in order to recommend an article to user
tent user groups in thread participation, further im- c, the system tries to find users with similar taste
provement is achieved in the hybrid system. Our as c. Items favored by similar users would be rec-
analysis also showed that each forum has its nature, ommended. Grundy (Rich, 1979) is known to be
resulting in different optimal parameters in the dif- one of the first collaborative-filtering based sys-
ferent forums. tems. Collaborative filtering systems can be ei-
ther model based or memory based (Breese et al.,
2 Related Work
1998). Memory-based algorithms, such as (Del-
Recommendation systems can help make informa- gado and Ishii, 1999; Nakamura and Abe, 1998;
tion retrieving process more intelligent. Generally, Shardanand and Maes, 1995), use a utility function
recommendation methods are categorized into two to measure the similarity between users. Then rec-
types (Adomavicius and Tuzhilin, 2005), content- ommendation of an item is made according to the
based filtering and collaborative filtering. sum of the utility values of active users that partic-
Systems using content-based filtering use the ipate in it. Model-based algorithms, on the other
content information of recommendation items a hand, try to formulate the probability function of
user is interested in to recommend new items to one item being liked statistically using active user
the user. For example, in a news recommendation information. (Ungar et al., 1998) clustered sim-
system, in order to recommend appropriate news ilar users into groups for recommendation. Dif-
articles to a user, it finds the most prominent fea- ferent clustering methods have been experimented,
tures (e.g., key words, tags, category) in the docu- including K-means and Gibbs Sampling. Other
ment that a user likes, then suggests similar articles probabilistic models have also been used to model
based on this personal profile. In Fabs system collaborative relationships, including a Bayesian
(Balabanovic and Shoham, 1997), Skyskill & We- model (Chien and George, 1999), linear regres-
bert system (Pazzani et al., 1997), documents are sion model (Sarwar et al., 2001), Gaussian mix-
represented using a set of most important words ture models (Hofmann, 2003; Hofmann, 2004). In
according to a weighting measure. The most popu- (Blei et al., 2001) a collaborative filtering appli-
lar measure of word importance is TF-IDF (term cation is discussed using LDA. However in this
frequency, inverse document frequency) (Salton model, re-estimation of parameters for the whole
and Buckley, 1988), which gives weights to words system is needed when a new item comes in. In
368
this paper, we formulate users participation differ- duce some bias toward negative instances in terms
ently using the LDA mixture model. of user interests. A users absence from a thread
Some previous work has also evaluated using does not necessarily mean the user is not interested
a hybrid model with both content and collabora- in that thread; it may be a result of the user being
tive features and showed outstanding performance. offline by that time or the thread is too behind in
For example, in (Basu et al., 1998), hybrid features pages. As a matter of fact, we found most users
are used to make recommendation using inductive read only the threads on the first page during their
learning. time of visit of a forum. This makes participation
prediction an even harder task than interest predic-
3 Forum Data tion.
In online forums, threads are ordered by the time
We have collected data from three forums in this
stamp of their last participating post. Provided with
study.1 Ubuntu community forum is a technical
the time stamp for each post, we can calculate the
support forum; World of Warcraft (WoW) forum is
order of a thread on its board during a users par-
about gaming; Fitness forum is about how to live
ticipation. Figure 1 shows the distribution of post
a healthy life. These three forums are quite rep-
location during users participation. We found that
resentative of online forums on the internet. Us-
most of the users read only the posts on the first
ing three different types of forums for task eval-
page. In order to minimize the false negative in-
uation helps to demonstrate the robustness of our
stances from the data set, we did thread location
proposed method. In addition, it can show how the
filtering. That is, we want to filter out messages
same method could have substantial performance
that actually interest the user but do not have the
difference on forums of different nature. Users
users participation because they are not on the first
behaviors in these three forums are very differ-
page. For any user, only those threads appearing in
ent. Casual forums like Wow gaming have much
the first 10 entries on a page during a users visit
more posts in each thread. However its posts are
are included in the data set.
the shortest in length. This is because discussions
inside these types of forums are more like casual
conversation, and there is not much requirement
on the users background, and thus there is more
user participation. In contrast, technical forums
like Ubuntu have fewer average posts in each
thread, and have the longest post length. This is
because a Question and Answer (QA) forum tends
to be very goal oriented. If a user finds the thread
is unrelated, then there will be no motivation for
participation.
Inside forums, different boards are created to
categorize the topics allowed for discussion. From
Figure 1: Thread position during users participation.
the data we find that users tend to participate in a
few selected boards of their choices. To create a
In the pre-processing step of the experiment, first
data set for user interest prediction in this study,
we use online status filtering discussed above to
we pick the most popular boards in each forum.
remove threads that a user does not see while of-
Even within the same board, users tend to partici-
fline. The statistics of the boards we have used in
pate in different threads base on their interest. We
each forum are shown in Table 1. The statistics
use a users participation information as an indica-
are consistent with the full forum statistics. For
tion whether a thread is interesting to a user or not.
example, users in technical forums tend to post
Hence, our task is to predict the user participation
less than casual forums. We define active users as
in forum threads. Note this approach could intro-
those who have participated in 10 or more threads.
1
Please contact the authors to obtain the data. Column Part. @300 shows the average number
369
of threads the top 300 users have participated in. that normalization by document length yielded
Filt. Threads@300 shows the average number of good empirical results in approximating a well cal-
threads after using online filtering with a window ibrated posterior probability for Naive Bayes clas-
of 10. Thread participation in Ubuntu forum is sifier. The normalized Naive Bayes classifier they
very sparse for each user, having only 10.01% par- used is as follows:
ticipating threads for each user after filtering. Fit- 1 Y 1
ness and Wow Forum have denser participation, P (Ci |f1..k ) = P (Ci ) P (fj |Ci ) |f | (2)
Z
at 18.97% and 13.86% respectively. j
370
Forum Name Threads Posts Active Users Part. @300 Filt. Threads @300
Ubuntu 185,747 940,230 1,700 464.72 4641.25
Fitness 27,250 529,201 2,808 613.15 3231.04
Wow Gaming 34,187 1,639,720 19,173 313.77 2264.46
371
groups, and j is the group composition in thread
j after inference using the training data. In gen-
eral, the probability of user ui appearing in thread
j is proportional to the membership probabilities
of this user in the groups that compose the partici-
pating users.
In the equation, k is the multinomial distribution Sigmoid rescore: In a ranked list, usually
of users in group k, T is the number of latent user items on the top and bottom of the list have
372
higher confidence than those in the middle. During evaluation, a 3-fold cross-validation is
That is to say more emphasis should be put performed for each user in the test set. In each fold,
on both ends of the list. Hence we use a sig- MAP@10 score is calculated from the ranked list
moid function on the Scorelinear to capture generated by the system. Then the average from all
this. the folds and all the users is computed as the final
1 result.
Scoresig = (8) To make a proper evaluation configuration, for
1 + el(Scorelin 0.5)
each user, only posts up to the first participation of
A sigmoid function is relatively flat on both the testing user are used for the test set.
ends while being steep in the middle. In the
equation, l is a tuning parameter that decides 5.1 Content-based Results
how flat the score of both ends of the list is Here we evaluate the performance of interest
going to be. Determining the best value for l thread prediction using only features from text.
is not a trivial problem. Here we empirically First we use the ranking model with latent topic
assign l = 10. information only on the development set to deter-
5 Experiment and Evaluation mine an optimal number of topics. Empirically,
we use hyper parameter = 0.1 and = 1/K
In this section, we evaluate our approach empiri- (K is the number of topics). We use the perfor-
cally on the three forum data sets described in Sec- mance of content-based recommendation directly
tion 3. We pick the top 300 most active users from to determine the optimal topic number K. We var-
each forum for the evaluation. Among the 300 ied the latent topic number K from 10 to 100, and
users, 100 of them are randomly selected as the de- found that the best performance was achieved us-
velopment set for parameter tuning, while the rest ing 30 topics in all three forums. Hence we use
is test set. All the data sets are filtered using an on- K = 30 for content based recommendation unless
line filter as previously described, with a window otherwise specified.
size of 10 threads. Next, we show how topic information can help
Threads are tokenized into words and filtered us- content-based recommendation achieve better re-
ing a simple English stop word list. All words sults. We tune the parameter described in Sec-
are then ordered by their occurrences multiplied by tion 4.1.2 and show corresponding performances.
their inverse document frequencies (IDF). We compare the performance using Naive Bayes
classifier, before and after normalization. The
|D|
idfw = log (9) MAP@10 results on the test set are shown in Fig-
|{d : w d}|
ure 3 for three forums. When = 0, no latent topic
The top 4,000 words from this list are then used to information is used, and when = 1, latent topics
form the vocabulary. are used without any word features.
We used standard mean average precision When using Naive Bayes classifier without nor-
(MAP) as the evaluation metric. This standard in- malization, we find relatively larger performance
formation retrieval evaluation metric measures the gain from adding topic information for the val-
quality of the returned rank lists from a system. ues of close to 0. This phenomenon is probably
Entries higher in the rank are more accurate than because of the poor posterior probabilities of the
lower ones. For an interesting thread recommenda- Naive Bayes classifier, which are close to either 1
tion system, it is preferable to provide a short and or 0.
high-quality list of recommendation; therefore, in- For normalized Naive Bayes classifier, interpo-
stead of reporting full-range MAP, we report MAP lating with latent topics based ranking yields per-
on top 10 relevant threads (MAP@10). The reason formance improvement compared to word-based
why we picked 10 as the number of relevant doc- results consistently for the three forums. In
ument for MAP evaluation is that users might not Wow Gaming corpus, the optimal performance
have time to read too many posts, even if they are is achieved with a relatively high value (at around
relevant. 0.5), and it is even higher for the Fitness forum.
373
This means that the system relies more on the la-
tent topics information. This is because in these fo-
rums, casual conversation contains more irregular
words, causing more severe data sparsity problem
than others.
Between the two naive Bayes classifiers, we
#word
can see that using normalized probabilities out-
performs the original one in Wow Gaming and
Ubuntu forums. This observation is consistent
with previous work (e.g., (Pavlov et al., 2004)).
However, we found that in Fitness Forum, the
performance degrades with normalization. Further
work is still needed to understand why this is the
#user
case.
5.2 Latent User Group Classification Figure 5: Position of items with different #users and
#words in a ranked list. (red=0 being higher on the
In this section, collaborative filtering using latent ranked list and green being lower)
user groups is evaluated. First, participating users
from the training set are used to estimate an LDA
model. Then, users participating in a thread are may be interested in a larger variety of topics and
used to infer the topic distribution of the thread. thus the user distribution in different topics is not
Candidate threads are then sorted by the proba- very obvious. In contrast, people in the gaming
bility of a target users participation according to forum are more specific to the topics they are inter-
Equation 4. Note that all the users in the forum are ested in.
used to estimate the latent user groups, but only the It is known that LDA tends to perform poorly
top 300 active users are used in evaluation. Here, when there are too few words/users. To have a
we vary the number of latent user groups G from general idea of how much user participation is
5 to 100. Hyper parameters were set empirically: enough for decent prediction, we show a graph
= 1/G, = 0.1. (Figure 5) depicting the relationships among the
Figure 4 shows the MAP@10 results using dif- number of users, the number of words, and the po-
ferent numbers of latent groups for the three fo- sition of the positive instances in the ranked lists.
rums. We compare the performance using latent In this graph, every dot is a positive thread instance
groups with a baseline using SVM ranking. In in Wow Gaming forum. Red color shows that
the baseline system, users participation in a thread the positive thread is indeed getting higher ranks
is used as a binary feature. LibSVM with radius than others. We observe that threads with around
based function (RBF) kernel is used to estimate the 16 participants can already achieve a decent perfor-
probability of a users participation. mance.
From the results, we find that ranking using la-
5.3 Hybrid System Performance
tent groups information outperforms the baseline
in almost all non-trivial cases. In the case of In this section, we evaluate the performance of the
Ubuntu forum, the performance gain is less com- hybrid system output. Parameters used in each fo-
pared to other forums. We believe this is because rum data set are the optimal parameters found in
in this technical support forum, the average user the previous sections. Here we show the effect of
participation in threads is much less, thus making the tuning parameter (described in Section 4.3).
it hard to infer a reliable group distribution in a Also, we compare three different scoring schemes
thread. In addition, the optimal number of user used to generate the final ranked list. Performance
groups differs greatly between Fitness forum and of the hybrid system is shown in Table 3.
Wow Gaming forum. We conjecture the reason We can see that the combination of the two sys-
behind this is that in the Fitness forum, users tems always outperforms any one model alone.
374
Ubuntu Forum Wow Gaming Fitness Forum
0.54 0.3 1
Naive Bayes Naive Bayes 0.9 Naive Bayes
0.51 Normalized NB 0.28 Normalized NB Normalized NB
0.8
0.48 0.7
MAP 10
MAP 10
MAP 10
0.26
0.6
0.45
0.5
0.24
0.42 0.4
0.22 0.3
0.39
0.2
0.36 0.2 0.1
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Gamma Gamma Gamma
MAP 10
MAP 10
0.18
0.25 0.4
0.16
0.2 0.3
0.14
0.15 0.2
1 10 100 1 10 100 1 10 100
Number of Groups Number of Groups Number of Groups
375
Fab: Content-based, collaborative recommendation. Michael Pazzani, Daniel Billsus, S. Michalski, and
Communications of the ACM, 40:6672. Janusz Wnek. 1997. Learning and revising user pro-
Chumki Basu, Haym Hirsh, and William Cohen. 1998. files: The identification of interesting web sites. In
Recommendation as classification: Using social and Machine Learning, pages 313331.
content-based information in recommendation. In In Elaine Rich. 1979. User modeling via stereotypes.
Proceedings of the Fifteenth National Conference on Cognitive Science, 3(4):329354.
Artificial Intelligence, pages 714720. AAAI Press. J. Rocchio, 1971. Relevance Feedback in Information
Paul N. Bennett. 2000. Assessing the calibration of Retrieval.
naive bayes posterior estimates. Gerard Salton and Christopher Buckley. 1988. Term-
David Blei, Andrew Y. Ng, and Michael I. Jordan. weighting approaches in automatic text retrieval.
2001. Latent dirichlet allocation. Journal of Ma- In INFORMATION PROCESSING AND MANAGE-
chine Learning Research, 3:2003. MENT, pages 513523.
John S. Breese, David Heckerman, and Carl Kadie. Badrul Sarwar, George Karypis, Joseph Konstan, and
1998. Empirical analysis of predictive algorithms for John Reidl. 2001. Item-based collaborative fil-
collaborative filtering. pages 4352. Morgan Kauf- tering recommendation algorithms. In WWW 01:
mann. Proceedings of the 10th international conference on
Y H Chien and E I George, 1999. A bayesian model for World Wide Web, pages 285295, New York, NY,
collaborative filtering. Number 1. USA. ACM.
Joaquin Delgado and Naohiro Ishii. 1999. Memory- Upendra Shardanand and Pattie Maes. 1995. So-
based weighted-majority prediction for recom- cial information filtering: Algorithms for automating
mender systems. word of mouth. In CHI, pages 210217.
Thomas L. Griffiths and Mark Steyvers. 2004. Find- Lyle Ungar, Dean Foster, Ellen Andre, Star Wars,
ing scientific topics. Proceedings of the National Fred Star Wars, Dean Star Wars, and Jason Hiver
Academy of Sciences of the United States of Amer- Whispers. 1998. Clustering methods for collabo-
ica, 101(Suppl 1):52285235, April. rative filtering. AAAI Press.
Thomas Hofmann. 2003. Collaborative filtering via Hao Wu, Jiajun Bu, Chun Chen, Can Wang, Guang Qiu,
gaussian probabilistic latent semantic analysis. In Lijun Zhang, and Jianfeng Shen. 2010. Modeling
Proceedings of the 26th annual international ACM dynamic multi-topic discussions in online forums. In
SIGIR conference on Research and development in AAAI.
informaion retrieval, SIGIR 03, pages 259266,
New York, NY, USA. ACM.
Thomas Hofmann. 2004. Latent semantic models
for collaborative filtering. ACM Trans. Inf. Syst.,
22(1):89115.
Qing Li, Jia Wang, Yuanzhu Peter Chen, and Zhangxi
Lin. 2010. User comments for news recom-
mendation in forum-based social media. Inf. Sci.,
180:49294939, December.
Nick Littlestone. 1988. Learning quickly when irrele-
vant attributes abound: A new linear-threshold algo-
rithm. In Machine Learning, pages 285318.
Atsuyoshi Nakamura and Naoki Abe. 1998. Collab-
orative filtering using weighted majority prediction
algorithms. In Proceedings of the Fifteenth Interna-
tional Conference on Machine Learning, ICML 98,
pages 395403, San Francisco, CA, USA. Morgan
Kaufmann Publishers Inc.
Dmitry Pavlov, Ramnath Balasubramanyan, Byron
Dom, Shyam Kapur, and Jignashu Parikh. 2004.
Document preprocessing for naive bayes classifica-
tion and clustering with mixture of multinomials. In
Proceedings of the tenth ACM SIGKDD international
conference on Knowledge discovery and data min-
ing, KDD 04, pages 829834, New York, NY, USA.
ACM.
376
Inferring Selectional Preferences from Part-Of-Speech N-grams
377
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 377386,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
disambiguation task than the best previous model relation between these two content words: sky is
(Erk et al., 2010), but lower coverage. the location where flies occurs. Other function
The paper is organized as follows. Section 2 words yield different collapsed dependencies.
describes the relations for which we compute For example, consider these two sentences:
selectional preferences. Section 3 describes The airplane flies over the ocean.
PONG. Section 4 evaluates PONG. Section 5 The airplane flies and lands.
relates PONG to prior work. Section 6 concludes. Collapsed dependencies for the first sentence
include prep_over between flies and ocean,
2 Relations Used which characterizes their relative vertical
position, and conj_and between flies and lands,
Selectional preferences characterize constraints which links two actions that an airplane can
on the arguments of predicates. Selectional perform. As these examples illustrate, collapsing
preferences for semantic roles (such as agent and dependencies involving prepositions and
patient) are generally more informative than for conjunctions can yield informative dependencies
grammatical dependencies (such as subject and between content words.
object). For example, consider these Besides collapsed dependencies, PONG infers
semantically equivalent but grammatically inverse dependencies. Inverse selectional
distinct sentences: preferences are selectional preferences of
Pat opened the door. arguments for their predicates, such as a
The door was opened by Pat. preference of a subject or object for its verb.
In both sentences the agent of opened, namely They capture semantic regularities such as the set
Pat, must be capable of opening something an of verbs that an agent can perform, which tend to
informative constraint on Pat. In contrast, outnumber the possible agents for a verb (Erk et
knowing that the grammatical subject of opened al., 2010).
is Pat in the first sentence and the door in the
second sentence tells us only that they are nouns. 3 Method
Despite this limitation, selectional preferences
for grammatical dependencies are still useful, for To compute selectional preferences, PONG
a number of reasons. First, in practice they combines information from a limited corpus
approximate semantic role labels. For instance, labeled with the grammatical dependencies
typically the grammatical subject of opened is its described in Section 2, and a much larger
agent. Second, grammatical dependencies can be unlabeled corpus. The key idea is to abstract
extracted by parsers, which tend to be more word sequences labeled with grammatical
accurate than current semantic role labelers. relations into POS N-grams, in order to learn a
Third, the number of different grammatical mapping from POS N-grams to those relations.
dependencies is large enough to capture diverse For instance, PONG abstracts the parsed
relations, but not so large as to have sparse data sentence Pat opened the door as NN VB DT NN,
for individual relations. Thus in this paper, we with the first and last NN as the subject and
use grammatical dependencies as relations. object of the VB. To estimate the distribution of
A parse tree determines the basic grammatical POS N-grams containing particular target and
dependencies between the words in a sentence. relative words, PONG POS-tags Google N-
For instance, in the parse of Pat opened the door, grams (Franz and Brants, 2006).
the verb opened has Pat as its subject and door Section 3.1 derives PONGs probabilistic
as its object, and door has the as its determiner. model for combining information from labeled
Besides these basic dependencies, we use two and unlabeled corpora. Section 3.2 and Section
additional types of dependencies. 3.3 describe how PONG estimates probabilities
Composing two basic dependencies yields a from each corpus. Section 3.4 discusses a
collapsed dependency (de Marneffe and Manning, sparseness problem revealed during probability
2008). For example, consider this sentence: estimation, and how we address it in PONG.
The airplane flies in the sky.
Here sky is the prepositional object of in, which 3.1 Probabilistic model
is the head of a prepositional phrase attached to We quantify the selectional preference for a
flies. Composing these two dependencies yields relative r to instantiate a relation R of a target t as
the collapsed dependency prep_in between flies the probability Pr(r | t, R), estimated as follows.
and sky, which captures an important semantic By the definition of conditional probability:
378
Pr(r , t , R) Pr( R | t , r , p) Pr( p | t , r ) Pr(t , r )
Pr(r | t , R)
Pr(t , R) p Pr(t , r )
We care only about the relative probability of Cancelling the common factor yields:
different r for fixed t and R, so we rewrite it as:
Pr( R | p, t , r ) Pr( p | t , r )
Pr(r, t , R) p
We use the chain rule: We approximate the first term Pr(R | p, t, r) as
Pr( R | r, t ) Pr(r | t ) Pr(t ) Pr(R | p), based on the simplifying assumption
and notice that t is held constant: that R is conditionally independent of t and r,
Pr( R | r , t ) Pr( r | t ) given p. In other words, we assume that given a
POS N-gram, the target and relative words t and
We estimate the second factor as follows: r give no additional information about the
Pr(t , r ) freq(t , r ) probability of a relation. However, their
Pr(r | t )
Pr(t ) freq(t ) respective positions i and j in the POS N-gram p
matter, so we condition the probability on them:
We calculate the denominator freq(t) as the
number of N-grams in the Google N-gram Pr( R | p, t , r ) Pr( R | p, i, j )
corpus that contain t, and the numerator freq(t, r) Summing over their possible positions, we get
as the number of N-grams containing both t and r. Pr( R | r , t )
To estimate the factor Pr(R | r, t) directly from
Pr( R | p, i, j ) Pr( p | t gi , r g j)
a corpus of text labeled with grammatical
p i j
relations, it would be trivial to count how often a
word r bears relation R to target word t. As Figure 1 shows, we estimate Pr(R | p, i, j) by
However, the results would be limited to the abstracting the labeled corpus into POS N-grams.
words in the corpus, and many relation We estimate Pr(p | t = gi, r = gj) based on the
frequencies would be estimated sparsely or frequency of partially lexicalized POS N-grams
missing altogether; t or r might not even occur. like DT JJ:red NN:hat VB NN among Google N-
Instead, we abstract each word in the corpus as grams with t and r in the specified positions.
its part-of-speech (POS) label. Thus we abstract Sections 3.2 and 3.3 describe how we estimate
The big boy ate meat as DT JJ NN VB NN. We Pr(R | p, i, j) and Pr(p | t = gi, r = gj), respectively.
call this sequence of POS tags a POS N-gram. Note that PONG estimates relative rather than
We use POS N-grams to predict word relations. absolute probabilities. Therefore it cannot (and
For instance, we predict that in any word does not) compare them against a fixed threshold
sequence with this POS N-gram, the JJ will to make decisions about selectional preferences.
modify (amod) the first NN, and the second NN 3.2 Mapping POS N-grams to relations
will be the direct object (dobj) of the VB.
This prediction is not 100% reliable. For To estimate Pr(R | p, i, j), we use the Penn
example, the initial 5-gram of The big boy ate Treebank Wall Street Journal (WSJ) corpus,
meat pie has the same POS 5-gram as before. which is labeled with grammatical relations
However, the dobj of its VB (ate) is not the using the Stanford dependency parser (Klein and
second NN (meat), but the subsequent NN (pie). Manning, 2003).
Thus POS N-grams predict word relations only To estimate the probability Pr(R | p, i, j) of a
in a probabilistic sense. relation R between a target at position i and a
To transform Pr(R | r, t) into a form we can relative at position j in a POS N-gram p, we
estimate, we first apply the definition of compute what fraction of the word N-grams g
conditional probability: with POS N-gram p have relation R between
Pr( R, t , r ) some target t and relative r at positions i and j:
Pr( R | t , r ) Pr( R | p, i, j )
Pr(t , r )
freq( g s.t.POS( g ) p relation( gi , g j ) R)
To estimate the numerator Pr(R, t, r), we first
marginalize over the POS N-gram p: freq( g s.t.POS( g ) p relation( gi , g j ))
Pr( R, t , r , p) 3.3 Estimating POS N-gram distributions
p Pr(t , r ) Given a target and relative, we need to estimate
We expand the numerator using the chain rule: their distribution of POS N-grams and positions.
379
Figure 1: Overview of PONG.
From the labeled corpus, PONG extracts abstract mappings from POS N-grams to relations.
From the unlabeled corpus, PONG estimates POS N-gram probability given a target and relative.
A labeled corpus is too sparse for this purpose, instance consists of two randomly chosen words
so we use the much larger unlabeled Google N- in the WSJ corpus labeled with a grammatical
grams corpus (Franz and Brants, 2006). relation. Coarse POS tags increased coverage of
The probability that an N-gram with target t at this pilot set that is, the fraction of instances for
position i and relative r at position j will have the which PONG computes a probability from 69%
POS N-gram p is: to 92%.
Pr( p | t gi , r gj) Using the universal tag set (Petrov et al., 2011)
as an even coarser tag set is an interesting future
freq( g s.t.POS( g ) p, g i t , g j r )) direction, especially for other languages. Its
freq( g s.t. gi t gj r) smaller size (12 tags vs. our 23) should reduce
data sparseness, but increase the risk of over-
To compute this ratio, we first use a well- generalization.
indexed table to efficiently retrieve all N-grams
with words t and r at the specified positions. We 4 Evaluation
then obtain their POS N-grams from the Stanford
POS tagger (Toutanova et al., 2003), and count To evaluate PONG, we use a standard pseudo-
how many of them have the POS N-gram p. disambiguation task, detailed in Section 4.1.
Section 4.2 describes our test set. Section 4.3
3.4 Reducing POS N-gram sparseness lists the metrics we evaluate on this test set.
We abstract word N-grams into POS N-grams to Section 4.4 describes the baselines we compare
address the sparseness of the labeled corpus, but PONG against on these metrics, and Section 4.5
even the POS N-grams can be sparse. For n=5, describes the relations we compare them on.
the rarer ones occur too sparsely (if at all) in our Section 4.6 reports our results. Section 4.7
labeled corpus to estimate their frequency. analyzes sources of error.
To address this issue, we use a coarser POS
tag set than the Penn Treebank POS tag set. As 4.1 Evaluation task
Table 2 shows, we merge tags for adjectives, The pseudo-disambiguation task (Gale et al.,
nouns, adverbs, and verbs into four coarser tags. 1992; Schutze, 1992) is as follows: given a
Coarse Original target word t, a relation R, a relative r, and a
random distracter r', prefer either r or r',
ADJ JJ, JJR, JJS
whichever is likelier to have relation R to word t.
ADVERB RB, RBR, RBS This evaluation does not use a threshold: just
NOUN NN, NNS, NNP, NNPS prefer whichever word is likelier according to the
VERB VB, VBD, VBG, VBN, VBP, VBZ model being evaluated. If the model assigns only
Table 2: Coarser POS tag set used in PONG one of the words a probability, prefer it, based on
the assumption that the unknown probability of
To gauge the impact of the coarser POS tags, the other word is lower. If the model assigns the
we calculated Pr(r | t, R) for 76 test instances same probability to both words, or no probability
used in an earlier unpublished study by Liu Liu, to either word, do not prefer either word.
a former Project LISTEN graduate student. Each
380
4.2 Test set distracters any actual relatives, i.e. candidates r'
where the test corpus contained the triple (R, t, r').
As a source of evaluation data, we used the
Table 3 shows the resulting number of (R, t, r, r')
British National Corpus (BNC). As a common test tuples for each relation.
test corpus for all the methods we evaluated, we
selected one half of BNC by sorting filenames
alphabetically and using the odd-numbered files. Relation R # tuples for R # tuples for RT
We used the other half of BNC as a training
corpus for the baseline methods we compared advmod 121 131
PONG to. amod 162 128
A test set for the pseudo-disambiguation task conj_and 155 151
task consists of tuples of the form (R, t, r, r'). To dobj 145 167
construct a test set, we adapted the process used nn 173 158
by Rooth et al. (1999) and Erk et al. (2010). nsubj 97 124
First, we chose 100 (R, t) pairs for each prep_of 144 153
relation R at random from the test corpus. Rooth xcomp 139 140
et al. (1999) and Erk et al. (2010) chose such Table 3: Test set size for each relation
pairs from a training corpus to ensure that it
contained the target t. In contrast, choosing pairs 4.3 Metrics
from an unseen test corpus includes target words
whether or not they occur in the training corpus. We report four evaluation metrics: precision,
To obtain a sample stratified by frequency, coverage, recall, and F-score. Precision (called
rather than skewed heavily toward high- accuracy in some papers on selectional
frequency pairs, Erk et al. (2010) drew (R, t) preferences) is the percentage of all covered
pairs from each of five frequency bands in the tuples where the original relative r is preferred.
entire British National Corpus (BNC): 50-100 Coverage is the percentage of tuples for which
occurrences; 101-200; 201-500; 500-1000; and the model prefers r to r' or vice versa. Recall is
more than 1000. However, we use only half of the percentage of all tuples where the original
BNC as our test corpus, so to obtain a relative is preferred, i.e., precision times
comparable test set, we drew 20 (R, t) pairs from coverage. F-score is the harmonic mean of
each of the corresponding frequency bands in precision and recall.
that half: 26-50 occurrences; 51-100; 101-250;
4.4 Baselines
251-500; and more than 500.
For each chosen (R, t) pair, we drew a separate We compare PONG to two baseline methods.
(R, t, r) triple from each of six frequency bands: EPP is a state-of-the-art model for which Erk
1-25 occurrences; 26-50; 51-100; 101-250; 251- et al. (2010) reported better performance than
500; and more than 500. We necessarily omitted both Resniks (1996) WordNet model and
frequency bands that contained no such triples. Rooths (1999) EM clustering model. EPP
We filtered out triples where r did not have the computes selectional preferences using
most frequent part of speech for the relation R. distributional similarity, based on the assumption
For example, this filter would exclude the triple that relatives are likely to appear in the same
(dobj, celebrate, the) because a direct object is contexts as relatives seen in the training corpus.
most frequently a noun, but the is a determiner. EPP computes the similarity of a potential
Then, like Erk et al. (2010), we paired the relatives vector space representation to relatives
relative r in each (R, t, r) triple with a distracter r' in the training corpus.
with the same (most frequent) part of speech as EPP has various options for its vector space
the relative r, yielding the test tuple (R, t, r, r'). representation, similarity measure, weighting
Rooth et al. (1999) restricted distracter scheme, generalization space, and whether to use
candidates to words with between 30 and 3,000 PCA. In re-implementing EPP, we chose the
occurrences in BNC; accordingly, we chose only options that performed best according to Erk et al.
distracters with between 15 and 1,500 (2010), with one exception. To save work, we
occurrences in our test corpus. We selected r' chose not to use PCA, which Erk et al. (2010)
from these candidates randomly, with probability described as performing only slightly better in
proportional to their frequency in the test corpus. the dependency-based space.
Like Rooth et al. (1999), we excluded as
381
Relation Target Relative Description
advmod verb adverb Adverbial modifier
amod noun adjective Adjective modifier
conj_and noun noun Conjunction with and
dobj verb noun Direct object
nn noun noun Noun compound modifier
nsubj verb noun Nominal subject
prep_of noun noun Prepositional modifier
xcomp verb verb Open clausal complement
Table 5: Coverage, Precision, Recall, and F-score for various relations; RT is the inverse of relation R.
PONG uses POS N-grams, EPP uses distributional similarity, and DEP uses dependency parses.
To score a potential relative r0, EPP uses this DEP, our second baseline method, runs the
formula: Stanford dependency parser to label the training
wtR ,t (r ) corpus with grammatical relations, and uses their
Selpref R ,t (r0 ) sim(r0 , r )
r Seen arg s ( R ,t ) Z R ,t frequencies to predict selectional preferences.
To do the pseudo-disambiguation task, DEP
Here sim(r0, r) is the nGCM similarity defined
compares the frequencies of (R, t, r) and (R, t, r').
below between vector space representations of r0
and a relative r seen in the training data: 4.5 Relations tested
To test PONG, EPP, and DEP, we chose the
n abi a 'bi
simnGCM (a, a ') exp( ( )2 ) most frequent eight relations between content
i 1 a a' words in the WSJ corpus, which occur over
10,000 times and are described in Table 4. We
n
also tested their inverse relations. However, EPP
where a ab2i
i 1
does not compute selectional preferences for
The weight function wtr,t(a) is analogous to adjective and adverb as relatives. For this reason,
inverse document frequency in Information we did not test EPP on advmod and amod
Retrieval. relations with adverbs and adjectives as relatives.
382
4.6 Experimental results is the probability of a POS N-gram for rare co-
occurrences of a target and relative in Google
Table 5 displays results for all 16 relations. To
word N-grams. Using a smaller tag set may
compute statistical significance conservatively in reduce the sparse data problem but increase the
comparing methods, we used paired t-tests with
risk of over-generalization.
N = 16 relations.
PONGs precision was significantly better
5 Relation to Prior Work
than EPP (p<0.001) but worse than DEP
(p<0.0001). Still, PONGs high precision In predicting selectional preferences, a key
validates its underlying assumption that POS N- issue is generalization. Our DEP baseline simply
grams strongly predict grammatical counts co-occurrences of target and relative
dependencies. words in a corpus to predict selectional
On coverage and recall, EPP beat PONG, preferences, but only for words seen in the
which beat DEP (p<0.0001). PONGs F-score corpus. Prior work, summarized in
was higher, but not significantly, than EPPs Table 6, has therefore tried to infer the similarity
(p>0.5) or DEPs (p>0.02). of unseen relatives to seen relatives. To illustrate,
consider the problem of inducing that the direct
4.7 Error analysis
objects of celebrate tend to be days or events.
In the pseudo-disambiguation task of choosing Resnik (1996) combined WordNet with a
which of two words is related to a target, PONG labeled corpus to model the probability that
makes errors of coverage (preferring neither relatives of a predicate belong to a particular
word) and precision (preferring the wrong word). conceptual class. This method could notice, for
Coverage errors, which occurred 17.4% of the example, that the direct objects of celebrate tend
time on average, arose only when PONG failed to belong to the conceptual class event. Thus it
to estimate a probability for either word. PONG could prefer anniversary or occasion as the
fails to score a potential relative r of a target t object of celebrate even if unseen in its training
with a specified relation R if the labeled corpus corpus. However, this method depends strongly
has no POS N-grams that (a) map to R, (b) on the WordNet taxonomy.
contain the POS of t and r, and (c) match Google Rather than use linguistic resources such as
word N-grams with t and r at those positions. WordNet, Rooth et al. (1999) and Wald et al.
Every relation has at least one POS N-gram that (2008) induced semantically annotated
maps to it, so condition (a) never fails. PONG subcategorization frames from unlabeled corpora.
uses the most frequent POS of t and r, and we They modeled semantic classes as hidden
believe that condition (b) never fails. However, variables, which they estimated using EM-based
condition (c) can and does fail when t and r do clustering. Ritter (2010) computed selectional
not co-occur in any Google N-grams, at least that preferences by using unsupervised topic models
match a POS N-gram that can map to relation R. such as LinkLDA, which infers semantic classes
For example, oversee and diet do not co-occur in of words automatically instead of requiring a pre-
any Google N-grams, so PONG cannot score diet defined set of classes as input.
as a potential dobj of oversee. The contexts in which a linguistic unit occurs
Precision errors, which occur 17% of the time provide information about its meaning. Erk
on average, arose when (a) PONG scored the (2007) and Erk et al. (2010) modeled the
distracter but failed to score the true relative, or contexts of a word as the distribution of words
(b) scored them both but preferred the distracter. that co-occur with it. They calculated the
Case (a) accounted for 44.62% of the errors on semantic similarity of two words as the similarity
the covered test tuples. of their context distributions according to various
One likely cause of errors in case (b) is over- measures. Erk et al. (2010) reported the state-of-
generalization when PONG abstracts a word N- the-art method we used as our EPP baseline.
gram labeled with a relation by mapping its POS In contrast to prior work that explored various
N-gram to that relation. In particular, the coarse solutions to the generalization problem, we dont
POS tag set may discard too much information. so much solve this problem as circumvent it.
Another likely cause of errors is probabilities Instead of generalizing from a training corpus
estimated poorly due to sparse data. The directly to unseen words, PONG abstracts a word
probability of a relation for a POS N-gram rare in N-gram to a POS N-gram and maps it to the
the training corpus is likely to be inaccurate. So relations that the word N-gram is labeled with.
383
Reference Relation to Lexical Primary corpus Generalization Method
target resource (labeled) & corpus
information (unlabeled) &
used information used
Resnik, Verb-object Senses in Target, relative, none Information
1996 Verb-subject WordNet and relation in a theoretic
Adjective-noun noun parsed, partially model
Modifier-head taxonomy sense-tagged
Head-modifier corpus (Brown
corpus)
Rooth et Verb-object none Target, relative, none EM-based
al., 1999 Verb-subject and relation in a clustering
parsed corpus
(parsed BNC)
Ritter, Verb-subject none Subject-verb- none LDA model
2010 Verb-object object tuples
Subject-verb- from 500 million
object web-pages
Erk, 2007 Predicate and none Target, relative, Words and their Similarity
Semantic roles and relation in a relations in a model based
semantic role parsed corpus on word co-
labeled corpus (BNC) occurrence
(FrameNet)
Erk et al., SYN option: none Target, relative, Two options: Similarity
2010 Verb-subject and relation in model using
Verb-object, and SYN option: a WORDSPACE: vector space
their inverse parsed corpus an unlabeled representation
relations (parsed BNC) corpus (BNC) of words
SEM option: SEM option: a
verb and semantic role DEPSPACE:
semantic roles labeled corpus Words and their
that have nouns (FrameNet) subject and object
as their headword relations in a
in a primary parsed corpus
corpus, and their (parsed BNC)
inverse relations
Zhou et Any (relations none Counts of words none PMI
al., 2011 not distinguished) in Web or (Pointwise
Google N-gram Mutual
Information)
This paper All grammatical none POS N-gram POS N-gram Combine both
dependencies in a distribution for distribution for POS N-gram
parsed corpus, relations in target and relative distributions
and their inverse parsed WSJ in Google N-gram
relations corpus
To compute selectional preferences, whether the corpus. The most closely related work we found
words are in the training corpus or not, PONG was by Gormley et al. (2011). They used
applies these abstract mappings to word N-grams patterns in POS N-grams to generate test data for
in the much larger Google N-grams corpus. their selectional preferences model, but not to
Some prior work on selectional preferences infer preferences. Zhou et al. (2011) identified
has used POS N-grams and a large unlabeled selectional preferences of one word for another
384
by using Pointwise Mutual Information (PMI) Erk, K. 2007. A Simple, Similarity-Based Model for
(Fano, 1961) to check whether they co-occur Selectional Preferences. In Proceedings of the 45th
more frequently in a large corpus than predicted Annual Meeting of the Association of
by their unigram frequencies. However, their Computational Linguistics, Prague, Czech
Republic, June, 2007, 216-223.
method did not distinguish among different
relations. Erk, K., Pad, S. and Pad, U. 2010. A Flexible,
Corpus-Driven Model of Regular and Inverse
6 Conclusion Selectional Preferences. Computational Linguistics
36(4), 723-763.
This paper describes, derives, and evaluates
PONG, a novel probabilistic model of selectional Fano, R. 1961. Transmission O F Information: A
preferences. PONG uses a labeled corpus to map Statistical Theory of Communications. MIT
POS N-grams to grammatical relations. It Press, Cambridge, MA.
combines this mapping with probabilities
estimated from a much larger POS-tagged but Franz, A. and Brants, T. 2006. All Our N-Gram Are
unlabeled Google N-grams corpus. Belong to You.
We tested PONG on the eight most common
relations in the WSJ corpus, and their inverses Gale, W.A., Church, K.W. and Yarowsky, D. 1992.
Work on Statistical Methods for Word Sense
more relations than evaluated in prior work.
Disambiguation. In Proceedings of the AAAI Fall
Compared to the state-of-the-art EPP baseline Symposium on Probabilistic Approaches to Natural
(Erk et al., 2010), PONG averaged higher Language, Cambridge, MA, October 2325, 1992,
precision but lower coverage and recall. 54-60.
Compared to the DEP baseline, PONG averaged
lower precision but higher coverage and recall. Gildea, D. and Jurafsky, D. 2002. Automatic Labeling
All these differences were substantial (p < 0.001). of Semantic Roles. Computational Linguistics
Compared to both baselines, PONGs average F- 28(3), 245-288.
score was higher, though not significantly.
Some directions for future work include: First, Gormley, M.R., Dredze, M., Durme, B.V. and Eisner,
improve PONG by incorporating models of J. 2011. Shared Components Topic Models with
Application to Selectional Preference, NIPS
lexical similarity explored in prior work. Second,
Workshop on Learning Semantics Sierra Nevada,
use the universal tag set to extend PONG to other Spain.
languages, or to perform better in English. Third,
in place of grammatical relations, use rich, im Walde, S.S., Hying, C., Scheible, C. and Schmid,
diverse semantic roles, while avoiding sparsity. H. 2008. Combining Em Training and the Mdl
Finally, use selectional preferences to teach word Principle for an Automatic Verb Classification
connotations by using various relations to Incorporating Selectional Preferences. In
generate example sentences or useful questions. Proceedings of the 46th Annual Meeting of the
Association for Computational Linguistics,
Acknowledgments Columbus, OH, 2008, 496-504.
The research reported here was supported by the Klein, D. and Manning, C.D. 2003. Accurate
Institute of Education Sciences, U.S. Department Unlexicalized Parsing. In Proceedings of the 41st
of Education, through Grant R305A080157. The Annual Meeting of the Association for
Computational Linguistics, Sapporo, Japan, July 7-
opinions expressed are those of the authors and
12, 2003, E.W. HINRICHS and D. ROTH, Eds.
do not necessarily represent the views of the
Institute or the U.S. Department of Education. Petrov, S., Das, D. and McDonald, R.T. 2011. A
We thank the helpful reviewers and Katrin Erk Universal Part-of-Speech Tagset. ArXiv
for her generous assistance. 1104.2086.
385
Tagging Text with Lexical Semantics: Why, What,
and How, Washington, DC, April 4-5, 1997, 52-57.
386
WebCAGe A Web-Harvested Corpus Annotated with GermaNet Senses
387
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 387396,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
Wikipedia2 . lexical space into a set of concepts that are inter-
As a proof of concept, this automatic method linked by semantic relations. A semantic concept
has been applied to German, a language for which is represented as a synset, i.e., as a set of words
sense-annotated corpora are still in short supply whose individual members (referred to as lexical
and fail to satisfy most if not all of the crite- units) are taken to be (near) synonyms. Thus, a
ria under (3) above. While the present paper synset is a set-representation of the semantic rela-
focuses on one particular language, the method tion of synonymy.
as such is language-independent. In the case There are two types of semantic relations in
of German, the sense inventory is taken from GermaNet. Conceptual relations hold between
the German wordnet GermaNet3 (Henrich and two semantic concepts, i.e. synsets. They in-
Hinrichs, 2010; Kunze and Lemnitzer, 2002). clude relations such as hypernymy, part-whole re-
The web-harvesting relies on an existing map- lations, entailment, or causation. Lexical rela-
ping of GermaNet to the German version of the tions hold between two individual lexical units.
web-based dictionary Wiktionary. This mapping Antonymy, a pair of opposites, is an example of a
is described in Henrich et al. (2011). The lexical relation.
resulting resource consists of a web-harvested GermaNet covers the three word categories of
corpus WebCAGe (short for: Web-Harvested adjectives, nouns, and verbs, each of which is
Corpus Annotated with GermaNet Senses), hierarchically structured in terms of the hyper-
which is freely available at: http://www.sfs.uni- nymy relation of synsets. The development of
tuebingen.de/en/webcage.shtml GermaNet started in 1997, and is still in progress.
The remainder of this paper is structured as GermaNets version 6.0 (release of April 2011)
follows: Section 2 provides a brief overview of contains 93407 lexical units, which are grouped
the resources GermaNet and Wiktionary. Sec- into 69594 synsets.
tion 3 introduces the mapping of GermaNet to
2.2 Wiktionary
Wiktionary and how this mapping can be used
to automatically harvest sense-annotated materi- Wiktionary is a web-based dictionary that is avail-
als from the web. The algorithm for identifying able for many languages, including German. As
the target words in the harvested texts is described is the case for its sister project Wikipedia, it
in Section 4. In Section 5, the approach of au- is written collaboratively by volunteers and is
tomatically creating a web-harvested corpus an- freely available4 . The dictionary provides infor-
notated with GermaNet senses is evaluated and mation such as part-of-speech, hyphenation, pos-
compared to existing sense-annotated corpora for sible translations, inflection, etc. for each word.
German. Related work is discussed in Section 6, It includes, among others, the same three word
together with concluding remarks and an outlook classes of adjectives, nouns, and verbs that are
on future work. also available in GermaNet. Distinct word senses
are distinguished by sense descriptions and ac-
2 Resources companied with example sentences illustrating
the sense in question.
2.1 GermaNet Further, Wiktionary provides relations to
GermaNet (Henrich and Hinrichs, 2010; Kunze other words, e.g., in the form of synonyms,
and Lemnitzer, 2002) is a lexical semantic net- antonyms, hypernyms, hyponyms, holonyms, and
work that is modeled after the Princeton Word- meronyms. In contrast to GermaNet, the relations
Net for English (Fellbaum, 1998). It partitions the are (mostly) not disambiguated.
For the present project, a dump of the Ger-
2
http://www.wikipedia.org/ man Wiktionary as of February 2, 2011 is uti-
3
Using a wordnet as the gold standard for the sense inven-
4
tory is fully in line with standard practice for English where Wiktionary is available under the Cre-
the Princeton WordNet (Fellbaum, 1998) is typically taken ative Commons Attribution/Share-Alike license
as the gold standard. http://creativecommons.org/licenses/by-sa/3.0/deed.en
388
Figure 1: Sense mapping of GermaNet and Wiktionary using the example of Bogen.
lized, consisting of 46457 German words com- possibilities for data mining community-driven
prising 70339 word senses. The Wiktionary data resources such as Wikipedia and web-generated
was extracted by the freely available Java-based content more generally. It is precisely this poten-
library JWKTL5 . tial that is fully exploited for the creation of the
WebCAGe sense-annotated corpus.
3 Creation of a Web-Harvested Corpus Fig. 1 illustrates the existing GermaNet-
Wiktionary mapping using the example word Bo-
The starting point for creating WebCAGe is an gen. The polysemous word Bogen has three dis-
existing mapping of GermaNet senses with Wik- tinct senses in GermaNet which directly corre-
tionary sense definitions as described in Henrich spond to three separate senses in Wiktionary6 .
et al. (2011). This mapping is the result of a Each Wiktionary sense entry contains a definition
two-stage process: i) an automatic word overlap and one or more example sentences illustrating
alignment algorithm in order to match GermaNet the sense in question. The examples in turn are
senses with Wiktionary sense descriptions, and often linked to external references, including sen-
ii) a manual post-correction step of the automatic tences contained in the German Gutenberg text
alignment. Manual post-correction can be kept at archive7 (see link in the topmost Wiktionary sense
a reasonable level of effort due to the high accu- entry in Fig. 1), Wikipedia articles (see link for
racy (93.8%) of the automatic alignment. the third Wiktionary sense entry in Fig. 1), and
The original purpose of this mapping was to other textual sources (see the second sense en-
automatically add Wiktionary sense descriptions try in Fig. 1). It is precisely this collection of
to GermaNet. However, the alignment of these
two resources opens up a much wider range of 6
Note that there are further senses in both resources not
displayed here for reasons of space.
5 7
http://www.ukp.tu-darmstadt.de/software/jwktl http://gutenberg.spiegel.de/
389
Figure 2: Sense mapping of GermaNet and Wiktionary using the example of Archiv.
heterogeneous material that can be harvested for than once in a given text. In keeping with
the purpose of compiling a sense-annotated cor- the widely used heuristic of one sense per dis-
pus. Since the target word (rendered in Fig. 1 course, multiple occurrences of a target word in
in bold face) in the example sentences for a par- a given text are all assigned to the same GermaNet
ticular Wiktionary sense is linked to a GermaNet sense. An inspection of the annotated data shows
sense via the sense mapping of GermaNet with that this heuristic has proven to be highly reliable
Wiktionary, the example sentences are automati- in practice. It is correct in 99.96% of all target
cally sense-annotated and can be included as part word occurrences in the Wiktionary example sen-
of WebCAGe. tences, in 96.75% of all occurrences in the exter-
Additional material for WebCAGe is harvested nal webpages, and in 95.62% of the Wikipedia
by following the links to Wikipedia, the Guten- files.
berg archive, and other web-based materials. The WebCAGe is developed primarily for the pur-
external webpages and the Gutenberg texts are ob- pose of the word sense disambiguation task.
tained from the web by a web-crawler that takes Therefore, only those target words that are gen-
some URLs as input and outputs the texts of the uinely ambiguous are included in this resource.
corresponding web sites. The Wikipedia articles Since WebCAGe uses GermaNet as its sense in-
are obtained by the open-source Java Wikipedia ventory, this means that each target word has at
Library JWPL 8 . Since the links to Wikipedia, the least two GermaNet senses, i.e., belongs to at least
Gutenberg archive, and other web-based materials two distinct synsets.
also belong to particular Wiktionary sense entries
The GermaNet-Wiktionary mapping is not al-
that in turn are mapped to GermaNet senses, the
ways one-to-one. Sometimes one GermaNet
target words contained in these materials are au-
sense is mapped to more than one sense in Wik-
tomatically sense-annotated.
tionary. Fig. 2 illustrates such a case. For
Notice that the target word often occurs more
the word Archiv each resource records three dis-
8
http://www.ukp.tu-darmstadt.de/software/jwpl/ tinct senses. The first sense (data repository)
390
in GermaNet corresponds to the first sense in of Apache OpenNLP tools9 and the TreeTagger
Wiktionary, and the second sense in GermaNet (Schmid, 1994) are used. Further, compounds
(archive) corresponds to both the second and are split by using BananaSplit10 . Since the au-
third senses in Wiktionary. The third sense in tomatic lemmatization obtained by the tagger and
GermaNet (archived file) does not map onto any the compound splitter are not 100% accurate, tar-
sense in Wiktionary at all. As a result, the word get word identification also utilizes the full set of
Archiv is included in the WebCAGe resource with inflected forms for a target word whenever such
precisely the sense mappings connected by the information is available. As it turns out, Wik-
arrows shown in Fig. 2. The fact that the sec- tionary can often be used for this purpose as well
ond GermaNet sense corresponds to two sense since the German version of Wiktionary often
descriptions in Wiktionary simply means that the contains the full set of word forms in tables11 such
target words in the example are both annotated by as the one shown in Fig. 3 for the word Bogen.
the same sense. Furthermore, note that the word
Archiv is still genuinely ambiguous since there is
a second (one-to-one) mapping between the first
senses recorded in GermaNet and Wiktionary, re-
spectively. However, since the third GermaNet
sense is not mapped onto any Wiktionary sense at
all, WebCAGe will not contain any example sen-
tences for this particular GermaNet sense.
The following section describes how the target
words within these textual materials can be auto- Figure 3: Wiktionary inflection table for Bogen.
matically identified.
Fig. 4 shows an example of such a sense-
4 Automatic Detection of Target Words annotated text for the target word Bogen vi-
olin bow. The text is an excerpt from the
For highly inflected languages such as German, Wikipedia article Violine violin, where the target
target word identification is more complex com- word (rendered in bold face) appears many times.
pared to languages with an impoverished inflec- Only the second occurrence shown in the figure
tional morphology, such as English, and thus re- (marked with a 2 on the left) exactly matches the
quires automatic lemmatization. Moreover, the word Bogen as is. All other occurrences are ei-
target word in a text to be sense-annotated is ther the plural form Bogen (4 and 7), the geni-
not always a simplex word but can also appear tive form Bogens (8), part of a compound such
as subpart of a complex word such as a com- as Bogenstange (3), or the plural form as part
pound. Since the constituent parts of a compound of a compound such as in Fernambukbogen and
are not usually separated by blank spaces or hy- Schulerbogen (5 and 6). The first occurrence
phens, German compounding poses a particular of the target word in Fig. 4 is also part of a
challenge for target word identification. Another compound. Here, the target word occurs in the
challenging case for automatic target word detec- singular as part of the adjectival compound bo-
tion in German concerns particle verbs such as an- gengestrichenen.
kundigen announce. Here, the difficulty arises For expository purposes, the data format shown
when the verbal stem (e.g., kundigen) is separated in Fig. 4 is much simplified compared to the ac-
from its particle (e.g., an) in German verb-initial tual, XML-based format in WebCAGe. The infor-
and verb-second clause types.
9
As a preprocessing step for target word identi- http://incubator.apache.org/opennlp/
10
http://niels.drni.de/s9y/pages/bananasplit.html
fication, the text is split into individual sentences, 11
The inflection table cannot be extracted with the Java
tokenized, and lemmatized. For this purpose, the Wikipedia Library JWPL. It is rather extracted from the Wik-
sentence detector and the tokenizer of the suite tionary dump file.
391
Figure 4: Excerpt from Wikipedia article Violine violin tagged with target word Bogen violin bow.
mation for each occurrence of a target word con- ysemous words contained in GermaNet, among
sists of the GermaNet sense, i.e., the lexical unit which there are 211 adjectives, 1499 nouns, and
ID, the lemma of the target word, and the Ger- 897 verbs (see Table 2). On average, these words
maNet word category information, i.e., ADJ for have 2.9 senses in GermaNet (2.4 for adjectives,
adjectives, NN for nouns, and VB for verbs. 2.6 for nouns, and 3.6 for verbs).
Table 2 also shows that WebCAGe is consid-
5 Evaluation erably larger than the other two sense-annotated
In order to assess the effectiveness of the ap- corpora available for German ((Broscheit et al.,
proach, we examine the overall size of WebCAGe 2010) and (Raileanu et al., 2002)). It is impor-
and the relative size of the different text col- tant to keep in mind, though, that the other two
lections (see Table 1), compare WebCAGe to resources were manually constructed, whereas
other sense-annotated corpora for German (see WebCAGe is the result of an automatic harvesting
Table 2), and present a precision- and recall-based method. Such an automatic method will only con-
evaluation of the algorithm that is used for auto- stitute a viable alternative to the labor-intensive
matically identifying target words in the harvested manual method if the results are of sufficient qual-
texts (see Table 3). ity so that the harvested data set can be used as is
or can be further improved with a minimal amount
Table 1 shows that Wiktionary (7644 tagged
of manual post-editing.
word tokens) and Wikipedia (1732) contribute
by far the largest subsets of the total number of For the purpose of the present evaluation, we
tagged word tokens (10750) compared with the conducted a precision- and recall-based analy-
external webpages (589) and the Gutenberg texts sis for the text types of Wiktionary examples,
(785). These tokens belong to 2607 distinct pol- external webpages, and Wikipedia articles sep-
392
Table 1: Current size of WebCAGe.
Wiktionary External Wikipedia Gutenberg All
examples webpages articles texts texts
Number of adjectives 575 31 79 28 713
tagged nouns 4103 446 1643 655 6847
word verbs 2966 112 10 102 3190
tokens all word classes 7644 589 1732 785 10750
adjectives 565 31 76 26 698
Number of
nouns 3965 420 1404 624 6413
tagged
verbs 2945 112 10 102 3169
sentences
all word classes 7475 563 1490 752 10280
adjectives 623 1297 430 65030 67380
Total
nouns 4184 9630 6851 376159 396824
number of
verbs 3087 5285 263 146755 155390
sentences
all word classes 7894 16212 7544 587944 619594
arately for the three word classes of adjectives, of sense-annotated target words and to manually
nouns, and verbs. Table 3 shows that precision sense-tag any missing target words for the four
and recall for all three word classes that occur text types.
for Wiktionary examples, external webpages, and
Wikipedia articles lies above 92%. The only size- 6 Related Work and Future Directions
able deviations are the results for verbs that occur
in the Gutenberg texts. Apart from this one excep- With relatively few exceptions to be discussed
tion, the results in Table 3 prove the viability of shortly, the construction of sense-annotated cor-
the proposed method for automatic harvesting of pora has focussed on purely manual methods.
sense-annotated data. The average precision for This is true for SemCor, the WordNet Gloss Cor-
all three word classes is of sufficient quality to be pus, and for the training sets constructed for En-
used as-is if approximately 2-5% noise in the an- glish as part of the SensEval and SemEval shared
notated data is acceptable. In order to eliminate task competitions (Agirre et al., 2007; Erk and
such noise, manual post-editing is required. How- Strapparava, 2012; Mihalcea et al., 2004). Purely
ever, such post-editing is within acceptable lim- manual methods were also used for the German
its: it took an experienced research assistant a to- sense-annotated corpora constructed by Broscheit
tal of 25 hours to hand-correct all the occurrences et al. (2010) and Raileanu et al. (2002) as well as
for other languages including the Bulgarian and
393
Table 3: Evaluation of the algorithm of identifying the target words.
Wiktionary External Wikipedia Gutenberg
examples webpages articles texts
adjectives 97.70% 95.83% 99.34% 100%
nouns 98.17% 98.50% 95.87% 92.19%
Precision
verbs 97.38% 92.26% 100% 69.87%
all word classes 97.32% 96.19% 96.26% 87.43%
adjectives 97.70% 97.22% 98.08% 97.14%
nouns 98.30% 96.03% 92.70.% 97.38%
Recall
verbs 97.51% 99.60% 100% 89.20%
all word classes 97.94% 97.32% 93.36% 95.42%
the Chinese sense-tagged corpora (Koeva et al., list supervised WSD algorithm as a seed set for it-
2006; Wu et al., 2006). The only previous at- eratively disambiguating the remaining examples
tempts of harvesting corpus data for the purpose collected in step 1. The selection and annotation
of constructing a sense-annotated corpus are the of the representative examples in Yarowskys ap-
semi-supervised method developed by Yarowsky proach is performed completely manually and is
(1995), the knowledge-based approach of Lea- therefore limited to the amount of data that can
cock et al. (1998), later also used by Agirre and reasonably be annotated by hand.
Lopez de Lacalle (2004), and the automatic asso- Leacock et al. (1998), Agirre and Lopez de La-
ciation of Web directories (from the Open Direc- calle (2004), and Mihalcea and Moldovan (1999)
tory Project, ODP) to WordNet senses by Santa- propose a set of methods for automatic harvesting
mara et al. (2003). of web data for the purposes of creating sense-
The latter study (Santamara et al., 2003) is annotated corpora. By focusing on web-based
closest in spirit to the approach presented here. data, their work resembles the research described
It also relies on an automatic mapping between in the present paper. However, the underlying har-
wordnet senses and a second web resource. While vesting methods differ. While our approach re-
our approach is based on automatic mappings be- lies on a wordnet to Wiktionary mapping, their
tween GermaNet and Wiktionary, their mapping approaches all rely on the monosemous relative
algorithm maps WordNet senses to ODP subdi- heuristic. Their heuristic works as follows: In or-
rectories. Since these ODP subdirectories contain der to harvest corpus examples for a polysemous
natural language descriptions of websites relevant word, the WordNet relations such as synonymy
to the subdirectory in question, this textual mate- and hypernymy are inspected for the presence of
rial can be used for harvesting sense-specific ex- unambiguous words, i.e., words that only appear
amples. The ODP project also covers German so in exactly one synset. The examples found for
that, in principle, this harvesting method could be these monosemous relatives can then be sense-
applied to German in order to collect additional annotated with the particular sense of its ambigu-
sense-tagged data for WebCAGe. ous word relative. In order to increase coverage
The approach of Yarowsky (1995) first collects of the monosemous relatives approach, Mihalcea
all example sentences that contain a polysemous and Moldovan (1999) have developed a gloss-
word from a very large corpus. In a second step, based extension, which relies on word overlap of
a small number of examples that are representa- the gloss and the WordNet sense in question for
tive for each of the senses of the polysemous tar- all those cases where a monosemous relative is
get word is selected from the large corpus from not contained in the WordNet dataset.
step 1. These representative examples are manu- The approaches of Leacock et al., Agirre and
ally sense-annotated and then fed into a decision- Lopez de Lacalle, and Mihalcea and Moldovan as
394
well as Yarowskys approach provide interesting hild Barkey, Sarah Schulz, and Johannes Wahle
directions for further enhancing the WebCAGe re- for their help with the evaluation reported in Sec-
source. It would be worthwhile to use the au- tion 5. Special thanks go to Yana Panchenko and
tomatically harvested sense-annotated examples Yannick Versley for their support with the web-
as the seed set for Yarowskys iterative method crawler and to Emanuel Dima and Klaus Sut-
for creating a large sense-annotated corpus. An- tner for helping us to obtain the Gutenberg and
other fruitful direction for further automatic ex- Wikipedia texts.
pansion of WebCAGe is to use the heuristic of
monosemous relatives used by Leacock et al., by
Agirre and Lopez de Lacalle, and by Mihalcea References
and Moldovan. However, we have to leave these Agirre, E., Lopez de Lacalle, O. 2004. Publicly
matters for future research. available topic signatures for all WordNet nominal
In order to validate the language independence senses. Proceedings of the 4th International Con-
ference on Languages Resources and Evaluations
of our approach, we plan to apply our method to
(LREC04), Lisbon, Portugal, pp. 11231126
sense inventories for languages other than Ger-
Agirre, E., Marquez, L., Wicentowski, R. 2007. Pro-
man. A precondition for such an experiment is an ceedings of the 4th International Workshop on Se-
existing mapping between the sense inventory in mantic Evaluations. Assoc. for Computational Lin-
question and a web-based resource such as Wik- guistics, Stroudsburg, PA, USA
tionary or Wikipedia. With BabelNet, Navigli and Broscheit, S., Frank, A., Jehle, D., Ponzetto, S. P.,
Ponzetto (2010) have created a multilingual re- Rehl, D., Summa, A., Suttner, K., Vola, S. 2010.
source that allows the testing of our approach to Rapid bootstrapping of Word Sense Disambigua-
languages other than German. As a first step in tion resources for German. Proceedings of the 10.
this direction, we applied our approach to English Konferenz zur Verarbeitung Naturlicher Sprache,
Saarbrucken, Germany, pp. 1927
using the mapping between the Princeton Word-
Erk, K., Strapparava, C. 2010. Proceedings of the 5th
Net and the English version of Wiktionary pro- International Workshop on Semantic Evaluation.
vided by Meyer and Gurevych (2011). The re- Assoc. for Computational Linguistics, Stroudsburg,
sults of these experiments, which are reported in PA, USA
Henrich et al. (2012), confirm the general appli- Fellbaum, C. (ed.). 1998. WordNet An Electronic
cability of our approach. Lexical Database. The MIT Press.
To conclude: This paper describes an automatic Henrich, V., Hinrichs, E. 2010. GernEdiT The Ger-
method for creating a domain-independent sense- maNet Editing Tool. Proceedings of the Seventh
annotated corpus harvested from the web. The Conference on International Language Resources
and Evaluation (LREC10), Valletta, Malta, pp.
data obtained by this method for German have
22282235
resulted in the WebCAGe resource which cur-
Henrich, V., Hinrichs, E., Vodolazova, T. 2011. Semi-
rently represents the largest sense-annotated cor- Automatic Extension of GermaNet with Sense Def-
pus available for this language. The publication of initions from Wiktionary. Proceedings of the 5th
this paper is accompanied by making WebCAGe Language & Technology Conference: Human Lan-
freely available. guage Technologies as a Challenge for Computer
Science and Linguistics (LTC11), Poznan, Poland,
Acknowledgements pp. 126130
Henrich, V., Hinrichs, E., Vodolazova, T. 2012. An
The research reported in this paper was jointly Automatic Method for Creating a Sense-Annotated
Corpus Harvested from the Web. Poster pre-
funded by the SFB 833 grant of the DFG and by
sented at 13th International Conference on Intelli-
the CLARIN-D grant of the BMBF. We would gent Text Processing and Computational Linguistics
like to thank Christina Hoppermann, Marie Hin- (CICLing-2012), New Delhi, India, March 2012
richs as well as three anonymous EACL 2012 re- Koeva, S., Leseva, S., Todorova, M. 2006. Bul-
viewers for their helpful comments on earlier ver- garian Sense Tagged Corpus. Proceedings of the
sions of this paper. We are very grateful to Rein- 5th SALTMIL Workshop on Minority Languages:
395
Strategies for Developing Machine Translation for for Computational Linguistics (ACL95), Associ-
Minority Languages, Genoa, Italy, pp. 7987 ation for Computational Linguistics, Stroudsburg,
Kunze, C., Lemnitzer, L. 2002. GermaNet rep- PA, USA, pp. 189196
resentation, visualization, application. Proceed-
ings of the 3rd International Language Resources
and Evaluation (LREC02), Las Palmas, Canary Is-
lands, pp. 14851491
Leacock, C., Chodorow, M., Miller, G. A. 1998.
Using corpus statistics and wordnet relations for
sense identification. Computational Linguistics,
24(1):147165
Meyer, C. M., Gurevych, I. 2011. What Psycholin-
guists Know About Chemistry: Aligning Wik-
tionary and WordNet for Increased Domain Cov-
erage. Proceedings of the 5th International Joint
Conference on Natural Language Processing (IJC-
NLP), Chiang Mai, Thailand, pp. 883892
Mihalcea, R., Moldovan, D. 1999. An Auto-
matic Method for Generating Sense Tagged Cor-
pora. Proceedings of the American Association for
Artificial Intelligence (AAAI99), Orlando, Florida,
pp. 461466
Mihalcea, R., Chklovski, T., Kilgarriff, A. 2004. Pro-
ceedings of Senseval-3: Third International Work-
shop on the Evaluation of Systems for the Semantic
Analysis of Text, Barcelona, Spain
Navigli, R., Ponzetto, S. P. 2010. BabelNet: Build-
ing a Very Large Multilingual Semantic Network.
Proceedings of the 48th Annual Meeting of the As-
sociation for Computational Linguistics (ACL10),
Uppsala, Sweden, pp. 216225
Raileanu, D., Buitelaar, P., Vintar, S., Bay, J. 2002.
Evaluation Corpora for Sense Disambiguation in
the Medical Domain. Proceedings of the 3rd In-
ternational Language Resources and Evaluation
(LREC02), Las Palmas, Canary Islands, pp. 609
612
Santamara, C., Gonzalo, J., Verdejo, F. 2003. Au-
tomatic Association of Web Directories to Word
Senses. Computational Linguistics 29 (3), MIT
Press, PP. 485502
Schmid, H. 1994. Probabilistic Part-of-Speech Tag-
ging Using Decision Trees. Proceedings of the In-
ternational Conference on New Methods in Lan-
guage Processing, Manchester, UK
Wu, Y., Jin, P., Zhang, Y., Yu, S. 2006. A Chinese
Corpus with Word Sense Annotation. Proceedings
of 21st International Conference on Computer Pro-
cessing of Oriental Languages (ICCPOL06), Sin-
gapore, pp. 414421
Yarowsky, D. 1995. Unsupervised word sense dis-
ambiguation rivaling supervised methods. Proceed-
ings of the 33rd Annual Meeting on Association
396
Learning to Behave by Reading
Regina Barzilay
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
regina@csail.mit.edu
Abstract
In this talk, I will address the problem of grounding linguistic analysis in control applications, such
as game playing and robot navigation. We assume access to natural language documents that describe
the desired behavior of a control algorithm (e.g., game strategy guides). Our goal is to demonstrate
that knowledge automatically extracted from such documents can dramatically improve performance
of the target application. First, I will present a reinforcement learning algorithm for learning to map
natural language instructions to executable actions. This technique has enabled automation of tasks
that until now have required human participation for example, automatically configuring software
by consulting how-to guides. Next, I will present a Monte-Carlo search algorithm for game playing
that incorporates information from game strategy guides. In this framework, the task of text inter-
pretation is formulated as a probabilistic model that is trained based on feedback from Monte-Carlo
search. When applied to the Civilization strategy game, a language-empowered player outperforms
its traditional counterpart by a significant margin.
397
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, page 397,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
Lexical surprisal as a general predictor of reading time
398
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 398408,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
to constrain predictions. However, whereas such 1.2 Empirical evidence for surprisal
an unlexicalized (i.e., POS-based) surprisal has
been shown to significantly predict RTs, success The simplest statistical language models that can
with lexical (i.e., word-based) surprisal has been be used to estimate surprisal values are n-gram
limited. This can be attributed to data sparsity models or Markov chains, which condition the
(larger training corpora might be needed to pro- probability of a given word only on its n 1 pre-
vide accurate lexical surprisal than for the unlex- ceding ones. Although Markov models theoret-
icalized counterpart), or to the noise introduced ically limit the amount of prior information that
by participants world knowledge, inaccessible to is relevant for prediction of the next step, they
the models. The present study thus sets out to find are often used in linguistic context as an approx-
such a lexical surprisal effect, trying to overcome imation to the full conditional probability. The
possible limitations of previous research. effect of bigram probability (or forward transi-
tional probability) has been repeatedly observed
1.1 Surprisal theory (e.g. McDonald and Shillcock, 2003), and Smith
The concept of surprisal originated in the field of and Levy (2008) report an effect of lexical sur-
information theory, as a measure of the amount of prisal as estimated by a trigram model on RTs
information conveyed by a particular event. Im- for the Dundee corpus (a collection of newspaper
probable (surprising) events carry more infor- texts with eye-tracking data from ten participants;
mation than expected ones, so that surprisal is in- Kennedy and Pynte, 2005).
versely related to probability, through a logarith- Phrase structure grammars (PSGs) have also
mic function. In the context of sentence process- been amply used as language models (Boston et
ing, if w1 , ..., wt1 denotes the sentence so far, al., 2008; Brouwer et al., 2010; Demberg and
then the cognitive effort required for processing Keller, 2008; Hale, 2001; Levy, 2008). PSGs
the next word, wt , is assumed to be proportional can combine statistical exposure effects with ex-
to its surprisal: plicit syntactic rules, by annotating norms with
their respective probabilities, which can be es-
effort(t) surprisal(wt ) timated from occurrence counts in text corpora.
Information about hierarchical sentence structure
= log(P (wt |w1 , ..., wt1 )) (1)
can thus be included in the models. In this way,
Different theoretical groundings for this rela- Brouwer et al. trained a probabilistic context-
tionship have been proposed (Hale, 2001; Levy free grammar (PCFG) on 204,000 sentences ex-
2008; Smith and Levy, 2008). Smith and Levy tracted from Dutch newspapers to estimate lexi-
derive it by taking a scale free assumption: Any cal surprisal (using an Earley-Stolcke parser; Stol-
linguistic unit can be subdivided into smaller en- cke, 1995), showing that it could account for
tities (e.g., a sentence is comprised of words, a the noun phrase coordination bias previously de-
word of phonemes), so that time to process the scribed and explained by Frazier (1987) in terms
whole will equal the sum of processing times for of a minimal-attachment preference of the human
each part. Since the probability of the whole can parser. In contrast, Demberg and Keller used texts
be expressed as the product of the probabilities of from a naturalistic source (the Dundee corpus) as
the subunits, the function relating probability and the experimental stimuli, thus evaluating surprisal
effort must be logarithmic. Levy (2008), on the as a wide-coverage account of processing diffi-
other hand, grounds surprisal in its information- culty. They also employed a PSG, trained on a
theoretical context, describing difficulty encoun- one-million-word language sample from the Wall
tered in on-line sentence processing as a result of Street Journal (part of the Penn Treebank II, Mar-
the need to update a probability distribution over cus et al., 1993). Using Roarks (2001) incremen-
possible parses, being directly proportional to the tal parser, they found significant effects of unlexi-
difference between the previous and updated dis- calized surprisal on RTs (see also Boston et al. for
tributions. By expressing the difference between a similar approach and results for German texts).
these in terms of relative entropy, Levy shows that However, they failed to find an effect for lexical-
difficulty at each newly encountered word should ized surprisal, over and above forward transitional
be equal to its surprisal. probability. Roark et al. (2009) also looked at the
399
effects of syntactic and lexical surprisal, using RT words than on POS (both the Brown corpus and
data for short narrative texts. However, their es- the WSJ are relatively small), and in addition, the
timates of these two surprisal values differ from particular journalistic style of the WSJ might not
those described above: In order to tease apart se- be the best alternative for modeling human be-
mantic and syntactic effects, they used Demberg haviour. Although similarity between the train-
and Kellers lexicalized surprisal as a total sur- ing and experimental data sets (both from news-
prisal measure, which they decompose into syn- paper sources) can improve the linguistic perfor-
tactic and lexical components. Their results show mance of the models, their ability to simulate hu-
significant effects of both syntactic and lexical man behaviour might be limited: Newspaper texts
surprisal, although the latter was found to hold probably form just a small fraction of a persons
only for closed class words. Lack of a wider effect linguistic experience. This study thus aims to
was attributed to data sparsity: The models were tackle some of the identified limitations: Rather
trained on the relatively small Brown corpus (over than cohesive texts, independent sentences, from
one million words from 500 samples of American a narrative style are used as experimental stim-
English text), so that surprisal estimates for the uli for which word-reading times are collected
less frequent content words would not have been (as explained in Section 3). In addition, as dis-
accurate enough. cussed in the following section, language mod-
Using the same training and experimental lan- els are trained on a larger corpus, from a more
guage samples as Demberg and Keller (2008), representative language sample. Following Frank
and only unlexicalized surprisal estimates, Frank (2009) and Frank and Bod (2011), two contrasting
(2009) and Frank and Bod (2011) focused on types of models are employed: hierarchical PSGs
comparing different language models, including and linear RNNs.
various n-gram models, PSGs and recurrent net-
works (RNN). The latter were found to be the bet- 2 Models
ter predictors of RTs, and PSGs could not explain
2.1 Training data
any variance in RT over and above the RNNs,
suggesting that human processing relies on linear The training texts were extracted from the writ-
rather than hierarchical representations. ten section of the British National Corpus (BNC),
Summing up, the only models taking into ac- a collection of language samples from a variety
count actual words that have been consistently of sources, designed to provide a comprehensive
shown to simulate human behaviour with natural- representation of current British English. A total
istic text samples are bigram models.1 A possi- of 702,412 sentences, containing only the 7,754
ble limitation in previous studies can be found in most frequent words (the open-class words used
the stimuli employed. In reading real newspaper by Andrews et al., 2009, plus the 200 most fre-
texts, prior knowledge of current affairs is likely quent words in English) were selected, making up
to highly influence RTs, however, this source of a 7.6-million-word training corpus. In addition to
variability cannot be accounted for by the mod- providing a larger amount of data than the WSJ,
els. In addition, whereas the models treat each this training set thus provides a more representa-
sentence as an independent unit, in the text cor- tive language sample.
pora employed they make up coherent texts, and
are therefore clearly dependent. Thirdly, the stim- 2.2 Experimental sentences
uli used by Demberg and Keller (2008) comprise Three hundred and sixty-one sentences, all com-
a very particular linguistic style: journalistic edi- prehensible out of context and containing only
torials, reducing the ability to generalize conclu- words included in the subset of the BNC used
sions to language in general. Finally, failure to to train the models, were randomly selected from
find lexical surprisal effects can also be attributed three freely accessible on-line novels2 (for addi-
to the training texts. Larger corpora are likely to tional details, see Frank, 2012). The fictional
be needed for training language models on actual narrative provides a good contrast to the pre-
1 2
Although Smith and Levy (2008) report an effect of tri- Obtained from www.free-online-novels.com.
grams, they did not check if it exceeded that of simpler bi- Having not been published elsewhere, it is unlikely partici-
grams. pants had read the novels previously.
400
viously examined newspaper editorials from the probability distribution
Dundee corpus, since participants did not need over 7,754 word types
prior knowledge regarding the details of the sto-
ries, and a less specialised language and style
were employed. In addition, the randomly se-
lected sentences did not make up coherent texts 200
(in contrast, Roark et al., 2009, employed short
stories), so that they were independent from each
other, both for the models and the readers. 400
401
3 Experiment
P !
fw,v i,j fi,j 3.1 Procedure
PMI(w, v) = log P P (2)
i fi,v j fw,j Text display followed a self-paced reading
paradigm: Sentences were presented on a com-
Finally, the 400 columns with the highest vari- puter screen one word at a time, with onset of
ance were selected from the 775415508-matrix the next word being controlled by the subject
of row vectors, making them more computation- through a key press. The time between word
ally manageable, but not significantly less infor- onset and subsequent key press was recorded as
mative. the RT (measured in milliseconds) on that word
Stage 2: Learning temporal structure by that subject.3 Words were presented centrally
Using the standard backpropagation algorithm, aligned in the screen, and punctuation marks ap-
a simple recurrent network (SRN) learned to pre- peared with the word that preceded them. A fixed-
dict, at each point in the training corpus, the next width font type (Courier New) was used, so that
words vector given the sequence of word vectors physical size of a word equalled number of char-
corresponding to the sentence so far. The total acters. Order of presentation was randomized for
corpus was presented five times, each time with each subject. The experiment was time-bounded
the sentences in a different random order. to 40 minutes, and the number of sentences read
by each participant varied between 120 and 349,
Stage 3: Decoding predicted word with an average of 224. Yes-no comprehension
representations questions followed 46% of the sentences.
The distributed output of the trained SRN
served as training input to the feedforward de- 3.2 Participants
coder network, that learned to map the dis- A total of 117 first year psychology students took
tributed representations back to localist ones. part in the experiment. Subjects unable to an-
This network, too, used standard backpropaga- swer correctly to more than 20% of the questions
tion. Its output units had softmax activation func- and 47 participants who were non-native English
tions, so that the output vector constitutes a prob- speakers were excluded from the analysis, leaving
ability distribution over word types. These trans- a total of 54 subjects.
late directly into surprisal values, which were col-
lected over the experimental sentences at ten in- 3.3 Design
tervals over the course of Stage 3 training (after
The obtained RTs served as the dependent vari-
presenting 2K, 5K, 10K, 20K, 50K, 100K, 200K,
able against which a mixed-effects multiple re-
and 350K sentences, and after presenting the full
gression analysis with crossed random effects for
training corpus once and twice). These will be
subjects and items (Baayen et al., 2008) was per-
denoted by RNN-1 to RNN-10.
formed. In order to control for low-level lexical
A much simpler RNN model suffices for ob-
factors that are known to influence RTs, such as
taining unlexicalized surprisal. Here, we used
word length or frequency, a baseline regression
the same models as described by Frank and Bod
model taking them into account was built. Subse-
(2011), albeit trained on the POS tags of our
quently, the decrease in the models deviance, af-
BNC training corpus. These models employed
ter the inclusion of surprisal as a fixed factor to the
so-called Echo State Networks (ESN; Jaeger and
baseline, was assessed using likelihood tests. The
Haas, 2004), which are RNNs that do not develop
resulting 2 statistic indicates the extent to which
internal representations because weights of input
each surprisal estimate accounts for RT, and can
and recurrent connections remain fixed at ran-
thus serve as a measure of the psychological ac-
dom values (only the output connection weights
curacy of each model.
are trained). Networks of six different sizes were
However, this kind of analysis assumes that RT
used. Of each size, three networks were trained,
for a word reflects processing of only that word,
using different random weights. The best and
worst model of each size were discarded to reduce 3
The collected RT data are available for download at
the effect of the random weights. www.stefanfrank.info/EACL2012.
402
but spill-over effects (in which processing diffi- Word position: Low-level effects of word or-
culty at word wt shows up in the RT on wt+1 ) der, not related to predictability itself, were
have been found in self-paced and natural read- modeled by including word position in the
ing (Just et al., 1982; Rayner, 1998; Rayner and sentence, both as a linear and quadratic fac-
Pollatsek, 1987). To evaluate these effects, the tor (some of the sentences were quite long,
decrease in deviance after adding surprisal of the so that the effect of word position is unlikely
previous item to the baseline was also assessed. to be linear).
The following control predictors were included
in the baseline regression model: Reading time for previous word: As sug-
gested by Baayen and Milin (2010), includ-
Lexical factors: ing RT on the previous word can control for
Number of characters: Both physical size several autocorrelation effects.
and number of characters have been found
to affect RTs for a word (Rayner and Pollat- 4 Results
sek, 1987), but the fixed-width font used in
Data were analysed using the free statistical soft-
the experiment assured number of characters
ware package R (R Development Core Team,
also encoded physical word length.
2009) and the lme4 library (Bates et al., 2011).
Frequency and forward transitional proba- Two analyses were performed for each language
bility: The effects of these two factors have model, using surprisal for either current or pre-
been repeatedly reported (e.g. Juhasz and vious word as the dependent variable. Unlikely
Rayner, 2003; Rayner, 1998). Given the high reading times (lower than 50ms or over 3000ms)
correlations between surprisal and these two were removed from the analysis, as were clitics,
measures, their inclusion in the baseline as- words followed by punctuation, words follow-
sures that the results can be attributed to pre- ing punctuation or clitics (since factors for pre-
dictability in context, over and above fre- vious word were included in the analysis), and
quency and bigram probability. Frequency sentence-initial words, leaving a total of 132,298
was estimated from occurrence counts of data points (between 1,335 and 3,829 per subject).
each word in the full BNC corpus (written
4.1 Baseline model
section). The same transformation (nega-
tive logarithm) was applied as for computing Theoretical considerations guided the selection
surprisal, thus obtaining unconditional and of the initial predictors presented above, but an
bigram surprisal values. empirical approach led actual regression model
building. Initial models with the original set of
Previous word lexical factors: Lexical fac- fixed effects, all two-way interactions, plus ran-
tors for the previous word were included in dom intercepts for subjects and items were evalu-
the analysis to control for spill-over effects. ated, and least significant factors were removed
Temporal factors and autocorrelation: one at a time, until only significant predictors
were left (|t| > 2). A different strategy was
RT data over naturalistic texts violate the re-
used to assess which by-subject and by item ran-
gression assumption of independence of obser-
dom slopes to include in the model. Given the
vations in several ways, and important word-by-
large number of predictors, starting from the sat-
word sequential correlations exist. In order to en-
urated model with all random slopes generated
sure validity of the statistical analysis, as well as
non-convergence problems and excessively long
providing a better model fit, the following factors
running times. By-subject and by-item random
were also included:
slopes for each fixed effect were therefore as-
Sentence position: Fatigue and practice ef- sessed individually, using likelihood tests. The
fects can influence RTs. Sentence position final baseline model included by-subject random
in the experiment was included both as linear intercepts, by-subject random slopes for sentence
and quadratic factor, allowing for the model- position and word position, and by-item slopes for
ing of initial speed-up due to practice, fol- previous RT. All factors (random slopes and fixed
lowed by a slowing down due to fatigue. effects) were centred and standardized to avoid
403
Lexicalized models Unlexicalized models
70 30
81
90
7 43
60
25
Psychological accuracy ()
2
50
6 32 20 4
5 41 3 56
40 2 1
34 1
3 4
15
4
30
3 2 10 2
20
2
10 5
1 1
1
0 0
-6.6 -6.4 -6.2 -6 -5.8 -5.6 -5.4 -5.2 -5 -2.55 -2.5 -2.45 -2.4 -2.35 -2.3 -2.25 -2.2 -2.15 -2.1
Linguistic accuracy (-average surprisal) PSG-a PSG-s RNN
Figure 2: Psychological accuracy (combined effect of current and previous surprisal) against linguistic accuracy
of the different models. Numbered labels denote the maximum number of levels up in the tree from which
conditional information is used (PSG); point in training when estimates were collected (word-based RNN); or
network size (POS-based RNN).
404
to explain a significant amount of variance over sults show the ability of lexicalized surprisal to
and above the RNN (2 (1) = 2.28; p = 0.13).4 explain a significant amount of variance in RT
Lexicalized models achieved greater psychologi- data for naturalistic texts, over and above that
cal accuracy than their unlexicalized counterparts, accounted for by other low-level lexical factors,
but the latter could still explain a small amount of such as frequency, length, and forward transi-
variance over and above the former (see Table 2).5 tional probability. Although previous studies had
presented results supporting such a probabilistic
Model comparison 2 (2) p-value language processing account, evidence for word-
based surprisal was limited: Brouwer et al. (2010)
Best models overall:
only examined a specific psycholinguistic phe-
POS- over word-based 10.40 0.006 nomenon, rather than a random language sample;
word- over POS-based 47.02 0.001 Demberg and Keller (2008) reported effects that
PSGs: were only significant for POS but not word-based
POS- over word-based 6.89 0.032 surprisal; and Smith and Levy (2008) found an
word- over POS-based 25.50 0.001 effect of lexicalized surprisal (according to a tri-
gram model), but did not assess whether simpler
RNNs: predictability estimates (i.e., by a bigram model)
POS- over word-based 5.80 0.055 could have accounted for those effects.
word- over POS-based 49.74 0.001 Demberg and Kellers (2008) failure to find lex-
icalized surprisal effects can be attributed both to
Table 2: Word- vs. POS-based models: comparisons the language corpus used to train the language
between best models overall, and best models within models, as well as to the experimental texts used.
each category. Both were sourced from newspaper texts: As
training corpora these are unrepresentative of a
4.3 Differences across word classes persons linguistic experience, and as experimen-
tal texts they are heavily dependent on partici-
In order to make sure that the lexicalized sur-
pants world knowledge. Roark et al. (2009), in
prisal effects found were not limited to closed-
contrast, used a more representative, albeit rela-
class words (as Roark et al., 2009, report), a fur-
tively small, training corpus, as well as narrative-
ther model comparison was performed by adding
style stimuli, thus obtaining RTs less dependent
by-POS random slopes of surprisal to the models
on participants prior knowledge. With such an
containing the baseline plus surprisal. If particu-
experimental set-up, they were able to demon-
lar syntactic categories were contributing to the
strate the effects of lexical surprisal for RT of
overall effect of surprisal more than others, in-
closed-class, but not open-class, words, which
cluding such random slopes would lead to addi-
they attributed to their differential frequency and
tional variance being explained. However, this
to training-data sparsity: The limited Brown cor-
was not the case: inclusion of by-POS random
pus would have been enough to produce accurate
slopes of surprisal did not lead to a significant im-
estimates of surprisal for function words, but not
provement in model fit (PSG: 2 (1) = 0.86, p =
for the less frequent content words. A larger train-
0.35; RNN: 2 (1) = 3.20, p = 0.07).6
ing corpus, constituting a broad language sample,
5 Discussion was used in our study, and the detected surprisal
effects were shown to hold across syntactic cate-
The present study aimed to find further evidence gory (modeling slopes for POS separately did not
for surprisal as a wide-coverage account of lan- improve model fit). However, direct comparison
guage processing difficulty, and indeed, the re- with Roark et al.s results is not possible: They
4
Best models in this case were PSG-a3 and RNN-7. employed alternative definitions of structural and
5
Since best performing lexicalized and unlexicalized lexical surprisal, which they derived by decom-
models belonged to different groups: RNN and PSG, respec- posing the total surprisal as obtained with a fully
tively, Table 2 also shows comparisons within model type.
6 lexicalized PSG model.
Comparison was made on the basis of previous word
surprisal (best models in this case were PSG-s3 and RNN- In the current study, a similar approach to that
9). taken by Demberg and Keller (2008) was used to
405
define structural (or unlexicalized), and lexical- texts, or self-paced reading of independent, nar-
ized surprisal, but the results are strikingly differ- rative sentences. The absence of global context,
ent: Whereas Demberg and Keller report a signif- or the unnatural reading methodology employed
icant effect for POS-based estimates, but not for in the current experiment, could have led to an
word-based surprisal, our results show that lexi- increased reliance on hierarchical structure for
calized surprisal is a far better predictor of RTs sentence comprehension. The sources and struc-
than its unlexicalized counterpart. This is not sur- tures relied upon by the human parser to elabo-
prising, given that while the unlexicalized mod- rate upcoming-word expectations could therefore
els only have access to syntactic sources of in- be task-dependent. On the other hand, our re-
formation, the lexicalized models, like the hu- sults show that the independent effects of word-
man parser, can also take into account lexical co- based PSG estimates only become apparent when
occurrence trends. However, when a training cor- investigating the effect of surprisal of the previous
pus is not large enough to accurately capture the word. That is, considering only the current words
latter, it might still be able to model the former, surprisal, as in Frank and Bods analysis, did not
given the higher frequency of occurrence of each reveal a significant contribution of PSGs over and
possible item (POS vs. word) in the training data. above RNNs. Thus, additional effects of PSG sur-
Roark et al. (2009) also included in their analysis prisal might only be apparent when spill-over ef-
a POS-based surprisal estimate, which lost signif- fects are investigated by taking previous word sur-
icance when the two components of the lexical- prisal as a predictor of RT.
ized surprisal were present, suggesting that such
unlexicalized estimates can be interpreted only as 6 Conclusion
a coarse version of the fully lexicalized surprisal, The results here presented show that lexicalized
incorporating both syntactic and lexical sources surprisal can indeed model RT over naturalistic
of information at the same time. The results pre- texts, thus providing a wide-coverage account of
sented here do not replicate this finding: The best language processing difficulty. Failure of previ-
unlexicalized estimates were able to explain ad- ous studies to find such an effect could be at-
ditional variance over and above the best word- tributed to the size or nature of the training cor-
based estimates. However, this comparison con- pus, suggesting that larger and more general cor-
trasted two different model types: a word-based pora are needed to model successfully both the
RNN and a POS-based PSG, so that the observed structural and lexical regularities used by the hu-
effects could be attributed to the model represen- man parser to generate predictions. Another cru-
tations (hierarchical vs. linear) rather than to the cial finding presented here is the importance of
item of analysis (POS vs. words). Within-model spill-over effects: Surprisal of a word had a much
comparisons showed that unlexicalized estimates larger influence on RT of the following item than
were still able to account for additional variance, of the word itself. Previous studies where lexi-
although only reaching significance at the 0.05 calized surprisal was only analysed in relation to
level for the PSGs. current RT could have missed a significant effect
Previous results reported by Frank (2009) and only manifested on the following item. Whether
Frank and Bod (2011) regarding the higher psy- spill-over effects are as important for different RT
chological accuracy of RNNs and the inability of collection paradigms (e.g., eye-tracking) remains
the PSGs to explain any additional variance in to be tested.
RT, were not replicated. Although for the word-
based estimates RNNs outperform the PSGs, we Acknowledgments
found both to have independent effects. Further- The research presented here was funded by the
more, in the POS-based analysis, performance of European Union Seventh Framework Programme
PSGs and RNNs reaches similarly high levels of (FP7/2007-2013) under grant number 253803.
psychological accuracy, with the best-performing The authors acknowledge the use of the UCL Le-
PSG producing slightly better results than the gion High Performance Computing Facility, and
best-performing RNN. This discrepancy in the re- associated support services, in the completion of
sults could reflect contrasting reading styles in this work.
the two studies: natural reading of newspaper
406
References Stefan L. Frank. 2012. Uncertainty reduction as a
measure of cognitive processing load in sentence
Gerry T.M. Altmann and Yuki Kamide. 1999. Incre- comprehension. Manuscript submitted for publica-
mental interpretation at verbs: Restricting the do- tion.
main of subsequent reference. Cognition, 73:247 Peter Hagoort, Lea Hald, Marcel Bastiaansen, and
264. Karl Magnus Petersson. 2004. Integration of word
Mark Andrews, Gabriella Vigliocco, and David P. Vin- meaning and world knowledge in language compre-
son. 2009. Integrating experiential and distribu- hension. Science, 304:438441.
tional data to learn semantic representations. Psy- John Hale. 2001. A probabilistic earley parser as a
chological Review, 116:463498. psycholinguistic model. In Proceedings of the sec-
R. Harald Baayen and Petar Milin. 2010. Analyzing ond meeting of the North American Chapter of the
reaction times. International Journal of Psycholog- Association for Computational Linguistics on Lan-
ical Research, 3:1228. guage technologies, pages 18, Stroudsburg, PA.
R. Harald Baayen, Doug J. Davidson, and Douglas M. Herbert Jaeger and Harald Haas. 2004. Harnessing
Bates. 2008. Mixed-effects modeling with crossed nonlinearity: predicting chaotic systems and saving
random effects for subjects and items. Journal of energy in wireless communication. Science, pages
Memory and Language, 59:390412. 7880.
Moshe Bar. 2007. The proactive brain: using Barbara J. Juhasz and Keith Rayner. 2003. Investigat-
analogies and associations to generate predictions. ing the effects of a set of intercorrelated variables on
Trends in Cognitive Sciences, 11:280289. eye fixation durations in reading. Journal of Exper-
Douglas Bates, Martin Maechler, and Ben Bolker, imental Psychology: Learning, Memory and Cogni-
2011. lme4: Linear mixed-effects models using tion, 29:13121318.
S4 classes. Available from: http://CRAN.R- Marcel A. Just, Patricia A. Carpenter, and Jacque-
project.org/package=lme4 (R package version line D. Woolley. 1982. Paradigms and processes
0.999375-39). in reading comprehension. Journal of Experimen-
tal Psychology: General, 111:228238.
Yoshua Bengio, Rejean Ducharme, Pascal Vincent,
and Christian Jauvin. 2003. A neural probabilis- Yuki Kamide, Christoph Scheepers, and Gerry T. M.
tic language model. Journal of Machine Learning Altmann. 2003. Integration of syntactic and se-
Research, 3:11371155. mantic information in predictive processing: cross-
linguistic evidence from German and English.
Marisa Ferrara Boston, John Hale, Reinhold Kliegl, Journal of Psycholinguistic Research, 32:3755.
Umesh Patil, and Shravan Vasishth. 2008. Parsing
Alan Kennedy and Joel Pynte. 2005. Parafoveal-on
costs as predictors of reading difficulty: An evalua-
foveal effects in normal reading. Vision Research,
tion using the potsdam sentence corpus. Journal of
45:153168.
Eye Movement Research,, 2:112.
Dan Klein and Christopher D. Manning. 2003. Ac-
Harm Brouwer, Hartmut Fitz, and John C. J. Hoeks. curate unlexicalized parsing. In Proceedings of the
2010. Modeling the noun phrase versus sentence 41st Meeting of the Association for Computational
coordination ambiguity in Dutch: evidence from Linguistics,, pages 423430.
surprisal theory. In Proceedings of the 2010 Work- Kestutis Kveraga, Avniel S. Ghuman, and Moshe Bar.
shop on Cognitive Modeling and Computational 2007. Top-down predictions in the cognitive brain.
Linguistics, pages 7280, Stroudsburg, PA, USA. Brain and Cognition, 65:145168.
John A. Bullinaria and Joseph P. Levy. 2007. Ex- Roger Levy. 2008. Expectation-based syntactic com-
tracting semantic representations from word co- prehension. Cognition, 106:11261177.
occurrence statistics: A computational study. Be- Mitchell P. Marcus, Mary Ann Marcinkiewicz, and
havior Research Methods, 39:510526. Beatrice Santorini. 1993. Building a large anno-
Vera Demberg and Frank Keller. 2008. Data from eye- tated corpus of English: the Penn Treebank. Com-
tracking corpora as evidence for theories of syn- putational Linguistics, 19:313330.
tactic processing complexity. Cognition, 109:193 Scott A. McDonald and Richard C. Shillcock. 2003.
210. Low-level predictive inference in reading: the influ-
Stefan L. Frank and Rens Bod. 2011. Insensitivity of ence of transitional probabilities on eye movements.
the human sentence-processing system to hierarchi- Vision Research, 43:17351751.
cal structure. Psychological Science, 22:829834. Andriy Mnih and Geoffrey Hinton. 2007. Three new
Stefan L. Frank. 2009. Surprisal-based comparison graphical models for statistical language modelling.
between a symbolic and a connectionist model of Proceedings of the 25th International Conference of
sentence processing. In Proceedings of the 31st An- Machine Learning, pages 641648.
nual Conference of the Cognitive Science Society, Keith Rayner and Alexander Pollatsek. 1987. Eye
pages 11391144, Austin, TX. movements in reading: A tutorial review. In
407
M. Coltheart, editor, Attention and performance
XII: the psychology of reading., pages 327362.
Lawrence Erlbaum Associates, London, UK.
Keith Rayner. 1998. Eye movements in reading and
information processing: 20 years of research. Psy-
chological Bulletin, 124:372422.
Brian Roark, Asaf Bachrach, Carlos Cardenas, and
Christophe Pallier. 2009. Deriving lexical and syn-
tactic expectation-based measures for psycholin-
guistic modeling via incremental top-down parsing.
In Proceedings of the 2009 Conference on Empiri-
cal Methods in Natural Language Processing: Vol-
ume 1 - Volume 1, pages 324333, Stroudsburg, PA.
Brian Roark. 2001. Probabilistic top-down parsing
and language modeling. Computational Linguis-
tics, 27:249276.
Beatrice Santorini. 1991. Part-of-speech tagging
guidelines for the Penn Treebank Project. Technical
report, Philadelphia, PA.
Nathaniel J. Smith and Roger Levy. 2008. Optimal
processing times in reading: a formal model and
empirical investigation. In Proceedings of the 30th
Annual Conference of the Cognitive Science Soci-
ety, pages 595600, Austin,TX.
Andreas Stolcke. 1995. An efficient probabilistic
context-free parsing algorithm that computes prefix
probabilities. Computational linguistics, 21:165
201.
Yoshimasa Tsuruoka and Junichi Tsujii. 2005. Bidi-
rectional inference with the easiest-first strategy for
tagging sequence data. In Proceedings of the con-
ference on Human Language Technology and Em-
pirical Methods in Natural Language Processing,
pages 467474, Stroudsburg, PA.
408
Spectral Learning for Non-Deterministic Dependency Parsing
Franco M. Luque Ariadna Quattoni and Borja Balle and Xavier Carreras
Universidad Nacional de Cordoba Universitat Politecnica de Catalunya
and CONICET Barcelona E-08034
Cordoba X5000HUA, Argentina {aquattoni,bballe,carreras}@lsi.upc.edu
francolq@famaf.unc.edu.ar
409
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 409419,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
malism, head-modifier sequences are generated ner, 2000), a context-free grammatical formal-
by a collection of finite-state automata. In our ism whose derivations are projective dependency
case, the underlying machines are probabilistic trees. We will use xi:j = xi xi+1 xj to de-
non-deterministic finite state automata (PNFA), note a sequence of symbols xt with i t j.
which we parameterize using the operator model A SHAG generates sentences s0:N , where sym-
representation. This representation allows the use bols st X with 1 t N are regular words
of simple spectral algorithms for estimating the and s0 = ? 6 X is a special root symbol. Let
model parameters from data (Hsu et al., 2009; X = X {?}. A derivation y, i.e. a depen-
Bailly, 2011; Balle et al., 2012). In all previous dency tree, is a collection of head-modifier se-
work, the algorithms used to induce hidden struc- quences hh, d, x1:T i, where h X is a word,
ture require running repeated inference on train- d {LEFT, RIGHT} is a direction, and x1:T is
ing datae.g. Expectation-Maximization (Demp- a sequence of T words, where each xt X is
ster et al., 1977), or split-merge algorithms. In a modifier of h in direction d. We say that h is
contrast, spectral methods are simple and very ef- the head of each xt . Modifier sequences x1:T are
ficient parameter estimation is reduced to com- ordered head-outwards, i.e. among x1:T , x1 is the
puting some data statistics, performing SVD, and word closest to h in the derived sentence, and xT
inverting matrices. is the furthest. A derivation y of a sentence s0:N
The main contributions of this paper are: consists of a LEFT and a RIGHT head-modifier se-
quence for each st . As special cases, the LEFT se-
We present a spectral learning algorithm for
quence of the root symbol is always empty, while
inducing PNFA with applications to head-
the RIGHT one consists of a single word corre-
automata dependency grammars. Our for-
sponding to the head of the sentence. We denote
mulation is based on thinking about the dis-
by Y the set of all valid derivations.
tribution generated by a PNFA in terms of
Assume a derivation y contains hh, LEFT, x1:T i
the forward-backward recursions.
and hh, RIGHT, x01:T 0 i. Let L(y, h) be the derived
Spectral learning algorithms in previous sentence headed by h, which can be expressed as
work only use statistics of prefixes of se- L(y, xT ) L(y, x1 ) h L(y, x01 ) L(y, x0T 0 ).1
quences. In contrast, our algorithm is able The language generated by a SHAG are the
to learn from substring statistics. strings L(y, ?) for any y Y.
In this paper we use probabilistic versions of
We derive an inside-outside algorithm for SHAG where probabilities of head-modifier se-
non-deterministic SHAG that runs in cubic quences in a derivation are independent of each
time, keeping the costs of CFG parsing. other:
In experiments we show that adding non-
Y
P(y) = P(x1:T |h, d) . (1)
determinism improves the accuracy of sev- hh,d,x1:T iy
eral baselines. When we compare our algo-
rithm to EM we observe a reduction of two In the literature, standard arc-factored models fur-
orders of magnitude in training time. ther assume that
TY
+1
The paper is organized as follows. Next section P(x1:T |h, d) = P(xt |h, d, t ) , (2)
describes the necessary background on SHAG t=1
and operator models. Section 3 introduces Op-
where xT +1 is always a special STOP word, and t
erator SHAG for parsing, and presents a spectral
is the state of a deterministic automaton generat-
learning algorithm. Section 4 presents a parsing
ing x1:T +1 . For example, setting 1 = FIRST and
algorithm. Section 5 presents experiments and
t>1 = REST corresponds to first-order models,
analysis of results, and section 6 concludes.
while setting 1 = NULL and t>1 = xt1 corre-
2 Preliminaries sponds to sibling models (Eisner, 2000; McDon-
ald et al., 2005; McDonald and Pereira, 2006).
2.1 Head-Automata Dependency Grammars 1
Throughout the paper we assume we can distinguish the
In this work we use split head-automata gram- words in a derivation, irrespective of whether two words at
mars (SHAG) (Eisner and Satta, 1999; Eis- different positions correspond to the same symbol.
410
2.2 Operator Models symbol a and moving to state i given that we are
An operator model A with n states is a tuple at state j.
> , {A }
h1 , a aX i, where Aa R
nn is an op- HMM are only one example of distributions
erator matrix and 1 , Rn are vectors. A that can be parameterized by operator models.
computes a function f : X R as follows: In general, operator models can parameterize any
PNFA, where the parameters of the model corre-
>
f (x1:T ) = A xT A x1 1 . (3) spond to probabilities of emitting a symbol from
a state and moving to the next state.
One intuitive way of understanding operator The advantage of working with operator mod-
models is to consider the case where f computes els is that, under certain mild assumptions on the
a probability distribution over strings. Such a dis- operator parameters, there exist algorithms that
tribution can be described in two equivalent ways: can estimate the operators from observable statis-
by making some independence assumptions and tics of the input sequences. These algorithms are
providing the corresponding parameters, or by ex- extremely efficient and are not susceptible to local
plaining the process used to compute f . This is minima issues. See (Hsu et al., 2009) for theoret-
akin to describing the distribution defined by an ical proofs of the learnability of HMM under the
HMM in terms of a factorization and its corre- operator model representation.
sponding transition and emission parameters, or In the following, we write x = xi:j X to
using the inductive equations of the forward al- denote sequences of symbols, and use Axi:j as a
gorithm. The operator model representation takes shorthand for Axj Axi . Also, for convenience
the latter approach. we assume X = {1, . . . , l}, so that we can index
Operator models have had numerous applica- vectors and matrices by symbols in X .
tions. For example, they can be used as an alter-
native parameterization of the function computed 3 Learning Operator SHAG
by an HMM (Hsu et al., 2009). Consider an HMM
with n hidden states and initial-state probabilities We will define a SHAG using a collection of op-
Rn , transition probabilities T Rnn , and erator models to compute probabilities. Assume
observation probabilities Oa Rnn for each that for each possible head h in the vocabulary X
a X , with the following meaning: and each direction d {LEFT, RIGHT} we have
an operator model that computes probabilities of
(i) is the probability of starting at state i, modifier sequences as follows:
T (i, j) is the probability of transitioning h,d > h,d
P(x1:T |h, d) = ( ) AxT Ah,d h,d
x1 1 .
from state j to state i,
Then, this collection of operator models defines
Oa is a diagonal matrix, such that Oa (i, i) is
an operator SHAG that assigns a probability to
the probability of generating symbol a from
each y Y according to (1). To learn the model
state i.
parameters, namely h1h,d , h,d
, {Ah,d
a }aX i for
Given an HMM, an equivalent operator model h X and d {LEFT, RIGHT}, we use spec-
can be defined by setting 1 = , Aa = T Oa tral learning methods based on the works of Hsu
and = ~1. To see this, let us show that the for- et al. (2009), Bailly (2011) and Balle et al. (2012).
ward algorithm computes the expression in equa- The main challenge of learning an operator
tion (3). Let t denote the state of the HMM model is to infer a hidden-state space from ob-
at time t. Consider a state-distribution vector servable quantities, i.e. quantities that can be com-
t Rn , where t (i) = P(x1:t1 , t = i). Ini- puted from the distribution of sequences that we
tially 1 = . At each step in the chain of prod- observe. As it turns out, we cannot recover the
ucts (3), t+1 = Axt t updates the state dis- actual hidden-state space used by the operators
tribution from positions t to t + 1 by applying we wish to learn. The key insight of the spectral
the appropriate operator, i.e. by emitting symbol learning method is that we can recover a hidden-
xt and transitioning to the new statePdistribution. state space that corresponds to a projection of the
The probability of x1:T is given by i T +1 (i). original hidden space. Such projected space is
Hence, Aa (i, j) is the probability of generating equivalent to the original one in the sense that we
411
can find operators in the projected space that pa- Furthermore, for each b X let Pb Rll denote
rameterize the same probability distribution over the matrix whose entries are given by
sequences.
Pb (c, a) = E(abc v] x) , (7)
In the rest of this section we describe an algo-
rithm for learning an operator model. We will as- the expected number of occurrences of trigrams.
l l
sume a fixed head word and direction, and drop h P p1 R and p R
Finally, we define vectors
and d from all terms. Hence, our goal is to learn as follows: p1 (a) = sX P(as), the probabil-
the following distribution, parameterized by oper- ity that a stringPbegins with a particular symbol;
ators 1 , {Aa }aX , and : and p (a) = pX P(pa), the probability that
a string ends with a particular symbol.
>
P(x1:T ) = A xT A x1 1 . (4) Now we show a particularly useful way to ex-
press the quantities defined above in terms of the
Our algorithm shares many features with the > , {A }
operators h1 , a aX i of P. First, note
previous spectral algorithms of Hsu et al. (2009)
that each entry of P can be written in this form:
and Bailly (2011), though the derivation given X
here is based upon the general formulation of P (b, a) = P(pabs) (8)
Balle et al. (2012). The main difference is that p,sX
our algorithm is able to learn operator models
X
>
= As Ab Aa Ap 1
from substring statistics, while algorithms in pre- p,sX
vious works were restricted to statistics on pre- >
X X
fixes. In principle, our algorithm should extract = ( As ) Ab Aa ( Ap 1 ) .
sX pX
much more information from a sample.
It is not hard to see that, since P isPa probability
3.1 Preliminary Definitions distribution over X , actually >
A =
P sX s
The spectral learning algorithm will use statistics ~1> . Furthermore, since pX Ap =
estimated from samples of the target distribution. P P k (I
P 1
k0 ( aX Aa ) P= aX Aa ) ,
More specifically, consider the function that com- we write 1 = (I aX Aa )1 1 . From (8) it
putes the expected number of occurrences of a is natural to define a forward matrix F Rnl
substring x in a random string x0 drawn from P: whose ath column contains the sum of all hidden-
state vectors obtained after generating all prefixes
f (x) = E(x v] x0 )
X ended in a:
= (x v] x0 )P(x0 ) X
x0 X
F (:, a) = Aa Ap 1 = Aa 1 . (9)
X pX
= P(pxs) , (5)
p,sX
Conversely, we also define a backward matrix
B Rln whose ath row contains the probability
where x v] x0 denotes the number of times x ap- of generating a from any possible state:
pears in x0 . Here we assume that the true values >
X
of f (x) for bigrams are known, though in practice B(a, :) = As Aa = ~1> Aa . (10)
sX
the algorithm will work with empirical estimates
of these. By plugging the forward and backward matri-
The information about f known by the algo- ces into (8) one obtains the factorization P =
rithm is organized in matrix form as follows. Let BF . With similar arguments it is easy to see
P Rll be a matrix containing the value of f (x) that one also has Pb = BAb F , p1 = B 1 , and
for all strings of length two, i.e. bigrams.2 . That p> >
= F . Hence, if B and F were known, one
is, each entry in P Rll contains the expected could in principle invert these expressions in order
number of occurrences of a given bigram: to recover the operators of the model from em-
pirical estimations computed from a sample. In
P (b, a) = E(ab v] x) . (6) the next section we show that in fact one does not
2
In fact, while we restrict ourselves to strings of length
need to know B and F to learn an operator model
two, an analogous algorithm can be derived that considers for P, but rather that having a good factorization
longer strings to define P . See (Balle et al., 2012) for details. of P is enough.
412
3.2 Inducing a Hidden-State Space Algorithm 1 Learn Operator SHAG
We have shown that an operator model A com- inputs:
An alphabet X
puting P induces a factorization of the matrix P ,
A training set TRAIN = {hhi , di , xi1:T i}M
i=1
namely P = BF . More generally, it turns out that The number of hidden states n
when the rank of P equals the minimal number of
1: for each h X and d {LEFT, RIGHT} do
states of an operator model that computes P, then
2: Compute an empirical estimate from TRAIN of
one can prove a duality relation between opera-
statistics matrices pb1 , pb , Pb, and {Pba }aX
tors and factorizations of P . In particular, one can 3: Compute the SVD of Pb and let U b be the matrix
show that, for any rank factorization P = QR, the of top n left singular vectors of Pb
operators given by 1 = Q+ p1 , > = p> R+ ,
4: Compute the observable operators for h and d:
+ +
and Aa = Q Pa R , yield an operator model for 5: b1h,d = U
b > pb1
P. A key fact in proving this result is that the func- 6: ) = pb>
(b h,d > b> b +
(U P )
h,d > b > Pb)+ for each a X
tion P is invariant to the basis chosen to represent 7: Ab =U
a
b Pba (U
operator matrices. See (Balle et al., 2012) for fur- 8: end for
ther details. 9: return Operators hb 1h,d ,
bh,d bh,d
, Aa i
Thus, we can recover an operator model for P for each h X , d {LEFT, RIGHT}, a X
from any rank factorization of P , provided a rank
assumption on P holds (which hereafter we as-
SHAG is learned separately. The running time
sume to be the case). Since we only have access
of the algorithm is dominated by two computa-
to an approximation of P , it seems reasonable to
tions. First, a pass over the training sequences to
choose a factorization which is robust to estima-
compute statistics over unigrams, bigrams and tri-
tion errors. A natural such choice is the thin SVD
grams. Second, SVD and matrix operations for
decomposition of P (i.e. using top n singular vec-
computing the operators, which run in time cubic
tors), given by: P = U (V > ) = U (U > P ).
in the number of symbols l. However, note that
Intuitively, we can think of U and U > P as pro-
when dealing with sparse matrices many of these
jected backward and forward matrices. Now that
operations can be performed more efficiently.
we have a factorization of P we can construct an
operator model for P as follows: 3 4 Parsing Algorithms
1 = U > p1 , (11) Given a sentence s0:N we would like
>
= p> >
(U P )
+
, (12) to find its most likely derivation, y =
> > argmaxyY(s0:N ) P(y). This problem, known as
Aa = U Pa (U P )+ . (13)
MAP inference, is known to be intractable for
Algorithm 1 presents pseudo-code for an algo- hidden-state structure prediction models, as it
rithm learning operators of a SHAG from train- involves finding the most likely tree structure
ing head-modifier sequences using this spectral while summing out over hidden states. We use
method. Note that each operator model in the a common approximation to MAP based on first
computing posterior marginals of tree edges (i.e.
3
To see that equations (11-13) define a model for P, one dependencies) and then maximizing over the
must first see that the matrix M = F (V > )+ is invertible
with inverse M 1 = U > B. Using this and recalling that
tree structure (see (Park and Darwiche, 2004)
p1 = B1 , Pa = BAa F , p> >
= F , one obtains that:
for complexity of general MAP inference and
approximations). For parsing, this strategy is
1 = U > B1 = M 1 1 ,
> >
sometimes known as MBR decoding; previous
= F (U > BF )+ =
>
M ,
work has shown that empirically it gives good
Aa = U > BAa F (U > BF )+ = M 1 Aa M .
performance (Goodman, 1996; Clark and Cur-
Finally: ran, 2004; Titov and Henderson, 2006; Petrov
> and Klein, 2007). In our case, we use the
P(x1:T ) = AxT Ax1 1
>
non-deterministic SHAG to compute posterior
= M M 1 AxT M M 1 Ax1 M M 1 1
>
marginals of dependencies. We first explain the
= AxT Ax1 1
general strategy of MBR decoding, and then
present an algorithm to compute marginals.
413
Let (si , sj ) denote a dependency between head and Satta (1999), we use decoding structures re-
word i and modifier word j. The posterior lated to complete half-constituents (or triangles,
or marginal probability of a dependency (si , sj ) denoted C) and incomplete half-constituents (or
given a sentence s0:N is defined as trapezoids, denoted I), each decorated with a di-
X rection (denoted L and R). We assume familiarity
i,j = P((si , sj ) | s0:N ) = P(y) . with their algorithm.
I,R
yY(s0:N ) : (si ,sj )y We define i,j Rn as the inside score-vector
of a right trapezoid dominated by dependency
To compute marginals, the sum over derivations (si , sj ),
can be decomposed into a product of inside and
outside quantities (Baker, 1979). Below we de-
X
I,R
i,j = P(y 0 )si ,R (x1:t ) . (15)
scribe an inside-outside algorithm for our gram- yY(si:j ) : (si ,sj )y ,
mars. Given a sentence s0:N and marginal scores y={hsi ,R,x1:t i} y 0 , xt =sj
i,j , we compute the parse tree for s0:N as
The term P(y 0 ) is the probability of head-modifier
sequences in the range si:j that do not involve
X
y = argmax log i,j (14)
yY(s0:N ) (s ,s )y
i j
si . The term si ,R (x1:t ) is a forward state-
distribution vector the qth coordinate of the
using the standard projective parsing algorithm vector is the probability that si generates right
for arc-factored models (Eisner, 2000). Overall modifiers x1:t and remains at state q. Similarly,
I,R
we use a two-pass parsing process, first to com- we define i,j Rn as the outside score-vector
pute marginals and then to compute the best tree. of a right trapezoid, as
X
4.1 An Inside-Outside Algorithm I,R
i,j = P(y 0 ) si ,R (xt+1:T ) , (16)
In this section we sketch an algorithm to com- yY(s0:i sj:n ) : root(y)=s0 ,
y={hsi ,R,xt:T i} y 0 , xt =sj
pute marginal probabilities of dependencies. Our
algorithm is an adaptation of the parsing algo- where si ,R (xt+1:T ) Rn is a backward state-
rithm for SHAG by Eisner and Satta (1999) to distribution vector the qth coordinate is the
the case of non-deterministic head-automata, and probability of being at state q of the right au-
has a runtime cost of O(n2 N 3 ), where n is the tomaton of si and generating xt+1:T . Analogous
number of states of the model, and N is the inside-outside expressions can be defined for the
length of the input sentence. Hence the algorithm rest of structures (left/right triangles and trape-
maintains the standard cubic cost on the sentence zoids). With these quantities, we can compute
length, while the quadratic cost on n is inher- marginals as
ent to the computations defined by our model in
( I , R > I , R 1
Eq. (3). The main insight behind our extension (i,j ) i,j Z if i < j ,
is that, because the computations of our model in- i,j = I,L > I,L 1 (17)
(i,j ) i,j Z if j < i ,
volve state-distribution vectors, we need to extend
the standard inside/outside quantities to be in the P ?,R > C , R
form of such state-distribution quantities.4 where Z = yY(s0:N) P(y) = ( ) 0,N .
Throughout this section we assume a fixed sen- Finally, we sketch the equations for computing
tence s0:N . Let Y(xi:j ) be the set of derivations inside scores in O(N 3 ) time. The outside equa-
that yield a subsequence xi:j . For a derivation y, tions can be derived analogously (see (Paskin,
we use root(y) to indicate the root word of it, 2001)). For 0 i < j N :
and use (xi , xj ) y to refer a dependency in y C,R
i,i = 1si ,R (18)
from head xi to modifier xj . Following Eisner
j
X
4 C,R I,R sk , R > C , R
Technically, when working with the projected operators i,j = i,k ( ) k,j (19)
the state-distribution vectors will not be distributions in the k=i+1
formal sense. However, they correspond to a projection of a j
state distribution, for some projection that we do not recover
s ,L
X
from data (namely M 1 in footnote 3). This projection has
I,R
i,j = C,R
Assij,R i,k (j )> k+1,j
C,L
(20)
no effect on the computations because it cancels out. k=i
414
5 Experiments 82
80
The goal of our experiments is to show that in-
5.1 Fully Unlexicalized Grammars We trained SHAG models using the standard
We trained fully unlexicalized dependency gram- WSJ sections of the English Penn Treebank (Mar-
mars from dependency treebanks, that is, X are cus et al., 1994). Figure 1 shows the Unlabeled
PoS tags and we parse PoS tag sequences. In Attachment Score (UAS) curve on the develop-
all cases, our modifier sequences include special ment set, in terms of the number of hidden states
START and STOP symbols at the boundaries. 5 6 for the spectral and EM models. We can see
We compare the following SHAG models: that D ET +F largely outperforms D ET7 , while the
hidden-state models obtain much larger improve-
D ET: a baseline deterministic grammar with
ments. For the EM model, we show the accuracy
a single state.
curve after 5, 10, 25 and 100 iterations.8
D ET +F: a deterministic grammar with two
In terms of peak accuracies, EM gives a slightly
states, one emitting the first modifier of a
better result than the spectral method (80.51% for
sequence, and another emitting the rest (see
EM with 15 states versus 79.75% for the spectral
(Eisner and Smith, 2010) for a similar deter-
method with 9 states). However, the spectral al-
ministic baseline).
gorithm is much faster to train. With our Matlab
S PECTRAL: a non-deterministic grammar implementation, it took about 30 seconds, while
with n hidden states trained with the spectral each iteration of EM took from 2 to 3 minutes,
algorithm. n is a parameter of the model. depending on the number of states. To give a con-
EM: a non-deterministic grammar with n crete example, to reach an accuracy close to 80%,
states trained with EM. Here, we estimate there is a factor of 150 between the training times
operators hb 1 , bh,d
b , A a i using forward- of the spectral method and EM (where we com-
backward for the E step. To initialize, we pare the peak performance of the spectral method
mimicked an HMM initialization: (1) we set versus EM at 25 iterations with 13 states).
b1 and b randomly; (2) we created a ran- 7
dom transition matrix T Rnn ; (3) we For parsing with deterministic SHAG we employ MBR
inference, even though Viterbi inference can be performed
5
Even though the operators 1 and of a PNFA ac- exactly. In experiments on development data D ET improved
count for start and stop probabilities, in preliminary experi- from 62.65% using Viterbi to 68.52% using MBR, and
ments we found that having explicit START and STOP sym- D ET +F improved from 72.72% to 74.80%.
8
bols results in more accurate models. We ran EM 10 times under different initial conditions
6
Note that, for parsing, the operators for the START and and selected the run that gave the best absolute accuracy after
STOP symbols can be packed into 1 and respectively. 100 iterations. We did not observe significant differences
One just defines 10 = ASTART 1 and 0>
= >
ASTOP . between the runs.
415
D ET D ET +F S PECTRAL EM 86
WSJ 69.45% 75.91% 80.44% 81.68%
84
78
416
nns ments in accuracy with respect to the baselines.
STOP
, I A DFA for the automaton (NN, LEFT) is shown
cc in Figure 3. The vectors were originally divided
prp$ vbg jjs
rb vbn pos $ nn in ten clusters, but the DFA construction required
jjr nnp cd
jj in dt cd two state mergings, leading to a eight state au-
9 2 tomaton. The state named I is the initial state.
prp$ nn pos
nn
$ nnp
jj dt nnp Clearly, we can see that there are special states
cc for punctuation (state 9) and coordination (states
, , STOP 1 and 5). States 0 and 2 are harder to interpret.
nn
To build a DFA, we computed the forward vec- Our main contribution is a basic tool for inducing
tors corresponding to frequent prefixes of modi- sequential hidden structure in dependency gram-
fier sequences of the development set. Then, we mars. Most of the recent work in dependency
clustered these vectors using a Group Average parsing has explored explicit feature engineering.
Agglomerative algorithm using the cosine simi- In part, this may be attributed to the high cost of
larity measure (Manning et al., 2008). This simi- using tools such as EM to induce representations.
larity measure is appropriate because it compares Our experiments have shown that adding hidden-
the angle between vectors, and is not affected by structure improves parsing accuracy, and that our
their magnitude (the magnitude of forward vec- spectral algorithm is highly scalable.
tors decreases with the number of modifiers gen- Our methods may be used to enrich the rep-
erated). Each cluster i defines a state in the DFA, resentational power of more sophisticated depen-
and we say that a sequence x1:t is in state i if its dency models. For example, future work should
corresponding forward vector at time t is in clus- consider enhancing lexicalized dependency gram-
ter i. Then, transitions in the DFA are defined us- mars with hidden states that summarize lexical
ing a procedure that looks at how sequences tra- dependencies. Another line for future research
verse the states. If a sequence x1:t is at state i at should extend the learning algorithm to be able
time t 1, and goes to state j at time t, then we to capture vertical hidden relations in the depen-
define a transition from state i to state j with la- dency tree, in addition to sequential relations.
bel xt . This procedure may require merging states Acknowledgements We are grateful to Gabriele
to give a consistent DFA, because different se- Musillo and the anonymous reviewers for providing us
quences may define different transitions for the with helpful comments. This work was supported by
same states and modifiers. After doing a merge, a Google Research Award and by the European Com-
new merges may be required, so the procedure mission (PASCAL2 NoE FP7-216886, XLike STREP
must be repeated until a DFA is obtained. FP7-288342). Borja Balle was supported by an FPU
fellowship (AP2008-02064) of the Spanish Ministry
For this analysis, we took the spectral model of Education. The Spanish Ministry of Science and
with 9 states, and built DFA from the non- Innovation supported Ariadna Quattoni (JCI-2009-
deterministic automata corresponding to heads 04240) and Xavier Carreras (RYC-2008-02223 and
and directions where we saw largest improve- KNOW2 TIN2009-14715-C04-04).
417
References Daniel Hsu, Sham M. Kakade, and Tong Zhang. 2009.
A spectral algorithm for learning hidden markov
Raphael Bailly. 2011. Quadratic weighted automata:
models. In COLT 2009 - The 22nd Conference on
Spectral algorithm and likelihood maximization.
Learning Theory.
JMLR Workshop and Conference Proceedings
Gabriel Infante-Lopez and Maarten de Rijke. 2004.
ACML.
Alternative approaches for generating bodies of
James K. Baker. 1979. Trainable grammars for speech
grammar rules. In Proceedings of the 42nd Meet-
recognition. In D. H. Klatt and J. J. Wolf, editors,
ing of the Association for Computational Lin-
Speech Communication Papers for the 97th Meeting
guistics (ACL04), Main Volume, pages 454461,
of the Acoustical Society of America, pages 547
Barcelona, Spain, July.
550.
Terry Koo and Michael Collins. 2010. Efficient third-
Borja Balle, Ariadna Quattoni, and Xavier Carreras.
order dependency parsers. In Proceedings of the
2012. Local loss optimization in operator models:
48th Annual Meeting of the Association for Compu-
A new insight into spectral learning. Technical Re-
tational Linguistics, pages 111, Uppsala, Sweden,
port LSI-12-5-R, Departament de Llenguatges i Sis-
July. Association for Computational Linguistics.
temes Informatics (LSI), Universitat Politecnica de
Catalunya (UPC). Christopher D. Manning, Prabhakar Raghavan, and
Xavier Carreras. 2007. Experiments with a higher- Hinrich Schutze. 2008. Introduction to Information
order projective dependency parser. In Proceed- Retrieval. Cambridge University Press, Cambridge,
ings of the CoNLL Shared Task Session of EMNLP- first edition, July.
CoNLL 2007, pages 957961, Prague, Czech Re- Mitchell P. Marcus, Beatrice Santorini, and Mary A.
public, June. Association for Computational Lin- Marcinkiewicz. 1994. Building a large annotated
guistics. corpus of english: The penn treebank. Computa-
Stephen Clark and James R. Curran. 2004. Parsing tional Linguistics, 19.
the wsj using ccg and log-linear models. In Pro- Andre Martins, Noah Smith, and Eric Xing. 2009.
ceedings of the 42nd Meeting of the Association for Concise integer linear programming formulations
Computational Linguistics (ACL04), Main Volume, for dependency parsing. In Proceedings of the Joint
pages 103110, Barcelona, Spain, July. Conference of the 47th Annual Meeting of the ACL
Michael Collins. 1999. Head-Driven Statistical Mod- and the 4th International Joint Conference on Natu-
els for Natural Language Parsing. Ph.D. thesis, ral Language Processing of the AFNLP, pages 342
University of Pennsylvania. 350, Suntec, Singapore, August. Association for
Arthur P. Dempster, Nan M. Laird, and Donald B. Ru- Computational Linguistics.
bin. 1977. Maximum likelihood from incomplete Takuya Matsuzaki, Yusuke Miyao, and Junichi Tsujii.
data via the em algorithm. Journal of the royal sta- 2005. Probabilistic CFG with latent annotations. In
tistical society, Series B, 39(1):138. Proceedings of the 43rd Annual Meeting of the As-
Jason Eisner and Giorgio Satta. 1999. Efficient pars- sociation for Computational Linguistics (ACL05),
ing for bilexical context-free grammars and head- pages 7582, Ann Arbor, Michigan, June. Associa-
automaton grammars. In Proceedings of the 37th tion for Computational Linguistics.
Annual Meeting of the Association for Computa- Ryan McDonald and Fernando Pereira. 2006. Online
tional Linguistics (ACL), pages 457464, Univer- learning of approximate dependency parsing algo-
sity of Maryland, June. rithms. In Proceedings of the 11th Conference of
Jason Eisner and Noah A. Smith. 2010. Favor the European Chapter of the Association for Com-
short dependencies: Parsing with soft and hard con- putational Linguistics, pages 8188.
straints on dependency length. In Harry Bunt, Paola Ryan McDonald, Fernando Pereira, Kiril Ribarov, and
Merlo, and Joakim Nivre, editors, Trends in Parsing Jan Hajic. 2005. Non-projective dependency pars-
Technology: Dependency Parsing, Domain Adapta- ing using spanning tree algorithms. In Proceed-
tion, and Deep Parsing, chapter 8, pages 121150. ings of Human Language Technology Conference
Springer. and Conference on Empirical Methods in Natural
Jason Eisner. 2000. Bilexical grammars and their Language Processing, pages 523530, Vancouver,
cubic-time parsing algorithms. In Harry Bunt and British Columbia, Canada, October. Association for
Anton Nijholt, editors, Advances in Probabilis- Computational Linguistics.
tic and Other Parsing Technologies, pages 2962. Gabriele Antonio Musillo and Paola Merlo. 2008. Un-
Kluwer Academic Publishers, October. lexicalised hidden variable models of split depen-
Joshua Goodman. 1996. Parsing algorithms and met- dency grammars. In Proceedings of ACL-08: HLT,
rics. In Proceedings of the 34th Annual Meeting Short Papers, pages 213216, Columbus, Ohio,
of the Association for Computational Linguistics, June. Association for Computational Linguistics.
pages 177183, Santa Cruz, California, USA, June. James D. Park and Adnan Darwiche. 2004. Com-
Association for Computational Linguistics. plexity results and approximation strategies for map
418
explanations. Journal of Artificial Intelligence Re-
search, 21:101133.
Mark Paskin. 2001. Cubic-time parsing and learning
algorithms for grammatical bigram models. Techni-
cal Report UCB/CSD-01-1148, University of Cali-
fornia, Berkeley.
Slav Petrov and Dan Klein. 2007. Improved infer-
ence for unlexicalized parsing. In Human Language
Technologies 2007: The Conference of the North
American Chapter of the Association for Computa-
tional Linguistics; Proceedings of the Main Confer-
ence, pages 404411, Rochester, New York, April.
Association for Computational Linguistics.
Slav Petrov, Leon Barrett, Romain Thibaux, and Dan
Klein. 2006. Learning accurate, compact, and in-
terpretable tree annotation. In Proceedings of the
21st International Conference on Computational
Linguistics and 44th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 433
440, Sydney, Australia, July. Association for Com-
putational Linguistics.
Ivan Titov and James Henderson. 2006. Loss mini-
mization in parse reranking. In Proceedings of the
2006 Conference on Empirical Methods in Natu-
ral Language Processing, pages 560567, Sydney,
Australia, July. Association for Computational Lin-
guistics.
Ivan Titov and James Henderson. 2007. A latent vari-
able model for generative dependency parsing. In
Proceedings of the Tenth International Conference
on Parsing Technologies, pages 144155, Prague,
Czech Republic, June. Association for Computa-
tional Linguistics.
419
Combining Tree Structures, Flat Features and Patterns
for Biomedical Relation Extraction
420
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 420429,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
of new approaches that are sensitive to the varia- The remainder of the paper is organized as fol-
tions of complex linguistic constructions. lows. In Section 2, we briefly review previous
The proposed hybrid kernel is the composition work. Section 3 lists the datasets. Then, in Sec-
of one tree kernel and two feature based kernels tion 4, we define our proposed hybrid kernel and
(one of them is already known in the literature describe its individual component kernels. Sec-
and the other is proposed in this paper for the first tion 5 outlines the experimental settings. Follow-
time). The novelty of the newly proposed feature ing that, empirical results are discussed in Section
based kernel is that it envisages to accommodate 6. Finally, we conclude with a summary of our
the advantages of pattern based approaches. More study as well as suggestions for further improve-
precisely: ment of our approach.
421
Corpus Sentences Positive pairs Negative pairs (PET) kernel respectively. w is a multiplicative
BioInfer 1,100 2,534 7,132 constant used for the PET kernel. It allows the
AIMed 1,955 1,000 4,834 hybrid kernel to assign more (or less) weight to
IEPA 486 335 482 the information obtained using tree structures de-
HPRD50 145 163 270 pending on the corpus. The proposed hybrid ker-
LLL 77 164 166 nel is valid according to the closure properties of
kernels.
Table 1: Basic statistics of the 5 benchmark PPI cor- Both the TPWF and SL kernels are linear ker-
pora. nels, while PET kernel is computed using Unlex-
icalized Partial Tree (uPT) kernel (Severyn and
on single corpora. Moschitti, 2010). The following subsections ex-
Apart from the approaches described above, plain each of the individual kernels in more detail.
there also exist other studies that used kernels for 4.1 Proposed TPWF Kernel
PPI extraction (e.g. subsequence kernel (Bunescu
and Mooney, 2006)). 4.1.1 Reduced graph, trigger words,
A notable exception is the work published by negative cues and dependency patterns
Bui et al. (2010). They proposed an approach that For each of the candidate entity pairs, we
consists of two phases. In the first phase, their construct a type of subgraph from the depen-
system categorizes the data into different groups dency graph formed by the syntactic dependen-
(i.e. subsets) based on various properties and pat- cies among the words of a sentence. We call it
terns. Later they classify candidate PPI pairs in- reduced graph and define it in the follow-
side each of the groups using SVM trained with ing way:
features specific for the corresponding group.
A reduced graph is a subgraph
3 Data of the dependency graph of a sentence
which includes:
There are 5 benchmark corpora for the PPI task
the two candidate entities and their
that are frequently used: HPRD50 (Fundel et al.,
governor nodes up to their least
2007), IEPA (Ding et al., 2002), LLL (Nedellec,
common governor (if exists).
2005), BioInfer (Pyysalo et al., 2007) and AIMed
(Bunescu et al., 2005). These corpora adopt dif- dependent nodes (if exist) of all the
ferent PPI annotation formats. For a comparative nodes added in the previous step.
evaluation Pyysalo et al. (2008) put all of them the immediate governor(s) (if ex-
in a common format which has become the stan- ists) of the least common governor.
dard evaluation format for the PPI task. In our
Figure 1 shows an example of a reduced graph.
experiments, we use the versions of the corpora
A reduced graph is an extension of the smallest
converted to such format.
common subgraph of the dependency graph that
Table 1 shows various statistics regarding the 5
aims at overcoming its limitations. It is a known
(converted) corpora.
issue that the smallest common subgraph (or sub-
4 Proposed Hybrid Kernel tree) sometimes does not contain cue words. Pre-
viously, Chowdhury et al. (2011a) proposed a lin-
The hybrid kernel that we propose is as follows: guistically motivated extension of the minimal
KHybrid (R1 , R2 ) = KT P W F (R1 , R2 ) (i.e. smallest) common subtree (which includes
+ KSL (R1 , R2 ) + w * KP ET (R1 , R2 ) the candidate entity pairs), known as Mildly Ex-
tended Dependency Tree (MEDT). However, the
where KT P W F stands for the new feature rules used for MEDT are too constrained. Our ob-
based kernel (henceforth, TPWF kernel) com- jective in constructing the reduced graph is to in-
puted using flat features collected by exploiting clude any potential modifier(s) or cue word(s) that
patterns, trigger words, negative cues and walk describes the relation between the given pair of
features. KSL and KP ET stand for the Shallow entities. Sometimes such modifiers or cue words
Linguistic (SL) kernel and the Path-enclosed Tree are not directly dependent (syntactically) on any
422
BioInfer AIMed IEPA HPRD50 LLL
P R F P R F P R F P R F P R F
Only walk features 51.8 71.2 60.0 48.7 63.2 55.0 61.0 75.2 67.4 60.2 65.0 62.5 64.6 87.8 74.4
Features: dep. patterns, 53.8 68.8 60.4 50.6 63.9 56.5 63.9 74.6 68.9 65.0 71.8 68.2 66.5 89.6 76.4
trigger, neg. cues, walks
Features: dep. patterns, 53.5 68.6 60.1 52.5 62.9 57.2 63.8 74.6 68.8 65.1 69.9 67.5 67.4 88.4 76.5
trigger, neg. cues, walks,
regex patterns
Table 2: Results of the proposed TPWF feature based kernel on 5 benchmark PPI corpora before and after adding
features collected using dependency patterns, regex patterns, trigger words and negative cues to the walk features.
The TPWF kernel is a component of the new hybrid kernel.
Figure 1: Dependency graph for the sentence A pVHL mutant containing a P154L substitution does not promote
degradation of HIF1-Alpha generated by the Stanford parser. The edges with blue dots form the smallest
common subgraph for the candidate entity pair pVHL and HIF1-Alpha, while the edges with red dots form the
reduced graph for the pair.
of the entities (of the candidate pair). Rather they of a (positive or negative) entity pair in the train-
are dependent on some other word(s) which is de- ing data. For example, the dependency pattern for
pendent on one (or both) of the entities. The word the reduced graph in Figure 1 is {det, amod, part-
not in Figure 1 is one such example. The re- mod, nsubj, aux, neg, dobj, prep of }. The same
duced graph aims to preserve these cue words. dependency pattern might be constructed for mul-
The following types of features are collected tiple (positive or negative) entity pairs. However,
from the reduced graph of a candidate pair: if it is constructed for both positive and negative
pairs, it has to be discarded from the pattern list.
1. HasTriggerWord: whether the least common
The dependency patterns allow some kind of
governor(s) of the target entity pairs inside
underspecification as they do not contain the lex-
the reduced graph matches any trigger word.
ical items (i.e. words) but contain the likely com-
2. Trigger-X: whether the least common gov- bination of syntactic dependencies that a given re-
ernor(s) of the target entity pairs inside the lated pair of entities would pose inside their re-
reduced graph matches the trigger word X. duced graph.
The list of trigger words contains 144 words
3. HasNegWord: whether the reduced graph previously used by Bui et al. (2010) and Fundel
contains any negative word. et al. (2007). The list of negative cues contain 18
4. DepPattern-i: whether the reduced graph words, most of which are mentioned in Fundel et
contains all the syntactic dependencies of the al. (2007).
i-th pattern of dependency pattern list. 4.1.2 Walk features
The dependency pattern list is automatically We extract e-walk and v-walk features from
constructed from the training data during the the Mildly Extended Dependency Tree (MEDT)
learning phase. Each pattern is a set of syntactic (Chowdhury et al., 2011a) of each candidate pair.
dependencies of the corresponding reduced graph Reduced graphs sometimes include some unin-
423
BioInfer AIMed IEPA HPRD50 LLL
Pos. / Neg. 2,534 / 7,132 1,000 / 4,834 335 / 482 163 / 270 164 / 166
P R F P R F P R F P R F P R F
Proposed TPWF kernel 53.8 68.8 60.4 50.6 63.9 56.5 63.9 74.6 68.9 65.0 71.8 68.2 66.5 89.6 76.4
(without regex)
Proposed TPWF kernel 53.5 68.6 60.1 52.5 62.9 57.2 63.8 74.6 68.8 65.1 69.9 67.5 67.4 88.4 76.5
(with regex)
SL kernel 60.8 65.8 63.2 56.2 64.4 60.0 73.3 71.9 72.6 62.0 65.0 63.5 74.9 85.4 79.8
PET kernel 72.8 74.9 73.9 44.8 72.8 55.5 70.7 77.9 74.2 65.0 73.0 68.8 72.1 89.6 79.9
Proposed hybrid kernel 80.0 71.4 75.5 64.2 58.2 61.1 81.1 69.3 74.7 72.9 59.5 65.5 70.4 95.7 81.1
(PET + SL + TPWF
(without regex))
Proposed hybrid kernel 80.1 72.0 75.9 64.4 58.3 61.2 79.3 69.6 74.1 71.9 61.4 66.2 70.6 95.1 81.0
(PET + SL + TPWF
(with regex))
Table 3: Results of the proposed hybrid kernel and its individual components. Pos. and Neg. refer to number
positive and negative relations respectively. PET refers to the path-enclosed tree kernel, SL refers to the shallow
linguistic kernel, and TPWF refers to the kernel computed using trigger, pattern, negative cue and walk features.
424
BioInfer AIMed IEPA HPRD50 LLL
Pos. / Neg. 2,534 / 7,132 1,000 / 4,834 335 / 482 163 / 270 164 / 166
P R F P R F P R F P R F P R F
SL kernel 60.9 57.2 59.0
(Giuliano et al., 2006)
APG kernel 56.7 67.2 61.3 52.9 61.8 56.4 69.6 82.7 75.1 64.3 65.8 63.4 72.5 87.2 76.8
(Airola et al., 2008)
Hybrid kernel and 65.7 71.1 68.1 55.0 68.8 60.8 67.5 78.6 71.7 68.5 76.1 70.9 77.6 86.0 80.1
multiple parser input
(Miwa et al., 2009a)
SVM-CW, multiple 67.6 64.2 74.4 69.7 80.5
parser input and graph,
walk and BOW features
(Miwa et al., 2009b)
kBSPS kernel 49.9 61.8 55.1 50.1 41.4 44.6 58.8 89.7 70.5 62.2 87.1 71.0 69.3 93.2 78.1
(Tikk et al., 2010)
Walk weighted 61.8 54.2 57.6 61.4 53.3 56.6 73.8 71.8 72.9 66.7 69.2 67.8 76.9 91.2 82.4
subsequence kernel
(Kim et al., 2010)
2 phase extraction 61.7 57.5 60.0 55.3 68.5 61.2
(Bui et al., 2010)
Our proposed hybrid 80.0 71.4 75.5 64.2 58.2 61.1 81.1 69.3 74.7 72.9 59.5 65.5 70.4 95.7 81.1
kernel (PET + SL +
TPWF without regex)
Table 4: Comparison of the results on the 5 benchmark PPI corpora. Pos. and Neg. refer to number positive and
negative relations respectively. The underlined numbers indicate the best results for the corresponding corpus
reported by any of the existing state-of-the-art approaches. The results of Bui et al. (2010) on LLL, HPRD50,
and IEPA are not reported since thy did not use all the positive and negative examples during cross validation.
Miwa et al. (2009b) showed that better results can be obtained using multiple corpora for training. However,
we consider only those results of their experiments where they used single training corpus as it is the standard
evaluation approach adopted by all the other studies on PPI extraction for comparing results. All the results of
the previous approaches reported in this table are directly quoted from their respective original papers.
where KSL , KGC and KLC correspond to SL, main). A PET is the smallest common subtree of a
global context (GC) and local context (LC) ker- phrase structure tree that includes the two entities
nels respectively. The GC kernel exploits contex- involved in a relation.
tual information of the words occurring before,
between and after the pair of entities (to be in-
vestigated for RE) in the corresponding sentence;
while the LC kernel exploits contextual informa- A tree kernel calculates the similarity between
tion surrounding individual entities. two input trees by counting the number of com-
mon sub-structures. Different techniques have
4.3 Path-enclosed tree (PET) Kernel been proposed to measure such similarity. We use
the Unlexicalized Partial Tree (uPT) kernel (Sev-
The path-enclosed tree (PET) kernel3 was first
eryn and Moschitti, 2010) for the computation of
proposed by Moschitti (2004) for semantic role
the PET kernel since a comparative evaluation by
labeling. It was later successfully adapted by
Chowdhury et al. (2011a) reported that uPT ker-
Zhang et al. (2005) and other works for relation
nels achieve better results for PPI extraction than
extraction on general texts (such as newspaper do-
the other techniques used for tree kernel compu-
3
Also known as shortest path-enclosed tree (SPT) kernel. tation.
425
5 Experimental Settings 6 Results and Discussion
We have followed the same criteria commonly To measure the contribution of the features col-
used for the PPI extraction tasks, i.e. abstract- lected from the reduced graphs (using dependency
wise 10-fold cross validation on individual corpus patterns, trigger words and negative cues) and
and one-answer-per-occurrence criterion. In fact, regex patterns, we have applied the new TPWF
we have used exactly the same (abstract-wise) kernel on the 5 PPI corpora before and after using
fold splitting of the 5 benchmark (converted) cor- these features. Results shown in Table 2 clearly
pora used by Tikk et al. (2010) for benchmarking indicate that usage of these features improve the
various kernel methods4 . performance. The improvement of performance
The Charniak-Johnson reranking parser (Char- is primarily due to the usage of dependency pat-
niak and Johnson, 2005), along with a self-trained terns which resulted in higher precision for all the
biomedical parsing model (McClosky, 2010), has corpora.
been used for tokenization, POS-tagging and We have tried to measure the contribution of
parsing of the sentences. Before parsing the sen- the regex patterns. However, from the empirical
tences, all the entities are blinded by assigning results a clear trend does not emerge (see Table
names as EntityX where X is the entity index. 2).
In each example, the POS tags of the two can- Table 3 shows a comparison among the re-
didate entities are changed to EntityX. The sults of the proposed hybrid kernel and its indi-
parse trees produced by the Charniak-Johnson vidual components. As we can see, the overall
reranking parser are then processed by the Stan- results of the hybrid kernel (with and without us-
ford parser5 (Klein and Manning, 2003) to obtain ing regex pattern features) are better than those
syntactic dependencies according to the Stanford by any of its individual component kernels. Inter-
Typed Dependency format. estingly, precision achieved on the 4 benchmark
The Stanford parser often skips some syntactic corpora (other than the smallest corpus LLL) is
dependencies in output. We use the following two much higher for the hybrid kernel than for the in-
rules to add some of such dependencies: dividual components. This strongly indicates that
these different types of information (i.e. depen-
If there is a conj and or conj or depen- dency patterns, regex patterns, triggers, negative
dency between two words X and Y, then X cues, syntactic dependencies among words and
should be dependent on any word Z on which constituent parse trees) and their different repre-
Y is dependent and vice versa. sentations (i.e. flat features, tree structures and
graphs) can complement each other to learn more
If there are two verbs X and Y such that in- accurate models.
side the corresponding sentence they have Table 4 shows a comparison of the PPI extrac-
only the word and or or between them, tion results of our proposed hybrid kernel with
then any word Z dependent on X should be those of other state-of-the-art approaches. Since
also dependent on Y and vice versa. the contribution of regex patterns in the perfor-
mance of the hybrid kernel was not relevant (as
Our system exploits SVM-LIGHT-TK6 (Mos- Tables 2 and 3 show), we used the results of pro-
chitti, 2006; Joachims, 1999). We made minor posed hybrid kernel without regex for the compar-
changes in the toolkit to compute the proposed ison. As we can see, the proposed kernel achieves
hybrid kernel. The ratio of negative and positive significantly higher results on the BioInfer corpus,
examples has been used as the value of the cost- the largest benchmark PPI corpus (2,534 positive
ratio-factor parameter. We have done parameter PPI pair annotations) available, than any of the
tuning following the approach described by Hsu existing approaches. Moreover, the results of the
et al. (2003). proposed hybrid kernel are on par with the state-
4
of-the-art results on the other smaller corpora.
Downloaded from http://informatik.hu-
berlin.de/forschung /gebiete/wbi/ppi-benchmark .
Furthermore, empirical results show that the
5
http://nlp.stanford.edu/software/lex-parser.shtml proposed hybrid kernel attains considerably
6
http://disi.unitn.it/moschitti/Tree-Kernel.htm higher precision than the existing approaches.
426
Since a dependency pattern, by construction, also demonstrates that the different types of infor-
contains all the syntactic dependencies inside the mation that we use are able to complement each
corresponding reduced graph, it may happen that other for relation extraction.
some of the dependencies (e.g. det or determiner) We believe there are at least three ways to
are not informative for classifying the label of the further improve the proposed approach. First
corresponding class label (i.e., positive or nega- of all, the 22 regular expression patterns (col-
tive relation) of the pattern. Their presence in- lected from Ono et al. (2001) and Bui et al.
side a pattern might make it unnecessarily rigid (2010)) are applied at the level of the sen-
and less general. So, we tried to identify and dis- tences and this sometimes produces unwanted
card such non informative dependencies by mea- matches. For example, consider the sentence
suring probabilities of the dependencies with re- X activates Y and inhibits Z where X, Y,
spect to the class label and then removing any of and Z are entities. The pattern Entity1.
them which has probability lower than a threshold activates. Entity2 matches both the XY and
(we tried with different threshold values). But do- XZ pairs in the sentence. But only the XY pair
ing so decreased the performance. This suggests should be considered. So, the patterns should
that the syntactic dependencies of a dependency be constrained to reduce the number of unwanted
pattern are not independent of each other even if matches. For example, they could be applied on
some of them might have low probability (with smaller linguistic units than full sentences. Sec-
respect to the class label) individually. We plan to ondly, different techniques could be used to iden-
further investigate whether there could be differ- tify less-informative syntactic dependencies in-
ent criteria for identifying non informative depen- side dependency patterns to make them more ac-
dencies. For the work reported in this paper, we curate and effective. Thirdly, usage of automati-
used the dependency patterns as they are initially cally collected paraphrases of regular expression
constructed. patterns instead of the patterns directly could be
We also did experiments to see whether collect- also helpful. Weakly supervised collection of
ing features for trigger words from the whole re- paraphrases for RE has been already investigated
duced graph would help. But that also decreased (e.g. Romano et al. (2006)) and, hence, can be
performance. This suggests that trigger words are tried for improving the TPWF kernel (which is a
more likely to appear in the least common gover- component of the proposed hybrid kernel).
nors.
Acknowledgments
7 Conclusion
This work was carried out in the context of the project
In this paper, we have proposed a new hybrid
eOnco - Pervasive knowledge and data management
kernel for RE that combines two vector based
in cancer care. The authors are grateful to Alessan-
kernels and a tree kernel. The proposed kernel
dro Moschitti for his help in the use of SVM-LIGHT-
outperforms any of the exiting approaches by a
TK. We also thank the anonymous reviewers for help-
wide margin on the BioInfer corpus, the largest
ful suggestions.
PPI benchmark corpus available. On the other
four smaller benchmark corpora, it performs ei-
ther better or almost as good as the existing state- References
of-the art approaches.
We have also proposed a novel feature based Antti Airola, Sampo Pyysalo, Jari Bjorne, Tapio
kernel, called TPWF kernel, using (automatically Pahikkala, Filip Ginter, and Tapio Salakoski. 2008.
All-paths graph kernel for protein-protein inter-
collected) dependency patterns, trigger words,
action extraction with evaluation of cross-corpus
negative cues, walk features and regular expres- learning. BMC Bioinformatics, 9(Suppl 11):S2.
sion patterns. The TPWF kernel is used as a com- Quoc-Chinh Bui, Sophia Katrenko, and Peter M.A.
ponent of the new hybrid kernel. Sloot. 2010. A hybrid approach to extract protein-
Empirical results show that the proposed hy- protein interactions. Bioinformatics.
brid kernel achieves considerably higher precision Razvan Bunescu and Raymond J. Mooney. 2006.
than the existing approaches, which indicates its Subsequence kernels for relation extraction. In Pro-
capability of learning more accurate models. This ceedings of NIPS 2006, pages 171178.
427
Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Ed- Parsing. Ph.D. thesis, Department of Computer
ward M. Marcotte, Raymond J. Mooney, Arun Ku- Science, Brown University.
mar Ramani, and Yuk Wah Wong. 2005. Compara- Makoto Miwa, Rune Stre, Yusuke Miyao, and
tive experiments on learning information extractors Junichi Tsujii. 2009a. Protein-protein interac-
for proteins and their interactions. Artificial Intelli- tion extraction by leveraging multiple kernels and
gence in Medicine, 33(2):139155. parsers. International Journal of Medical Informat-
Eugene Charniak and Mark Johnson. 2005. Coarse- ics, 78.
to-fine n-best parsing and maxent discriminative Makoto Miwa, Rune Stre, Yusuke Miyao, and
reranking. In Proceedings of ACL 2005. Junichi Tsujii. 2009b. A rich feature vector for
Md. Faisal Mahbub Chowdhury and Alberto Lavelli. protein-protein interaction extraction from multiple
2011b. Drug-drug interaction extraction using com- corpora. In Proceedings of EMNLP 2009, pages
posite kernels. In Proceedings of DDIExtrac- 121130, Singapore.
tion2011: First Challenge Task: Drug-Drug In- Alessandro Moschitti. 2004. A study on convolution
teraction Extraction, pages 2733, Huelva, Spain, kernels for shallow semantic parsing. In Proceed-
September. ings of ACL 2004, Barcelona, Spain.
Md. Faisal Mahbub Chowdhury, Alberto Lavelli, and Alessandro Moschitti. 2006. Making Tree Kernels
Alessandro Moschitti. 2011a. A study on de- Practical for Natural Language Learning. In Pro-
pendency tree kernels for automatic extraction of ceedings of EACL 2006, Trento, Italy.
protein-protein interaction. In Proceedings of Claire Nedellec. 2005. Learning language in logic -
BioNLP 2011 Workshop, pages 124133, Portland, genic interaction extraction challenge. Proceedings
Oregon, USA, June. of the ICML 2005 workshop: Learning Language in
Md. Faisal Mahbub Chowdhury, Asma Ben Abacha, Logic (LLL05), pages 3137.
Alberto Lavelli, and Pierre Zweigenbaum. 2011c. Toshihide Ono, Haretsugu Hishigaki, Akira Tanigami,
Two dierent machine learning techniques for drug- and Toshihisa Takagi. 2001. Automated ex-
drug interaction extraction. In Proceedings of traction of information on proteinprotein interac-
DDIExtraction2011: First Challenge Task: Drug- tions from the biological literature. Bioinformatics,
Drug Interaction Extraction, pages 1926, Huelva, 17(2):155161.
Spain, September. Sampo Pyysalo, Filip Ginter, Juho Heimonen, Jari
J. Ding, D. Berleant, D. Nettleton, and E. Wurtele. Bjorne, Jorma Boberg, Jouni Jarvinen, and Tapio
2002. Mining MEDLINE: abstracts, sentences, or Salakoski. 2007. Bioinfer: a corpus for information
phrases? Pacific Symposium on Biocomputing, extraction in the biomedical domain. BMC Bioin-
pages 326337. formatics, 8(1):50.
Katrin Fundel, Robert Kuffner, and Ralf Zimmer. Sampo Pyysalo, Antti Airola, Juho Heimonen, Jari
2007. Relexrelation extraction using dependency Bjorne, Filip Ginter, and Tapio Salakoski. 2008.
parse trees. Bioinformatics, 23(3):365371. Comparative analysis of five protein-protein in-
Claudio Giuliano, Alberto Lavelli, and Lorenza Ro- teraction corpora. BMC Bioinformatics, 9(Suppl
mano. 2006. Exploiting shallow linguistic infor- 3):S6.
mation for relation extraction from biomedical lit- Lorenza Romano, Milen Kouylekov, Idan Szpektor,
erature. In Proceedings of EACL 2006, pages 401 Ido Dagan, and Alberto Lavelli. 2006. Investi-
408. gating a generic paraphrasebased approach for re-
CW Hsu, CC Chang, and CJ Lin, 2003. A practical lation extraction. In Proceedings of EACL 2006,
guide to support vector classification. Department pages 409416.
of Computer Science and Information Engineering, Isabel Segura-Bedmar, Paloma Martnez, and Cesar de
National Taiwan University, Taipei, Taiwan. Pablo-Sanchez. 2011. Using a shallow linguistic
Thorsten Joachims. 1999. Making large-scale sup- kernel for drug-drug interaction extraction. Jour-
port vector machine learning practical. In Advances nal of Biomedical Informatics, In Press, Corrected
in kernel methods: support vector learning, pages Proof, Available online, 24 April.
169184. MIT Press, Cambridge, MA, USA. Aliaksei Severyn and Alessandro Moschitti. 2010.
Seonho Kim, Juntae Yoon, Jihoon Yang, and Seog Fast cutting plane training for structural kernels. In
Park. 2010. Walk-weighted subsequence kernels Proceedings of ECML-PKDD 2010.
for protein-protein interaction extraction. BMC Domonkos Tikk, Philippe Thomas, Peter Palaga,
Bioinformatics, 11(1). Jorg Hakenberg, and Ulf Leser. 2010. A Compre-
Dan Klein and Christopher D. Manning. 2003. Accu- hensive Benchmark of Kernel Methods to Extract
rate unlexicalized parsing. In Proceedings of ACL Protein-Protein Interactions from Literature. PLoS
2003, pages 423430, Sapporo, Japan. Computational Biology, 6(7), July.
David McClosky. 2010. Any Domain Parsing: Au- Min Zhang, Jian Su, Danmei Wang, Guodong Zhou,
tomatic Domain Adaptation for Natural Language and Chew Lim Tan. 2005. Discovering relations
428
between named entities from a large raw corpus us-
ing tree similarity-based clustering. In Natural Lan-
guage Processing IJCNLP 2005, volume 3651 of
Lecture Notes in Computer Science, pages 378389.
Springer Berlin / Heidelberg.
429
Coordination Structure Analysis using Dual Decomposition
430
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 430438,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
Stanford parser/Enju They disambiguated coordination structures
I am a ( freshman advertising ) and ( based on the edit distance between two conjuncts.
marketing major ) Hara et al. (2009) extended the method, dealing
with nested coordinations as well. We used their
Correct coordination structure
method as one of the two sub-models.
I am a freshman ( ( advertising and mar-
keting ) major ) 3 Background
3.1 Coordination structure analysis with
Table 1: Output from the Stanford parser, Enju and the
correct coordination structure alignment-based local features
Coordination structure analysis with alignment-
so we can easily add other modules or features for based local features (Hara et al., 2009) is a hy-
future. brid approach to coordination disambiguation that
The structure of this paper is as follows. First, combines a simple grammar to ensure consistent
we describe three basic methods required in the global structure of coordinations in a sentence,
technique we propose: 1) coordination structure and features based on sequence alignment to cap-
analysis with alignment-based local features, 2) ture local symmetry of conjuncts. In this section,
HPSG parsing, and 3) dual decomposition. Fi- we describe the method briefly.
nally, we show experimental results that demon- A sentence is denoted by x = x1 ...xk , where xi
strate the effectiveness of our approach. We com- is the i-th word of x. A coordination boundaries
pare three methods: coordination structure anal- set is denoted by y = y1 ...yk , where
ysis with alignment-based local features, HPSG
(bl , el , br , er ) (if xi is a coordinating
parsing, and the dual-decomposition-based ap-
proach that combines both. conjunction having left
yi = conjunct xbl ...xel and
2 Related Work
right conjunct xbr ...xer )
null (otherwise)
Many previous studies for coordination disam-
biguation have focused on a particular type of NP In other words, yi has a non-null value
coordination (Hogan, 2007). Resnik (1999) dis- only when it is a coordinating conjunction.
ambiguated coordination structures by using se- For example, a sentence I bought books and
mantic similarity of the conjuncts in a taxonomy. stationary has a coordination boundaries set
He dealt with two kinds of patterns, [n0 n1 and (null, null, null, (3, 3, 5, 5), null).
n2 n3 ] and [n1 and n2 n3 ], where ni are all nouns. The score of a coordination boundaries set is
He detected coordination structures based on sim- defined as the sum of score of all coordinating
ilarity of form, meaning and conceptual associa- conjunctions in the sentence.
tion between n1 and n2 and between n1 and n3 .
Nakov and Hearst (2005) used the Web as a train-
k
431
COMPS list of synsem COMPS < >
SEM semantics
nonlocal Spring
NONLOC REL list of local
SLASH list of local
432
u(1) 0
Head-complement
schema
HEAD 1
for k = 1 to K do
2
x(k) arg maxx (f (x) + u(k) x)
SUBJ
COMPS 4
Unify HEAD 1
SUBJ 2 Unify 3 COMPS < > y (k) arg maxy (g(y) u(k) y)
COMPS < 3 | 4 >
synsem if x = y then
synsem
return u(k)
HEAD verb HEAD verb
SUBJ < 5 >
ynsem HEAD noun
SUBJ < > HEAD verb
HEAD noun
SUBJ < SUBJ < > >
synsem COMPS < > COMPS < SUBJ < 5 > > COMPS < >
COMPS < > COMPS < >
end if
Spring has come u(k+1) uk ak (x(k) y (k) )
Lexical entries
end for
return u(K)
HEAD verb
n SUBJ < >
COMPS < >
subject-head Table 4: The subgradient method
HEAD 1 HEAD verb
SUBJ 2 SUBJ < 1 >
COMPS 4 COMPS < >
head-comp
HEAD noun HEAD verb HEAD verb
3 COMPS < > 1 SUBJ < > SUBJ < 1 > 2 SUBJ < 1 >
4> COMPS < > COMPS < 2 > COMPS < > shows, you can use existing algorithms and dont
Spring has come need to have an exact algorithm for the optimiza-
left) and Head- tion problem, which are features of dual decom-
Figure 3: HPSG parsing position.
Figure 2: HPSG parsing; taken from Miyao et al.
(2004). If x(k) = y (k) occurs during the algorithm, then
EM feature rep- we simply take x(k) as the primal solution, which
required because HPSG schemata are not injec-
ent, and in this is the exact answer. If not, we simply take x(K) ,
tive, i.e., daughters signs cannot be uniquely de-
ment structure. Coordina(on
termined given the mother. The following annota-
the answer of coordination structure analysis with
-Head Schema tions are at least required. First, the HEAD feature alignment-based features, as an approximate an-
coord_left_schema
ma1 defined in of each non-head daughter must be specified since swer to the primal solution. The answer does not
to express gen- this is not percolated Par(al,
to the mother sign. Second, always solve the original problem Eq (2), but pre-
ovide sharing of Le3,Conjunct vious works (e.g., (Rush et al., 2010)) has shown
SLASH/REL features Coordina(on
are required as described in
values. our previous study (Miyao et al., 2003a). Finally, that it is effective in practice. We use it in this
coord_right_schema
HPSG parsing the SUBJ feature of the complement daughter in paper.
e. First, each the Head-Complement
Coordina(ng, Schema must be specified
Right,
nd come are since this Conjunc(on Conjunct an unsatu-
schema may subcategorize 4 Proposed method
tructure of the rated constituent, i.e., a constituent with a non-
cation provides In this section, we describe how we apply dual
empty SUBJ feature. When the corpus is anno-
The sign of the decomposition to the two models.
tated with
Figure 3: at least these offeatures,
Construction the lexical
coordination in Enjuen-
peatedly apply- tries required to explain the sentence are uniquely
ns. Finally, the 4.1 Notation
determined. In this study, we define partially-
is output on the composed into efficiently
specified derivation solvable
trees as sub-problems.
tree structures anno- We define some notations here. First we describe
Ittated
is becoming popular in the
with schema names and HPSG signs NLP community
includ- weighted CFG parsing, which is used for both
and
inghas been shown to
the specifications of work
the aboveeffectively
features.on sev- coordination structure analysis with alignment-
e Penn eral We
NLP tasks (Rush et al., 2010).
describe the process of grammar develop- based features and HPSG parsing. We follows the
We consider an optimization
ment in terms of the four phases: problemspecification, formulation by Rush et al., (2010). We assume a
rammar devel- externalization, extraction, and verification. context-free grammar in Chomsky normal form,
arg max(f (x) + g(x)) (2)
o be annotated x with a set of non-terminals N . All rules of the
3.1 Specification
ons, and ii) ad- grammar are either the form A BC or A w
which is difficult to solve (e.g. NP-hard), while
grammar rules General grammatical constraints are defined in where A, B, C N and w V . For rules of the
arg maxx f (x) and arg maxx g(x) are effectively
history of rule this phase, and in HPSG, they are represented form A w we refer to A as the pre-terminal for
tree annotated solvable.
through theIn dual decomposition,
design of the sign andwe solve Fig-
schemata. w.
annotations are uremin
1 shows
max(f the(x)
definition
+ g(y) + foru(x
thetyped
y)) feature Given a sentence with n words, w1 w2 ...wn , a
nted for simplicity, structure of a sign used in this study. Some more
u x,y
parse tree is a set of rule productions of the form
been omitted. features are defined for each syntactic category al-
instead of the original problem. A BC, i, k, j where A, B, C N , and
To find the minimum value, we can use a sub- 1 i k j n. Each rule production rep-
gradient method (Rush et al., 2010). The subgra- resents the use of CFG rule A BC where non-
dient method is given in Table 4. As the algorithm terminal A spans words wi ...wj , non-terminal B
433
spans word wi ...wk , and non-terminal C spans 1 if rule COORDa,c CJTa,b CC , CJT ,c or
word wk+1 ...wj if k < j, and the use of CFG COORD ,c CJT , CCa,b CJT ,c is in the parse
rule A wi if i = k = j. tree; otherwise it is 0.
We now define the index set for the coordina- We apply the same extension to the HPSG in-
tion structure analysis as dex set, also giving an over-complete representa-
tion. We define za,b,c analogously to ya,b,c .
Icsa = {A BC, i, k, j : A, B, C N,
1 i k j n} 4.2 Proposed method
We now describe the dual decomposition ap-
Each parse tree is a vector y = {yr : r Icsa }, proach for coordination disambiguation. First, we
with yr = 1 if rule r is in the parse tree, and yr = define the set Q as follows:
0 otherwise. Therefore, each parse tree is repre-
sented as a vector in {0, 1}m , where m = |Icsa |. Q = {(y, z) : y Y, z Z, ya,b,c = za,b,c
We use Y to denote the set of all valid parse-tree for all (a, b, c) Iuni }
vectors. The set Y is a subset of {0, 1}m .
Therefore, Q is the set of all (y, z) pairs that
In addition, we assume a vector csa = {rcsa :
agree on their coordination structures. The coor-
r Icsa } that specifies a score for each rule pro-
dination structure analysis with alignment-based
duction. Each rcsa can take any real value. The
features and HPSG parsing problem is then to
optimal parse tree is y = arg maxyY y csa
solve
where y csa = r yr rcsa is the inner product
between y and csa . max (y csa + z hpsg ) (3)
(y,z)Q
We use similar notation for HPSG parsing. We
define Ihpsg , Z and hpsg as the index set for where > 0 is a parameter dictating the relative
HPSG parsing, the set of all valid parse-tree vec- weight of the two models and is chosen to opti-
tors and the weight vector for HPSG parsing re- mize performance on the development test set.
spectively. This problem is equivalent to
We extend the index sets for both the coor-
dination structure analysis with alignment-based max(g(z) csa + z hpsg ) (4)
zZ
features and HPSG parsing to make a constraint
where g : Z Y is a function that maps a
between the two sub-problems. For the coor-
HPSG tree z to its set of coordination structures
dination structure analysis with alignment-based
z = g(y).
features we define
the extended index set to be
We solve this optimization problem by using
I csa = Icsa Iuni where
dual decomposition. Figure 4 shows the result-
Iuni = {(a, b, c) : a, b, c {1...n}} ing algorithm. The algorithm tries to optimize
the combined objective by separately solving the
Here each triple (a, b, c) represents that word sub-problems again and again. After each itera-
wc is recognized as the last word of the right tion, the algorithm updates the weights u(a, b, c).
conjunct and the scope of the left conjunct or These updates modify the objective functions for
the coordinating conjunction is wa ...wb 1 . Thus the two sub-problems, encouraging them to agree
each parse-tree vector y will have additional com- on the same coordination structures. If y (k) =
ponents ya,b,c . Note that this representation is z (k) occurs during the iterations, then the algo-
over-complete, since a parse tree is enough to rithm simply returns y (k) as the exact answer. If
determine unique coordination structures for a not, the algorithm returns the answer of coordina-
sentence: more explicitly, the value of ya,b,c is tion analysis with alignment features as a heuristic
1
This definition is derived from the structure of a co-
answer.
ordination in Enju (Figure 3). The triples show where It is needed to modify original sub-problems
the coordinating conjunction and right conjunct are in for calculating (1) and (2) in Table 4. We modified
coord right schema, and the left conjunct and partial coor- the sub-problems to regard the score of u(a, b, c)
dination are in coord left schema. Thus they alone enable
not only the coordination structure analysis with alignment-
as a bonus/penalty of the coordination. The mod-
based features but Enju to uniquely determine the structure ified coordination structure analysis with align-
of a coordination. ment features adds u(k) (i, j, m) and u(k) (j+1, l
434
u(1) (a, b, c) 0 for all (a, b, c) Iuni
for k = 1 to K do !
y (k) arg maxyY (y csa (a,b,c)Iuni u(k) (a, b, c)ya,b,c ) ... (1)
!
z (k) arg maxzZ (z hpsg + (a,b,c)Iuni u(k) (a, b, c)za,b,c ) ... (2)
if y (k) (a, b, c) = z (k) (a, b, c) for all (a, b, c) Iuni then
return y (k)
end if
for all (a, b, c) Iuni do
u(k+1) (a, b, c) u(k) (a, b, c) ak (y (k) (a, b, c) z (k) (a, b, c))
end for
end for
return y (K)
w f (x, (i, j, l, m)) to the score of the sub- COORD WSJ Genia
1, m), as well as adding w f (x, (i, j, l, m)) to COORD WSJ Genia
NP 63.7 66.3
tree, when the rule production COORDi,m
the score of the subtree, when the rule produc- NP 63.7 66.3
CJTi,j CCj+1,l1 CJTl,m is applied. VP 13.8 11.4
tion COORDi,m CJTi,j CCj+1,l1 CJT l,m is VP 13.8
ADJP
11.4
6.8 9.6
The modified Enju adds u (i, j, l) when co-
(k)
applied. ADJP 6.8 9.6
ord left schema is applied, where word wc S 11.4 6.0
The modified Enju adds u(k) (a, b, c) when S 11.4
PP
6.0
2.4 5.1
is recognized as a coordinating conjunction
coord right schema is applied, where word PP 2.4 5.1
and left side of its scope is wa ...wb , or co- Others 1.9 1.5
wa ...wb is recognized as a coordinating conjunc- Others 1.9 1.5
ord right schema is applied, where word wc
tion and the last word of the right conjunct is
is recognized as a coordinating conjunction and Table 6: The percentage of each conjunct type (%) of
wc , or coord left schema is applied, where word Table 6: The each percentage
test setof each conjunct type (%) of
right side of its scope is wa ...wb . each test set
wa ...wb is recognized as the left conjunct and the
last word of 5 Experiments
the right conjunct is wc .
Penn Treebank has more VP-COOD tags and S-
rized into phrase COOD types suchwhile
tags, as a NP
the coordination
Genia corpus has more
5 Experiments 5.1 Test/Training data
or PP coordination.
NP-COOD Table 6
tags shows
and the percentagetags.
ADJP-COOD
We trained the alignment-based coordination of each phrase type in all coordianitons. It indi-
5.1 Test/Training data
analysis model on both the Genia corpus (?) Wall Street Journal portion of the Penn
cates the 5.2 Implementation of sub-problems
We trained and the the Wall Street Journal
alignment-based portion of Treebank
coordination the Penn has more VP coordinations and S co-
analysis modelTreebank
on both (?),theand evaluated
Genia corpusthe(Kimperformance of We
ordianitons,
used Enju (?) for the implementation of
while the Genia corpus has more NP
et al., 2003)ourandmethod
the Wall onStreet
(i) theJournal
Genia portion
corpus andcoordianitons
(ii) the HPSG parsing, which has a wide-coverage prob-
and ADJP coordiations.
of the Penn Wall
TreebankStreet(Marcus
Journal et portion Penn Treebank. abilistic HPSG grammar and an efficient parsing
of the and
al., 1993),
evaluated theMore precisely, we used HPSG (i) treebank algorithm, while we re-implemented Hara et al.,
performance of our method on 5.2 con-Implementation of sub-problems
verted and
the Genia corpus from(ii)thethePenn
WallTreebank
Street Jour-and Genia, and (2009)s algorithm with slight modifications.
nal portion offurther
the Pennextracted the training/test data for
Treebank. More precisely, We coor-
used Enju (Miyao and Tsujii, 2004) for
the 5.2.1 Step
implementation of HPSGsize parsing, which has
we used HPSG dination
treebankstructure
convertedanalysis
fromwith alignment-based
the Penn
Treebank and features
Genia,usingandthe annotation
further in thethe
extracted a wide-coverage
Treebank. Ta- We probabilistic
used the followingHPSG step size in our algo-
grammar
data??
training/test ble forshows the corpus used in the experiments.
coordination structure analy- and an rithm
efficient (Figure
parsing ??). First,
algorithm, we
whileinitialized
we re- a0 , which
The Wall features
sis with alignment-based Street Journal
using the portion implemented
anno-of the Penn is chosen
Hara et to
al., optimize
(2009)s performance
algorithm withon the devel-
tation in the Treebank
Treebank.has 2317
Table sentences
5 shows from WSJslight
the corpus articles, opment set. Then we defined ak = a0 2k ,
modifications.
(k! )
and there are 1356 COOD tags in the sentences, where! k is the number of times that L(u ) >
used in the experiments.
5.2.1 Step size(k 1) ) for k # k.
The Wall while
Streetthe Genia portion
Journal corpus has 1754
of the sentences from L(u
Penn
Treebank in MEDLINE abstracts,
the test set has and there from
2317 sentences are 1848 COODWe used the following step size in our algo-
WSJ articles,tags andinthere
the sentences. COOD tags arerithm
are 1356 coordinations further(Figure5.34). Evaluation metric a , which
First, we initialized 0
subcategorized
in the sentences, while the intoGeniaphrase
corpustypes NP- to We
in thesuchisaschosen evaluated
optimize the performance
performance on the of the tested meth-
devel-
a0 2 k ,
COOD
test set has 1764 or VP-COOD.
sentences from MEDLINETable ?? ab- showsopment
the per-set. ods Then by we
the defined
accuracyakof=coordination-level brack-
stracts, and centage
there areof1848 each coordinations
phrase type ininall theCOODwhere tags.k is eting (?); i.e.,
the number of we count
times L(u(k
thateach of )the
) >coordination
1)
It indicates theare
sentences. Coordinations Wall Streetsubcatego-
further Journal portion L(uof(k the k as
scopes
) for k.one output of the system, and the system
435
Task (i) Task (ii)
Training WSJ (sec. 221) + Genia (No. 11600) WSJ (sec. 221)
Development Genia (No. 16011800) WSJ (sec. 22)
Test Genia (No. 18011999) WSJ (sec. 23)
100%$
Proposed Enju CSA
95%$
Precision 72.4 66.3 65.3
90%$
Recall 67.8 65.5 60.5 85%$
F1 70.0 65.9 62.8 80%$
75%$
Table 7: Results of Task (i) on the test set. The preci- 70%$
sion, recall, and F1 (%) for the proposed method, Enju, 65%$
and Coordination structure analysis with alignment- 60%$
1$ 3$ 5$ 7$ 9$ 11$13$15$17$19$21$23$25$27$29$31$33$35$37$39$41$43$45$47$49$
based features (CSA)
accuracy certicates
436
COORD # Proposed Enju CSA # Hara et al. (2009)
Overall 1848 67.7 63.3 61.9 3598 61.5
NP 1213 67.5 61.4 64.1 2317 64.2
VP 208 79.8 78.8 66.3 456 54.2
ADJP 193 58.5 59.1 54.4 312 80.4
S 111 51.4 52.3 34.2 188 22.9
PP 110 64.5 59.1 57.3 167 59.9
Others 13 78.3 73.9 65.2 140 49.3
Table 8: The number of coordinations of each type (#), and the recall (%) for the proposed method, Enju,
coordination structure analysis with alignment-based features (CSA) , and Hara et al. (2009) of Task (i) on the
development set. Note that Hara et al. (2009) uses a different test set and different annotation rules, although its
test data is also taken from the Genia corpus. Thus we cannot compare them directly.
100%$
Proposed Enju CSA
95%$
Precision 76.3 70.7 66.0
90%$
Recall 70.6 69.0 60.1 85%$
F1 73.3 69.9 62.9 80%$
75%$
Table 9: Results of Task (ii) on the test set. The preci- 70%$
sion, recall, and F1 (%) for the proposed method, Enju, 65%$
and Coordination structure analysis with alignment- 60%$
1$ 3$ 5$ 7$ 9$ 11$13$15$17$19$21$23$25$27$29$31$33$35$37$39$41$43$45$47$49$
based features (CSA)
accuracy certicates
437
References In Proceedings of the 2007 Joint Conference on
Empirical Methods in Natural Language Process-
Kazuo Hara, Masashi Shimbo, Hideharu Okuma, and ing and Computational Natural Language Learn-
Yuji Matsumoto. 2009. Coordinate structure analy- ing, pages 610619, Jun.
sis with global structural constraints and alignment-
based local features. In Proceedings of the 47th An-
nual Meeting of the ACL and the 4th IJCNLP of the
AFNLP, pages 967975, Aug.
Deirdre Hogan. 2007. Coordinate noun phrase dis-
ambiguation in a generative parsing model. In Pro-
ceedings of the 45th Annual Meeting of the Asso-
ciation of Computational Linguistics (ACL 2007),
pages 680687.
Jun-Dong Kim, Tomoko Ohta, and Junich Tsujii.
2003. Genia corpus - a semantically annotated cor-
pus for bio-textmining. Bioinformatics, 19.
Dan Klein and Christopher D. Manning. 2003. Fast
exact inference with a factored model for natural
language parsing. Advances in Neural Information
Processing Systems, 15:310.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1993. Building a large annotated
corpus of english: The penn treebank. Computa-
tional Linguistics, 19:313330.
Yusuke Miyao and Junich Tsujii. 2004. Deep lin-
guistic analysis for the accurate identification of
predicate-argument relations. In Proceeding of
COLING 2004, pages 13921397.
Yusuke Miyao and Junich Tsujii. 2008. Feature
forest models for probabilistic hpsg parsing. MIT
Press, 1(34):3580.
Yusuke Miyao, Takashi Ninomiya, and Junichi Tsu-
jii. 2004. Corpus-oriented grammar development
for acquiring a head-driven phrase structure gram-
mar from the penn treebank. In Proceedings of
the First International Joint Conference on Natural
Language Processing (IJCNLP 2004).
Preslav Nakov and Marti Hearst. 2005. Using the web
as an implicit training set: Application to structural
ambiguity resolution. In Proceedings of the Human
Language Technology Conference and Conference
on Empirical Methods in Natural Language (HLT-
EMNLP 2005), pages 835842.
Carl Pollard and Ivan A. Sag. 1994. Head-driven
phrase structure grammar. University of Chicago
Press.
Philip Resnik. 1999. Semantic similarity in a takon-
omy. Journal of Artificial Intelligence Research,
11:95130.
Alexander M. Rush, David Sontag, Michael Collins,
and Tommi Jaakkola. 2010. On dual decomposi-
tion and linear programming relaxations for natu-
ral language processing. In Proceeding of the con-
ference on Empirical Methods in Natural Language
Processing.
Masashi Shimbo and Kazuo Hara. 2007. A discrimi-
native learning model for coordinate conjunctions.
438
Cutting the Long Tail: Hybrid Language Models
for Translation Style Adaptation
439
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 439448,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
Beginning of Sentence: [s] End of Sentence: [/s]
TED NEWS TED NEWS
1st [s] Thank you . [/s] 1st [s] ( AP ) - 1st [s] Thank you . [/s] 1st he said . [/s]
2 [s] Thank you very much 2 [s] WASHINGTON ( ... 2 you very much . [/s] 2 she said . [/s]
3 [s] I m going to 3 [s] NEW YORK ( AP 3 in the world . [/s] 3 , he said . [/s]
4 [s] And I said , 4 [s] ( CNN ) 4 and so on . [/s] 4 he said . [/s]
5 [s] I don t know 5 [s] NEW YORK ( R... 5 , you know . [/s] 5 in a statement . [/s]
6 [s] He said , 6 [s] He said : 6 of the world . [/s] 6 the United States . [/s]
7 [s] I said , 7 [s] I don t 7 around the world . [/s] 7 to this report . [/s]
8 [s] And of course , 8 [s] It was last updated 8 . Thank you . [/s] 8 he added . [/s]
9 [s] And one of the 9 [s] At the same time 9 the United States . [/s] 9 , police said . [/s]
10 [s] And I want to ... 10 all the time . [/s] 10 , officials said . [/s]
11 [s] And that s what 69 [s] I don t know 11 to do it . [/s] ...
12 [s] We re going to 612 [s] I m going to 12 and so forth . [/s] 13 in the world . [/s]
13 [s] And I think that 2434 [s] I said , 13 don t know . [/s] 17 around the world . [/s]
14 [s] And you can see 7034 [s] He said , 14 to do that . [/s] 46 of the world . [/s]
15 [s] And this is a 8199 [s] And I said , 15 in the future . [/s] 129 all the time . [/s]
16 [s] And this is the 8233 [s] Thank you very much 16 the same time . [/s] 157 and so on . [/s]
17 [s] And he said , ... 17 , you know ? [/s] 1652 , you know . [/s]
18 [s] So this is a [s] Thank you . [/s] 18 to do this . [/s] 5509 you very much . [/s]
Table 1: Common sentence-initial and sentence-final 5-grams, as ranked by frequency, in the TED and NEWS
corpora. Numbers denote the frequency rank.
monolingual consists of a rather small collection used by these two LMs to score the tests refer-
of TED talks plus a variety of large out-of-domain ence translations. Note that the latter measure is
corpora, such as news stories and UN proceed- bounded at the LM order minus one, and is in-
ings. versely proportional to the number of back-offs
Given the diversity of topics, the in-domain performed by the model. Hence, we use this value
data alone cannot ensure sufficient coverage to an to estimate how well an n-gram LM fits the test
SMT system. The addition of background data data. Indeed, despite the genre mismatch, the per-
can certainly improve the n-gram coverage and plexity of a NEWS 5-gram LM on the TED-2010
thus the fluency of our translations, but it may also test reference translations is 104 versus 112 for
move our system towards an unsuitable language the in-domain LM, and the average history size is
style, such as that of written news. 2.5 versus 1.7 words.
In our study, we focus on the subproblem of TED NEWS
target language modeling and consider two En- 1st , the 1st
glish text collections, namely the in-domain TED ... ...
and the out-of-domain NEWS3 , summarized in 9 I 40 I
12 you 64 you
Table 2. Because of its larger size two orders
90 actually 965 actually
of magnitude the NEWS corpus can provide a 268 stuff 2479 guy
better LM coverage than the TED on the test data. 370 guy 2861 stuff
This is reflected both on perplexity and on the av- 436 amazing 4706 amazing
erage length of the context (or history h) actually
Table 3: Excerpts from TED and NEWS training vo-
3
http://www.statmt.org/wmt11/translation-task.html cabularies, as ranked by frequency. Numbers denote
the frequency rank.
440
two corpora (Table 3). The very first forms, as adoption of the log-linear modeling framework in
ranked by frequency, are quite similar in the two many NLP tasks has recently introduced the use
corpora. However, there are important excep- of multiple LM components (features), which per-
tions: the pronouns I and you are among the top mit to naturally factor out and integrate different
20 frequent forms in the TED, while in the NEWS aspects of language into one model. In SMT, the
they are ranked only 40th and 64th respectively. factored model (Koehn and Hoang, 2007), for in-
Other interesting cases are the words actually, stance, permits to better tailor the LM to the task
stuff, guy and amazing, all ranked about 10 times syntax, by complementing word-based n-grams
higher in the TED than in the NEWS corpus. with a part-of-speech (POS) LM , that can be es-
We can also analyze the most typical ways timated even on a limited amount of task-specific
to start and end a sentence in the two text col- data. Besides many works addressing holistic LM
lections. As shown in Table 1, the frequency domain adaptation for SMT, e.g. Foster and Kuhn
ranking of sentence-initial and sentence-final 5- (2007), recently methods were also proposed to
grams in the in-domain corpus is notably different explicitly adapt the LM to the discourse topic of a
from the out-of-domain one. TEDs most frequent talk (Ruiz and Federico, 2011). Our work makes
sentence-initial 5-gram [s] Thank you . [/s] is another step in this direction by investigating hy-
not at all attested in the NEWS corpus. As for brid LMs that try to explicitly represent the speak-
the 4th most common sentence start [s] And I ing style of the talk genre. As a difference from
said , is only ranked 8199th in the NEWS, and standard class-based LMs (Brown et al., 1992) or
so on. Notably, the top ranked NEWS 5-grams in- the more recent local LMs (Monz, 2011), which
clude names of cities (Washington, New York) and are used to predict sequences of classes or word-
of news agency (AP, Reuters). As regards sen- class pairs, our hybrid LM is devised to pre-
tence endings, we observe similar contrasts: for dict sequences of classes interleaved by words.
instance, the word sequence and so on . [/s] While we do not claim any technical novelty in
is ranked 4th in the TED and 157th in the NEWS the model itself, to our knowledge a deep investi-
while , you know . [/s] is 5th in the TED and gation of hybrid LMs for the sake of style adap-
only 1652th in the NEWS. tation is definitely new. Finally, the term hybrid
These figures confirm that the talks have a spe- LM was inspired by Yazgan and Saraclar (2004),
cific language style, remarkably different from which called with this name a LM predicting se-
that of the written news genre. In summary, talks quences of words and sub-words units, devised to
are characterized by a massive use of first and sec- let a speech recognizer detect out-of-vocabulary-
ond persons, by shorter sentences, and by more words.
colloquial lexical and syntactic constructions.
4 Hybrid Language Model
3 Related Work
Hybrid LMs are n-gram models trained on a
The brittleness of n-gram LMs in case of mis- mixed text representation where each word is ei-
match between training and task data is a well ther mapped to a class or left as is. This choice
known issue (Rosenfeld, 2000). So called do- is made according to a measure of word common-
main adaptation methods (Bellegarda, 2004) can ness and is univocal for each word type.
improve the situation, once a limited amount The rationale is to discard topic-specific words,
of task specific data become available. Ideally, while preserving those words that best character-
domain-adaptive LMs aim to improve model ro- ize the language style (note that word frequency
bustness under changing conditions, involving is computed on the in-domain corpus only). Map-
possible variations in vocabulary, syntax, content, ping non-frequent terms to classes naturally leads
and style. Most of the known LM adaption tech- to a shorter tail in the frequency distribution, as
niques (Bellegarda, 2004), however, address all visualized by Figure 1. A model trained on such
these variations in a holistic way. A possible rea- data has a better n-gram coverage of the test set
son for this is that LM adaptation methods were and may take advantage of a larger context when
originally developed under the automatic speech scoring translation hypotheses.
recognition framework, which typically assumes As classes, we use deterministically assigned
the presence of one single LM. The progressive POS tags, obtained by first tagging the data with
441
!""""""#
442
appear in the processed text, thus introducing a Hybrid 10g LM |V | POS-Err h10g
further level of abstraction from the original text. all words 51299 0.0% 1.7
all lemmas 38486 0.0% 1.9
Here follows a TED sentence in its original
.25 POS/words 475 1.9% 2.7
version (first line) and after three different hy- .50 POS/words 93 4.1% 3.5
brid mappings namely WP =.25, WP =.25 with .75 POS/words 50 5.7% 4.1
lemma forms, and WP =.50: allPOS 43 6.6% 4.4
.25 POS/lemmas 302 1.8% 2.8
Now you laugh, but that quote has kind of a sting to it, right. .25 POS/words(fdf) 301 1.9% 2.7
Now you VB , but that NN has kind of a NN to it, right.
Now you VB , but that NN have kind of a NN to it, right. Table 5: Comparison of LMs obtained from different
RB you VB , CC that NN VBZ NN of a NN to it, RB . hybrid mappings of the English TED corpus: vocabu-
lary size, POS error rate, and average word history on
IWSLTtst2010s reference translations.
5 Evaluation
In this section we perform an intrinsic evaluation course, the more words are mapped, the less dis-
of the proposed LM technique, then we measure criminative our model will be. Thus, choosing the
its impact on translation quality when integrated best hybrid mapping means finding the best trade-
into a state-of-the-art phrase-based SMT system. off between coverage and informativeness.
We also applied hybrid LM to the French lan-
5.1 Intrinsic evaluation guage, again using Tree Tagger to create the POS
We analyze here a set of hybrid LMs trained on mapping. The tag set in this case comprises 34
the English TED corpus by varying the ratio of classes and the POS error rate with WP =.25 is
POS-mapped words and the word representation 1.2% (compare with 1.9% in English). As previ-
technique (word vs lemma). All models were ously discussed, morphology has a notable effect
trained with the IRSTLM toolkit (Federico et al., on the modeling of French. In fact, the vocabu-
2008), using a very high n-gram order (10) and lary reduction obtained by mapping all the words
Witten-Bell smoothing. to their most probable lemma is -45% (57959 to
First, we estimate an upper bound of the POS 31908 types in the TED corpus), while in English
tagging errors introduced by deterministic tag- it is only -25%.
ging. At this end, the hybridly mapped data is
5.2 SMT baseline
compared with the actual output of Tree Tagger on
the TED training corpus (see Table 5). Naturally, Our SMT experiments address the translation of
the impact of tagging errors correlates with the ra- TED talks from Arabic to English and from En-
tio of POS-mapped tokens, as no error is counted glish to French. The training and test datasets
on non-mapped tokens. For instance, we note that were provided by the organizers of the IWSLT11
the POS error rate is only 1.9% in our primary set- evaluation, and are summarized in Table 6.
ting, WP =.25 and word representation, whereas Marked in bold are the corpora used for hybrid
on a fully POS-mapped text it is 6.6%. Note that LM training. Dev and test sets have a single ref-
the English tag set used by Tree Tagger includes erence translation.
43 classes. For both language pairs, we set up com-
Now we focus on the main goal of hybrid text petitive phrase-based systems6 using the Moses
representation, namely increasing the coverage of toolkit (Koehn et al., 2007). The decoder fea-
the in-domain LM on the test data. Here too, we tures a statistical log-linear model including a
measure coverage by the average length of word phrase translation model and a phrase reordering
history h used to score the test reference transla- model (Tillmann, 2004; Koehn et al., 2005), two
tions (see Section 2). We do not provide perplex- word-based language models, distortion, word
ity figures, since these are not directly compara- and phrase penalties. The translation and re-
ble across models with different vocabularies. As ordering models are obtained by combining mod-
shown by Table 5, n-gram coverage increases with els independently trained on the available paral-
the ratio of POS-mapped tokens, ranging from 1.7 6
The SMT systems used in this paper are thoroughly de-
on an all-words LM to 4.4 on an all-POS LM. Of scribed in (Ruiz et al., 2011).
443
Corpus |S| |W | ` translation models, while the English-French sys-
TED 90K 1.7M 18.9 tem uses lowercased models and a standard re-
AR-EN
UN 7.9M 220M 27.8
casing post-process.
TED 124K 2.4M 19.5
EN
NEWS 30.7M 782M 25.4
Feature weights are tuned on dev2010 by
dev2010 934 19K 20.0 means of a minimum error training procedure
AR test (MERT) (Och, 2003). Following suggestions by
tst2010 1664 30K 18.1
TED 105K 2.0M 19.5 Clark et al. (2011) and Cettolo et al. (2011) on
EN-FR UN 11M 291M 26.5 controlling optimizer instability, we run MERT
NEWS 111K 3.1M 27.6 four times on the same configuration and use the
FR
TED 107K 2.2M 20.6 average of the resulting weights to evaluate trans-
NEWS 11.6M 291M 25.2 lation performance.
dev2010 934 20K 21.5
EN test
tst2010 1664 32K 19.1 5.3 Hybrid LM integration
As previously stated, hybrid LMs are trained only
Table 6: IWSLT11 training and test data statistics:
number of sentences |S|, number of tokens |W | and on in-domain data and are added to the log-linear
average sentence length `. Token numbers are com- decoder as an additional target LM. To this end,
puted on the target language, except for the test sets. we use the class-based LM implementation pro-
vided in Moses and IRSTLM, which applies the
word-to-class mapping to translation hypotheses
lel corpora: namely TED and NEWS for Arabic- before LM querying8 . The order of the additional
English; TED, NEWS and UN for English- LM is set to 10 in the Arabic-English evaluation
French. To this end we applied the fill-up method and 7 in the English-French, as these appeared to
(Nakov, 2008; Bisazza et al., 2011) in which out- be the best settings in preliminary tests.
of-domain phrase tables are merged with the in- Translation quality is measured by BLEU (Pa-
domain table by adding only new phrase pairs. pineni et al., 2002), METEOR (Banerjee and
Out-of-domain phrases are marked with a binary Lavie, 2005) and TER (Snover et al., 2006)9 . To
feature whose weight is tuned together with the test whether differences among systems are statis-
SMT system weights. tically significant we use approximate randomiza-
For each target language, two standard 5-gram tion as done in (Riezler and Maxwell, 2005)10 .
LMs are trained separately on the monolingual
TED and NEWS datasets, and log-linearly com- Model variants. The effect on MT quality of
bined at decoding time. In the Arabic-English various hybrid LM variants is shown in Table 7.
task, we use a hierarchical reordering model (Gal- Note that allPOS and allLemmas refer to deter-
ley and Manning, 2008; Hardmeier et al., 2011), ministically assigned POS tags and lemmas, re-
while in the English-French task we use a default spectively. Concerning the ratio of POS-mapped
word-based bidirectional model. The distortion tokens, the best performing values are WP =.25 in
limit is set to the default value of 6. Note that Arabic-English and WP =.50 in English-French.
the use of large n-gram LMs and of lexicalized These hybrid mappings outperform all the uni-
reordering models was shown to wipe out the im- form representations (words, lemmas and POS)
provement achievable by POS-level LM (Kirch- with statistically significant BLEU and METEOR
hoff and Yang, 2005; Birch et al., 2007). improvements.
Concerning data preprocessing we apply stan- The fdf experiment involves the use of doc-
dard tokenization to the English and French text, ument frequency for the selection of common
while for Arabic we use an in-house tokenizer that words. Its performance is very close to that of hy-
removes diacritics and normalizes special charac- 8
Detailed instructions on how to build and use hybrid
ters and digits. Arabic text is then segmented with LMs can be found at http://hlt.fbk.eu/people/bisazza.
AMIRA (Diab et al., 2004) according to the ATB 9
We use case-sensitive BLEU and TER, but case-
scheme7 . The Arabic-English system uses cased insensitive METEOR to enable the use of paraphrase tables
distributed with the tool (version 1.3).
7 10
The Arabic Treebank tokenization scheme isolates con- Translation scores and significance tests were com-
junctions w+ and f+, prepositions l+, k+, b+, future marker puted with the Multeval toolkit (Clark et al., 2011):
s+, pronominal suffixes, but not the article Al+. https://github.com/jhclark/multeval.
444
(a) Arabic to English, IWSLTtst2010 (b) English to French, IWSLTtst2010
Added InDomain 10gLM BLEU MET TER Added InDomain 7gLM BLEU MET TER
.00 POS/words (all words) 26.1 30.5 55.4 .00 POS/words (all words) 31.1 52.5 49.9
.00 POS/lemmas (all lem.) 26.0 30.5 55.4 .00 POS/lemmas (all lem.) 31.2 52.6 49.7
1.0 POS/words (all POS) 25.9 30.6 55.3 1.0 POS/words (all POS) 31.4 52.8 49.8
.25 POS/words 26.5 30.6 54.7 .25 POS/lemmas 31.5 52.9 49.7
.50 POS/words 26.5 30.6 54.9 .50 POS/lemmas 31.9 53.3 49.5
.75 POS/words 26.3 30.7 55.0 .75 POS/lemmas 31.7 53.2 49.6
.25 POS/words(fdf) 26.5 30.7 54.7 .50 POS/lemmas(fdf) 31.9 53.3 49.5
.25 POS/lemmaF 26.4 30.6 54.8 .50 POS/lemmaF 31.6 53.0 49.6
.25 POS/lemmas 26.5 30.8 54.6 .50 POS/words 31.7 53.1 49.5
Table 7: Comparison of various hybrid LM variants. Translation quality is measured with BLEU, METEOR and
TER (all in percentage form). The settings used for weight tuning are marked with . Best models according to
all metrics are highlighted in bold.
brid LMs simply based on term frequency; only Comparison with baseline. In Table 8 the
METEOR gains 0.1 points in Arabic-English. A best performing hybrid LM is compared against
possible reason for this is that document fre- the baseline that only includes the standard LMs
quency was computed on fixed-size text chunks described in Section 5.2. To complete our eval-
rather than on real document boundaries (see Sec- uation, we also report the effect of an in-domain
tion 4.1). The lemmaF experiment refers to the LM trained on 50 word classes induced from the
use of canonical forms for frequency measuring: corpus by maximum-likelihood based clustering
this technique does not seem to help in either lan- (Och, 1999).
guage pair. Finally, we compare the use of lem- In the two language pairs, both types of LM
mas versus surface forms to represent common result in consistent improvements over the base-
words. As expected, lemmas appear to be help- line. However, the gains achieved by the hybrid
ful for French language modeling. Interestingly approach are larger and all statistically signifi-
this is also the case for English, even if by a small cant. The hybrid approach is significantly bet-
margin (+0.2 METEOR, -0.1 TER). ter than the unsupervised one by TER in Arabic-
English and by BLEU and METEOR in English-
Summing up, hybrid mapping appears as a French (these siginificances are not reported in
winning strategy compared to uniform map-
ping. Although differences among LM variants
(a) Arabic to English, IWSLTtst2010
are small, the best model in Arabic-English is
.25-POS/lemmas, which can be thought of as Added InDomain
BLEU MET TER
a domain-generic lemma-level LM. In English- 10g LM
French, instead, the highest scores are achieved none (baseline) 26.0 30.4 55.6
by .50-POS/lemmas or .50-POS/lemmas(fdf), that unsup. classes 26.4 30.8 55.1
is POS-level LM with few frequently occurring hybrid 26.5 (+.5) 30.8 (+.4) 54.6 (-1.0)
lexical anchors (vocabulary size 59). An inter- (b) English to French, IWSLTtst2010
pretation of this result is that, for French, mod- Added InDomain
BLEU MET TER
eling the syntax is more helpful than modeling 7g LM
the style. We also suspect that the French TED none (baseline) 31.2 52.7 49.8
corpus is more irregular and diverse with respect unsup. classes 31.5 52.9 49.6
to the style, than its English counterpart. In fact, hybrid 31.9 (+.7) 53.3 (+.6) 49.5 (-.3)
while the English corpus include transcripts of
talks given by English speakers, the French one is Table 8: Final MT results: baseline vs unsupervised
mostly a collection of (human) translations. Typi- word classes-based LM and best hybrid LM. Statis-
cal features of the speech style may have been lost tically significant improvements over the baseline are
in this process. marked with at the p < .01 and at the p < .05 level.
445
the table for clarity). The proposed method ap- points, while that of the word-based PP is 79. The
pears to better leverage the available in-domain BLEU improvement given by hybrid LM, how-
data, achieving improvements according to all ever modest, is consistent across the talks, with
metrics: +0.5/+0.4/-1.0 BLEU/METEOR/TER only two outliers: a drop of -0.2 on talk 00, and
in Arabic-English and +0.7/-0.6/-0.3 in English- a drop of -0.7 on talk 02. The largest gain (+1.1)
French, without requiring any bitext annotation or is observed on talk 10, from 16.8 to 17.9 BLEU.
decoder modification.
6 Conclusions
Talk-level analysis. To conclude the study,
we analyze the effect of our best hybrid LM We have proposed a language modeling technique
on Arabic-English translation quality, at the sin- that leverages the in-domain data for SMT style
gle talk level. The test used in the experiments adaptation. Trained to predict mixed sequences
(tst2010) consists of 11 transcripts with an av- of POS classes and frequent words, hybrid LMs
erage length of 15173 sentences. For each are devised to capture typical lexical and syntactic
talk, we compare the baseline BLEU score with constructions that characterize the style of speech
that obtained by adding a .25-POS/lemmas hybrid transcripts.
LM. Results are presented in Figure 2. The dark Compared to standard language models, hy-
and light columns denote baseline and hybrid-LM brid LMs generalize better to the test data and
BLEU scores, respectively, and refer to the left y- partially compensate for the disproportion be-
axis. Additional data points, plotted on the right tween in-domain and out-of-domain training data.
y-axis in reverse order, represent talk-level per- At the same time, hybrid LMs show more dis-
plexities (PP) of a standard 5-gram LM trained criminative power than merely POS-level LMs.
on TED () and those of the .25-POS/lemmas The integration of hybrid LMs into a competi-
10-gram hybrid LM (M), computed on reference tive phrase-based SMT system is straightforward
translations. and leads to consistent improvements on the TED
What emerges first is a dramatic variation of task, according to three different translation qual-
performance among the speeches, with baseline ity metrics.
BLEU scores ranging from 33.95 on talk 00 to Target language modeling is only one aspect
only 12.42 on talk 02. The latter talk appears as of the statistical translation problem. Now that
a corner case also according to perplexities (397 the usability of the proposed method has been as-
by word LM and 111 by hybrid LM). Notably, the sessed for language modeling, future work will
perplexities of the two LMs correlate well with address the extension of the idea to the modeling
each other, but the hybrids PP is much more sta- of phrase translation and reordering.
ble across talks: its standard deviation is only 14
Acknowledgments
-./0"123456" -./0"17826" 99"1:;<=4#>6" 99"17826" This work was supported by the T4ME network
&#(!!" !"
&%(#!" #!"
of excellence (IST-249119), funded by the DG
&!(!!" $!!"
INFSO of the European Commission through the
%)(#!"
$#!"
7th Framework Programme. We thank the anony-
%#(!!" mous reviewers for their valuable suggestions.
%!!"
%%(#!"
%#!"
%!(!!"
&!!"
$)(#!"
$#(!!" &#!" References
$%(#!" '!!"
Satanjeev Banerjee and Alon Lavie. 2005. METEOR:
$!(!!" '#!"
!!" !$" !%" !&" !'" !#" !*" !)" !+" !," $!"
An automatic metric for MT evaluation with im-
proved correlation with human judgments. In Pro-
ceedings of the ACL Workshop on Intrinsic and Ex-
Figure 2: Talk-level evaluation on Arabic-English trinsic Evaluation Measures for Machine Transla-
(IWSLT-tst2010). Left y-axis: BLEU impact of a .25- tion and/or Summarization, pages 6572, Ann Ar-
POS/lemma hybrid LM. Right y-axis: perplexities by bor, Michigan, June. Association for Computational
word LM and by hybrid LM. Linguistics.
446
Jerome R. Bellegarda. 2004. Statistical language Processing, pages 848856, Morristown, NJ, USA.
model adaptation: review and perspectives. Speech Association for Computational Linguistics.
Communication, 42(1):93 108. Christian Hardmeier, Jorg Tiedemann, Markus Saers,
Alexandra Birch, Miles Osborne, and Philipp Koehn. Marcello Federico, and Mathur Prashant. 2011.
2007. CCG supertags in factored statistical ma- The Uppsala-FBK systems at WMT 2011. In Pro-
chine translation. In Proceedings of the Second ceedings of the Sixth Workshop on Statistical Ma-
Workshop on Statistical Machine Translation, pages chine Translation, pages 372378, Edinburgh, Scot-
916, Prague, Czech Republic, June. Association land, July. Association for Computational Linguis-
for Computational Linguistics. tics.
Arianna Bisazza, Nick Ruiz, and Marcello Fed- Katrin Kirchhoff and Mei Yang. 2005. Improved lan-
erico. 2011. Fill-up versus Interpolation Meth- guage modeling for statistical machine translation.
ods for Phrase-based SMT Adaptation. In Interna- In Proceedings of the ACL Workshop on Building
tional Workshop on Spoken Language Translation and Using Parallel Texts, pages 125128, Ann Ar-
(IWSLT), San Francisco, CA. bor, Michigan, June. Association for Computational
P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai, Linguistics.
and R. L. Mercer. 1992. Class-based n-gram mod- Philipp Koehn and Hieu Hoang. 2007. Factored
els of natural language. Computational Linguistics, translation models. In Proceedings of the 2007
18(4):467479. Joint Conference on Empirical Methods in Natural
Mauro Cettolo, Nicola Bertoldi, and Marcello Fed- Language Processing and Computational Natural
erico. 2011. Methods for smoothing the optimizer Language Learning (EMNLP-CoNLL), pages 868
instability in SMT. In MT Summit XIII: the Thir- 876, Prague, Czech Republic, June. Association for
teenth Machine Translation Summit, pages 3239, Computational Linguistics.
Xiamen, China. Philipp Koehn, Amittai Axelrod, Alexandra Birch
Jonathan Clark, Chris Dyer, Alon Lavie, and Mayne, Chris Callison-Burch, Miles Osborne, and
Noah Smith. 2011. Better hypothesis testing David Talbot. 2005. Edinburgh system description
for statistical machine translation: Controlling for the 2005 IWSLT speech translation evaluation.
for optimizer instability. In Proceedings of In Proc. of the International Workshop on Spoken
the Association for Computational Lingustics, Language Translation, October.
ACL 2011, Portland, Oregon, USA. Associa- P. Koehn, H. Hoang, A. Birch, C. Callison-Burch,
tion for Computational Linguistics. available at M. Federico, N. Bertoldi, B. Cowan, W. Shen,
http://www.cs.cmu.edu/ jhclark/pubs/significance.pdf. C. Moran, R. Zens, C. Dyer, O. Bojar, A. Con-
Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. stantin, and E. Herbst. 2007. Moses: Open Source
2004. Automatic Tagging of Arabic Text: From Toolkit for Statistical Machine Translation. In Pro-
Raw Text to Base Phrase Chunks. In Daniel Marcu ceedings of the 45th Annual Meeting of the Associa-
Susan Dumais and Salim Roukos, editors, HLT- tion for Computational Linguistics Companion Vol-
NAACL 2004: Short Papers, pages 149152, ume Proceedings of the Demo and Poster Sessions,
Boston, Massachusetts, USA, May 2 - May 7. As- pages 177180, Prague, Czech Republic.
sociation for Computational Linguistics. Christof Monz. 2011. Statistical Machine Translation
Marcello Federico, Nicola Bertoldi, and Mauro Cet- with Local Language Models. In Proceedings of the
tolo. 2008. IRSTLM: an Open Source Toolkit for 2011 Conference on Empirical Methods in Natural
Handling Large Scale Language Models. In Pro- Language Processing, pages 869879, Edinburgh,
ceedings of Interspeech, pages 16181621, Mel- Scotland, UK., July. Association for Computational
bourne, Australia. Linguistics.
Marcello Federico, Luisa Bentivogli, Michael Paul, Preslav Nakov. 2008. Improving English-Spanish
and Sebastian Stuker. 2011. Overview of the Statistical Machine Translation: Experiments in
IWSLT 2011 Evaluation Campaign. In Interna- Domain Adaptation, Sentence Paraphrasing, Tok-
tional Workshop on Spoken Language Translation enization, and Recasing. . In Workshop on Statis-
(IWSLT), San Francisco, CA. tical Machine Translation, Association for Compu-
George Foster and Roland Kuhn. 2007. Mixture- tational Linguistics.
model adaptation for SMT. In Proceedings of the Franz Josef Och. 1999. An efficient method for de-
Second Workshop on Statistical Machine Transla- termining bilingual word classes. In Proceedings of
tion, pages 128135, Prague, Czech Republic, June. the 9th Conference of the European Chapter of the
Association for Computational Linguistics. Association for Computational Linguistics (EACL),
Michel Galley and Christopher D. Manning. 2008. A pages 7176.
simple and effective hierarchical phrase reordering Franz Josef Och. 2003. Minimum Error Rate Train-
model. In EMNLP 08: Proceedings of the Con- ing in Statistical Machine Translation. In Erhard
ference on Empirical Methods in Natural Language Hinrichs and Dan Roth, editors, Proceedings of the
447
41st Annual Meeting of the Association for Compu-
tational Linguistics, pages 160167.
Kishore Papineni, Salim Roukos, Todd Ward, and
Wei-Jing Zhu. 2002. BLEU: a method for auto-
matic evaluation of machine translation. In Pro-
ceedings of the 40th Annual Meeting of the Asso-
ciation of Computational Linguistics (ACL), pages
311318, Philadelphia, PA.
Stefan Riezler and John T. Maxwell. 2005. On some
pitfalls in automatic evaluation and significance
testing for MT. In Proceedings of the ACL Work-
shop on Intrinsic and Extrinsic Evaluation Mea-
sures for Machine Translation and/or Summariza-
tion, pages 5764, Ann Arbor, Michigan, June. As-
sociation for Computational Linguistics.
R. Rosenfeld. 2000. Two decades of statistical lan-
guage modeling: where do we go from here? Pro-
ceedings of the IEEE, 88(8):1270 1278.
Nick Ruiz and Marcello Federico. 2011. Topic adap-
tation for lecture translation through bilingual la-
tent semantic models. In Proceedings of the Sixth
Workshop on Statistical Machine Translation, pages
294302, Edinburgh, Scotland, July. Association
for Computational Linguistics.
Nick Ruiz, Arianna Bisazza, Fabio Brugnara, Daniele
Falavigna, Diego Giuliani, Suhel Jaber, Roberto
Gretter, and Marcello Federico. 2011. FBK @
IWSLT 2011. In International Workshop on Spo-
ken Language Translation (IWSLT), San Francisco,
CA.
Helmut Schmid. 1994. Probabilistic part-of-speech
tagging using decision trees. In Proceedings of In-
ternational Conference on New Methods in Lan-
guage Processing.
Matthew Snover, Bonnie Dorr, Rich Schwartz, Linnea
Micciulla, and John Makhoul. 2006. A study of
translation edit rate with targeted human annotation.
In 5th Conference of the Association for Machine
Translation in the Americas (AMTA), Boston, Mas-
sachusetts, August.
Christoph Tillmann. 2004. A Unigram Orientation
Model for Statistical Machine Translation. In Pro-
ceedings of the Joint Conference on Human Lan-
guage Technologies and the Annual Meeting of the
North American Chapter of the Association of Com-
putational Linguistics (HLT-NAACL).
A. Yazgan and M. Saraclar. 2004. Hybrid language
models for out of vocabulary word detection in large
vocabulary conversational speech recognition. In
Proceedings of ICASSP, volume 1, pages I 7458
vol.1, may.
448
Detecting Highly Confident Word Translations from Comparable
Corpora without Any Prior Knowledge
449
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 449459,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
since the construction of their lexicon is language- employed for a precision-oriented algorithm. In
pair biased and cannot be completely employed our setting, it basically means that we keep a
on distant languages. It solely relies on unsatis- translation pair (wiS , wjT ) if and only if, after the
factory language-pair independent cross-language symmetrization process, the top translation candi-
clues such as words shared across languages. date for the source word wiS is the target word wiT
Recent work from Vulic et al.(2011) utilized and vice versa. The one-to-one constraint aims
the distributional hypothesis in a different direc- at matching the most confident candidates during
tion. It attempts to abrogate the need of a seed lex- the early stages of the algorithm, and then exclud-
icon as a prerequisite for bilingual lexicon extrac- ing them from further search. The utility of the
tion. They train a cross-language topic model on constraint for parallel corpora has already been
document-aligned comparable corpora and intro- evaluated by Melamed (2000).
duce different methods for identifying word trans- The remainder of the paper is structured as
lations across languages, underpinned by per- follows. Section 2 gives a brief overview of
topic word distributions from the trained topic the methods, relying on per-topic word distribu-
model. Due to the fact that they deal with compa- tions, which serve as the tool for computing cross-
rable Wikipedia data, their translation model con- language similarity between words. In Section
tains a lot of noise, and some words are poorly 3, we motivate the main assumptions of the al-
translated simply because there are not enough gorithm and describe the full algorithm. Sec-
occurrences in the corpus. The goal of this work is tion 4 justifies the underlying assumptions of
to design an algorithm which will learn to harvest the algorithm by providing comparisons with a
only the most probable translations from the per- current-state-of-the-art system for Italian-English
word topic distributions. The translations learned and Dutch-English language pairs. It also con-
by the algorithm then might serve as a highly ac- tains another set of experiments which inves-
curate, precision-based initial seed lexicon, which tigates the potential of the algorithm in build-
can then be used as a tool for translating source ing a language-pair unbiased seed lexicon, and
word vectors into the target language. The key ad- compares the lexicon with other seed lexicons.
vantage of such a lexicon lies in the fact that there Finally, Section 5 lists conclusion and possible
is no language-pair dependent prior knowledge paths of future work.
involved in its construction (e.g., orthographic
features). Hence, it is completely applicable to
2 Calculating Initial Cross-Language
any language pair for which there exist sufficient Word Similarity
comparable data for training of the topic model. This section gives a quick overview of the Cue
Since comparable corpora often construct a method, the TI method, and their combination,
very noisy environment, it is of the utmost impor- described by Vulic et al.(2011), which proved to
tance for a precision-oriented algorithm to learn be the most efficient and accurate for identify-
when to stop the process of matching words, and ing potential word translations once the cross-
which candidate pairs are surely not translations language BiLDA topic model is trained and the
of each other. The method described in this paper associated per-topic distributions are obtained for
follows this intuition: while extracting a bilingual both source and target corpora. The BiLDA
lexicon, we try to rematch words, keeping only model we use is a natural extension of the stan-
the most confident candidate pairs and disregard- dard LDA model and, along with the definition of
ing all the others. After that step, the most con- per-topic word distributions, has been presented
fident candidate pairs might be used with some in (Ni et al., 2009; De Smet and Moens, 2009;
of the existing context-based techniques to find Mimno et al., 2009). BiLDA takes advantage of
translations for the words discarded in the pre- the document alignment by using a single variable
vious step. The algorithm is based on: (1) the that contains the topic distribution . This vari-
assumption of symmetry, and (2) the one-to-one able is language-independent, because it is shared
constraint. The idea of symmetrization has been by each of the paired bilingual comparable doc-
borrowed from the symmetrization heuristics in- uments. Topics for each document are sampled
troduced for word alignments in SMT (Och and from , from which the words are then sampled
Ney, 2003), where the intersection heuristics is in conjugation with the vocabulary distribution
450
S S
in the same textual units and therefore add ex-
zji wji
tra information of potential relatedness. These
MS
two methods for automatic bilingual lexicon ex-
traction interpret and exploit underlying per-topic
T T
zji wji
word distributions in different ways, so combin-
MT ing the two should lead to even better results. The
D
two methods are linearly combined, with the over-
all score given by:
SimT I+Cue (w1S , w2T ) = SimT I (w1S , w2T )
+ (1 )SimCue (w1S , w2T ) (1)
Figure 1: Plate
Figure modelbilingual
1: The for bilingual
LDALatent Dirichletmodel
(BiLDA) Allocation Both methods posses several desirable proper-
ties. According to Griffiths et al. (2007), the con-
ditioning for the Cue method automatically com-
(for language S) and (for language T).
promises between word frequency and semantic
2.1 Cue Method relatedness since higher frequency words tend to
A straightforward approach to express similarity have higher probability across all topics, but the
between words tries to emphasize the associative distribution over topics P (zk |w1S ) ensures that se-
relation in a natural way - modeling the proba- mantically related topics dominate the sum. The
bility P (w2T |w1S ), i.e. the probability that a tar- similar phenomenon is captured by the TI method
get word w2T will be generated as a response to a by the usage of TF, which rewards high frequency
cue source word w1S , where the link between the words, and ITF, which assigns a higher impor-
1
words is established via the shared topic space: tance for words semantically more related to a
T S
PK
P (w2 |w1 ) = k=1 P (w2 |zk )P (zk |w1S ), where
T specific topic. These properties are incorporated
K denotes the number of cross-language topics. in the combination of the methods. As the final
result, the combined method provides, for each
2.2 TI Method source word, a ranked list of target words with as-
This approach constructs word vectors over a sociated scores that measure the strength of cross-
shared space of cross-language topics, where val- language similarity. The higher the score, the
ues within vectors are the TF-ITF scores (term more confident a translation pair is. We will use
frequency - inverse topic frequency), computed this observation in the next section during the al-
in a completely analogical manner as the TF- gorithm construction.
IDF scores for the original word-document space The lexicon constructed by solely applying the
(Manning and Schutze, 1999). Term frequency, combination of these methods without any addi-
given a source word wiS and a topic zk , measures tional assumptions will serve as a baseline in the
the importance of the word wiS within the particu- results section.
lar topic zk , while inverse topical frequency (ITF)
3 Constructing the Algorithm
of the word wiS measures the general importance
of the source word wiS across all topics. The fi- This section explains the underlying assumptions
nal TF-ITF score for the source word wiS and the of the algorithm: the assumption of symmetry
topic zk is given by T F IT Fi,k = T Fi,k IT Fi . and the one-to-one assumption. Finally, it pro-
The TF-ITF scores for target words associated vides the complete outline of the algorithm.
with target topics are calculated in an analogical
manner and the standard cosine similarity is then 3.1 Assumption of Symmetry
used to find the most similar target word vectors First, we start with the intuition that the assump-
for a given source word vector. tion of symmetry strengthens the confidence of a
translation pair. In other words, if the most prob-
2.3 Combining the Methods able translation candidate for a source word w1S is
Topic models have the ability to build clusters of a target word w2T and, vice versa, the most prob-
words which might not always co-occur together able translation candidate of the target word w2T
451
is the source word w1S , and their TI+Cue scores T , GM ) to the list F inal .
ple (ws,i i s
are above a certain threshold, we can claim that (b) If we have reached the end of the list
the words w1S and w2T are a translation pair. The for the target candidate word ws,i T with-
definition of the symmetric relation can also be out finding the given source word wsS ,
relaxed. Instead of observing only one top can- and i < N , continue with the next word
didate from the lists, we can observe top N can- T
ws,i+1 . Do not add any tuple to F inals
didates from both sides and include them in the in this step.
search space, and then re-rank the potential candi- 5. If the list F inals is not empty, sort the tuples
dates taking into account their associated TI+Cue in the list in descending order according to
scores and their respective positions in the list. their GMi scores. The first element of the
We will call N the search space depth. Here is T
sorted list contains a word ws,high , the final
the outline of the re-ranking method if the search
translation candidate of the source word wsS .
space consists of the top N candidates on both
If the list F inals is not empty, the final re-
sides:
sult of this process will be the cross-language
1. Given is a source word wsS , for which we ac- word translation pair (wsS , ws,high
T ).
tually want to find the most probable trans- We will call this symmetrization process the
lation candidate. Initialize an empty list symmetrizing re-ranking. It attempts at push-
F inals = {} in which target language ing the correct cross-language synonym to the top
candidates with their recalculated associated of the candidates list, taking into account both
scores will be stored. the strength of similarities defined through the
2. Obtain TI+Cue scores for all target words. TI+Cue scores in both directions, and positions
Keep only N best scoring target candidates: in ranked lists. A blatant example depicting how
T , . . . , w T } along with their respective
{ws,1 s,N this process helps boost precision is presented in
scores. Figure 2. We can also design a thresholded variant
3. For each target candidate from of this procedure by imposing an extra constraint.
T T
{ws,1 , . . . , ws,N } acquire TI+Cue scores When calculating target language candidates for
over the entire source vocabulary. Keep only the source word wsS in Step 2, we proceed fur-
N best scoring source language candidates. ther only if the first target candidate scores above
Each word ws,i T {ws,1 T , . . . , w T } now
s,N a certain threshold P and, additionally, in Step 3,
has a list of N source language candidates we keep lists of N source language candidates
associated with it: {wi,1 S , w S . . . , w S }.
i,2 i,N for only those target words for which the first
4. For each target candidate word ws,i T source language candidate in their respective list
T T
{ws,1 , . . . , ws,N }, do as follows: scored above the same threshold P . We will call
(a) If one of the words from the associated this procedure the thresholded symmetrizing re-
list is the given source word wsS , re- ranking, and this version will be employed in the
member: (1) the position m, denoting final algorithm.
how high in the list the word wsS was
found, and (2) the associated TI+Cue 3.2 One-to-one Assumption
score SimT I+Cue (ws,i T , wS S
i,m = ws ). Melamed (2000) has already established that most
Calculate: source words in parallel corpora tend to translate
(i) G1,i = SimT I+Cue (wsS , ws,i T )/i
to only one target word. That tendency is modeled
(ii) G2,i = SimT I+Cue (ws,i T , w S )/m by the one-to-one assumption, which constrains
i,m
Following that, calculate GMi , the ge- each source word to have at most one translation
on the target side. Melameds paper reports that
ometric mean of
1
p the values G1,i and
G2,i : GMi = G1,i G2,i . Add a tu- this bias leads to a significant positive impact on
precision and recall of bilingual lexicon extraction
1
Scores G1,i and G2,i are structured in such a way to from parallel corpora. This assumption should
balance between positions in the ranked lists and the TI+Cue
scores, since they reward candidate words which have high
also be reasonable for many types of comparable
TI+Cue scores associated with them, and penalize words if corpora such as Wikipedia or news corpora, which
they are found lower in the list of potential candidates. are topically aligned or cover similar themes. We
452
klooster
0.3049
0.1740
monastery monnik
0.1338
benedictijn
0.2237
klooster
0.2266
0.1586 0.1494
abdij monk monnik
0.1131
abdij
0.1155
abdij
0.2549
0.1496
abbey monnik
0.1288
klooster
Figure 2: An example where the assumption of symmetry and the one-to-one assumption clearly help boost
precision. If we keep top Nc = 3 candidates from both sides, the algorithm is able to detect that the correct
Dutch-English translation pair is (abdij, abbey). The TI+Cue method without any assumptions would result with
an indirect association (abdij, monastery). If only the one-to-one assumption was present, the algorithm would
greedily learn the correct direct association (monastery, klooster), remove those words from their respective
vocabularies and then again result with another indirect association (abdij, monk). By additionally employing
the assumption of symmetry with the re-ranking method from Subsection 3.1, the algorithm correctly learns
the translation pair (abdij, abbey). Correct translation pairs (klooster, monastery) and (monnik, monk) are also
obtained. Again here, the pair (monnik, monk) would not be obtained without the one-to-one assumption.
will prove that the assumption leads to better pre- cally very close, and therefore have similar distri-
cision scores even for bilingual lexicon extraction butions over cross-language topics, but island is a
from such comparable data. The intuition be- much more frequent term. The TI+Cue method
hind introducing this constraint is fairly simple. concludes that two words are potential trans-
Without the assumption, the similarity scores be- lations whenever their distributions over cross-
tween source and target words are calculated in- language topics are much more similar than ex-
dependently of each other. We will illustrate the pected by chance. Moreover, it gives a preference
problem arising from the independence assump- to more frequent candidates, so it will eventually
tion with an example. end up learning an indirect association2 between
Suppose we have an Italian word arcipelago, words arcipelago and island. The one-to-one as-
and we would like to detect its correct English sumption should mitigate the problem of such in-
translation (archipelago). However, after the direct associations if we design our algorithm in
TI+Cue method is employed, and even after the such a way that it learns the most confident direct
symmetrizing re-ranking process from the previ- associations2 first:
ous step is used, we still acquire a wrong transla- 2
A direct association, as defined in (Melamed, 2000), is
tion candidate pair (arcipelago, island). Why is an association between two words (in this setting found by
that so? The word (arcipelago) (or its translation) the TI+Cue method) where the two words are indeed mutual
and the acquired translation (island) are semanti- translations. Otherwise, it is an indirect association.
453
1. Learn the correct direct association pair vocabularies: V S = V S {wsS } and
(isola, island). V T = V T {ws,high T } to satisfy the
2. Remove the words isola and island from one-to-one constraint. Add the pair
their respective vocabularies. (wsS , ws,high
T ) to the lexicon L.
3. Since island is not in the vocabulary, the
indirect association between arcipelago and We will name this procedure the one-
island is not present any more. The algo- vocabulary-pass and employ it later in an iter-
rithm learns the correct direct association ative algorithm with a varying threshold and a
(arcipelago, archipelago). varying maximum search space depth.
454
laxes them by lowering the threshold and expand- BiLDA training are obtained from Vulic et al.
ing the search space by incrementing the max- (2011). We train the BiLDA model with 2000
imum search space depth. The algorithm may topics using Gibbs sampling, since that number
leave some of the source words unmatched, which of topics displays the best performance in their
is also dependent on the parameters of the algo- paper. The linear interpolation parameter for the
rithm, but, due to the one-to-one assumption, that combined TI+Cue method is set to = 0.1.
scenario also occurs whenever a target vocabulary The parameters of the algorithm, adjusted on a
contains more words than a source vocabulary. set of 500 randomly sampled Italian words, are set
The number of operations of the algorithm also to the following values in all experiments, except
depends on the parameters, but it mostly depends where noted different: P0 = 0.20, Pf = 0.00,
on the sizes of the given vocabularies. The com- decp = 0.01, N0 = 3, and Nf = 10.
plexity is O(|V S ||V T |), but the algorithm is com- The initial ground truth for our source vocab-
putationally feasible even for large vocabularies. ularies has been constructed by the freely avail-
able Google Translate tool. The final ground truth
4 Results and Discussion for our test sets has been established after we
4.1 Training Collections have manually revised the list of pairs obtained by
Google Translate, deleting incorrect entries and
The data used for training of the models is col- adding additional correct entries. All translation
lected from various sources and varies strongly in candidates are evaluated against this benchmark
theme, style, length and its comparableness. In lexicon.
order to reduce data sparsity, we keep only lem-
matized non-proper noun forms. 4.3 Experiment I: Do Our Assumptions Help
For Italian-English language pair, we use Lexicon Extraction?
18, 898 Wikipedia article pairs to train BiLDA,
With this set of experiments, we wanted to test
covering different themes with different scopes
whether both the assumption of symmetry and
and subtopics being addressed. Document align-
the one-to-one assumption are useful in improv-
ment is established via interlingual links from the
ing precision of the initial TI+Cue lexicon extrac-
Wikipedia metadata. Our vocabularies consist of
tion method. We compare three different lexicon
7, 160 Italian nouns and 9, 116 English nouns.
extraction algorithms: (1) the basic TI+Cue ex-
For Dutch-English language pair, we use 7, 602
traction algorithm (LALG-BASIC) which serves
Wikipedia article pairs, and 6, 206 Europarl doc-
as the baseline algorithm5 , (2) the algorithm from
ument pairs, and combine them for training.4 Our
Section 3, but without the one-to-one assump-
final vocabularies consist of 15, 284 Dutch nouns
tion (LALG-SYM), meaning that if we find a
and 12, 715 English nouns.
translation pair, we still keep words from the
Unlike, for instance, Wikipedia articles, where
translation pair in their respective vocabularies,
document alignment is established via interlin-
and (3) the complete algorithm from Section 3
gual links, in some cases it is necessary to perform
(LALG-ALL). In order to evaluate these lexicon
document alignment as the initial step. Since our
extraction algorithms for both Italian-English and
work focuses on Wikipedia data, we will not get
Dutch-English, we have constructed a test set of
into detail with algorithms for document align-
650 Italian nouns, and a test set of 1000 Dutch
ment. An IR-based method for document align-
nouns of high and medium frequency. Precision
ment is given in (Utiyama and Isahara, 2003;
scores for both language pairs and for all lexicon
Munteanu and Marcu, 2005), and a feature-based
extraction algorithms are provided in Table 1.
method can be found in (Vu et al., 2009).
Based on these results, it is clearly visible that
4.2 Experimental Setup both assumptions our algorithm makes are valid
All our experiments rely on BiLDA training 5
We have also tested whether LALG-BASIC outperforms
with comparable data. Corpora and software for a method modeling direct co-occurrence, that uses cosine
to detect similarity between word vectors consisting of TF-
4
In case of Europarl, we use only the evidence of docu- IDF scores in the shared document space (Cimiano et al.,
ment alignment during the training and do not benefit from 2009). Precision using that method is significantly lower,
the parallelness of the sentences in the corpus. e.g. 0.5538 vs. 0.6708 of LALG-BASIC for Italian-English.
455
1
LEX Algorithm Italian-English Dutch-English
IT-EN Precision
LALG-BASIC 0.6708 0.6560 IT-EN F-score
LALG-SYM 0.6862 0.6780 0.95 NL-EN Precision
NL-EN F-score
LALG-ALL 0.7215 0.7170
0.9
Table 1: Precision scores on our test sets for the 3 dif-
Precision/F-score
ferent lexicon extraction algorithms.
0.85
Affect Precision?
0.65
The next set of experiments aims at exploring how 0.2 0.15 0.1 0.05 0
precision scores change while we gradually de- Threshold
456
Italian-English Dutch-English
Lexicon # Correct Precision F0.5 # Correct Precision F0.5
LEX-1 350 0.8121 0.1876 898 0.8618 0.2308
LEX-2 766 0.8938 0.3473 1376 0.9011 0.3216
LEX-LALG 782 0.8958 0.3524 1106 0.9559 0.2778
LEX-1+LEX-LALG 1070 0.8785 0.4290 1860 0.9082 0.3961
LEX-R+LEX-LALG 1141 0.9239 0.4548 1507 0.9642 0.3500
LEX-2+LEX-LALG 1429 0.8926 0.5102 2261 0.9217 0.4505
Table 2: A comparison of different lexicons. For lexicons employing our LALG-ALL algorithm, only translation
candidates that scored above the threshold P = 0.11 have been kept.
it with phy. Similar rules have been introduced Knight (2002) has been outperformed in terms of
for Dutch-English: the suffix tie is replaced by precision and coverage. Additionally, we have
tion, sie by sion, and teit by ty. shown that adding simple translation rules for lan-
Finally, we have compared the results of the guages sharing same roots might lead to even bet-
following constructed lexicons: ter scores (LEX-2+LEX-LALG). However, it is
not always possible to rely on such knowledge,
A lexicon containing only words shared and the usefulness of the designed LALG-ALL
across languages (LEX-1). algorithm really comes to the fore when the algo-
A lexicon containing shared words and trans- rithm is applied on distant language pairs which
lation pairs found by applying the language- do not share many words and cognates, and word
specific transformation rules (LEX-2). translation rules cannot be easily established. In
A lexicon containing only translation pairs such cases, without any prior knowledge about the
obtained by the LALG-ALL algorithm that languages involved in a translation process, one is
score above a certain threshold P (LEX- left with the linguistically unbiased LEX-1+LEX-
LALG). LALG lexicon, which also displays a promising
A combination of the lexicons LEX-1 and performance.
LEX-LALG (LEX-1+LEX-LALG). Non- 5 Conclusions and Future Work
matching duplicates are resolved by taking
We have designed an algorithm that focuses on ac-
the translation pair from LEX-LALG as the
quiring and keeping only highly confident trans-
correct one. Note that this lexicon is com-
lation candidates from multilingual comparable
pletely language-pair independent.
corpora. By employing the algorithm we have
A lexicon combining only translation pairs
improved precision scores of the methods rely-
found by applying the language-specific
ing on per-topic word distributions from a cross-
transformation rules and LEX-LALG (LEX-
language topic model. We have shown that the al-
R+LEX-LALG).
gorithm is able to produce a highly reliable bilin-
A combination of the lexicons LEX-2 and
gual seed lexicon even when all other lexical clues
LEX-LALG, where non-matching dupli-
are absent, thus making our algorithm suitable
cates are resolved by taking the translation
even for unrelated language pairs. In future work,
pair from LEX-LALG if it is present in
we plan to further improve the algorithm and use
LEX-1, and from LEX-2 otherwise (LEX-
it as a source of translational evidence for differ-
2+LEX-LALG).
ent alignment tasks in the setting of non-parallel
According to the results from Table 2, we can corpora.
conclude that adding translation pairs extracted
Acknowledgments
by our LALG-ALL algorithm has a major posi-
tive impact on both precision and coverage. Ob- The research has been carried out in the frame-
taining results for two different language pairs work of the TermWise Knowledge Platform (IOF-
proves that the approach is generic and appli- KP/09/001) funded by the Industrial Research
cable to any other language pairs. The previ- Fund K.U. Leuven, Belgium.
ous approach relying on work from Koehn and
457
References 46th Annual Meeting of the Association for Compu-
tational Linguistics, pages 771779.
Jaime G. Carbonell, Jaime G. Yang, Robert E. Fred-
Zellig S. Harris. 1954. Distributional structure. Word
erking, Ralf D. Brown, Yibing Geng, Danny Lee,
10, (23):146162.
Yiming Frederking, Robert E, Ralf D. Geng, and
Philipp Koehn and Kevin Knight. 2002. Learning a
Yiming Yang. 1997. Translingual information re-
translation lexicon from monolingual corpora. In
trieval: A comparative evaluation. In Proceedings
Proceedings of the ACL-02 Workshop on Unsuper-
of the 15th International Joint Conference on Arti-
vised Lexical Acquisition, pages 916.
ficial Intelligence, pages 708714.
Audrey Laroche and Philippe Langlais. 2010. Re-
Yun-Chuang Chiao and Pierre Zweigenbaum. 2002.
visiting context-based projection methods for term-
Looking for candidate translational equivalents in
translation spotting in comparable corpora. In Pro-
specialized, comparable corpora. In Proceedings
ceedings of the 23rd International Conference on
of the 19th International Conference on Computa-
Computational Linguistics, pages 617625.
tional Linguistics, pages 15.
Gina-Anne Levow, Douglas W. Oard, and Philip
Philipp Cimiano, Antje Schultz, Sergej Sizov, Philipp Resnik. 2005. Dictionary-based techniques for
Sorg, and Steffen Staab. 2009. Explicit versus cross-language information retrieval. Information
latent concept models for cross-language informa- Processing and Management, 41:523547.
tion retrieval. In Proceedings of the 21st Inter-
Bo Li, Eric Gaussier, and Akiko Aizawa. 2011. Clus-
national Joint Conference on Artifical Intelligence,
tering comparable corpora for bilingual lexicon ex-
pages 15131518.
traction. In Proceedings of the 49th Annual Meeting
Wim De Smet and Marie-Francine Moens. 2009. of the Association for Computational Linguistics:
Cross-language linking of news stories on the Web Human Language Technologies, pages 473478.
using interlingual topic modeling. In Proceedings Christopher D. Manning and Hinrich Schutze. 1999.
of the CIKM 2009 Workshop on Social Web Search Foundations of Statistical Natural Language Pro-
and Mining, pages 5764. cessing. MIT Press, Cambridge, MA, USA.
Herve Dejean, Eric Gaussier, and Fatia Sadat. 2002. I. Dan Melamed. 2000. Models of translational equiv-
An approach based on multilingual thesauri and alence among words. Computational Linguistics,
model combination for bilingual lexicon extraction. 26:221249.
In Proceedings of the 19th International Conference David Mimno, Hanna M. Wallach, Jason Naradowsky,
on Computational Linguistics, pages 17. David A. Smith, and Andrew McCallum. 2009.
Mona T. Diab and Steve Finch. 2000. A statis- Polylingual topic models. In Proceedings of the
tical translation model using comparable corpora. 2009 Conference on Empirical Methods in Natural
In Proceedings of the 6th Triennial Conference on Language Processing, pages 880889.
Recherche dInformation Assistee par Ordinateur Emmanuel Morin, Beatrice Daille, Koichi Takeuchi,
(RIAO), pages 15001508. and Kyo Kageura. 2007. Bilingual terminology
Pascale Fung and Percy Cheung. 2004. Mining very- mining - using brain, not brawn comparable cor-
non-parallel corpora: Parallel sentence and lexicon pora. In Proceedings of the 45th Annual Meeting
extraction via bootstrapping and EM. In Proceed- of the Association for Computational Linguistics,
ings of the Conference on Empirical Methods in pages 664671.
Natural Language Processing, pages 5763. Dragos Stefan Munteanu and Daniel Marcu. 2005.
Pascale Fung and Lo Yuen Yee. 1998. An IR ap- Improving machine translation performance by ex-
proach for translating new words from nonparallel, ploiting non-parallel corpora. Computational Lin-
comparable texts. In Proceedings of the 17th Inter- guistics, 31:477504.
national Conference on Computational Linguistics, Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and Zheng
pages 414420. Chen. 2009. Mining multilingual topics from
Eric Gaussier, Jean-Michel Renders, Irina Matveeva, Wikipedia. In Proceedings of the 18th International
Cyril Goutte, and Herve Dejean. 2004. A geomet- World Wide Web Conference, pages 11551156.
ric view on bilingual lexicon extraction from com- Franz Josef Och and Hermann Ney. 2003. A sys-
parable corpora. In Proceedings of the 42nd Annual tematic comparison of various statistical alignment
Meeting of the Association for Computational Lin- models. Computational Linguistics, 29(1):1951.
guistics, pages 526533. Reinhard Rapp. 1995. Identifying word translations in
Thomas L. Griffiths, Mark Steyvers, and Joshua B. non-parallel texts. In Proceedings of the 33rd An-
Tenenbaum. 2007. Topics in semantic represen- nual Meeting of the Association for Computational
tation. Psychological Review, 114(2):211244. Linguistics, pages 320322.
Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, Reinhard Rapp. 1999. Automatic identification of
and Dan Klein. 2008. Learning bilingual lexicons word translations from unrelated English and Ger-
from monolingual corpora. In Proceedings of the man corpora. In Proceedings of the 37th Annual
458
Meeting of the Association for Computational Lin-
guistics, pages 519526.
Daphna Shezaf and Ari Rappoport. 2010. Bilingual
lexicon generation using non-aligned signatures. In
Proceedings of the 48th Annual Meeting of the As-
sociation for Computational Linguistics, pages 98
107.
Masao Utiyama and Hitoshi Isahara. 2003. Reliable
measures for aligning Japanese-English news arti-
cles and sentences. In Proceedings of the 41st An-
nual Meeting of the Association for Computational
Linguistics, pages 7279.
C. J. van Rijsbergen. 1979. Information Retrieval.
Butterworth.
Thuy Vu, Ai Ti Aw, and Min Zhang. 2009. Feature-
based method for document alignment in compara-
ble news corpora. In Proceedings of the 12th Con-
ference of the European Chapter of the Association
for Computational Linguistics, pages 843851.
Ivan Vulic, Wim De Smet, and Marie-Francine Moens.
2011. Identifying word translations from compara-
ble corpora using latent topic models. In Proceed-
ings of the 49th Annual Meeting of the Association
for Computational Linguistics: Human Language
Technologies, pages 479484.
459
Efficient Parsing with Linear Context-Free Rewriting Systems
Abstract SBARQ
460
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 460470,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
ROOT
ROOT(ab) S(a) $.(b)
S S(abcd) VAFIN(b) NN(c) VP2 (a, d)
VP2 (a, bc) PROAV(a) NN(b) VVPP(c)
VP
PROAV(Danach)
PROAV VAFIN NN NN VVPP $. VAFIN(habe)
NN(Kohlenstaub)
Danach habe Kohlenstaub Feuer gefangen .
NN(Feuer)
Afterwards had coal dust fire caught . VVPP(gefangen)
$.(.)
Figure 2: A discontinuous tree from the Negra corpus. Figure 3: The productions that can be read off from the
Translation: After that coal dust had caught fire. tree in figure 2. Note that lexical productions rewrite to
, because they do not rewrite to any non-terminals.
2 Linear Context-Free Rewriting
Systems terminal may cover a tuple of discontinuous strings
instead of a single, contiguous sequence of termi-
Linear Context-Free Rewriting Systems (LCFRS; nals. The number of components in such a tuple
Vijay-Shanker et al., 1987; Weir, 1988) subsume is called the fan-out of a rule, which is equal to
a wide variety of mildly context-sensitive for- the number of gaps plus one; the fan-out of the
malisms, such as Tree-Adjoining Grammar (TAG), grammar is the maximum fan-out of its production.
Combinatory Categorial Grammar (CCG), Min- A context-free grammar is a LCFRS with a fan-out
imalist Grammar, Multiple Context-Free Gram- of 1. For convenience we will will use the rule
mar (MCFG) and synchronous CFG (Vijay-Shanker notation of simple RCG (Boullier, 1998), which
and Weir, 1994; Kallmeyer, 2010). Furthermore, is a syntactic variant of LCFRS, with an arguably
they can be used to parse dependency struc- more transparent notation.
tures (Kuhlmann and Satta, 2009). Since LCFRS A LCFRS is a tuple G = hN, T, V, P, Si. N
subsumes various synchronous grammars, they are is a finite set of non-terminals; a function dim :
also important for machine translation. This makes N N specifies the unique fan-out for every non-
it possible to use LCFRS as a syntactic backbone terminal symbol. T and V are disjoint finite sets
with which various formalisms can be parsed by of terminals and variables. S is the distinguished
compiling grammars into an LCFRS, similar to the start symbol with dim(S) = 1. P is a finite set of
TuLiPa system (Kallmeyer et al., 2008). As all rewrite rules (productions) of the form:
mildly context-sensitive formalisms, LCFRS are
parsable in polynomial time, where the degree A(1 , . . . dim(A) ) B1 (X11 , . . . , Xdim(B
1
1)
)
depends on the productions of the grammar. In-
. . . Bm (X1m , . . . , Xdim(B
m
m)
)
tuitively, LCFRS can be seen as a generalization
of context-free grammars to rewriting other ob- for m 0, where A, B1 , . . . , Bm N ,
jects than just continuous strings: productions are each Xji V for 1 i m, 1 j dim(Aj )
context-free, but instead of strings they can rewrite and i (T V ) for 1 i dim(Ai ).
tuples, trees or graphs. Productions must be linear: if a variable occurs
We focus on the use of LCFRS for parsing with in a rule, it occurs exactly once on the left hand
discontinuous constituents. This follows up on side (LHS), and exactly once on the right hand side
recent work on parsing the discontinuous anno- (RHS). A rule is ordered if for any two variables
tations in German corpora with LCFRS (Maier, X1 and X2 occurring in a non-terminal on the RHS,
2010; van Cranenburgh et al., 2011) and work on X1 precedes X2 on the LHS iff X1 precedes X2
parsing the Wall Street journal corpus in which on the RHS.
traces have been converted to discontinuous con- Every production has a fan-out determined by
stituents (Evang and Kallmeyer, 2011). In the case the fan-out of the non-terminal symbol on the left-
of parsing with discontinuous constituents a non- hand side. Apart from the fan-out productions also
461
have a rank: the number of non-terminals on the This binarization introduces a production with
right-hand side. These two variables determine a fan-out of 2, which could have been avoided.
the time complexity of parsing with a grammar. A After binarization, an LCFRS can be parsed in
production can be instantiated when its variables O(|G| |w|p ) time, where |G| is the size of the
can be bound to non-overlapping spans such that grammar, |w| is the length of the sentence. The de-
for each component i of the LHS, the concatena- gree p of the polynomial is the maximum parsing
tion of its terminals and bound variables forms a complexity of a rule, defined as:
contiguous span in the input, while the endpoints
of each span are non-contiguous. parsing complexity := + 1 + 2 (6)
As in the case of a PCFG, we can read off LCFRS where is the fan-out of the left-hand side and
productions from a treebank (Maier and Sgaard, 1 and 2 are the fan-outs of the right-hand side
2008), and the relative frequencies of productions of the rule in question (Gildea, 2010). As Gildea
form a maximum likelihood estimate, for a prob- (2010) shows, there is no one to one correspon-
abilistic LCFRS (PLCFRS), i.e., a (discontinuous) dence between fan-out and parsing complexity: it
treebank grammar. As an example, figure 3 shows is possible that parsing complexity can be reduced
the productions extracted from the tree in figure 2. by increasing the fan-out of a production. In other
words, there can be a production which can be bi-
3 Binarization
narized with a parsing complexity that is minimal
A probabilistic LCFRS can be parsed using a CKY- while its fan-out is sub-optimal. Therefore we fo-
like tabular parsing algorithm (cf. Kallmeyer and cus on parsing complexity rather than fan-out in
Maier, 2010; van Cranenburgh et al., 2011), but this work, since parsing complexity determines the
this requires a binarized grammar.1 Any LCFRS actual time complexity of parsing with a grammar.
can be binarized. Crescenzi et al. (2011) state There has been some work investigating whether
while CFGs can always be reduced to rank two the increase in complexity can be minimized ef-
(Chomsky Normal Form), this is not the case for fectively (Gomez-Rodrguez et al., 2009; Gildea,
LCFRS with any fan-out greater than one. How- 2010; Crescenzi et al., 2011).
ever, this assertion is made under the assumption of More radically, it has been suggested that the
a fixed fan-out. If this assumption is relaxed then power of LCFRS should be limited to well-nested
it is easy to binarize either deterministically or, as structures, which gives an asymptotic improve-
will be investigated in this work, optimally with ment in parsing time (Gomez-Rodrguez et al.,
a dynamic programming approach. Binarizing an 2010). However, there is linguistic evidence that
LCFRS may increase its fan-out, which results in not all language use can be described in well-
an increase in asymptotic complexity. Consider nested structures (Chen-Main and Joshi, 2010).
the following production: Therefore we will use the full power of LCFRS in
X(pqrs) A(p, r) B(q) C(s) (1) this workparsing complexity is determined by
the treebank, not by a priori constraints.
Henceforth, we assume that non-terminals on the
right-hand side are ordered by the order of their 3.1 Further binarization strategies
first variable on the left-hand side. There are two Apart from optimizing for parsing complexity, for
ways to binarize this production. The first is from linguistic reasons it can also be useful to parse
left to right: the head of a constituent first, yielding so-called
X(ps) XAB (p) C(s) (2) head-driven binarizations (Collins, 1999). Addi-
tionally, such a head-driven binarization can be
XAB (pqr) A(p, r) B(q) (3) Markovizedi.e., the resulting production can be
This binarization maintains the fan-out of 1. The constrained to apply to a limited amount of hor-
second way is from right to left: izontal context as opposed to the full context in
the original constituent (e.g., Klein and Manning,
X(pqrs) A(p, r) XBC (q, s) (4)
2003), which can have a beneficial effect on accu-
XBC (q, s) B(q) C(s) (5) racy. In the notation of Klein and Manning (2003)
1
Other algorithms exist which support n-ary productions, there are two Markovization parameters: h and
but these are less suitable for statistical treebank parsing. v. The first parameter describes the amount of
462
X X X
X XB,C,D,E XB XD
XB,C,D,E XB,C,D B XE XA
X XC,D,E XB,C XD XB
B B XD,E B B
A X C Y D E A X C Y D E A X C Y D E A X C Y D E A X C Y D E
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
original right branching optimal head-driven optimal head-driven
p = 4, = 2 p = 5, = 2 p = 4, = 2 p = 5, = 2 p = 4, = 2
Figure 4: The four binarization strategies. C is the head node. Underneath each tree is the maximum parsing
complexity and fan-out among its productions.
horizontal context for the artificial labels of a bi- dering and there is no probabilistic interpretation
narized production. In a normal form binarization, of Markovization in such a setting.
this parameter equals infinity, because the bina- To summarize, we have at least four binarization
rized production should only apply in the exact strategies (cf. figure 4 for an illustration):
same context as the context in which it originally
belongs, as otherwise the set of strings accepted 1. right branching: A right-to-left binarization.
by the grammar would be affected. An artificial No regard for optimality or statistical tweaks.
label will have the form XA,B,C for a binarized 2. optimal: A binarization which minimizes pars-
production of a constituent X that has covered ing complexity, introduced in Gildea (2010).
children A, B, and C of X. The other extreme, Binarizing with this strategy is exponential in
h = 1, enables generalizations by stringing parts the resulting optimal fan-out (Gildea, 2010).
of binarized constituents together, as long as they 3. head-driven: Head-outward binarization with
share one non-terminal. In the previous example, horizontal Markovization. No regard for opti-
the label would become just XA , i.e., the pres- mality.
ence of B and C would no longer be required, 4. optimal head-driven: Head-outward binariza-
which enables switching to any binarized produc- tion with horizontal Markovization. Min-
tion that has covered A as the last node. Limit- imizes parsing complexity. Introduced in
ing the amount of horizontal context on which a and proven to be NP-hard by Crescenzi et al.
production is conditioned is important when the (2011).
treebank contains many unique constituents which
3.2 Finding optimal binarizations
can only be parsed by stringing together different
binarized productions; in other words, it is a way An issue with the minimal binarizations is that
of dealing with the data sparseness about n-ary the algorithm for finding them has a high compu-
productions in the treebank. tational complexity, and has not been evaluated
empirically on treebank data.2 Empirical inves-
The second parameter describes parent annota- tigation is interesting for two reasons. First of
tion, which will not be investigated in this work; all, the high computational complexity may not
the default value is v = 1 which implies only in- be relevant with constant factors of constituents,
cluding the immediate parent of the constituent which can reasonably be expected to be relatively
that is being binarized; including grandparents is a small. Second, it is important to establish whether
way of weakening independence assumptions. an asymptotic improvement is actually obtained
Crescenzi et al. (2011) also remark that through optimal binarizations, and whether this
an optimal head-driven binarization allows for translates to an improvement in practice.
Markovization. However, it is questionable Gildea (2010) presents a general algorithm to
whether such a binarization is worthy of the name binarize an LCFRS while minimizing a given scor-
Markovization, as the non-terminals are not intro- ing function. We will use this algorithm with two
duced deterministically from left to right, but in different scoring functions.
an arbitrary fashion dictated by concerns of pars- 2
Gildea (2010) evaluates on a dependency bank, but does
ing complexity; as such there is not a Markov not report whether any improvement is obtained over a naive
process based on a meaningful (e.g., temporal) or- binarization.
463
100000 100000
right branching head-driven
optimal optimal head-driven
10000 10000
Frequency
Frequency
1000 1000
100 100
10 10
1 1
3 4 5 6 7 8 9 3 4 5 6 7 8 9
Parsing complexity Parsing complexity
Figure 5: The distribution of parsing complexity Figure 6: The distribution of parsing complexity among
among productions in binarized grammars read off from productions in Markovized, head-driven grammars read
NEGRA -25. The y-axis has a logarithmic scale. off from NEGRA -25. The y-axis has a logarithmic scale.
The first directly optimizes parsing complexity. opment and test splits (Dubey and Keller, 2003).
Given a (partially) binarized constituent c, the func- Following common practice, punctuation, which
tion returns a tuple of scores, for which a linear is left out of the phrase-structure in Negra, is re-
order is defined by comparing elements starting attached to the nearest constituent.
from the most significant (left-most) element. The In the course of experiments it was discovered
tuples contain the parsing complexity p, and the that the heuristic method for punctuation attach-
fan-out to break ties in parsing complexity; if ment used in previous work (e.g., Maier, 2010;
there are still ties after considering the fan-out, thevan Cranenburgh et al., 2011), as implemented in
sum of the parsing complexities of the subtrees of rparse,3 introduces additional discontinuity. We
c is considered, which will give preference to a bi- applied a slightly different heuristic: punctuation
narization where the worst case complexity occurs is attached to the highest constituent that contains a
once instead of twice. The formula is then: neighbor to its right. The result is that punctuation
opt(c) = hp, , si can be introduced into the phrase-structure with-
out any additional discontinuity, and thus without
The second function is the similar except that artificially inflating the fan-out and complexity of
only head-driven strategies are accepted. A head- grammars read off from the treebank. This new
driven strategy is a binarization in which the head heuristic provides a significant improvement: in-
is introduced first, after which the rest of the chil- stead of a fan-out of 9 and a parsing complexity of
dren are introduced one at a time. 19, we obtain values of 4 and 9 respectively.
hp, , si if c is head-driven The parser is presented with the gold part-of-
opt-hd(c) =
h, , i otherwise speech tags from the corpus. For reasons of effi-
ciency we restrict sentences to 25 words (includ-
Given a (partial) binarization c, the score should
ing punctuation) in this experiment: NEGRA -25.
reflect the maximum complexity and fan-out in
A grammar was read off from the training part
that binarization, to optimize for the worst case, as
of NEGRA -25, and sentences of up to 25 words
well as the sum, to optimize the average case. This
in the development set were parsed using the re-
aspect appears to be glossed over by Gildea (2010).
sulting PLCFRS, using the different binarization
Considering only the score of the last production in
schemes. First with a right-branching, right-to-left
a binarization produces suboptimal binarizations.
binarization, and second with the minimal bina-
3.3 Experiments rization according to parsing complexity and fan-
As data we use version 2 of the Negra (Skut et al., 3
Available from http://www.wolfgang-maier.net/
1997) treebank, with the common training, devel- rparse/downloads. Retrieved March 25th, 2011
464
right optimal
branching optimal head-driven head-driven
Markovization v=1, h= v=1, h= v=1, h=2 v=1, h=2
fan-out 4 4 4 4
complexity 8 8 9 8
labels 12861 12388 4576 3187
clauses 62072 62097 53050 52966
time to binarize 1.83 s 46.37 s 2.74 s 28.9 s
time to parse 246.34 s 193.94 s 2860.26 s 716.58 s
coverage 96.08 % 96.08 % 98.99 % 98.73 %
F1 score 66.83 % 66.75 % 72.37 % 71.79 %
Table 1: The effect of binarization strategies on parsing efficiency, with sentences from the development section of
NEGRA -25.
out. The last two binarizations are head-driven rizations is exponential (Gildea, 2010) and NP-
and Markovizedthe first straightforwardly from hard (Crescenzi et al., 2011), they can be computed
left-to-right, the latter optimized for minimal pars- relatively quickly on this data set.5 Importantly, in
ing complexity. With Markovization we are forced the first case there is no improvement on fan-out
to add a level of parent annotation to tame the or parsing complexity, while in the head-driven
increase in productivity caused by h = 1. case there is a minimal improvement because of a
The distribution of parsing complexity (mea- single production with parsing complexity 15 with-
sured with eq. 6) in the grammars with different out optimal binarization. On the other hand, the
binarization strategies is shown in figure 5 and optimal binarizations might still have a significant
6. Although the optimal binarizations do seem effect on the average case complexity, rather than
to have some effect on the distribution of parsing the worst-case complexities. Indeed, in both cases
complexities, it remains to be seen whether this parsing with the optimal grammar is faster; in the
can be cashed out as a performance improvement first case, however, when the time for binariza-
in practice. To this end, we also parse using the tion is considered as well, this advantage mostly
binarized grammars. disappears.
In this work we binarize and parse with The difference in F1 scores might relate to the
disco-dop introduced in van Cranenburgh et al. efficacy of Markovization in the binarizations. It
(2011).4 In this experiment we report scores of the should be noted that it makes little theoretical
(exact) Viterbi derivations of a treebank PLCFRS; sense to Markovize a binarization when it is not
cf. table 1 for the results. Times represent CPU a left-to-right or right-to-left binarization, because
time (single core); accuracy is given with a gener- with an optimal binarization the non-terminals of
alization of PARSEVAL to discontinuous structures, a constituent are introduced in an arbitrary order.
described in Maier (2010). More importantly, in our experiments, these
Instead of using Maiers implementation of dis- techniques of optimal binarizations did not scale
continuous F1 scores in rparse, we employ a vari- to longer sentences. While it is possible to obtain
ant that ignores (a) punctuation, and (b) the root an optimal binarization of the unrestricted Negra
node of each tree. This makes our evaluation in- corpus, parsing long sentences with the resulting
comparable to previous results on discontinuous grammar remains infeasible. Therefore we need to
parsing, but brings it in line with common practice look at other techniques for parsing longer sen-
on the Wall street journal benchmark. Note that tences. We will stick with the straightforward
this change yields scores about 2 or 3 percentage
points lower than those of rparse. 5
The implementation exploits two important optimiza-
Despite the fact that obtaining optimal bina- tions. The first is the use of bit vectors to keep track of which
non-terminals are covered by a partial binarization. The sec-
4
All code is available from: http://github.com/ ond is to skip constituents without discontinuity, which are
andreasvc/disco-dop. equivalent to CFG productions.
465
head-driven, head-outward binarization strategy, cedure introduced in Boyd (2007). Each discontin-
despite this being a computationally sub-optimal uous node is split into a set of new nodes, one for
binarization. each component; for example a node NP2 will be
One technique for efficient parsing of LCFRS is split into two nodes labeled NP *1 and NP *2 (like
the use of context-summary estimates (Kallmeyer Barthelemy et al., we mark components with an
and Maier, 2010), as part of a best-first parsing index to reduce overgeneration). Because Boyds
algorithm. This allowed Maier (2010) to parse transformation is reversible, chart items from this
sentences of up to 30 words. However, the calcu- grammar can be converted back to discontinuous
lation of these estimates is not feasible for longer chart items, and can guide parsing of an LCFRS.
sentences and large grammars (van Cranenburgh This guiding takes the form of a white list. Af-
et al., 2011). ter parsing with the coarse grammar, the result-
Another strategy is to perform an online approx- ing chart is pruned by removing all items that
imation of the sentence to be parsed, after which fail to meet a certain criterion. In our case this
parsing with the LCFRS can be pruned effectively. is whether a chart item is part of one of the k-best
This is the strategy that will be explored in the derivationswe use k = 50 in all experiments (as
current work. in van Cranenburgh et al., 2011). This has simi-
lar effects as removing items below a threshold
4 Context-free grammar approximation of marginalized posterior probability; however,
for coarse-to-fine parsing the latter strategy requires computation of outside
Coarse-to-fine parsing (Charniak et al., 2006) is probabilities from a parse forest, which is more
a technique to speed up parsing by exploiting the involved with an LCFRS than with a PCFG. When
information that can be gained from parsing with parsing with the fine grammar, whenever a new
simpler, coarser grammarse.g., a grammar with item is derived, the white list is consulted to see
a smaller set of labels on which the original gram- whether this item is allowed to be used in further
mar can be projected. Constituents that do not derivations; otherwise it is immediately discarded.
contribute to a full parse tree with a coarse gram- This coarse-to-fine approach will be referred to as
mar can be ruled out for finer grammars as well, CFG - CTF , and the transformed, coarse grammar
which greatly reduces the number of edges that will be referred to as a split-PCFG.
need to be explored. However, by changing just Splitting discontinuous nodes for the coarse
the labels only the grammar constant is affected. grammar introduces new nodes, so obviously we
With discontinuous treebank parsing the asymp- need to binarize after this transformation. On the
totic complexity of the grammar also plays a major other hand, the coarse-to-fine approach requires a
role. Therefore we suggest to parse not just with mapping between the grammars, so after reversing
a coarser grammar, but with a coarser grammar the transformation of splitting nodes, the resulting
formalism, following a suggestion in van Cranen- discontinuous trees must be binarized (and option-
burgh et al. (2011). ally Markovized) in the same manner as those on
This idea is inspired by the work of Barthelemy which the fine grammar is based.
et al. (2001), who apply it in a non-probabilistic To resolve this tension we elect to binarize twice.
setting where the coarse grammar acts as a guide to The first time is before splitting discontinuous
the non-deterministic choices of the fine grammar. nodes, and this is where we introduce Markoviza-
Within the coarse-to-fine approach the technique tion. This same binarization will be used for the
becomes a matter of pruning with some probabilis- fine grammar as well, which ensures the models
tic threshold. Instead of using the coarse gram- make the same kind of generalizations. The sec-
mar only as a guide to solve non-deterministic ond binarization is after splitting nodes, this time
choices, we apply it as a pruning step which also with a binary normal form (2NF; all productions
discards the most suboptimal parses. The basic are either unary, binary, or lexical).
idea is to extract a grammar that defines a superset Parsing with this grammar proceeds as fol-
of the language we want to parse, but with a fan- lows. After obtaining an exhaustive chart from
out of 1. More concretely, a context-free grammar the coarse stage, the chart is pruned so as to only
can be read off from discontinuous trees that have contain items occurring in the k-best derivations.
been transformed to context-free trees by the pro- When parsing in the fine stage, each new item is
466
S
S SA
SA S SB
SB SA B*0 SB : SC *0,B*1,SC *1
B SB SC *0 SB : B*1,SC *1
S SD SD SD
B SE SE SE
A X C Y D E A X C Y D E A X C Y D E A X C Y D E
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Figure 7: Transformations for a context-free coarse grammar. From left to right: the original constituent,
Markovized with v = 1, h = 1, discontinuities resolved, normal form (second binarization).
Table 2: Some statistics on the coarse and fine grammars read off from NEGRA -40.
5 Evaluation
Figure 8: Efficiency of parsing PLCFRS with and with-
We evaluate on Negra with the same setup as in
out coarse-to-fine. The latter includes time for both
section 3.3. We report discontinuous F1 scores as coarse & fine grammar. Datapoints represent the aver-
well as exact match scores. For previous results on age time to parse sentences of that length; each length
discontinuous parsing with Negra, see table 3. For is made up of 2040 sentences.
results with the CFG - CTF method see table 4.
We first establish the viability of the CFG - CTF
method on NEGRA -25, with a head-driven v = 1, sentences of length > 22 despite its overhead of
h = 2 binarization, and reporting again the scores parsing twice.
of the exact Viterbi derivations from a treebank The second experiment demonstrates the CFG -
PLCFRS versus a PCFG using our transformations. CTF technique on longer sentences. We restrict the
Figure 8 compares the parsing times of LCFRS length of sentences in the training, development
with and without the new CFG - CTF method. The and test corpora to 40 words: NEGRA -40. As a first
graph shows a steep incline for parsing with LCFRS step we apply the CFG - CTF technique to parse with
directly, which makes it infeasible to parse longer a PLCFRS as the fine grammar, pruning away all
sentences, while the CFG - CTF method is faster for items not occurring in the 10,000 best derivations
467
words PARSEVAL Exact
(F1 ) match
DPSG :Plaehn (2004) 15 73.16 39.0
PLCFRS :Maier (2010) 30 71.52 31.65
Disco-DOP: van Cranenburgh et al. (2011) 30 73.98 34.80
Table 4: Results on NEGRA -25 and NEGRA -40 with the CFG - CTF method. NB: As explained in section 3.3, these
F1 scores are incomparable to the results in table 3; for comparison, the F1 score for Disco-DOP on the dev set
40 is 77.13 % using that evaluation scheme.
from the PCFG chart. The result shows that the same model from NEGRA -40 can also be used to
PLCFRS gives a slight improvement over the split-- parse the full development set, without length re-
pcfg, which accords with the observation that the strictions, establishing that the CFG - CTF method
latter makes stronger independence assumptions effectively eliminates any limitation of length for
in the case of discontinuity. parsing with LCFRS.
468
References Proceedings of NAACL HLT 2010., pages 769
776.
Francois Barthelemy, Pierre Boullier, Philippe De-
schamp, and Eric de la Clergerie. 2001. Guided Carlos Gomez-Rodrguez, Marco Kuhlmann, and
parsing of range concatenation languages. In Giorgio Satta. 2010. Efficient parsing of well-
Proc. of ACL, pages 4249. nested linear context-free rewriting systems. In
Proceedings of NAACL HLT 2010., pages 276
Pierre Boullier. 1998. Proposal for a natural lan- 284.
guage processing syntactic backbone. Techni-
cal Report RR-3342, INRIA-Rocquencourt, Le Carlos Gomez-Rodrguez, Marco Kuhlmann, Gior-
Chesnay, France. URL http://www.inria. gio Satta, and David Weir. 2009. Optimal reduc-
fr/RRRT/RR-3342.html. tion of rule length in linear context-free rewrit-
ing systems. In Proceedings of NAACL HLT
Adriane Boyd. 2007. Discontinuity revisited: An 2009, pages 539547.
improved conversion to context-free representa-
Joshua Goodman. 2003. Efficient parsing of
tions. In Proceedings of the Linguistic Annota-
DOP with PCFG-reductions. In Rens Bod,
tion Workshop, pages 4144.
Remko Scha, and Khalil Simaan, editors, Data-
Sabine Brants, Stefanie Dipper, Silvia Hansen, Oriented Parsing. The University of Chicago
Wolfgang Lezius, and George Smith. 2002. The Press.
Tiger treebank. In Proceedings of the workshop
Laura Kallmeyer. 2010. Parsing Beyond Context-
on treebanks and linguistic theories, pages 24
Free Grammars. Cognitive Technologies.
41.
Springer Berlin Heidelberg.
Eugene Charniak, Mark Johnson, M. Elsner,
Laura Kallmeyer, Timm Lichte, Wolfgang Maier,
J. Austerweil, D. Ellis, I. Haxton, C. Hill,
Yannick Parmentier, Johannes Dellert, and Kil-
R. Shrivaths, J. Moore, M. Pozar, et al. 2006.
ian Evang. 2008. Tulipa: Towards a multi-
Multilevel coarse-to-fine PCFG parsing. In Pro-
formalism parsing environment for grammar
ceedings of NAACL-HLT, pages 168175.
engineering. In Proceedings of the Workshop
Joan Chen-Main and Aravind K. Joshi. 2010. Un- on Grammar Engineering Across Frameworks,
avoidable ill-nestedness in natural language and pages 18.
the adequacy of tree local-mctag induced depen- Laura Kallmeyer and Wolfgang Maier. 2010. Data-
dency structures. In Proceedings of TAG+. URL driven parsing with probabilistic linear context-
http://www.research.att.com/srini/ free rewriting systems. In Proceedings of the
TAG+10/papers/chenmainjoshi.pdf. 23rd International Conference on Computa-
Michael Collins. 1999. Head-driven statistical tional Linguistics, pages 537545.
models for natural language parsing. Ph.D. the- Dan Klein and Christopher D. Manning. 2003. Ac-
sis, University of Pennsylvania. curate unlexicalized parsing. In Proc. of ACL,
Pierluigi Crescenzi, Daniel Gildea, Aandrea volume 1, pages 423430.
Marino, Gianluca Rossi, and Giorgio Satta. Marco Kuhlmann and Giorgio Satta. 2009. Tree-
2011. Optimal head-driven parsing complex- bank grammar techniques for non-projective de-
ity for linear context-free rewriting systems. In pendency parsing. In Proceedings of EACL,
Proc. of ACL. pages 478486.
Amit Dubey and Frank Keller. 2003. Parsing ger- Roger Levy. 2005. Probabilistic models of word
man with sister-head dependencies. In Proc. of order and syntactic discontinuity. Ph.D. thesis,
ACL, pages 96103. Stanford University.
Kilian Evang and Laura Kallmeyer. 2011. Wolfgang Maier. 2010. Direct parsing of discon-
PLCFRS parsing of English discontinuous con- tinuous constituents in German. In Proceedings
stituents. In Proceedings of IWPT, pages 104 of the SPMRL workshop at NAACL HLT 2010,
116. pages 5866.
Daniel Gildea. 2010. Optimal parsing strategies Wolfgang Maier and Timm Lichte. 2009. Charac-
for linear context-free rewriting systems. In terizing discontinuity in constituent treebanks.
469
In Proceedings of Formal Grammar 2009, pages Ph.D. thesis, University of Pennsylvania.
167182. Springer. URL http://repository.upenn.edu/
Wolfgang Maier and Anders Sgaard. 2008. Tree- dissertations/AAI8908403/.
banks and mild context-sensitivity. In Proceed-
ings of Formal Grammar 2008, page 61.
Mitchell P. Marcus, Mary Ann Marcinkiewicz, and
Beatrice Santorini. 1993. Building a large an-
notated corpus of english: The penn treebank.
Computational linguistics, 19(2):313330.
James D. McCawley. 1982. Parentheticals and
discontinuous constituent structure. Linguistic
Inquiry, 13(1):91106.
Oliver Plaehn. 2004. Computing the most prob-
able parse for a discontinuous phrase structure
grammar. In Harry Bunt, John Carroll, and Gior-
gio Satta, editors, New developments in parsing
technology, pages 91106. Kluwer Academic
Publishers, Norwell, MA, USA.
Remko Scha. 1990. Language theory and language
technology; competence and performance. In
Q.A.M. de Kort and G.L.J. Leerdam, editors,
Computertoepassingen in de Neerlandistiek,
pages 722. LVVN, Almere, the Netherlands.
Original title: Taaltheorie en taaltechnologie;
competence en performance. Translation avail-
able at http://iaaa.nl/rs/LeerdamE.html.
Stuart M. Shieber. 1985. Evidence against the
context-freeness of natural language. Linguis-
tics and Philosophy, 8:333343.
Wojciech Skut, Brigitte Krenn, Thorten Brants,
and Hans Uszkoreit. 1997. An annotation
scheme for free word order languages. In Pro-
ceedings of ANLP, pages 8895.
Andreas van Cranenburgh, Remko Scha, and
Federico Sangati. 2011. Discontinuous data-
oriented parsing: A mildly context-sensitive all-
fragments grammar. In Proceedings of SPMRL,
pages 3444.
K. Vijay-Shanker and David J. Weir. 1994. The
equivalence of four extensions of context-free
grammars. Theory of Computing Systems,
27(6):511546.
K. Vijay-Shanker, David J. Weir, and Aravind K.
Joshi. 1987. Characterizing structural descrip-
tions produced by various grammatical for-
malisms. In Proc. of ACL, pages 104111.
David J. Weir. 1988. Characterizing mildly
context-sensitive grammar formalisms.
470
Evaluating language understanding accuracy with respect to objective
outcomes in a dialogue system
Myroslava O. Dzikovska and Peter Bell and Amy Isard and Johanna D. Moore
Institute for Language, Cognition and Computation
School of Informatics, University of Edinburgh, United Kingdom
{m.dzikovska,peter.bell,amy.isard,j.moore}@ed.ac.uk
471
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 471481,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
which manually annotated features predict learn- in series or in parallel. Explanation and defi-
ing outcomes, to justify new features needed in nition questions require longer answers that con-
the system (Forbes-Riley et al., 2007; Rotaru and sist of 1-2 sentences, e.g., Why was bulb A on
Litman, 2006; Forbes-Riley and Litman, 2006). when switch Z was open? (expected answer Be-
We adapt the PARADISE methodology to eval- cause it was still in a closed path with the bat-
uating individual NLP components, linking com- tery) or What is voltage? (expected answer
monly used intrinsic evaluation scores with ex- Voltage is the difference in states between two
trinsic outcome metrics. We describe an evalua- terminals). We focus on the performance of the
tion of an interpretation component of a tutorial system on these long-answer questions, since re-
dialogue system, with student learning gain as the acting to them appropriately requires processing
target outcome measure. We first describe the more complex input than factual questions.
evaluation setup, which uses standard classifica- We collected a corpus of 35 dialogues from
tion accuracy metrics for system evaluation (Sec- paid undergraduate volunteers interacting with the
tion 2). We discuss the results of the intrinsic sys- system as part of a formative system evaluation.
tem evaluation in Section 3. We then show that Each student completed a multiple-choice test as-
standard evaluation metrics do not serve as good sessing their knowledge of the material before and
predictors of system performance for the system after the session. In addition, system logs con-
we evaluated. However, adding confusion matrix tained information about how each students utter-
features improves the predictive model (Section ance was interpreted. The resulting data set con-
4). We argue that in practical applications such tains 3426 student answers grouped into 35 sub-
predictive metrics should be used alongside stan- sets, paired with test results. The answers were
dard metrics for component evaluations, to bet- then manually annotated to create a gold standard
ter predict how different components will perform evaluation corpus.
in the context of a specific task. We demonstrate
how this technique can help differentiate the out- 2.2 B EETLE II Interpretation Output
put quality between a majority class baseline, the The interpretation component of B EETLE II uses
systems output, and the output of a new classifier a syntactic parser and a set of hand-authored rules
we trained on our data (Section 5). Finally, we to extract the domain-specific semantic represen-
discuss some limitations and possible extensions tations of student utterances from the text. The
to this approach (Section 6). student answer is first classified with respect to its
domain-specific speech act, as follows:
2 Evaluation Procedure
Answer: a contentful expression to which
2.1 Data Collection the system responds with a tutoring action,
We collected transcripts of students interacting either accepting it as correct or remediating
with B EETLE II (Dzikovska et al., 2010b), a tu- the problems as discussed in (Dzikovska et
torial dialogue system for teaching conceptual al., 2010a).
knowledge in the basic electricity and electron-
Help request: any expression indicating that
ics domain. The system is a learning environment
the student does not know the answer and
with a self-contained curriculum targeted at stu-
without domain content.
dents with no knowledge of high school physics.
When interacting with the system, students spend Social: any expression such as sorry which
3-5 hours going through pre-prepared reading ma- appears to relate to social interaction and has
terial, building and observing circuits in a simula- no recognizable domain content.
tor, and talking with a dialogue-based computer
tutor via a text-based chat interface. Uninterpretable: the system could not arrive
During the interaction, students can be asked at any interpretation of the utterance. It will
two types of questions. Factual questions require respond by identifying the likely source of
them to name a set of objects or a simple prop- error, if possible (e.g., a word it does not un-
erty, e.g., Which components in circuit 1 are in derstand) and asking the student to rephrase
a closed path? or Are bulbs A and B wired their utterance (Dzikovska et al., 2009).
472
If the student utterance was determined to be an the tutoring strategy based on the general answer
answer, it is further diagnosed for correctness as class (correct, incomplete, or contradictory). In
discussed in (Dzikovska et al., 2010b), using a do- addition, this allows us to cast the problem in
main reasoner together with semantic representa- terms of classifier evaluation, and to use standard
tions of expected correct answers supplied by hu- classifier evaluation metrics. If more detailed an-
man tutors. The resulting diagnosis contains the notations were available, this approach could eas-
following information: ily be extended, as discussed in Section 6.
We employed a hierarchical annotation scheme
Consistency: whether the student statement
shown in Figure 1, which is a simplification of
correctly describes the facts mentioned in
the DeMAND coding scheme (Campbell et al.,
the question and the simulation environment:
2009). Student utterances were first annotated
e.g., student saying Switch X is closed is
as either related to domain content, or not con-
labeled inconsistent if the question stipulated
taining any domain content, but expressing the
that this switch is open.
students metacognitive state or attitudes. Utter-
Diagnosis: an analysis of how well the stu- ances expressing domain content were then coded
dents explanation matches the expected an- with respect to their correctness, as being fully
swer. It consists of 4 parts correct, partially correct but incomplete, contain-
ing some errors (rather than just omissions) or
Matched: parts of the student utterance
irrelevant1 . The irrelevant category was used
that matched the expected answer
for utterances which were correct in general but
Contradictory: parts of the student ut-
which did not directly answer the question. Inter-
terance that contradict the expected an-
annotator agreement for this annotation scheme
swer
on the corpus was = 0.69.
Extra: parts of the student utterance that The speech acts and diagnoses logged by the
do not appear in the expected answer system can be automatically mapped into our an-
Not-mentioned: parts of the expected notation labels. Help requests and social acts
answer missing from the student utter- are assigned the non-content label; answers
ance. are assigned a label based on which diagnosis
The speech act and the diagnosis are passed to fields were filled: contradictory for those an-
the tutorial planner which makes decisions about swers labeled as either inconsistent, or contain-
feedback. They constitute the output of the inter- ing something in the contradictory field; incom-
pretation component, and its quality is likely to plete if there is something not mentioned, but
affect the learning outcomes, therefore we need something matched as well, and irrelevant if
an effective way to evaluate it. In future work, nothing matched (i.e., the entire expected answer
performance of individual pipeline components is in not-mentioned). Finally, uninterpretable ut-
could also be evaluated in a similar fashion. terances are treated as unclassified, analogous to a
situation where a statistical classifier does not out-
2.3 Data Annotation put a label for an input because the classification
The general idea of breaking down the student an- probability is below its confidence threshold.
swer into correct, incorrect and missing parts is This mapping was then compared against the
common in tutorial dialogue systems (Nielsen et manually annotated labels to compute the intrin-
al., 2008; Dzikovska et al., 2010b; Jordan et al., sic evaluation scores for the B EETLE II interpreter
2006). However, representation details are highly described in Section 3.
system specific, and difficult and time-consuming
to annotate. Therefore we implemented a simpli-
3 Intrinsic Evaluation Results
fied annotation scheme which classifies whole an- The interpretation component of B EETLE II was
swers as correct, partially correct but incomplete, developed based on the transcripts of 8 sessions
or contradictory, without explicitly identifying the 1
Several different subcategories of non-content utter-
correct and incorrect parts. This makes it easier to ances, and of contradictory utterances, were recorded. How-
create the gold standard and still retains useful in- ever, they resulting classes were too small and so were col-
formation, because tutoring systems often choose lapsed into a single category for purposes of this study.
473
Category Subcategory Description
Non-content Metacognitive and social expressions without domain content, e.g., I
dont know, I need help, you are stupid
Content The utterance includes domain content.
correct The student answer is fully correct
pc incomplete The student said something correct, but incomplete, with some parts of
the expected answer missing
contradictory The students answer contains something incorrect or contradicting the
expected answer, rather than just an omission
irrelevant The students statement is correct in general, but it does not answer the
question.
Label Count Frequency 43%, the same as B EETLE II. However, this is
correct 1438 0.43 obviously not a good choice for a tutoring sys-
pc incomplete 796 0.24 tem, since students who make mistakes will never
contradictory 808 0.24 get tutoring feedback. This is reflected in a much
irrelevant 105 0.03 lower value of the F score (0.12 macroaverage F
non content 232 0.07 score for baseline vs. 0.44 for B EETLE II). Note
also that there is a large difference in the micro-
Table 1: Distribution of annotated labels in the evalu-
and macro- averaged scores. It is not immediately
ation corpus
clear which of these metrics is the most important,
and how they relate to actual system performance.
of students interacting with earlier versions of the We discuss machine learning models to help an-
system. These sessions were completed prior to swer this question in the next section.
the beginning of the experiment during which our
evaluation corpus was collected, and are not in- 4 Linking Evaluation Measures to
cluded in the corpus. Thus, the corpus constitutes Outcome Measures
unseen testing data for the B EETLE II interpreter.
Table 1 shows the distribution of codes in Although the intrinsic evaluation shows that the
the annotated data. The distribution is unbal- B EETLE II interpreter performs better than the
anced, and therefore in our evaluation results we baseline on the F score, ultimately system devel-
use two different ways to average over per-class opers are not interested in improving interpreta-
evaluation scores. Macro-average combines per- tion for its own sake: they want to know whether
class scores disregarding the class sizes; micro- the time spent on improvements, and the compli-
average weighs the per-class scores by class size. cations in system design which may accompany
The overall classification accuracy (defined as the them, are worth the effort. Specifically, do such
number of correctly classified instances out of all changes translate into improvement in overall sys-
instances) is mathematically equivalent to micro- tem performance?
averaged recall; however, macro-averaging better To answer this question without running expen-
reflects performance on small classes, and is com- sive user studies we can build a model which pre-
monly used for unbalanced classification prob- dicts likely outcomes based on the data observed
lems (see, e.g., (Lewis, 1991)). so far, and then use the models predictions as an
The detailed evaluation results are presented additional evaluation metric. We chose a multiple
in Table 2. We will focus on two metrics: the linear regression model for this task, linking the
overall classification accuracy (listed as micro- classification scores with learning gain as mea-
averaged recall as discussed above), and the sured during the data collection. This approach
macro-averaged F score. follows the general PARADISE approach (Walker
The majority class baseline is to assign cor- et al., 2000), but while PARADISE is typically
rect to every instance. Its overall accuracy is used to determine which system components need
474
Label baseline B EETLE II
prec. recall F1 prec. recall F1
correct 0.43 1.00 0.60 0.93 0.52 0.67
pc incomplete 0.00 0.00 0.00 0.42 0.53 0.47
contradictory 0.00 0.00 0.00 0.57 0.22 0.31
irrelevant 0.00 0.00 0.00 0.17 0.15 0.16
non-content 0.00 0.00 0.00 0.91 0.41 0.57
macroaverage 0.09 0.20 0.12 0.60 0.37 0.44
microaverage 0.18 0.43 0.25 0.70 0.43 0.51
Table 2: Intrinsic Evaluation Results for the B EETLE II and a majority class baseline
the most improvement, we focus on finding a bet- rate confusion matrices for each student. We nor-
ter performance metric for a single component malized each confusion matrix cell by the total
(interpretation), using standard evaluation scores number of incorrect classifications for that stu-
as features. dent. We then added features based on confusion
Recall from Section 2.1 that each participant frequencies to our feature set.2
in our data collection was given a pre-test and Ideally, we should add 20 different features to
a post-test, measuring their knowledge of course our model, corresponding to every possible con-
material. The test score was equal to the propor- fusion. However, we are facing a sparse data
tion of correctly answered questions. The normal- problem, illustrated by the overall confusion ma-
ized learning gain, postpre
1pre is a metric typically trix for the corpus in Table 3. For example,
used to assess system quality in intelligent tutor- we only observed 25 instances where a contra-
ing, and this is the metric we are trying to model. dictory utterance was miscategorized as correct
Thus, the training data for our model consists of (compared to 200 contradictorypc incomplete
35 instances, each corresponding to a single dia- confusions), and so for many students this mis-
logue and the learning gain associated with it. We classification was never observed, and predictions
can compute intrinsic evaluation scores for each based on this feature are not likely to be reliable.
dialogue, in order to build a model that predicts Therefore, we limited our features to those mis-
that students learning gain based on these scores. classifications that occurred at least twice for each
If the models predictions are sufficiently reliable, student (i.e., at least 70 times in the entire cor-
we can also use them for predicting the learning pus). The list of resulting features is shown in the
gain that a student could achieve when interacting conf row of Table 4. Since only a small num-
with a new version of the interpretation compo- ber of features was included, this limits the appli-
nent for the system, not yet tested with users. We cability of the model we derived from this data
can then use the predicted score to compare dif- set to the systems which make similar types of
ferent implementations and choose the one with confusions. However, it is still interesting to in-
the highest predicted learning gain. vestigate whether confusion probabilities provide
additional information compared to standard eval-
4.1 Features uation metrics. We discuss how better coverage
Table 4 lists the feature sets we used. We tried two could be obtained in Section 6.
basic types of features. First, we used the eval-
uation scores reported in the previous section as 4.2 Regression Models
features. Second, we hypothesized that some er- Table 5 shows the regression models we obtained
rors that the system makes are likely to be worse using different feature sets. All models were ob-
than others from a tutoring perspective. For ex- tained using stepwise linear regression, using the
ample, if the student gives a contradictory answer, Akaike information criterion (AIC) for variable
accepting it as correct may lead to student miscon- 2
We also experimented with using % unclassified as an
ceptions; on the other hand, calling an irrelevant additional feature, since % of rejections is known to be a
answer partially correct but incomplete may be problem for spoken dialogue systems. However, it did not
less of a problem. Therefore, we computed sepa- improve the models, and we do not report it here for brevity.
475
Actual
Predicted contradictory correct irrelevant non-content pc incomplete
contradictory 175 86 3 0 43
correct 25 752 1 4 26
irrelevant 31 12 16 4 29
non-content 1 3 2 95 3
pc incomplete 200 317 40 28 419
Table 3: Confusion matrix for B EETLE II. System predicted values are in rows; actual values in columns.
selection implemented in the R stepwise regres- full set of intrinsic evaluation scores with confu-
sion library. As measures of model quality, we re- sion frequencies. Note that if the full set of met-
port R2 , the percentage of variance accounted for rics (precision, recall, F score) is used, the model
by the models (a typical measure of fit in regres- derives a more complex formula which covers
sion modeling), and mean squared error (MSE). about 33% of the variance. Our best models,
These were estimated using leave-one-out cross- however, combine the averaged scores with con-
validation, since our data set is small. fusion frequencies, resulting in a higher R2 and
We used feature ablation to evaluate the contri- a lower MSE (22% relative decrease between the
bution of different features. First, we investigated scores.f and conf+scores.f models in the ta-
models using precision, recall or F-score alone. ble). This shows that these features have comple-
As can be seen from the table, precision is not pre- mentary information, and that combining them in
dictive of learning gain, while F-score and recall an application-specific way may help to predict
perform similarly to one another, with R2 = 0.12. how the components will behave in practice.
In comparison, the model using only confusion
frequencies has substantially higher estimated R2 5 Using prediction models in evaluation
and a lower MSE.3 In addition, out of the 3 con-
The models from Table 5 can be used to compare
fusion features, only one is selected as predictive.
different possible implementations of the inter-
This supports our hypothesis that different types
pretation component, under the assumption that
of errors may have different importance within a
the component with a higher predicted learning
practical system.
gain score is more appropriate to use in an ITS.
The confusion frequency feature chosen by To show how our predictive models can be used
the stepwise model (predicted-pc incomplete- in making implementation decisions, we compare
actual-contradictory) has a reasonable theoret- three possible choices for an interpretation com-
ical justification. Previous research shows that ponent: the original B EETLE II interpreter, the
students who give more correct or partially cor- baseline classifier described earlier, and a new de-
rect answers, either in human-human or human- cision tree classifier trained on our data.
computer dialogue, exhibit higher learning gains,
We built a decision tree classifier using the
and this has been established for different sys-
Weka implementation of C4.5 pruned decision
tems and tutoring domains (Litman et al., 2009).
trees, with default parameters. As features, we
Consequently, % of contradictory answers is neg-
used lexical similarity scores computed by the
atively predictive of learning gain. It is reasonable
Text::Similarity package4 . We computed
to suppose, as predicted by our model, that sys-
8 features: the similarity between student answer
tems that do not identify such answers well, and
and either the expected answer text or the question
therefore do not remediate them correctly, will do
text, using 4 different scores: raw number of over-
worse in terms of learning outcomes.
lapping words, F1 score, lesk score and cosine
Based on this initial finding, we investigated score. Its intrinsic evaluation scores are shown in
the models that combined either F scores or the Table 6, estimated using 10-fold cross-validation.
3
The decrease in MSE is not statistically significant, pos- We can compare B EETLE II and baseline clas-
sibly because of the small data set. However, since we ob- sifier using the scores.all model. The predicted
serve the same pattern of results across our models, it is still
4
useful to examine. http://search.cpan.org/dist/Text-Similarity/
476
Name Variables
scores.fm fmeasure.microaverage, fmeasure.macroaverage, fmeasure.correct,
fmeasure.contradictory, fmeasure.pc incomplete,fmeasure.non-content,
fmeasure.irrelevant
scores.precision precision.microaverage, precision.macroaverage, precision.correct,
precision.contradictory, precision.pc incomplete,precision.non-content,
precision.irrelevant
scores.recall recall.microaverage, recall.macroaverage, recall.correct, recall.contradictory,
recall.pc incomplete,recall.non-content, recall.irrelevant
scores.all scores.fm + scores.precision + scores.recall
conf Freq.predicted.contradictory.actual.correct,
Freq.predicted.pc incomplete.actual.correct,
Freq.predicted.pc incomplete.actual.contradictory
Table 5: Regression models for learning gain. R2 and MSE estimated with leave-one-out cross-validation.
Standard deviation in parentheses.
477
score for B EETLE II is 0.66. The predicted Label prec. recall F1
score for the baseline is 0.28. We cannot use correct 0.66 0.76 0.71
the models based on confusion scores (conf, pc incomplete 0.38 0.34 0.36
conf+scores.f or full) for evaluating the base- contradictory 0.40 0.35 0.37
line, because the confusions it makes are always irrelevant 0.07 0.04 0.05
to predict that the answer is correct when the non-content 0.62 0.76 0.68
actual label is incomplete or contradictory. macroaverage 0.43 0.45 0.43
Such situations were too rare in our training data, microaverage 0.51 0.53 0.52
and therefore were not included in the models (as
Table 6: Intrinsic evaluation scores for our newly built
discussed in Section 4.1). Additional data will
classifier.
need to be collected before this model can rea-
sonably predict baseline behavior.
Compared to our new classifier, B EETLE II has swer. However, we could still use a classifier to
lower overall accuracy (0.43 vs. 0.53), but per- double-check the interpreters output. If the
forms micro- and macro- averaged scores. B EE - predictions made by the original interpreter and
TLE II precision is higher than that of the classi- the classifier differ, and in particular when the
fier. This is not unexpected given how the system classifier assigns the contradictory label to an
was designed: since misunderstandings caused answer, B EETLE II may choose to use a generic
dialogue breakdown in pilot tests, the interpreter strategy for contradictory utterances, e.g. telling
was built to prefer rejecting utterances as uninter- the student that their answer is incorrect without
pretable rather than assigning them to an incorrect specifying the exact problem, or asking them to
class, leading to high precision but lower recall. re-read portions of the material.
However, we can use all our predictive models
6 Discussion and Future Work
to evaluate the classifier. We checked the the con-
fusion matrix (not shown here due to space lim- In this paper, we proposed an approach for cost-
itations), and saw that the classifier made some sensitive evaluation of language interpretation
of the same types of confusions that B EETLE II within practical applications. Our approach is
interpreter made. On the scores.all model, the based on the PARADISE methodology for dia-
predicted learning gain score for the classifier is logue system evaluation (Walker et al., 2000).
0.63, also very close to B EETLE II. But with the We followed the typical pattern of a PARADISE
conf+scores.all model, the predicted score is study, but instead of relying on a variety of fea-
0.89, compared to 0.59 for B EETLE II, indicating tures that characterize the interaction, we used
that we should prefer the newly built classifier. scores that reflect only the performance of the
Looking at individual class performance, the interpretation component. For B EETLE II we
classifier performs better than the B EETLE II in- could build regression models that account for
terpreter on identifying correct and contradic- nearly 50% variance in the desired outcomes, on
tory answers, but does not do as well for par- par with models reported in earlier PARADISE
tially correct but incomplete, and for irrelevant an- studies (Moller et al., 2007; Moller et al., 2008;
swers. Using our predictive performance metric Walker et al., 2000; Larsen, 2003). More impor-
highlights the differences between the classifiers tantly, we demonstrated that combining averaged
and effectively helps determine which confusion scores with features based on confusion frequen-
types are the most important. cies improves prediction quality and allows us to
One limitation of this prediction, however, is see differences between systems which are not ob-
that the original systems output is considerably vious from the scores alone.
more complex: the B EETLE II interpreter explic- Previous work on task-based evaluation of NLP
itly identifies correct, incorrect and missing parts components used RTE or information extraction
of the student answer which are then used by the as target tasks (Sammons et al., 2010; Yuret et al.,
system to formulate adaptive feedback. This is 2010; Miyao et al., 2008), based on standard cor-
an important feature of the system because it al- pora. We specifically targeted applications which
lows for implementation of strategies such as ac- involve human-computer interaction, where run-
knowledging and restating correct parts of the an- ning task-based evaluations is particularly expen-
478
sive, and building a predictive model of system tation variants during the system development,
performance can simplify system development. without re-running user evaluations, can provide
Our evaluation data limited the set of features important information, as we illustrated with an
that we could use in our models. For most con- example of evaluating a new classifier we built for
fusion features, there were not enough instances our interpretation task. Moreover, the confusion
in the data to build a model that would reliably frequency feature that our models picked is con-
predict learning gain for those cases. One way sistent with earlier results from a different tutor-
to solve this problem would be to conduct a user ing domain (see Section 4.2). Thus, these models
study in which the system simulates random er- could provide a starting point when making sys-
rors appearing some of the time. This could pro- tem development choices, which can then be con-
vide the data needed for more accurate models. firmed by user evaluations in new domains.
The general pattern we observed in our data The models we built do not fully account for
is that a model based on F-scores alone predicts the variance in the training data. This is expected,
only a small proportion of the variance. If a full since interpretation performance is not the only
set of metrics (including F-score, precision and factor influencing the objective outcome: other
recall) is used, linear regression derives a more factors, such choosing the the appropriate tutor-
complex equation, with different weights for pre- ing strategy, are also important. Similar models
cision and recall. Instead of the linear model, we could be built for other system components to ac-
may consider using a model based on F score, count for their contribution to the variance. Fi-
F = (1 + 2 ) 2PPR+R , and fitting it to the data to nally, we could consider using different learning
derive the weight rather than using the standard algorithms. Moller et al. (2008) examined deci-
F1 score. We plan to investigate this in the future. sion trees and neural networks in addition to mul-
Our method would apply to a wide range of tiple linear regression for predicting user satisfac-
systems. It can be used straightforwardly with tion in spoken dialogue. They found that neural
many current spoken dialogue systems which rely networks had the best prediction performance for
on classifiers to support language understanding their task. We plan to explore other learning algo-
in domains such as call routing and technical sup- rithms for this task as part of our future work.
port (Gupta et al., 2006; Acomb et al., 2007).
7 Conclusion
We applied it to a system that outputs more com-
plex logical forms, but we showed that we could In this paper, we described an evaluation of an
simplify its output to a set of labels which still interpretation component of a tutorial dialogue
allowed us to make informed decisions. Simi- system using predictive models that link intrin-
lar simplifications could be derived for other sys- sic evaluation scores with learning outcomes. We
tems based on domain-specific dialogue acts typ- showed that adding features based on confusion
ically used in dialogue management. For slot- frequencies for individual classes significantly
based systems, it may be useful to consider con- improves the prediction. This approach can be
cept accuracy for recognizing individual slot val- used to compare different implementations of lan-
ues. Finally, for tutoring systems it is possible guage interpretation components, and to decide
to annotate the answers on a more fine-grained which option to use, based on the predicted im-
level. Nielsen et al. (2008) proposed an annota- provement in a task-specific target outcome met-
tion scheme based on the output of a dependency ric trained on previous evaluation data.
parser, and trained a classifier to identify individ-
ual dependencies as expressed, contradicted Acknowledgments
or unaddressed. Their system could be evalu- We thank Natalie Steinhauser, Gwendolyn Camp-
ated using the same approach. bell, Charlie Scott, Simon Caine, Leanne Taylor,
The specific formulas we derived are not likely Katherine Harrison and Jonathan Kilgour for help
to be highly generalizable. It is a well-known with data collection and preparation; and Christo-
limitation of PARADISE evaluations that models pher Brew for helpful comments and discussion.
built based on one system often do not perform This work has been supported in part by the US
well when applied to different systems (Moller et ONR award N000141010085.
al., 2008). But using them to compare implemen-
479
References Gilbert. 2006. The AT&T spoken language un-
derstanding system. IEEE Transactions on Audio,
Kate Acomb, Jonathan Bloom, Krishna Dayanidhi, Speech & Language Processing, 14(1):213222.
Phillip Hunter, Peter Krogh, Esther Levin, and Pamela W. Jordan, Maxim Makatchev, and Umarani
Roberto Pieraccini. 2007. Technical support dia- Pappuswamy. 2006. Understanding complex nat-
log systems: Issues, problems, and solutions. In ural language explanations in tutorial applications.
Proceedings of the Workshop on Bridging the Gap: In Proceedings of the Third Workshop on Scalable
Academic and Industrial Research in Dialog Tech- Natural Language Understanding, ScaNaLU 06,
nologies, pages 2531, Rochester, NY, April. pages 1724.
Gwendolyn C. Campbell, Natalie B. Steinhauser, Lars Bo Larsen. 2003. Issues in the evaluation of spo-
Myroslava O. Dzikovska, Johanna D. Moore, ken dialogue systems using objective and subjective
Charles B. Callaway, and Elaine Farrow. 2009. The measures. In Proceedings of the 2003 IEEE Work-
DeMAND coding scheme: A common language shop on Automatic Speech Recognition and Under-
for representing and analyzing student discourse. In standing, pages 209214.
Proceedings of 14th International Conference on David D. Lewis. 1991. Evaluating text categorization.
Artificial Intelligence in Education (AIED), poster In Proceedings of the workshop on Speech and Nat-
session, Brighton, UK, July. ural Language, HLT 91, pages 312318, Strouds-
Myroslava O. Dzikovska, Charles B. Callaway, Elaine burg, PA, USA.
Farrow, Johanna D. Moore, Natalie B. Steinhauser, Diane Litman, Johanna Moore, Myroslava Dzikovska,
and Gwendolyn E. Campbell. 2009. Dealing with and Elaine Farrow. 2009. Using natural lan-
interpretation errors in tutorial dialogue. In Pro- guage processing to analyze tutorial dialogue cor-
ceedings of the SIGDIAL 2009 Conference, pages pora across domains and modalities. In Proceed-
3845, London, UK, September. ings of 14th International Conference on Artificial
Myroslava Dzikovska, Diana Bental, Johanna D. Intelligence in Education (AIED), Brighton, UK,
Moore, Natalie B. Steinhauser, Gwendolyn E. July.
Campbell, Elaine Farrow, and Charles B. Callaway. Yusuke Miyao, Rune Stre, Kenji Sagae, Takuya Mat-
2010a. Intelligent tutoring with natural language suzaki, and Junichi Tsujii. 2008. Task-oriented
support in the Beetle II system. In Sustaining TEL: evaluation of syntactic parsers and their representa-
From Innovation to Learning and Practice - 5th Eu- tions. In Proceedings of ACL-08: HLT, pages 46
ropean Conference on Technology Enhanced Learn- 54, Columbus, Ohio, June.
ing, (EC-TEL 2010), Barcelona, Spain, October. Sebastian Moller, Paula Smeele, Heleen Boland, and
Jan Krebber. 2007. Evaluating spoken dialogue
Myroslava O. Dzikovska, Johanna D. Moore, Natalie
systems according to de-facto standards: A case
Steinhauser, Gwendolyn Campbell, Elaine Farrow,
study. Computer Speech & Language, 21(1):26
and Charles B. Callaway. 2010b. Beetle II: a sys-
53.
tem for tutoring and computational linguistics ex-
Sebastian Moller, Klaus-Peter Engelbrecht, and
perimentation. In Proceedings of the 48th Annual
Robert Schleicher. 2008. Predicting the quality and
Meeting of the Association for Computational Lin-
usability of spoken dialogue services. Speech Com-
guistics (ACL-2010) demo session, Uppsala, Swe-
munication, pages 730744.
den, July.
Rodney D. Nielsen, Wayne Ward, and James H. Mar-
Kate Forbes-Riley and Diane J. Litman. 2006. Mod- tin. 2008. Learning to assess low-level conceptual
elling user satisfaction and student learning in a understanding. In Proceedings 21st International
spoken dialogue tutoring system with generic, tu- FLAIRS Conference, Coconut Grove, Florida, May.
toring, and user affect parameters. In Proceed- Mihai Rotaru and Diane J. Litman. 2006. Exploit-
ings of the Human Language Technology Confer- ing discourse structure for spoken dialogue perfor-
ence of the North American Chapter of the Asso- mance analysis. In Proceedings of the 2006 Con-
ciation of Computational Linguistics (HLT-NAACL ference on Empirical Methods in Natural Language
06), pages 264271, Stroudsburg, PA, USA. Processing, EMNLP 06, pages 8593, Strouds-
Kate Forbes-Riley, Diane Litman, Amruta Purandare, burg, PA, USA.
Mihai Rotaru, and Joel Tetreault. 2007. Compar- Mark Sammons, V.G.Vinod Vydiswaran, and Dan
ing linguistic features for modeling learning in com- Roth. 2010. Ask not what textual entailment can
puter tutoring. In Proceedings of the 2007 confer- do for you.... In Proceedings of the 48th Annual
ence on Artificial Intelligence in Education: Build- Meeting of the Association for Computational Lin-
ing Technology Rich Learning Contexts That Work, guistics, pages 11991208, Uppsala, Sweden, July.
pages 270277, Amsterdam, The Netherlands. IOS Marilyn A. Walker, Candace A. Kamm, and Diane J.
Press. Litman. 2000. Towards Developing General Mod-
Narendra K. Gupta, Gokhan Tur, Dilek Hakkani-Tur, els of Usability with PARADISE. Natural Lan-
Srinivas Bangalore, Giuseppe Riccardi, and Mazin guage Engineering, 6(3).
480
Deniz Yuret, Aydin Han, and Zehra Turgut. 2010.
SemEval-2010 task 12: Parser evaluation using tex-
tual entailments. In Proceedings of the 5th Inter-
national Workshop on Semantic Evaluation, pages
5156, Uppsala, Sweden, July.
481
Experimenting with Distant Supervision for Emotion Classification
482
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 482491,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
ually labelled data. 74% for anger. However, they achieved signifi-
We show that the success of this approach de- cant improvements using acoustic features avail-
pends on both the conventional markers chosen able in their speech data, improving accuracies up
and the emotion classes themselves. Some emo- to a maximum of 81.5%.
tions are both reliably marked by different con-
2.2 Conventions
ventions and distinguishable from other emotions;
this seems particularly true for happiness, sadness As we are using text data, such intonational and
and anger, indicating that this approach can pro- prosodic cues are unavailable, as are the other
vide not only the basic distinction required for rich sources of emotional cues we obtain from
sentiment analysis but some more finer-grained gesture, posture and facial expression in face-to-
information. Others are either less distinguishable face communication. However, the prevalence of
from short text messages, or less reliably marked. online text-based communication has led to the
emergence of textual conventions understood by
2 Related Work the users to perform some of the same functions
as these acoustic and non-verbal cues. The most
2.1 Emotion and Sentiment Classification familiar of these is the use of emoticons, either
Much research in this area has concentrated on the Western-style (e.g. :), :-( etc.) or Eastern-style
related tasks of subjectivity classification (distin- (e.g. (_), (>_<) etc.). Other conventions
guishing objective from subjective texts see e.g. have emerged more recently for particular inter-
(Wiebe and Riloff, 2005)); and sentiment classifi- faces or domains; in Twitter data, one common
cation (classifying subjective texts into those that convention is the use of hashtags to add or em-
convey positive, negative and neutral sentiment phasise emotional content see (1).
see e.g. (Pang and Lee, 2008)). We are interested (1) a. Best day in ages! #Happy :)
in emotion detection: classifying subjective texts
according to a finer-grained classification of the b. Gets so #angry when tutors dont email
emotions they convey, and thus providing richer back... Do you job idiots!
and more informative data for social media anal- Linguistic and social research into the use of
ysis than simple positive/negative sentiment. In such conventions suggests that their function is
this study we confine ourselves to the six basic generally to emphasise or strengthen the emo-
emotions identified by Ekman (1972) as being tion or sentiment conveyed by a message, rather
common across cultures; other finer-grained clas- than to add emotional content which would not
sifications are of course available. otherwise be present. Walther and DAddario
(2001) found that the contribution of emoticons
2.1.1 Emotion Classification towards the sentiment of a message was out-
The task of emotion classification is by nature weighed by the verbal content, although nega-
a multi-class problem, and classification experi- tive ones tended to shift interpretation towards the
ments have therefore achieved lower accuracies negative. Ip (2002) experimented with emoticons
than seen in the binary problems of sentiment and in instant messaging, with the results suggesting
subjectivity classification. Danisman and Alpko- that emoticons do not add positivity or negativ-
cak (2008) used vector space models for the same ity but rather increase valence (making positive
six-way emotion classification we examine here, messages more positive and vice versa). Similarly
and achieved F-measures around 32%; Seol et al. Derks et al. (2008a; 2008b) found that emoticons
(2008) used neural networks for an 8-way clas- are used in strengthening the intensity of a ver-
sification (hope, love, thank, neutral, happy, sad, bal message (although they serve other functions
fear, anger) and achieved per-class accuracies of such as expressing humour), and hypothesized
45% to 65%. Chuang and Wu (2004) used su- that they serve similar functions to actual non-
pervised classifiers (SVMs) and manually defined verbal behavior; Provine et al. (2007) also found
keyword features over a seven-way classification that emoticons are used to punctuate messages
consisting of the same six-class taxonomy plus a rather than replace lexical content, appearing in
neutral category, and achieved an average accu- similar grammatical locations to verbal laughter
racy of 65.5%, varying from 56% for disgust to and preserving phrase structure.
483
2.3 Distant Supervision b. Leftover ToeJams with Kettle Salt and
These findings suggest, of course, that emoticons Vinegar chips. #stress #sadness #comfort
and related conventional markers are likely to be #letsturnthisfrownupsidedown
useful features for sentiment and emotion classifi-
3 Methodology
cation. They also suggest, though, that they might
be used as surrogates for manual emotion class la- We used a collection of Twitter messages, all
bels: if their function is often to complement the marked with emoticons or hashtags correspond-
verbal content available in messages, they should ing to one of Ekman (1972)s six emotion classes.
give us a way to automatically label messages ac- For emoticons, we used Ansari (2010)s taxon-
cording to emotional class, while leaving us with omy, taken from the Yahoo messenger classifica-
messages with enough verbal content to achieve tion. For hashtags, we used emotion names them-
reasonable classification. selves together with the main related adjective
This approach has been exploited in several both are used commonly on Twitter in slightly
ways in recent work; Tanaka et al. (2005) used different ways as shown in (3); note that emo-
Japanese-style emoticons as classification labels, tion names are often used as marked verbs as well
and Go et al. (2009) and Pak and Paroubek (2010) as nouns. Details of the classes and markers are
used Western-style emoticons to label and classify given in Table 1.
Twitter messages according to positive and nega-
tive sentiment, using traditional supervised clas- (3) a. Gets so #angry when tutors dont email
sification methods. The highest accuracies ap- back... Do you job idiots!
pear to have been achieved by Go et al. (2009),
b. Im going to say it, Paranormal Activity
who used various combinations of features (un-
2 scared me and I didnt sleep well last
igrams, bigrams, part-of-speech tags) and clas-
night because of it. #fear #demons
sifiers (Nave Bayes, maximum entropy, and
SVMs), achieving their best accuracy of 83.0% c. Girls that sleep w guys without even fully
with unigram and bigram features and a maxi- getting to know them #disgust me
mum entropy; using only unigrams with a SVM
classifier achieved only slightly lower accuracy at Messages with multiple conventions (see (4))
82.2%. Ansari (2010) then provides an initial in- were collected and used in the experiments, ensur-
vestigation into applying the same methods to six- ing that the marker being used as a label in a par-
way emotion classification, treating each emotion ticular experiment was not available as a feature in
independently as a binary classification problem that experiment. Messages with no markers were
and showing that accuracy varied with emotion not collected. While this prevents us from exper-
class as well as with dataset size. The highest ac- imenting with the classification of neutral or ob-
curacies achieved were up to 81%, but these were jective messages, it would require manual anno-
on very small datasets (e.g. 81.0% accuracy on tation to distinguish these from emotion-carrying
fear, but with only around 200 positive and nega- messages which are not marked. We assume that
tive data instances). any implementation of the techniques we investi-
We view this approach as having several ad- gate here would be able to use a preliminary stage
vantages; apart from the ease of data collection of subjectivity and/or sentiment detection to iden-
it allows by avoiding manual annotation, it gives tify these messages, and leave this aside here.
us access to the authors own intended interpeta-
tions, as the markers are of course added by the (4) a. just because people are celebs they dont
authors themselves at time of writing. In some reply to your tweets! NOT FAIR #Angry
cases such as the examples of (1) above, the emo- :( I wish They would reply! #Please
tion conveyed may be clear to a third-party anno-
Data was collected from Twitters Streaming
tator; but in others it may not be clear at all with-
API service.1 This provides a 1-2% random sam-
out the marker see (2):
ple of all tweets with no constraints on language
(2) a. Still trying to recover from seeing the 1
See http://dev.twitter.com/docs/
#bluewaffle on my TL #disgusted #sick streaming-api.
484
absolute performance for future work.
Table 1: Conventional markers used for emotion
classes.
4 Experiments
happy :-) :) ;-) :D :P 8) 8-| <@o Throughout, the markers (emoticons and/or hash-
sad :-( :( ;-( :-< :( tags) used as labels in any experiment were re-
anger :-@ :@ moved before feature extraction in that experi-
fear :| :-o :-O ment labels were not used as features.
surprise :s :S
4.1 Experiment 1: Emotion detection
disgust :$ +o(
happy #happy #happiness To simulate the task of detecting emotion classes
sad #sad #sadness from a general stream of messages, we first built
anger #angry #anger for each convention type C and each emotion
class E a dataset DE C of size N containing (a)
fear #scared #fear
surprise #surprised #surprise as positive instances, N/2 messages containing
disgust #disgusted #disgust markers of the emotion class E and no other
markers of type C, and (b) as negative instances,
N/2 messages containing markers of type C of
or location. These are collected in near real time any other emotion class. For example, the posi-
and stored in a local database. An English lan- tive instance set for emoticon-marked anger was
guage selection filter was applied; scripts collect- based on those tweets which contained :-@ or
ing each conventional marker set were alternated :@, but none of the emoticons from the happy,
throughout different times of day and days of the sad, surprise, disgust or fear classes;
week to avoid any bias associated with e.g. week- any hashtags were allowed, including those as-
ends or mornings. The numbers of messages col- sociated with emotion classes. The negative in-
lected varied with the popularity of the markers stance set contained a representative sample of
themselves: for emoticons, we obtained a max- the same number of instances, with each having
imum of 837,849 (for happy) and a minimum at least one of the happy, sad, surprise,
of 10,539 for anger; for hashtags, a maximum disgust or fear emoticons but not containing
of 10,219 for happy and a minimum of 536 for :-@ or :@.
disgust. 2 This of course excludes messages with no emo-
Classification in all experiments was using sup- tional markers; for this to act as an approximation
port vector machines (SVMs) (Vapnik, 1995) via of the general task therefore requires a assump-
the LIBSVM implementation of Chang and Lin tion that unmarked messages reflect the same dis-
(2001) with a linear kernel and unigram features. tribution over emotion classes as marked mes-
Unigram features included all words and hashtags sages. For emotion-carrying but unmarked mes-
(other than those used as labels in relevant exper- sages, this does seem intuitively likely, but re-
iments) after removal of URLs and Twitter user- quires investigation. For neutral objective mes-
names. Some improvement in performance might sages it is clearly false, but as stated above we as-
be available using more advanced features (e.g. sume a preliminary stage of subjectivity detection
n-grams), other classification methods (e.g. maxi- in any practical application.
mum entropy, as lexical features are unlikely to be Performance was evaluated using 10-fold
independent) and/or feature weightings (e.g. the cross-validation. Results are shown as the bold
variant of TFIDF used for sentiment classification figures in Table 2; despite the small dataset
by Martineau (2009)). Here, our interest is more sizes in some cases, a 2 test shows all to be
in the difference between the emotion and con- significantly different from chance. The best-
vention marker classes - we leave investigation of performing classes show accuracies very similar
to those achieved by Go et al. (2009) for their bi-
2
One possible way to increase dataset sizes for the rarer nary positive/negative classification, as might be
markers might be to include synonyms in the hashtag names
used; however, peoples use and understanding of hashtags is
expected; for emoticon markers, the best classes
not straightforwardly predictable from lexical form. Instead, are happy, sad and anger; interestingly the
we intend to run a longer-term data gathering exercise. best classes for hashtag markers are not the same
485
but the highest figures (between 63% and 68%)
Table 2: Experiment 1: Within-class results. Same-
convention (bold) figures are accuracies over 10-fold are achieved for happy, sad and anger; here
cross-validation; cross-convention (italic) figures are perhaps we can have some confidence that not
accuracies over full sets. only are the markers acting as predictable labels
Train themselves, but also seem to be labelling the same
Convention Test emoticon hashtag thing (and therefore perhaps are actually labelling
emoticon happy 79.8% 63.5% the emotion we are hoping to label).
emoticon sad 79.9% 65.5%
emoticon anger 80.1% 62.9% 4.2 Experiment 2: Emotion discrimination
emoticon fear 76.2% 58.5% To investigate whether these independent clas-
emoticon surprise 77.4% 48.2% sifiers can be used in multi-class classification
emoticon disgust 75.2% 54.6% (distinguishing emotion classes from each other
hashtag happy 67.7% 82.5% rather than just distinguishing one class from a
hashtag sad 67.1% 74.6% general other set), we next cross-tested the clas-
hashtag anger 62.8% 74.7% sifiers between emotion classes: training models
hashtag fear 60.6% 77.2% on one emotion and testing on the others for
hashtag surprise 51.9% 67.4% each convention type C and each emotion class
hashtag disgust 64.6% 78.3% E1, train a classifier on dataset DE1 C and test on
486
Table 4: Experiment 2: Cross-class results. Same-class figures from 10-fold cross-validation are shown in
(italics) for comparison; all other figures are accuracies over full sets.
Train
Convention Test happy sad anger fear surprise disgust
emoticon happy (78.1%) 17.3% 39.6% 26.7% 28.3% 42.8%
emoticon sad 16.5% (78.9%) 59.1% 71.9% 69.9% 55.5%
emoticon anger 29.8% 67.0% (79.7%) 74.2% 76.4% 67.5%
emoticon fear 27.0% 69.9% 64.4% (75.3%) 74.0% 61.2%
emoticon surprise 25.4% 69.9% 67.7% 76.3% (78.1%) 66.4%
emoticon disgust 42.2% 54.4% 61.1% 64.2% 64.1% (73.9%)
hashtag happy (81.1%) 10.7% 45.3% 47.8% 52.7% 43.4%
hashtag sad 13.8% (77.9%) 47.7% 49.7% 46.5% 54.2%
hashtag anger 44.6% 45.2% (74.3%) 72.0% 63.0% 62.9%
hashtag fear 45.0% 50.4% 68.6% (74.7%) 63.9% 60.7%
hashtag surprise 51.5% 45.7% 67.4% 70.7% (70.2%) 64.2%
hashtag disgust 40.4% 53.5% 74.7% 71.8% 70.8% (74.2%)
487
Table 5: Experiment 2: Cross-class, cross-convention results (train on hashtags, test on emoticons and vice
versa). All figures are accuracies over full sets. Accuracies over 60% are shown in bold.
Train
Convention Test happy sad anger fear surprise disgust
emoticon happy 61.2% 40.4% 44.1% 47.4% 52.0% 45.9%
emoticon sad 38.3% 60.2% 55.1% 51.5% 47.1% 53.9%
emoticon anger 47.0% 48.0% 63.7% 56.2% 50.9% 56.6%
emoticon fear 39.8% 57.7% 57.1% 55.9% 50.8% 56.1%
emoticon surprise 43.7% 55.2% 59.2% 58.4% 53.1% 54.0%
emoticon disgust 51.5% 48.0% 53.5% 55.1% 53.1% 51.5%
hashtag happy 68.7% 32.5% 43.6% 32.1% 35.4% 50.4%
hashtag sad 33.8% 65.4% 53.2% 65.0% 61.8% 48.8%
hashtag anger 43.9% 55.5% 63.9% 59.6% 60.4% 53.0%
hashtag fear 44.3% 54.6% 56.1% 58.9% 61.5% 54.3%
hashtag surprise 54.2% 45.3% 49.8% 49.9% 51.8% 52.3%
hashtag disgust 41.5% 57.6% 61.6% 62.2% 59.3% 55.4%
associated with surprise produces classifiers for the three classes already seen to be prob-
which perform well on data labelled with many lematic: surprise, fear and disgust. To
other hashtag classes, suggesting that those emo- create our dataset for this experiment, we there-
tions were present in the training data. Con- fore took only instances which were given the
versely, the more specific hashtag labels produce same primary label by all labellers i.e. only
classifiers which perform poorly on data labelled those examples which we could take as reliably
with emoticons and which thus contains a range and unambiguously labelled. This gave an un-
of actual emotions. balanced dataset, with numbers varying from 266
instances for happy to only 12 instances for
4.3 Experiment 3: Manual labelling each of surprise and fear. Classifiers were
To confirm whether either (or both) set of auto- trained using the datasets from Experiment 2. Per-
matic (distant) labels do in fact label the under- formance is shown in Table 6; given the imbal-
lying emotion class intended, we used human an- ance between class numbers in the test dataset,
notators via Amazons Mechanical Turk to label evaluation is given as recall, precision and F-score
a set of 1,000 instances. These instances were all for the class in question rather than a simple accu-
labelled with emoticons (we did not use hashtag- racy figure (which is biased by the high proportion
labelled data: as hashtags are so lexically close to of happy examples).
the names of the emotion classes being labelled,
their presence may influence labellers unduly)3
and were evenly distributed across the 6 classes, Table 6: Experiment 3: Results on manual labels.
in so far as indicated by the emoticons. Labellers Train Class Precision Recall F-score
were asked to choose the primary emotion class emoticon happy 79.4% 75.6% 77.5%
(from the fixed set of six) associated with the mes- emoticon sad 43.5% 73.2% 54.5%
sage; they were also allowed to specify if any emoticon anger 62.2% 37.3% 46.7%
other classes were also present. Each data in- emoticon fear 6.8% 63.6% 12.3%
stance was labelled by three different annotators. emoticon surprise 15.0% 90.0% 25.7%
Agreement between labellers was poor over- emoticon disgust 8.3% 25.0% 12.5%
all. The three annotators unanimously agreed in hashtag happy 78.9% 51.9% 62.6%
only 47% of cases overall; although two of three hashtag sad 47.9% 81.7% 60.4%
agreed in 83% of cases. Agreement was worst hashtag anger 58.2% 76.0% 65.9%
3
hashtag fear 10.1% 81.8% 18.0%
Although, of course, one may argue that they do the hashtag surprise 5.9% 60.0% 10.7%
same for their intended audience of readers in which case,
such an effect is legitimate. hashtag disgust 6.7% 66.7% 11.8%
488
Again, results for happy are good, and cor- To avoid any effect of ordering, the order of the
respond fairly closely to the levels of accuracy emoticon list and each drop-down menu was ran-
reported by Go et al. (2009) and others for the domised every time the survey page was loaded.
binary positive/negative sentiment detection task. The survey was distributed via Twitter, Facebook
Emoticons give significantly better performance and academic mailing lists. Respondents were not
than hashtags here. Results for sad and anger given the opportunity to give their own definitions
are reasonable, and provide a baseline for fur- or to provide finer-grained classifications, as we
ther experiments with more advanced features and wanted to establish purely whether they would re-
classification methods once more manually anno- liably associate labels with the six emotions in our
tated data is available for these classes. In con- taxonomy.
trast, hashtags give much better performance with
these classes than the (perhaps vague or ambigu- 5.2 Results
ous) emoticons. The survey was completed by 492 individuals;
The remaining emotion classes, however, show full results are shown in Table 7. It demonstrated
poor performance for both labelling conventions. agreement with the predefined emoticons for sad
The observed low precision and high recall can be and most of the emoticons for happy (people
adjusted using classifier parameters, but F-scores were unsure what 8-| and <@o meant). For all
are not improved. Note that Experiment 1 shows the emoticons listed as anger, surprise and
that both emoticon and hashtag labels are to some disgust, the survey showed that people are reli-
extent predictable, even for these classes; how- ably unsure as to what these mean. For the emoti-
ever, Experiment 2 shows that they may not be con :-o there was a direct contrast between the
reliably different to each other, and Experiment 3 defined meaning and the survey meaning; the def-
tells us that they do not appear to coincide well inition of this emoticon following Ansari (2010)
with human annotator judgements of emotions. was fear, but the survey reliably assigned this to
More reliable labels may therefore be required; surprise.
although we do note that the low reliability of Given the small scale of the survey, we hesi-
the human annotations for these classes, and the tate to draw strong conclusions about the emoti-
correspondingly small amount of annotated data con meanings themselves (in fact, recent conver-
used in this evaluation, means we hesitate to draw sations with schoolchildren see below have in-
strong conclusions about fear, surprise and dicated very different interpretations from these
disgust. An approach which considers multi- adult survey respondents). However, we do con-
ple classes to be associated with individual mes- clude that for most emotions outside happy and
sages may also be beneficial: using majority- sad, emoticons may indeed be an unreliable la-
decision labels rather than unanimous labels im- bel; as hashtags also appear more reliable in the
proves F-scores for surprise to 23-35% by in- classification experiments, we expect these to be
cluding many examples also labelled as happy a more promising approach for fine-grained emo-
(although this gives no improvements for other tion discrimination in future.
classes).
6 Conclusions
5 Survey
The approach shows reasonable performance at
To further detemine whether emoticons used individual emotion label prediction, for both
as emotion class labels are ambiguous or vague emoticons and hashtags. For some emotions (hap-
in meaning, we set up a web survey to exam- piness, sadness and anger), performance across
ine whether people could reliably classify these label conventions (training on one, and testing on
emoticons. the other) is encouraging; for these classes, per-
formance on those manually labelled examples
5.1 Method where annotators agree is also reasonable. This
Our survey asked people to match up which of gives us confidence not only that the approach
the six emotion classes (selected from a drop- produces reliable classifiers which can predict the
down menu) best matched each emoticon. Each labels, but that these classifiers are actually de-
drop-down menu included a Not Sure option. tecting the desired underlying emotional classes,
489
Table 7: Survey results showing the defined emotion, the most popular emotion from the survey, the percentage
of votes this emotion received, and the 2 significance test for the distribution of votes. These are indexed by
emoticon.
Emoticon Defined Emotion Survey Emotion % of votes Significance of votes distribution
:-) Happy Happy 94.9 2 = 3051.7 (p < 0.001)
:) Happy Happy 95.5 2 = 3098.2 (p < 0.001)
;-) Happy Happy 87.4 2 = 2541 (p < 0.001)
:D Happy Happy 85.7 2 = 2427.2 (p < 0.001)
:P Happy Happy 59.1 2 = 1225.4 (p < 0.001)
8) Happy Happy 61.9 2 = 1297.4 (p < 0.001)
8-| Happy Not Sure 52.2 2 = 748.6 (p < 0.001)
<@o Happy Not Sure 84.6 2 = 2335.1 (p < 0.001)
:-( Sad Sad 91.3 2 = 2784.2 (p < 0.001)
:( Sad Sad 89.0 2 = 2632.1 (p < 0.001)
;-( Sad Sad 67.9 2 = 1504.9 (p < 0.001)
:-< Sad Sad 56.1 2 = 972.59 (p < 0.001)
:( Sad Sad 80.7 2 = 2116 (p < 0.001)
:-@ Anger Not Sure 47.8 2 = 642.47 (p < 0.001)
:@ Anger Not Sure 50.4 2 = 691.6 (p < 0.001)
:s Surprise Not Sure 52.2 2 = 757.7 (p < 0.001)
:$ Disgust Not Sure 62.8 2 = 1136 (p < 0.001)
+o( Disgust Not Sure 64.2 2 = 1298.1 (p < 0.001)
:| Fear Not Sure 55.1 2 = 803.41 (p < 0.001)
:-o Fear Surprise 89.2 2 = 2647.8 (p < 0.001)
490
Daantje Derks, Arjan Bos, and Jasper von Grumbkow. F. Radulovic and N. Milikic. 2009. Smiley ontology.
2008a. Emoticons and online message interpreta- In Proceedings of The 1st International Workshop
tion. Social Science Computer Review, 26(3):379 On Social Networks Interoperability.
388. Jonathon Read. 2005. Using emoticons to reduce de-
Daantje Derks, Arjan Bos, and Jasper von Grumbkow. pendency in machine learning techniques for sen-
2008b. Emoticons in computer-mediated commu- timent classification. In Proceedings of the 43rd
nication: Social motives and social context. Cy- Meeting of the Association for Computational Lin-
berPsychology & Behavior, 11(1):99101, Febru- guistics. Association for Computational Linguis-
ary. tics.
Jacob Eisenstein, Brendan OConnor, Noah A. Smith, Young-Soo Seol, Dong-Joo Kim, and Han-Woo Kim.
and Eric P. Xing. 2010. A latent variable model 2008. Emotion recognition from text using knowl-
for geographic lexical variation. In Proceedings edge based ANN. In Proceedings of ITC-CSCC.
of the 2010 Conference on Empirical Methods in Y. Tanaka, H. Takamura, and M. Okumura. 2005. Ex-
Natural Language Processing, pages 12771287, traction and classification of facemarks with kernel
Cambridge, MA, October. Association for Compu- methods. In Proceedings of IUI.
tational Linguistics. Vladimir N. Vapnik. 1995. The Nature of Statistical
Paul Ekman. 1972. Universals and cultural differ- Learning Theory. Springer.
ences in facial expressions of emotion. In J. Cole, Joseph Walther and Kyle DAddario. 2001. The
editor, Nebraska Symposium on Motivation 1971, impacts of emoticons on message interpretation in
volume 19. University of Nebraska Press. computer-mediated communication. Social Science
Alec Go, Richa Bhayani, and Lei Huang. 2009. Twit- Computer Review, 19(3):324347.
ter sentiment classification using distant supervi- J. Wiebe and E. Riloff. 2005. Creating subjective
sion. Masters thesis, Stanford University. and objective sentence classifiers from unannotated
Pei-Yun Hsueh, Prem Melville, and Vikas Sindhwani. texts . In Proceedings of the 6th International Con-
2009. Data quality from crowdsourcing: A study of ference on Computational Linguistics and Intelli-
annotation selection criteria. In Proceedings of the gent Text Processing (CICLing-05), volume 3406 of
NAACL HLT 2009 Workshop on Active Learning for Springer LNCS. Springer-Verlag.
Natural Language Processing, pages 2735, Boul-
der, Colorado, June. Association for Computational
Linguistics.
Amy Ip. 2002. The impact of emoticons on affect in-
terpretation in instant messaging. Carnegie Mellon
University.
Justin Martineau. 2009. Delta TFIDF: An improved
feature space for sentiment analysis. Artificial In-
telligence, 29:258261.
Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-
sky. 2009. Distant supervision for relation extrac-
tion without labeled data. In Proceedings of ACL-
IJCNLP 2009.
Alexander Pak and Patrick Paroubek. 2010. Twitter
as a corpus for sentiment analysis and opinion min-
ing. In Proceedings of the 7th conference on Inter-
national Language Resources and Evaluation.
Bo Pang and Lillian Lee. 2008. Opinion mining and
sentiment analysis. Foundations and Trends in In-
formation Retrieval, 2(12):1135.
Robert Provine, Robert Spencer, and Darcy Mandell.
2007. Emotional expression online: Emoticons
punctuate website text messages. Journal of Lan-
guage and Social Psychology, 26(3):299307.
M. Ptaszynski, J. Maciejewski, P. Dybala, R. Rzepka,
and K Araki. 2010. CAO: A fully automatic emoti-
con analysis system based on theory of kinesics. In
Proceedings of The 24th AAAI Conference on Arti-
ficial Intelligence (AAAI-10), pages 10261032, At-
lanta, GA.
491
Feature-Rich Part-of-speech Tagging
for Morphologically Complex Languages: Application to Bulgarian
Georgi Georgiev and Valentin Zhikov Petya Osenova and Kiril Simov
Ontotext AD IICT, Bulgarian Academy of Sciences
135 Tsarigradsko Sh., Sofia, Bulgaria 25A Acad. G. Bonchev, Sofia, Bulgaria
{georgi.georgiev,valentin.zhikov}@ontotext.com {petya,kivs}@bultreebank.org
Preslav Nakov
Qatar Computing Research Institute, Qatar Foundation
Tornado Tower, floor 10, P.O. Box 5825, Doha, Qatar
pnakov@qf.org.qa
Abstract For example, there are six tags for verbs in the
Penn Treebank: VB (verb, base form; e.g., sing),
We present experiments with part-of-
speech tagging for Bulgarian, a Slavic lan- VBD (verb, past tense; e.g., sang), VBG (verb,
guage with rich inflectional and deriva- gerund or present participle; e.g., singing), VBN
tional morphology. Unlike most previous (verb, past participle; e.g., sung) VBP (verb, non-
work, which has used a small number of 3rd person singular present; e.g., sing), and VBZ
grammatical categories, we work with 680 (verb, 3rd person singular present; e.g., sings);
morpho-syntactic tags. We combine a large these tags are morpho-syntactic in nature. Other
morphological lexicon with prior linguis-
corpora have used even larger tagsets, e.g., the
tic knowledge and guided learning from a
POS-annotated corpus, achieving accuracy Brown corpus (Kucera and Francis, 1967) and the
of 97.98%, which is a significant improve- Lancaster-Oslo/Bergen (LOB) corpus (Johansson
ment over the state-of-the-art for Bulgarian. et al., 1986) use 87 and 135 tags, respectively.
POS tagging poses major challenges for mor-
1 Introduction phologically complex languages, whose tagsets
encode a lot of additional morpho-syntactic fea-
Part-of-speech (POS) tagging is the task of as-
tures (for most of the basic POS categories), e.g.,
signing each of the words in a given piece of text a
gender, number, person, etc. For example, the
contextually suitable grammatical category. This
BulTreeBank (Simov et al., 2004) for Bulgarian
is not trivial since words can play different syn-
uses 680 tags, while the Prague Dependency Tree-
tactic roles in different contexts, e.g., can is a
bank (Hajic, 1998) for Czech has over 1,400 tags.
noun in I opened a can of coke. but a verb in
I can write. Traditionally, linguists have classi- Below we present experiments with POS tag-
fied English words into the following eight basic ging for Bulgarian, which is an inflectional lan-
POS categories: noun, pronoun, adjective, verb, guage with rich morphology. Unlike most previ-
adverb, preposition, conjunction, and interjection; ous work, which has used a reduced set of POS
this list is often extended a bit, e.g., with deter- tags, we use all 680 tags in the BulTreeBank. We
miners, particles, participles, etc., but the number combine prior linguistic knowledge and statistical
of categories considered is rarely more than 15. learning, achieving accuracy comparable to that
Computational linguistics works with a larger reported for state-of-the-art systems for English.
inventory of POS tags, e.g., the Penn Treebank The remainder of the paper is organized as fol-
(Marcus et al., 1993) uses 48 tags: 36 for part- lows: Section 2 provides an overview of related
of-speech, and 12 for punctuation and currency work, Section 3 describes Bulgarian morphology,
symbols. This increase in the number of tags Section 4 introduces our approach, Section 5 de-
is partially due to finer granularity, e.g., there scribes the datasets, Section 6 presents our exper-
are special tags for determiners, particles, modal iments in detail, Section 7 discusses the results,
verbs, cardinal numbers, foreign words, existen- Section 8 offers application-specific error analy-
tial there, etc., but also to the desire to encode sis, and Section 9 concludes and points to some
morphological information as part of the tags. promising directions for future work.
492
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 492502,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
2 Related Work First, a coarse POS class is assigned (e.g., noun,
verb, adjective), then, additional fine-grained
Most research on part-of-speech tagging has fo- morphological features like case, number and
cused on English, and has relied on the Penn Tree- gender are added, and finally, the proposed tags
bank (Marcus et al., 1993) and its tagset for train- are further reconsidered using non-local features.
ing and evaluation. The task is typically addressed Similarly, Smith et al. (2005) decomposed the
as a sequential tagging problem; one notable ex- complex tags into factors, where models for pre-
ception is the work of Brill (1995), who proposed dicting part-of-speech, gender, number, case, and
non-sequential transformation-based learning. lemma are estimated separately, and then com-
A number of different sequential learning posed into a single CRF model; this yielded com-
frameworks have been tried, yielding 96-97% petitive results for Arabic, Korean, and Czech.
accuracy: Lafferty et al. (2001) experimented Most previous work on Bulgarian POS tagging
with conditional random fields (CRFs) (95.7% has started with large tagsets, which were then
accuracy), Ratnaparkhi (1996) used a maximum reduced. For example, Dojchinova and Mihov
entropy sequence classifier (96.6% accuracy), (2004) mapped their initial tagset of 946 tags to
Brants (2000) employed a hidden Markov model just 40, which allowed them to achieve 95.5%
(96.6% accuracy), Collins (2002) adopted an av- accuracy using the transformation-based learning
eraged perception discriminative sequence model of Brill (1995), and 98.4% accuracy using manu-
(97.1% accuracy). All these models fix the order ally crafted linguistic rules. Similarly, Georgiev
of inference from left to right. et al. (2009), who used maximum entropy and
Toutanova et al. (2003) introduced a cyclic de- the BulTreeBank (Simov et al., 2004), grouped
pendency network (97.2% accuracy), where the its 680 fine-grained POS tags into 95 coarse-
search is bi-directional. Shen et al. (2007) have grained ones, and thus improved their accuracy
further shown that better results (97.3% accu- from 90.34% to 94.4%. Simov and Osenova
racy) can be obtained using guided learning, a (2001) used a recurrent neural network to predict
framework for bidirectional sequence classifica- (a) 160 morpho-syntactic tags (92.9% accuracy)
tion, which integrates token classification and in- and (b) 15 POS tags (95.2% accuracy).
ference order selection into a single learning task Some researchers did not reduce the tagset:
and uses a perceptron-like (Collins and Roark, Savkov et al. (2011) used 680 tags (94.7% ac-
2004) passive-aggressive classifier to make the curacy), and Tanev and Mitkov (2002) used 303
easiest decisions first. Recently, Tsuruoka et al. tags and the BULMORPH morphological ana-
(2011), proposed a simple perceptron-based clas- lyzer (Krushkov, 1997), achieving P=R=95%.
sifier applied from left to right but augmented
with a lookahead mechanism that searches the 3 Bulgarian Morphology
space of future actions, yielding 97.3% accuracy. Bulgarian is an Indo-European language from the
For morphologically complex languages, the Slavic language group, written with the Cyrillic
problem of POS tagging typically includes mor- alphabet and spoken by about 9-12 million peo-
phological disambiguation, which yields a much ple. It is also a member of the Balkan Sprachbund
larger number of tags. For example, for Arabic, and thus differs from most other Slavic languages:
Habash and Rambow (2005) used support vector it has no case declensions, uses a suffixed definite
machines (SVM), achieving 97.6% accuracy with article (which has a short and a long form for sin-
139 tags from the Arabic Treebank (Maamouri et gular masculine), and lacks verb infinitive forms.
al., 2003). For Czech, Hajic et al. (2001) com- It further uses special evidential verb forms to ex-
bined a hidden Markov model (HMM) with lin- press unwitnessed, retold, and doubtful activities.
guistic rules, which yielded 95.2% accuracy using Bulgarian is an inflective language with very
an inventory of over 1,400 tags from the Prague rich morphology. For example, Bulgarian verbs
Dependency Treebank (Hajic, 1998). For Ice- have 52 synthetic wordforms on average, while
landic, Dredze and Wallenberg (2008) reported pronouns have altogether more than ten grammat-
92.1% accuracy with 639 tags developed for the ical features (not necessarily shared by all pro-
Icelandic frequency lexicon (Pind et al., 1991), nouns), including case, gender, person, number,
they used guided learning and tag decomposition: definiteness, etc.
493
This rich morphology inevitably leads to ambi- In many cases, strong domain preferences exist
guity proliferation; our analysis of BulTreeBank about how various systematic ambiguities should
shows four major types of ambiguity: be resolved. We made a study for the newswire
domain, analyzing a corpus of 546,029 words,
1. Between the wordforms of the same lexeme,
and we found that ambiguity type 2 (lexeme-
i.e., in the paradigm. For example, divana,
lexeme) prevailed for functional parts-of-speech,
an inflected form of divan (sofa, mascu-
while the other types were more frequent for in-
line), can mean (a) the sofa (definite, singu-
flecting parts-of-speech. Below we show the most
lar, short definite article) or (b) a count form,
frequent types of morpho-syntactic ambiguities
e.g., as in dva divana (two sofas).
and their frequency in our corpus:
2. Between two or more lexemes, i.e., conver- na: preposition (of) vs. emphatic particle,
sion. For example, kato can be (a) a subor- with a ratio of 28,554 to 38;
dinator meaning as, when, or (b) a preposi- da: auxiliary particle (to) vs. affirmative
tion meaning like, such as. particle, with a ratio of 12,035 to 543;
3. Between a lexeme and an inflected wordform e: 3rd person present auxiliary verb (to be)
of another lexeme, i.e., across-paradigms. vs. particle (well) vs. interjection (wow),
For example, politika can mean (a) the with a ratio of 9,136 to 21 to 5;
politician (masculine, singular, definite, singular masculine noun with a short definite
short definite article) or (b) politics (fem- article vs. count form of a masculine noun,
inine, singular, indefinite). with a ratio of 6,437 to 1,592;
adverb vs. neuter singular adjective, with a
4. Between the wordforms of two or more ratio of 3,858 to 1,753.
lexemes, i.e., across-paradigms and quasi-
Overall, the following factors should be taken
conversion. For example, vrvi can mean
into account when modeling Bulgarian morpho-
(a) walks (verb, 2nd or 3rd person, present
syntax: (1) locality vs. non-locality of grammat-
tense) or (b) strings, laces (feminine, plu-
ical features, (2) interdependence of grammatical
ral, indefinite).
features, and (3) domain-specific preferences.
Some morpho-syntactic ambiguities in Bulgar-
ian are occasional, but many are systematic, e.g., 4 Method
neuter singular adjectives have the same forms We used the guided learning framework described
as adverbs. Overall, most ambiguities are local, in (Shen et al., 2007), which has yielded state-of-
and thus arguably resolvable using n-grams, e.g., the-art results for English and has been success-
compare hubavo dete (beautiful child), where fully applied to other morphologically complex
hubavo is a neuter adjective, and Pe hubavo. languages such as Icelandic (Dredze and Wallen-
(I sing beautifully.), where it is an adverb of berg, 2008); we found it quite suitable for Bul-
manner. Other ambiguities, however, are non- garian as well. We used the feature set defined in
local and may require discourse-level analysis, (Shen et al., 2007), which includes the following:
e.g., Vidh go. can mean I saw him., where
go is a masculine pronoun, or I saw it., where 1. The feature set of Ratnaparkhi (1996), in-
it is a neuter pronoun. Finally, there are ambi- cluding prefix, suffix and lexical, as well as
guities that are very hard or even impossible1 to some bigram and trigram context features;
resolve, e.g., Deteto vleze veselo. can mean 2. Feature templates as in (Ratnaparkhi, 1996),
both The child came in happy. (veselo is an ad- which have been shown helpful in bidirec-
jective) and The child came in happily. (it is an tional search;
adverb); however, the latter is much more likely.
1
The problem also exists for English, e.g., the annotators
3. More bigram and trigram features and bi-
of the Penn Treebank were allowed to use tag combinations lexical features as in (Shen et al., 2007).
for inherently ambiguous cases: JJ|NN (adjective or noun as
prenominal modifier), JJ|VBG (adjective or gerund/present
Note that we allowed prefixes and suffixes of
participle), JJ|VBN (adjective or past participle), NN|VBG length up to 9, as in (Toutanova et al., 2003) and
(noun or gerund), and RB|RP (adverb or particle). (Tsuruoka and Tsujii, 2005).
494
We further extended the set of features with The rules are quite efficient at reducing the POS
the tags proposed for the current word token by a ambiguity. On the test dataset, before the rule ap-
morphological lexicon, which maps words to pos- plication, 34.2% of the tokens (excluding punctu-
sible tags; it is exhaustive, i.e., the correct tag is ation) had more than one tag in our morphological
always among the suggested ones for each token. lexicon. This number is reduced to 18.5% after
We also used 70 linguistically-motivated, high- the cascaded application of the 70 linguistic rules.
precision rules in order to further reduce the num- Table 1 illustrates the effect of the rules on a small
ber of possible tags suggested by the lexicon. sentence fragment. In this example, the rules have
The rules are similar to those proposed by Hin- left only one tag (the correct one) for three of the
richs and Trushkina (2004) for German; we im- ambiguous words. Since the rules in essence de-
plemented them as constraints in the CLaRK sys- crease the average number of tags per token, we
tem (Simov et al., 2003). calculated that the lexicon suggests 1.6 tags per
Here is an example of a rule: If a wordform token on average, and after the application of the
is ambiguous between a masculine count noun rules this number decreases to 1.44 per token.
(Ncmt) and a singular short definite masculine
5 Datasets
noun (Ncmsh), the Ncmt tag should be chosen if
the previous token is a numeral or a number. 5.1 BulTreeBank
The 70 rules were developed by linguists based We used the latest version of the BulTree-
on observations over the training dataset only. Bank (Simov and Osenova, 2004), which contains
They target primarily the most frequent cases of 20,556 sentences and 321,542 word tokens (four
ambiguity, and to a lesser extent some infrequent times less than the English Penn Treebank), anno-
but very problematic cases. Some rules operate tated using a total of 680 unique morpho-syntactic
over classes of words, while other refer to partic- tags. See (Simov et al., 2004) for a detailed de-
ular wordforms. The rules were designed to be scription of the BulTreeBank tagset.
100% accurate on our training dataset; our exper- We split the data into training/development/test
iments show that they are also 100% accurate on as shown in Table 2. Note that only 552 of all 680
the test and on the development dataset. tag types were used in the training dataset, and
Note that some of the rules are dependent on the development and the test datasets combined
others, and thus the order of their cascaded appli- contain a total of 128 new tag types that were not
cation is important. For example, the wordform seen in the training dataset. Moreover, 32% of the
is ambiguous between an accusative feminine sin- word types in the development dataset and 31%
gular short form of a personal pronoun (her) and of those in the testing dataset do not occur in the
an interjection (wow). To handle this properly, training dataset. Thus, data sparseness is an issue
the rule for interjection, which targets sentence at two levels: word-level and tag-level.
initial positions, followed by a comma, needs to
Dataset Sentences Tokens Types Tags
be executed first. The rule for personal pronouns
Train 16,532 253,526 38,659 552
is only applied afterwards. Dev 2,007 32,995 9,635 425
Test 2,017 35,021 9,627 435
Word Tags
To$i Ppe-os3m Table 2: Statistics about our datasets.
obaqe Cc; Dd
nma Afsi; Vnitf-o3s; Vnitf-r3s;
5.2 Morphological Lexicon
Vpitf-o2s; Vpitf-o3s; Vpitf-r3s
vzmonost Ncfsi In order to alleviate the data sparseness issues,
da Ta;Tx we further used a large morphological lexicon for
sledi Ncfpi; Vpitf-o2s; Vpitf-o3s; Vpitf-r3s;
Bulgarian, which is an extended version of the
Vpitz2s
dictionary described in (Popov et al., 1998) and
... ...
(Popov et al., 2003). It contains over 1.5M in-
Table 1: Sample fragment showing the possible tags flected wordforms (for 110K lemmata and 40K
suggested by the lexicon. The tags that are further proper names), each mapped to a set of possible
filtered by the rules are in italic; the correct tag is bold. morpho-syntactic tags.
495
6 Experiments and Evaluation 6.1 Baselines
State-of-the-art POS taggers for English typically First, we experimented with the most-frequent-
build a lexicon containing all tags a word type has tag baseline, which is standard for POS tagging.
taken in the training dataset; this lexicon is then This baseline ignores context altogether and as-
used to limit the set of possible tags that an input signs each word type the POS tag it was most
token can be assigned, i.e., it imposes a hard con- frequently seen with in the training dataset; ties
straint on the possibilities explored by the POS are broken randomly. We coped with word types
tagger. For example, if can has only been tagged not seen in the training dataset using three sim-
as a verb and as a noun in the training dataset, ple strategies: (a) we considered them all wrong,
it will be only assigned those two tags at test (b) we assigned them Ncmsi, which is the most
time; other tags such as adjective, adverb and pro- frequent open-class tag in the training dataset, or
noun will not be considered. Out-of-vocabulary (c) we used a very simple guesser, which assigned
words, i.e., those that were not seen in the train- Ncfsi, Ncnsi, Ncfsi, and Ncmsf, if the target word
ing dataset, are constrained as well, e.g., to a small ended by -a, -o, -i, and -t, respectively, other-
set of frequent open-class tags. wise, it assigned Ncmsi. The results are shown
In our experiments, we used a morphological in lines 1-3 of Table 3: we can see that the token-
lexicon that is much larger than what could be level accuracy ranges in 78-80% for (a)-(c), which
built from the training corpus only: building a is relatively high, given that we use a large inven-
lexicon from the training corpus only is of lim- tory of 680 morpho-syntactic tags.
ited utility since one can hardly expect to see in We further tried a baseline that uses the above-
the training corpus all 52 synthetic forms a verb described morphological lexicon, in addition to
can possibly have. Moreover, we did not use the the training dataset. We first built two frequency
tags listed in the lexicon as hard constraints (ex- lists, containing respectively (1) the most frequent
cept in one of our baselines); instead, we experi- tag in the training dataset for each word type, as
mented with a different, non-restrictive approach: before, and (2) the most frequent tag in the train-
we used the lexicons predictions as features or ing dataset for each class of tags that can be as-
soft constraints, i.e., as suggestions only, thus al- signed to some word type, according to the lexi-
lowing each token to take any possible tag. Note con. For example, the most frequent tag for poli-
that for both known and out-of-vocabulary words tika is Ncfsi, and the most frequent tag for the
we used all 680 tags rather than the 552 tags ob- tag-class {Ncmt;Ncmsi} is Ncmt.
served in the training dataset; we could afford to Given a target word type, this new baseline first
explore this huge search space thanks to the effi- tries to assign it the most frequent tag from the
ciency of the guided learning framework. Allow- first list. If this is not possible, which happens
ing all 680 tags on training helped the model by (i) in case of ties or (ii) when the word type was
exposing it to a larger set of negative examples. not seen on training, it extracts the tag-class from
We combined these lexicon features with stan- the lexicon and consults the second list. If there
dard features extracted from the training corpus. is a single most frequent tag in the corpus for this
We further experimented with the 70 contextual tag-class, it is assigned; otherwise a random tag
linguistic rules, using them (a) as soft and (b) as from this tag-class is selected.
hard constraints. Finally, we set four baselines: Line 4 of Table 3 shows that this latter baseline
three that do not use the lexicon and one that does. achieves a very high accuracy of 94.40%. Note,
however, that this is over-optimistic: the lexicon
Accuracy (%) contains a tag-class for each word type in our test-
# Baselines (token-level) ing dataset, i.e., while there can be word types
1 MFT + unknowns are wrong 78.10
not seen in the training dataset, there are no word
2 MFT + unknowns are Ncmsi 78.52
3 MFT + guesser for unknowns 79.49 types that are not listed in the lexicon. Thus, this
4 MFT + lexicon tag-classes 94.40 high accuracy is probably due to a large extent
to the scale and quality of our morphological lexi-
Table 3: Most-frequent-tag (MFT) baselines. con, and it might not be as strong with smaller lex-
icons; we plan to investigate this in future work.
496
6.2 Lexicon Tags as Soft Constraints 6.3 Linguistic Rules as Hard Constraints
We experimented with three types of features: Next, we experimented with using the suggestions
of the linguistic rules as hard constraints. Table 4
1. Word-related features only; shows that this is a very good idea. Comparing
line 1 to line 2, which do not use the morpholog-
2. Word-related features + the tags suggested ical lexicon, we can see very significant improve-
by the lexicon; ments: from 95.72% to 97.20% at the token-level
and from 52.95% to 64.50% at the sentence-level.
3. Word-related features + the tags suggested The improvements are smaller but still consistent
by the lexicon but then further filtered using when the morphological lexicon is used: compar-
the 70 contextual linguistic rules. ing lines 3 and 4 to lines 6 and 7, respectively, we
see an improvement from 97.83% to 97.91% and
Table 4 shows the sentence-level and the token- from 97.80% to 97.93% at the token-level, and
level accuracy on the test dataset for the three about 1% absolute at the sentence-level.
kinds of features: shown on lines 1, 3 and 4, re-
spectively. We can see that using the tags pro- 6.4 Increasing the Beam Size
posed by the lexicon as features (lines 3 and 4) Finally, we increased the beam size of guided
has a major positive impact, yielding up to 49% learning from 1 to 3 as in (Shen et al., 2007).
error reduction at the token-level and up to 37% Comparing line 7 to line 8 in Table 4, we can see
at the sentence-level, as compared to using word- that this yields further token-level improvement:
related features alone (line 1). from 97.93% to 97.98%.
Interestingly, filtering the tags proposed by the
lexicon using the 70 contextual linguistic rules 7 Discussion
yields a minor decrease in accuracy both at the
Table 5 compares our results to previously re-
word token-level and at the sentence-level (com-
ported evaluation results for Bulgarian. The
pare line 4 to line 2). This is surprising since
first four lines show the token-level accuracy for
the linguistic rules are extremely reliable: they
standard POS tagging tools trained and evalu-
were designed to be 100% accurate on the train-
ated on the BulTreeBank:2 TreeTagger (Schmid,
ing dataset, and we found them experimentally to
1994), which uses decision trees, TnT (Brants,
be 100% correct on the development and on the
2000), which uses a hidden Markov model,
testing dataset as well.
SVMtool (Gimenez and Marquez, 2004), which
One possible explanation is that by limiting the is based on support vector machines, and
set of available tags for a given token at training ACOPOST (Schroder, 2002), implementing the
time, we prevent the model from observing some memory-based model of Daelemans et al. (1996).
potentially useful negative examples. We tested The following lines report the token-level accu-
this hypothesis by using the unfiltered lexicon racy reported in previous work, as compared to
predictions at training time but then making use our own experiments using guided learning.
of the filtered ones at testing time; the results are We can see that we outperform by a very large
shown on line 5. We can observe a small increase margin (92.53% vs. 97.98%, which represents
in accuracy compared to line 4: from 97.80% to 73% error reduction) the systems from the first
97.84% at the token-level, and from 70.30% to four lines, which are directly comparable to our
70.40% at the sentence-level. Although these dif- experiments: they are trained and evaluated on the
ferences are tiny, they suggest that having more BulTreeBank using the full inventory of 680 tags.
negative examples at training is helpful. We further achieved statistically significant im-
We can conclude that using the lexicon as a provement (p < 0.0001; Pearsons chi-squared
source of soft constraints has a major positive im- test (Plackett, 1983)) over the best pervious result
pact, e.g., because it provides access to impor- on 680 tags: from 94.65% to 97.98%, which rep-
tant external knowledge that is complementary resents 62.24% error reduction at the token-level.
to what can be learned from the training corpus 2
We used the pre-trained TreeTagger; for the rest, we re-
alone; the improvements when using linguistic port the accuracy given on the Webpage of the BulTreeBank:
rules as soft constraints are more limited. www.bultreebank.org/taggers/taggers.html
497
Lexicon Linguistic Rules (applied to filter): Beam Accuracy (%)
# (source of) (a) the lexicon features (b) the output tags size Sentence-level Token-level
1 1 52.95 95.72
2 yes 1 64.50 97.20
3 features 1 70.40 97.83
4 features yes 1 70.30 97.80
5 features yes, for test only 1 70.40 97.84
6 features yes 1 71.34 97.91
7 features yes yes 1 71.69 97.93
8 features yes yes 3 71.94 97.98
Table 4: Evaluation results on the test dataset. Line 1 shows the evaluation results when using features derived
from the text corpus only; these features are used by all systems in the table. Line 2 further uses the contextual
linguistic rules to limit the set of possible POS tags that can be predicted. Note that these rules (1) consult the
lexicon, and (2) always predict a single POS tag. Line 3 uses the POS tags listed in the lexicon as features, i.e.,
as soft suggestions only. Line 4 is like line 3, but the list of feature-tags proposed by the lexicon is filtered by
the contextual linguistic rules. Line 5 is like line 4, but the linguistic rules filtering is only applied at test time;
it is not done on training. Lines 6 and 7 are similar to lines 3 and 4, respectively, but here the linguistic rules
are further applied to limit the set of possible POS tags that can be predicted, i.e., the rules are used as hard
constraints. Finally, line 8 is like line 7, but here the beam size is increased to 3.
Overall, we improved over almost all previ- Still, our performance is impressive because
ously published results. Our accuracy is sec- (1) our model is trained on 253,526 tokens only
ond only to the manual rules approach of Do- while the standard training sections 0-18 of the
jchinova and Mihov (2004). Note, however, that Penn Treebank contain a total of 912,344 tokens,
they used 40 tags only, i.e., their inventory is 17 i.e., almost four times more, and (2) we predict
times smaller than ours. Moreover, they have op- 680 rather than just 48 tags as for the Penn Tree-
timized their tagset specifically to achieve very bank, which is 14 times more.
high POS tagging accuracy by choosing not to at- Note, however, that (1) we used a large exter-
tempt to resolve some inherently hard systematic nal morphological lexicon for Bulgarian, which
ambiguities, e.g., they do not try to choose be- yielded about 50% error reduction (without it,
tween second and third person past singular verbs, our accuracy was 95.72% only), and (2) our
whose inflected forms are identical in Bulgarian train/dev/test sentences are generally shorter, and
and hard to distinguish when the subject is not thus arguably simpler for a POS tagger to analyze:
present (Bulgarian is a pro-drop language). we have 17.4 words per test sentence in the Bul-
In order to compare our results more closely TreeBank vs. 23.7 in the Penn Treebank.
to the smaller tagsets in Table 5, we evaluated Our results also compare favorably to the state-
our best model with respect to (a) the first letter of-the-art results for other morphologically com-
of the tag only (which is part-of-speech only, no plex languages that use large tagsets, e.g., 95.2%
morphological information; 13 tags), e.g., Ncmsf for Czech with 1,400+ tags (Hajic et al., 2001),
becomes N, and (b) the first two letters of the 92.1% for Icelandic with 639 tags (Dredze and
tag (POS + limited morphological information; Wallenberg, 2008), 97.6% for Arabic with 139
49 tags), e.g., Ncmsf becomes Nc. This yielded tags (Habash and Rambow, 2005).
99.30% accuracy for (a) and 98.85% for (b).
The latter improves over (Dojchinova and Mihov, 8 Error Analysis
2004), while using a bit larger number of tags. In this section, we present error analysis with re-
Our best token-level accuracy of 97.98% is spect to the impact of the POS taggers perfor-
comparable and even slightly better than the state- mance on other processing steps in a natural lan-
of-the-art results for English: 97.33% when using guage processing pipeline, such as lemmatization
Penn Treebank data only (Shen et al., 2007), and and syntactic dependency parsing.
97.50% for Penn Treebank plus some additional First, we explore the most frequently confused
unlabeled data (Sgaard, 2011). Of course, our pairs of tags for our best-performing POS tagging
results are only indirectly comparable to English. system; these are shown in Table 6.
498
Accuracy
Tool/Authors Method # Tags (token-level, %)
*TreeTagger Decision Trees 680 89.21
*ACOPOST Memory-based Learning 680 89.91
*SVMtool Support Vector Machines 680 92.22
*TnT Hidden Markov Model 680 92.53
(Georgiev et al., 2009) Maximum Entropy 680 90.34
(Simov and Osenova, 2001) Recurrent Neural Network 160 92.87
(Georgiev et al., 2009) Maximum Entropy 95 94.43
(Savkov et al., 2011) SVM + Lexicon + Rules 680 94.65
(Tanev and Mitkov, 2002) Manual Rules 303 95.00(=P=R)
(Simov and Osenova, 2001) Recurrent Neural Network 15 95.17
(Dojchinova and Mihov, 2004) Transformation-based Learning 40 95.50
(Dojchinova and Mihov, 2004) Manual Rules + Lexicon 40 98.40
Guided Learning 680 95.72
Guided Learning + Lexicon 680 97.83
This work Guided Learning + Lexicon + Rules 680 97.98
Guided Learning + Lexicon + Rules 49 98.85
Guided Learning + Lexicon + Rules 13 99.30
Table 5: Comparison to previous work for Bulgarian. The first four lines report evaluation results for various
standard POS tagging tools, which were retrained and evaluated on the BulTreeBank. The following lines report
token-level accuracy for previously published work, as compared to our own experiments using guided learning.
We can see that most of the wrong tags share Here is an example of such a rule:
the same part-of-speech (indicated by the initial if tag = Vpitf-o1s then
uppercase letter), such as V for verb, N for noun, {remove oh; concatenate a}
etc. This means that most errors refer to the mor-
The application of the above rule to the past
phosyntactic features. For example, personal or
simple verb form qetoh (I read) would remove
impersonal verb; definite or indefinite feminine
oh, and then concatenate a. The result would be
noun; singular or plural masculine adjective, etc.
the correct lemma qeta (to read).
At the same time, there are also cases, where the
Such rules are generated for each wordform in
error has to do with the part-of-speech label itself.
the morphological lexicon; the above functional
For example, between an adjective and an adverb,
representation allows for compact representation
or between a numeral and an indefinite pronoun.
in a finite state automaton. Similar rules are ap-
We want to use the above tagger to develop plied to the unknown words, where the lemma-
(1) a rule-based lemmatizer, using the morpholog- tizer tries to guess the correct lemma.
ical lexicon, e.g., as in (Plisson et al., 2004), and Obviously, the applicability of each rule cru-
(2) a dependency parser like MaltParser (Nivre et cially depends on the output of the POS tagger.
al., 2007), trained on the dependency part of the If the tagger suggests the correct tag, then the
BulTreeBank. We thus study the potential impact wordform would be lemmatized correctly. Note
of wrong tags on the performance of these tools. that, in some cases of wrongly assigned POS tags
The lemmatizer relies on the lexicon and uses in a given context, we might still get the correct
string transformation functions defined via two lemma. This is possible in the majority of the
operations remove and concatenate: erroneous cases in which the part-of-speech has
if tag = Tag then been assigned correctly, but the wrong grammat-
ical alternative has been selected. In such cases,
{remove OldEnd; concatenate NewEnd}
the error does not influence lemmatization.
where Tag is the tag of the wordform, OldEnd is In order to calculate the proportion of such
the string that has to be removed from the end of cases, we divided each tag into two parts:
the wordform, and NewEnd is the string that has (a) grammatical features that are common for all
to be concatenated to the beginning of the word- wordforms of a given lemma, and (b) features that
form in order to produce the lemma. are specific to the wordform.
499
Freq. Gold Tag Proposed Tag Finally, we should note that there are two spe-
43 Ansi Dm cial classes of tokens for which it is generally
23 Vpitf-r3s Vnitf-r3s hard to predict some of the grammatical features:
16 Npmsh Npmsi
14 Vpiif-r3s Vniif-r3s
(1) abbreviations and (2) numerals written with
13 Npfsd Npfsi digits. In sentences, they participate in agreement
12 Dm Ansi relations only if they are pronounced as whole
12 Vpitcam-smi Vpitcao-smi phrases; unfortunately, it is very hard for the tag-
12 Vpptf-r3p Vpitf-r3p ger to guess such relations since it does not have
11 Vpptf-r3s Vpptf-o3s at its disposal enough features, such as the inflec-
10 Mcmsi Pfe-os-mi tion of the numeral form, that might help detect
10 Ppetas3n Ppetas3m
and use the agreement pattern.
10 Ppetds3f Psot3f
9 Npnsi Npnsd
9 Vpptf-o3s Vpptf-r3s 9 Conclusion and Future Work
8 Dm A-pi
We have presented experiments with part-of-
8 Ppxts Ppxtd
7 Mcfsi Pfe-os-fi speech tagging for Bulgarian, a Slavic language
7 Npfsi Npfsd with rich inflectional and derivational morphol-
7 Ppetas3m Ppetas3n ogy. Unlike most previous work for this language,
7 Vnitf-r3s Vpitf-r3s which has limited the number of possible tags, we
7 Vpitcam-p-i Vpitcao-p-i used a very rich tagset of 680 morpho-syntactic
tags as defined in the BulTreeBank. By com-
Table 6: Most frequently confused pairs of tags.
bining a large morphological lexicon with prior
linguistic knowledge and guided learning from a
The part-of-speech features are always deter- POS-annotated corpus, we achieved accuracy of
mined by the lemma. For example, Bulgarian 97.98%, which is a significant improvement over
verbs have the lemma features aspect and tran- the state-of-the-art for Bulgarian. Our token-level
sitivity. If they are correct, then the lemma is pre- accuracy is also comparable to the best results re-
dicted also correctly, regardless of whether cor- ported for English.
rect or wrong on the grammatical features. For In future work, we want to experiment with a
example, if the verb participle form (aorist or richer set of features, e.g., derived from unlabeled
imperfect) has its correct aspect and transitivity, data (Sgaard, 2011) or from the Web (Umansky-
then it is lemmatized also correctly, regardless Pesin et al., 2010; Bansal and Klein, 2011). We
of whether the imperfect or aorist features were further plan to explore ways to decompose the
guessed correctly; similarly, for other error types. complex Bulgarian morpho-syntactic tags, e.g., as
We evaluated these cases for the 711 errors in our proposed in (Simov and Osenova, 2001) and
experiment, and we found that 206 of them (about (Smith et al., 2005). Modeling long-distance
29%) were non-problematic for lemmatization. syntactic dependencies (Dredze and Wallenberg,
2008) is another promising direction; we believe
For the MaltParser, we encode most of the this can be implemented efficiently using poste-
grammatical features of the wordforms as spe- rior regularization (Graca et al., 2009) or expecta-
cific features for the parser. Hence, it is much tion constraints (Bellare et al., 2009).
harder to evaluate the problematic cases due to
the tagger. Still, we were able to make an es- Acknowledgments
timation of some cases. Our strategy was to ig-
nore the grammatical features that do not always We would like to thank the anonymous reviewers
contribute to the syntactic behavior of the word- for their useful comments, which have helped us
forms. Such grammatical features for the verbs improve the paper.
are aspect and tense. Thus, proposing perfective The research presented above has been par-
instead of imperfective for a verb or present in- tially supported by the EU FP7 project 231720
stead of past tense would not cause problems for EuroMatrixPlus, and by the SmartBook project,
the MaltParser. Among our 711 errors, 190 cases funded by the Bulgarian National Science Fund
(or about 27%) were not problematic for parsing. under grant D002-111/15.12.2008.
500
References vector machines. In Proceedings of the 4th Inter-
national Conference on Language Resources and
Mohit Bansal and Dan Klein. 2011. Web-scale fea- Evaluation, LREC 04, Lisbon, Portugal.
tures for full-scale parsing. In Proceedings of the
Joao Graca, Kuzman Ganchev, Ben Taskar, and Fer-
49th Annual Meeting of the Association for Com-
nando Pereira. 2009. Posterior vs parameter spar-
putational Linguistics: Human Language Technolo-
sity in latent variable models. In Yoshua Bengio,
gies, ACL-HLT 10, pages 693702, Portland, Ore-
Dale Schuurmans, John D. Lafferty, Christopher
gon, USA.
K. I. Williams, and Aron Culotta, editors, Advances
Kedar Bellare, Gregory Druck, and Andrew McCal- in Neural Information Processing Systems 22, NIPS
lum. 2009. Alternating projections for learning 09, pages 664672. Curran Associates, Inc., Van-
with expectation constraints. In Proceedings of the couver, British Columbia, Canada.
25th Conference on Uncertainty in Artificial Intel- Nizar Habash and Owen Rambow. 2005. Arabic to-
ligence, UAI 09, pages 4350, Montreal, Quebec, kenization, part-of-speech tagging and morpholog-
Canada. ical disambiguation in one fell swoop. In Proceed-
Thorsten Brants. 2000. TnT a statistical part-of- ings of the 43rd Annual Meeting of the Associa-
speech tagger. In Proceedings of the Sixth Applied tion for Computational Linguistics, ACL 05, pages
Natural Language Processing, ANLP 00, pages 573580, Ann Arbor, Michigan.
224231, Seattle, Washington, USA. Jan Hajic, Pavel Krbec, Pavel Kveton, Karel Oliva,
Eric Brill. 1995. Transformation-based error-driven and Vladimr Petkevic. 2001. Serial combination
learning and natural language processing: a case of rules and statistics: A case study in Czech tag-
study in part-of-speech tagging. Comput. Linguist., ging. In Proceedings of the 39th Annual Meeting
21:543565. of the Association for Computational Linguistics,
Michael Collins and Brian Roark. 2004. Incremen- ACL 01, pages 268275, Toulouse, France.
tal parsing with the perceptron algorithm. In Pro- Jan Hajic. 1998. Building a Syntactically Annotated
ceedings of the 42nd Meeting of the Association for Corpus: The Prague Dependency Treebank. In Eva
Computational Linguistics, Main Volume, ACL 04, Hajicova, editor, Issues of Valency and Meaning.
pages 111118, Barcelona, Spain. Studies in Honor of Jarmila Panevova, pages 12
Michael Collins. 2002. Discriminative training meth- 19. Prague Karolinum, Charles University Press.
ods for hidden Markov models: theory and experi- Erhard W. Hinrichs and Julia S. Trushkina. 2004.
ments with perceptron algorithms. In Proceedings Forging agreement: Morphological disambiguation
of the Conference on Empirical Methods in Natu- of noun phrases. Research on Language & Compu-
ral Language Processing, EMNLP 02, pages 18, tation, 2:621648.
Philadelphia, PA, USA. Stig Johansson, Eric Atwell, Roger Garside, and Geof-
Walter Daelemans, Jakub Zavrel, Peter Berck, and frey Leech, 1986. The Tagged LOB Corpus: Users
Steven Gillis. 1996. MBT: A memory-based part manual. ICAME, The Norwegian Computing Cen-
of speech tagger generator. In Eva Ejerhed and tre for the Humanities, Bergen University, Norway.
Ido Dagan, editors, Fourth Workshop on Very Large Hristo Krushkov. 1997. Modelling and building ma-
Corpora, pages 1427, Copenhagen, Denmark. chine dictionaries and morphological processors
Veselka Dojchinova and Stoyan Mihov. 2004. High (in Bulgarian). Ph.D. thesis, University of Plov-
performance part-of-speech tagging of Bulgarian. div, Faculty of Mathematics and Informatics, Plov-
In Christoph Bussler and Dieter Fensel, editors, div, Bulgaria.
AIMSA, volume 3192 of Lecture Notes in Computer Henry Kucera and Winthrop Nelson Francis. 1967.
Science, pages 246255. Springer. Computational analysis of present-day American
Mark Dredze and Joel Wallenberg. 2008. Icelandic English. Brown University Press, Providence, RI.
data driven part of speech tagging. In Proceedings John D. Lafferty, Andrew McCallum, and Fernando
of the 44th Annual Meeting of the Association of C. N. Pereira. 2001. Conditional random fields:
Computational Linguistics: Short Papers, ACL 08, Probabilistic models for segmenting and labeling
pages 3336, Columbus, Ohio, USA. sequence data. In Proceedings of the 18th Inter-
Georgi Georgiev, Preslav Nakov, Petya Osenova, and national Conference on Machine Learning, ICML
Kiril Simov. 2009. Cross-lingual adaptation as 01, pages 282289, San Francisco, CA, USA.
a baseline: adapting maximum entropy models to Mohamed Maamouri, Ann Bies, Hubert Jin, and Tim
Bulgarian. In Proceedings of the RANLP09 Work- Buckwalter. 2003. Arabic Treebank: Part 1 v 2.0.
shop on Adaptation of Language Resources and LDC2003T06.
Technology to New Domains, AdaptLRTtoND 09, Mitchell P. Marcus, Mary Ann Marcinkiewicz, and
pages 3538, Borovets, Bulgaria. Beatrice Santorini. 1993. Building a large anno-
Jesus Gimenez and Llus Marquez. 2004. SVMTool: tated corpus of English: the Penn Treebank. Com-
A general POS tagger generator based on support put. Linguist., 19:313330.
501
Joakim Nivre, Johan Hall, Jens Nilsson, Atanas garian texts. Technical Report BTB-TR04, Bulgar-
Chanev, Gulsen Eryigit, Sandra Kubler, Svetoslav ian Academy of Sciences.
Marinov, and Erwin Marsi. 2007. MaltParser: Kiril Ivanov Simov, Alexander Simov, Milen
A language-independent system for data-driven de- Kouylekov, Krasimira Ivanova, Ilko Grigorov, and
pendency parsing. Natural Language Engineering, Hristo Ganev. 2003. Development of corpora
13(2):95135. within the CLaRK system: The BulTreeBank
Jorgen Pind, Fridrik Magnusson, and Stefan Briem. project experience. In Proceedings of the 10th con-
1991. The Icelandic frequency dictionary. Techni- ference of the European chapter of the Association
cal report, The Institute of Lexicography, University for Computational Linguistics, EACL 03, pages
of Iceland, Reykjavik, Iceland. 243246, Budapest, Hungary.
Robin L. Plackett. 1983. Karl Pearson and the Chi- Kiril Simov, Petya Osenova, and Milena Slavcheva.
Squared Test. International Statistical Review / Re- 2004. BTB-TR03: BulTreeBank morphosyntac-
vue Internationale de Statistique, 51(1):5972. tic tagset. Technical Report BTB-TR03, Bulgarian
Joel Plisson, Nada Lavrac, and Dunja Mladenic. 2004. Academy of Sciences.
A rule based approach to word lemmatization. In Noah A. Smith, David A. Smith, and Roy W. Tromble.
Proceedings of the 7th International Multiconfer- 2005. Context-based morphological disambigua-
ence: Information Society, IS 2004, pages 8386, tion with random fields. In Proceedings of Hu-
Ljubljana, Slovenia. man Language Technology Conference and Confer-
Dimitar Popov, Kiril Simov, and Svetlomira Vidinska. ence on Empirical Methods in Natural Language
1998. Dictionary of Writing, Pronunciation and Processing, pages 475482, Vancouver, British
Punctuation of Bulgarian Language (in Bulgarian). Columbia, Canada.
Atlantis KL, Sofia, Bulgaria. Anders Sgaard. 2011. Semi-supervised condensed
nearest neighbor for part-of-speech tagging. In Pro-
Dimityr Popov, Kiril Simov, Svetlomira Vidinska, and
ceedings of the 49th Annual Meeting of the Associa-
Petya Osenova. 2003. Spelling Dictionary of Bul-
tion for Computational Linguistics, ACL-HLT 10,
garian. Nauka i izkustvo, Sofia, Bulgaria.
pages 4852, Portland, Oregon, USA.
Adwait Ratnaparkhi. 1996. A maximum entropy
Hristo Tanev and Ruslan Mitkov. 2002. Shallow
model for part-of-speech tagging. In Eva Ejerhed
language processing architecture for Bulgarian. In
and Ido Dagan, editors, Fourth Workshop on Very
Proceedings of the 19th International Conference
Large Corpora, pages 133142, Copenhagen, Den-
on Computational Linguistics, COLING 02, pages
mark.
17, Taipei, Taiwan.
Aleksandar Savkov, Laska Laskova, Petya Osenova, Kristina Toutanova, Dan Klein, Christopher D. Man-
Kiril Simov, and Stanislava Kancheva. 2011. ning, and Yoram Singer. 2003. Feature-rich
A web-based morphological tagger for Bulgarian. part-of-speech tagging with a cyclic dependency
In Daniela Majchrakova and Radovan Garabk, network. In Proceedings of the Conference of
editors, Slovko 2011. Sixth International Confer- the North American Chapter of the Association
ence. Natural Language Processing, Multilingual- for Computational Linguistics, NAACL 03, pages
ity, pages 126137, Modra/Bratislava, Slovakia. 173180, Edmonton, Canada.
Helmut Schmid. 1994. Probabilistic part-of-speech Yoshimasa Tsuruoka and Junichi Tsujii. 2005. Bidi-
tagging using decision trees. In International Con- rectional inference with the easiest-first strategy
ference on New Methods in Language Processing, for tagging sequence data. In Proceedings of the
pages 4449, Manchester, UK. Conference on Human Language Technology and
Ingo Schroder. 2002. A case study in part-of-speech- Empirical Methods in Natural Language Process-
tagging using the ICOPOST toolkit. Technical Re- ing, HLT-EMNLP 05, pages 467474, Vancouver,
port FBI-HH-M-314/02, Department of Computer British Columbia, Canada.
Science, University of Hamburg. Yoshimasa Tsuruoka, Yusuke Miyao, and Junichi
Libin Shen, Giorgio Satta, and Aravind Joshi. 2007. Kazama. 2011. Learning with lookahead: Can
Guided learning for bidirectional sequence classi- history-based models rival globally optimized mod-
fication. In Proceedings of the 45th Annual Meet- els? In Proceedings of the 49th Annual Meeting of
ing of the Association of Computational Linguistics, the Association for Computational Linguistics: Hu-
ACL 07, pages 760767, Prague, Czech Republic. man Language Technologies, ACL-HLT 10, pages
Kiril Simov and Petya Osenova. 2001. A hybrid 238246, Portland, Oregon, USA.
system for morphosyntactic disambiguation in Bul- Shulamit Umansky-Pesin, Roi Reichart, and Ari Rap-
garian. In Proceedings of the EuroConference on poport. 2010. A multi-domain web-based algo-
Recent Advances in Natural Language Processing, rithm for POS tagging of unknown words. In Pro-
RANLP 01, pages 57, Tzigov chark, Bulgaria. ceedings of the 23rd International Conference on
Kiril Simov and Petya Osenova. 2004. BTB-TR04: Computational Linguistics: Posters, COLING 10,
BulTreeBank morphosyntactic annotation of Bul- pages 12741282, Beijing, China.
502
Instance-Driven Attachment of Semantic Annotations
over Conceptual Hierarchies
503
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 503513,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
Annotations
composedby livesin instrumentplayed sungby mappings {cs cg } from more specific con-
Conceptual hierarchy
cepts to more general concepts, as encoded in a
People
hierarchy H, e.g., American ActorsActors,
People from KievPeople from Ukraine,
Composers Musicians ActorsEntertainers.
Composers by genre Cellists Singers
Thus, the main inputs are the conceptual hi-
Baroque Composers Jazz Composers erarchy H, and the instance-level annotations I.
The hierarchy contains instance-to-concept map-
Figure 1: Hierarchical Semantic Annotations: The pings, as well as specific-to-general concept map-
attachment of semantic annotations (e.g., composed- pings. Via transitivity, instances (milla jovovich)
by) into a conceptual hierarchy, a portion of which is and concepts (American Actors) may be im-
shown in the diagram, requires the identification of the
mediate children of more general concepts (Ac-
correct concept at the correct level of generality (e.g.,
Composers rather than Jazz Composers or Peo- tors), or transitive descendants of more general
ple, for the right argument of composed-by). concepts (Entertainers). The hierarchy is not re-
quired to be a tree; in particular, a concept may
have multiple parent concepts. The instance-level
ucts, as the right argument of the annotation annotations may be created collaboratively by hu-
composed-by), but actually identify the concepts man contributors, or extracted automatically from
at the correct level of generality/specificity (e.g., Web documents or some other data source.
Composers rather than Artists or Jazz Com-
Goal: Given the data sources, the goal is to de-
posers) in the underlying conceptual hierarchy.
termine to which concept c in the hierarchy H the
To ensure portability to new, previously unseen
arguments of the target concept-level annotation
annotations, the proposed method avoids encod-
r should be attached. While the left argument of
ing features specific to a particular domain or an-
acted-in could attach to American Actors, Peo-
notation. In particular, the use of annotations la-
ple from Kiev, Entertainers or People, it is
bels (composed-by) as lexical features might be
best attached to the concept Actors. The goal
tempting, but would anchor the annotation model
is to select the concept c that most appropriately
to that particular annotation. Instead, the method
generalizes across the instances. Over the set I
relies only on features that generalize across an-
of instance-level annotations, selecting a method
notations. Over a gold standard of semantic anno-
for this goal can be thought of as a minimization
tations and concepts that best capture their argu-
problem. The metric to be minimized is the sum
ments, the method substantially outperforms three
of the distances between each predicted concept c
baseline methods. On average, the method com-
and the correct concept cgold , where the distance
putes concepts that are less than one step in the
is the number of edges between c and cgold in H.
hierarchy away from the corresponding gold stan-
dard concepts of the various annotations. Intuitions and Challenges: Given instances such
as milla jovovich that instantiate an argument of
2 Hierarchical Semantic Annotations an annotation like acted-in, the conceptual hierar-
chy can be used to propagate the annotation up-
2.1 Task Description wards, from instances to their concepts, then in
Data Sources: The computation of hierarchical turn further upwards to more general concepts.
semantic annotations relies on the following data The best concept would be one of the many can-
sources: didate concepts reached during propagation. In-
a target annotation r (e.g., acted-in) that takes tuitively, when compared to other candidate con-
M arguments; cepts, a higher proportion of the descendant in-
N annotations I={<i1j , . . . , iM j >}N j=1 of
stances of the best concept should instantiate (or
r at instance level, e.g., {<leonardo dicaprio, match) the annotation. At the same time, rela-
inception>, <milla jovovich, fifth element>} (in tive to other candidate concepts, the best concept
this example, M =2); should have more descendant instances.
mappings {ic} from instances to con- While the intuitions seem clear, their inclu-
cepts to which they belong, e.g., milla jovovich sion in a working method faces a series of prac-
American Actors, milla jovovich People tical challenges. First, the data sources may be
from Kiev, milla jovovich Models; noisy. One form of noise is missing or erroneous
504
Conceptual hierarchy Candidate concepts Training/testing data
Entities Entities People-Actors, 3/2, 0.1/0.7 . . .
People Actors-People, 2/3, 0.7/0.1 . . .
Actors Actors-American Actors, 2/1, 0.7/0.9 . . .
American Actors American Actors-Actors, 1/2, 0.9/0.7 . . .
Locations People English Actors .
.
.
Singers Actors
Raw statistics Classified data
Entities, 4, 0.01 . . . 0, People-Actors, 3/2, 0.1/0.7 . . .
People, 3, 0.1 . . . 1, Actors-People, 2/3, 0.7/0.1 . . .
Actors, 2, 0.7 . . . 1, Actors-American Actors, 2/1, 0.7/0.9 . . .
American Actors English Actors 0, American Actors-Actors, 1/2, 0.9/0.7 . . .
American Actors, 1, 0.9 . . .
English Actors, 1, 0.8 . . . .
.
.
Instance-level annotations
Features Depth, Instance Percent . . .
acted-in(leonardo dicaprio, inception)
acted-in(milla jovovich, fifth element) Ranked data (for Concept-level annotations)
acted-in(judy dench, casino royale) 4, Actors
acted-in(colin firth, the kings speech) 3, People
2, American Actors
1, English Actors
Query logs 0, Entities
Instance to concept mappings fifth element actors
leonardo dicaprio: American Actors fifth element costumes
milla jovovich: American Actors inception quotes
judy dench: English Actors out of africa actors Concept-level annotations
colin firth: English Actors the kings speech oscars acted-in(Actors, ?)
instance-level annotations, which may artificially Second, to apply evidence collected from some
skew the distribution of matching instances to- annotations to a new annotation, the evidence
wards a less than optimal region in the hierarchy. must generalize across annotations. However,
If the input annotations for acted-in are available collected evidence or statistics may vary widely
almost exhaustively for all descendant instances across annotations. Observing that 90% of all de-
of American Actors, and are available for only a scendant instances of the concept Actors match
few of the descendant instances of Belgian Ac- an annotation acted-in constitutes strong evidence
tors, Italian Actors etc., then the distribution that Actors is a good concept for acted-in. In
over the hierarchy may incorrectly suggest that contrast, observing that only 0.09% of all descen-
the left argument of acted-in is American Actors dant instances of the concept Football Teams
rather than the more general Actors. In another match won-super-bowl should not be as strong
example, if virtually all instances that instantiate negative evidence as the percentage suggests.
the left argument of the annotation won-award are
mapped to the concept Award Winning Actors, 2.2 Inferring Concept-Level Annotations
then it would be difficult to distinguish Award
Winning Actors from the more general Actors Determining Candidate Concepts: As illus-
or People, as best concept to be computed for trated in the left part of Figure 2, the first step to-
the annotation. Another type of noise is missing wards inferring concept-level from instance-level
or erroneous edges in the hierarchy, which could annotations is to propagate the instances that in-
artificially direct propagation towards irrelevant stantiate a particular argument of the annota-
regions of the hierarchy, or prevent propagation tion, upwards in the hierarchy. Starting from the
from even reaching relevant regions of the hier- left arguments of the annotation acted-in, namely
archy. For example, if the hierarchy incorrectly leonardo dicaprio, milla jovovich etc., the prop-
maps Actors to Entertainment, then Entertain- agation reaches their parent concepts American
ment and its ancestor concepts incorrectly be- Actors, English Actors, then their parent and
come candidate concepts during propagation for ancestor concepts Actors, People, Entities
the left argument of acted-in. Conversely, if miss- etc. The concepts reached during upward prop-
ing edges caused Actors to not have any children agation become candidate concepts. In subse-
in the hierarchy, then Actors would not even be quent steps, the candidates are modeled, scored
reached and considered as a candidate concept and ranked such that ideally the best concept is
during propagation. ranked at the top.
Ranking Candidate Concepts: The identifica-
505
tion of a ranking function is cast as a semi- ing descendant instances might be noise.
supervised learning problem. Given the cor- Also in this category are features that relay in-
rect (gold) concept of an annotation, it would be formation about the candidate concepts children
tempting to employ binary classification directly, concepts. These features include (1) M ATCHED
by marking the correct concept as a positive ex- C HILDREN the number of child concepts con-
ample, and all other candidate concepts as nega- taining at least one matching instance, (2) C HIL -
tive examples. Unfortunately, this would produce DREN P ERCENT the percentage of child concepts
a highly imbalanced training set, with thousands with at least one matching instance, (3) AVG I N -
of negative examples and, more importantly, with STANCE P ERCENT C HILDREN the average per-
only one positive example. Another disadvan- centage of matching descendant instances of the
tage of using binary classification directly is that child concepts, and (4) I NSTANCE P ERCENT TO
it is difficult to capture the preference for concepts I NSTANCE P ERCENT C HILDREN the ratio be-
closer in the hierarchy to the correct concept, over tween I NSTANCE P ERCENT and AVERAGE I N -
concepts many edges away. Finally, the absolute STANCE P ERCENT OF C HILDREN . The last fea-
values of the features that might be employed may ture is meant to capture dramatic changes in per-
be comparable within an annotation, but incompa- centages when moving in the hierarchy from child
rable across annotations, which reduces the porta- concepts to the candidate concept in question.
bility of the resulting model to new annotations. (B) Concept Features: Concept features ap-
To address the above issues, the ranking func- proximate the generality of the concepts: (1)
tion proposed does not construct training exam- N UM I NSTANCES the number of descendant in-
ples from raw features collected for each indi- stances of the concept, (2) N UM C HILDREN the
vidual candidate concept. Instead, it constructs number of child concepts, and (3) D EPTH the dis-
training examples from pairwise comparisons of tance to the concepts farthest descendant.
a candidate concept with another candidate con- (C) Argument Co-occurrence Features: The ar-
cept. Concretely, a pairwise comparison is la- gument co-occurrence features model the likeli-
beled as a positive example if the first concept is hood that an annotation applies to a concept by
closer to the correct concept than the second, or as looking at co-occurrences with another argument
negative otherwise. The pairwise formulation has of the same annotation. Intuitively, if a con-
three immediate advantages. First, it accomodates cept representing one argument has a high co-
the preference for concepts closer to the gold con- occurrence with an instance that is some other ar-
cept. Second, the pairwise formulation produces gument, a relationship more likely exists between
a larger, more balanced training set. Third, deci- members of the concept and the instance. For ex-
sions of whether the first concept being compared ample, given acted-in, Actors is likely to have a
is more relevant than the second are more likely to higher co-occurrence with casablanca than Peo-
generalize across annotations, than absolute deci- ple is. These features are generated from a set of
sions of whether (and how much) a particular con- Web queries. Therefore, the collected values are
cept is relevant for a given annotation. likely to be affected by different noise than that
Compiling Ranking Features: The features are present in the original dataset. For every concept
grouped into four categories: (A) annotation co- and instance pair from the arguments of a given
occurrence features, (B) concept features, (C) ar- annotation, they feature the number of times each
gument co-occurrence features, and (D) combina- of the tokens in the concept appears in the same
tion features, as described below. query with each of the tokens in the instance,
(A) Annotation Co-occurrence Features: The normalizing to the respective number of tokens.
annotation co-occurrence features emphasize how The procedure generates, for each candidate con-
well an annotation applies to a concept. These cept, an average co-occurrence score (AVG C O -
features include (1) M ATCHED I NSTANCES the OCCURRENCE ) and a total co-occurrence score
number of descendant instances of the concept (T OTAL C O - OCCURRENCE) over all instances the
that appear with the annotation, (2) I NSTANCE concept is paired with.
P ERCENT the percentage of matched instances in (D) Combination Features: The last group
the concept, (3) M ORE THAN T HREE M ATCHING of features are combinations of the above fea-
I NSTANCES and (4) M ORE THAN T EN M ATCH - tures: (1) D EPTH , I NSTANCE P ERCENT which is
ING I NSTANCES , which indicate when the match- D EPTH multiplied by I NSTANCE P ERCENT, and
506
Concept Distance Match Total Match Total AvgInst Depth Avg Total
ToCorrect Inst Inst Child Child PercOfChild Cooccur Cooccur
People 4 36512 879423 22 29 4% 14 0.67 33506
Actors 0 29101 54420 6 10 32% 6 2.08 99971
English Actors 2 3091 5922 3 4 37% 3 2.75 28378
Table 1: Training/Testing Examples: The top table shows examples of raw statistics gathered for three candidate
concepts for the left argument of the annotation acted-in. The second table shows the training/testing examples
generated from these concepts and statistics. Each example represents a pair of concepts which is labeled positive
if the first concept is closer to the correct concept than the second concept. Features shown here are the ratio
between a statistic for the first concept and a statistic for the second (e.g. D EPTH for Actors-English Actors is 2
as Actors has depth of 6 and English Actors has depth of 3). Some features omitted due to space constraints.
(2) D EPTH , I NSTANCE P ERCENT, C HILDREN, ceptual hierarchy derived automatically from the
which is the D EPTH multipled by the I NSTANCE Wikipedia (Remy, 2002) category network, as de-
P ERCENT multiplied by M ATCHED C HILDREN. scribed in (Ponzetto and Navigli, 2009). The hi-
Both these features seek to balance the perceived erarchy filters out edges (e.g., from British Film
relevance of an annotation to a candidate concept, Actors to Cinema of the United Kingdom) from
with the generality of the candidate concept. the Wikipedia category network that do not corre-
Generating Learning Examples: For a given spond to IsA relations. A concept in the hierarchy
annotation, the ranking features described so far is a Wikipedia category (e.g., English Film Ac-
are computed for each candidate concept (e.g., tors) that has zero or more Wikipedia categories
Movie Actors, Models, Actors). However, as child concepts, and zero or more Wikipedia
the actual training and testing examples are gener- categories (e.g., English People by Occupation,
ated for pairs of candidate concepts (e.g., <Film British Film Actors) as parent concepts. Each
Actors, Models>, <Film Actors, Actors>, concept in the hierarchy has zero or more in-
<Models, Actors>). A training example rep- stances, which are the Wikipedia articles listed (in
resents a comparison between two candidate con- Wikipedia) under the respective categories (e.g.,
cepts, and specifies which of the two is more rele- colin firth is an instance of English Actors).
vant. To create training and testing examples, the
values of the features of the first concept in the Instance-Level Annotations: The experiments
pair are respectively combined with the values of exploit a set of binary instance-level annotations
the features of the second concept in the pair to (e.g., acted-in, composed) among Wikipedia in-
produce values corresponding to the entire pair. stances, as available in Freebase (Bollacker et
Following classification of testing examples, al., 2008). The annotation is a Freebase prop-
concepts are ranked according to the number of erty (e.g., /music/composition/composer). Inter-
other concepts which they are classified as more nally, the left and right arguments are Freebase
relevant than. Table 1 shows examples of train- topic identifiers mapped to their corresponding
ing/testing data. Wikipedia articles (e.g., /m/03f4k mapped to the
Wikipedia article on george gershwin). In this pa-
3 Experimental Setting per, the derived annotations and instances are dis-
played in a shorter, more readable form for con-
3.1 Data Sources ciseness and clarity. As features do not use the
Conceptual Hierarchy: The experiments com- label of the annotation, labels are never used in
pute concept-level annotations relative to a con- the experiments and evaluation.
507
Web Search Queries: The argument co- mantics. The manual annotation is carried out
occurrence features described above are com- independently by two human judges, who then
puted over a set of around 100 million verify each others work and discard inconsisten-
anonymized Web search queries from 2010. cies. For example, the gold concept of the left
argument of composed-by is annotated to be the
3.2 Experimental Runs Wikipedia category Musical Compositions. In
The experimental runs exploit ranking features the process, some annotation labels are discarded,
described in the previous section, employing: when (a) it is not clear what concept captures an
one of three learning algorithms: naive Bayes argument (e.g., for the right argument of function-
(NAIVE BAYES), maximum entropy (M AXENT), of-building), or (b) more than 5000 candidate con-
or perceptron (P ERCEPTRON) (Mitchell, 1997), cepts are available via propagation for one of the
chosen for their scalability to larger datasets via arguments, which would cause too many train-
distributed implementations. ing or testing examples to be generated via con-
one of three ways of combining the values cept pairs, and slow down the experiments. The
of features collected for individual candidate con- retained 139 annotation labels, whose arguments
cepts into values of features for pairs of candidate have been labeled with their respective gold con-
concepts: the raw ratio of the values of the re- cepts, form the gold standard for the experiments.
spective features of the two concepts (0 when the More precisely, an entry in the resulting gold stan-
denominator is 0); the ratio scaled to the interval dard consists of an annotation label, one of its
[0, 1]; or a binary value indicating which of the arguments being considered (left or right), and
values is larger. a gold concept that best captures that argument.
For completeness, the experiments include The set of annotation labels from the gold stan-
three additional, baseline runs. Each baseline dard is quite diverse and covers many domains of
computes scores for all candidate concepts based potential interest, e.g., has-company(Industries,
on the respective metric; then candidate concepts Companies), written-by(Films, Screenwrit-
are ranked in decreasing order of their scores. The ers), member-of (Politicians,Political Parties),
baselines metrics are: or part-of-movement(Artists, Art Movements).
I NST P ERCENT ranks candidate concepts by Evaluation Metric: Following previous work
the percentage of matched instances that are de- on selectional preferences (Kozareva and Hovy,
scendants of the concept. It emphasizes concepts 2010; Ritter et al., 2010), each entry in the gold
which are proven to belong to the annotation; standard, (i.e., each argument for a given annota-
tion) is evaluated separately. Experimental runs
E NTROPY ranks candidate concepts by the
compute a ranked list of candidate concepts for
entropy (Shannon, 1948) of the proportion of
each entry in the gold standard. In theory, a com-
matched descendant instances of the concept;
puted candidate concept is better if it is closer
AVG D EPTH ranks candidate concepts by
semantically to the gold concept. In practice,
their distances to half of the maximum hierarchy
the accuracy of a ranked list of candidate con-
height, emphasizing a balance of generality and
cepts, relative to the gold concept of the anno-
specificity.
tation label, is measured by two scoring metrics
3.3 Evaluation Procedure that correspond to the mean reciprocal rank score
(MRR) (Voorhees and Tice, 2000) and a modifi-
Gold Standard of Concept-Level Annotations: cation of it (DRR) (Pasca and Alfonseca, 2009):
A random, weighted sample of 200 annotation la- 1 X
N
1
M RR = max
bels (e.g., corresponding to composed-by, play- N i=1 rank ranki
instrument) is selected, out of the set of labels N is the number of annotations and ranki is the
of all instance-level annotations collected from rank of the gold concept in the returned list for
Freebase. During sampling, the weights are the MRR. An annotation ai receives no credit for
counts of distinct instance-level annotations (e.g., MRR if the gold concept does not appear in the
<rhapsody in blue, george gershwin>) avail- corresponding ranked list.
N
able for the label. The arguments of the anno- 1 X 1
DRR = max
tation labels are then manually annotated with N i=1 rank ranki (1 + Len)
a gold concept, which is the category from the For DRR, ranki is the rank of a candidate con-
Wikipedia hierarchy that best captures their se- cept in the returned list and Len is the length of
508
Annotation (Number of Candidate Concepts) Examples of Instances Top Ranked Concepts
Composers compose Musical Compositions (3038) aaron copland; black sabbath Music by Nationality; Composers; Classical
Composers
Musical Compositions composed-by Composers (1734) we are the champions; yor- Musical Compositions; Compositions by
ckscher marsch Composer; Classical Music
Foods contain Nutrients (1112) acca sellowiana; lasagna Foods; Edible Plants; Food Ingredients
Organizations has-boardmember People (3401) conocophillips; spence school Companies by Stock Exchange; Companies
Listed on the NYSE; Companies
Educational Organizations has-graduate Alumni (4072) air force institute of technology; Education by Country; Schools by Country;
deering high school Universities and Colleges by Country
Television Actors guest-role Fictional Characters (4823) melanie griffith; patti laBelle Television Actors by Nationality; Actors;
American Actors
Musical Groups has-member Musicians (2287) steroid maximus; u2 Musical Groups; Musical Groups by Genre;
Musical Groups by Nationality
Record Labels represent Musician (920) columbia records; vandit Record Labels; Record Labels by Country;
Record Labels by Genre
Awards awarded-to People (458) academy award for best original Film Awards; Awards; Grammy Awards
song; erasmus prize
Foods contain Nutrients (177) lycopene; glutamic acid Carboxylic Acids ; Acids; Essential Nutrients
Architects design Buildings and Structures (4811) 20 times square; berkeley build- Buildings and Structures; Buildings and Struc-
ing tures by Architect; Houses by Country
People died-from Causes of Death (577) malaria; skiing Diseases; Infectious Diseases; Causes of
Death
Art Directors direct Films (1265) batman begins; the lion king Films; Films by Director; Film
Episodes guest-star Television Actors (1067) amy poehler; david caruso Television Actors by Nationality; Actors;
American Actors
Television Network has-tv-show Television Series (2492) george of the jungle; great expec- Television Series by Network; Television Se-
tations ries; Television Series by Genre
Musicians play Musical Instruments (423) accordion; tubular bell Musical Instruments; Musical Instruments by
Nationality; Percussion Instruments
Politicians member-of Political Parties (938) independent moralizing front; Political Parties; Political Parties by Country;
national coalition party Political Parties by Ideology
Table 2: Concepts Computed for Gold-Standard Annotations: Examples of entries from the gold standard and
counts of candidate concepts (Wikipedia categories) reached from upward propagation of instances (Wikipedia
instances). The target gold concept is shown in bold. Also shown are examples of Wikipedia instances, and the
top concepts computed by the best-performing learning algorithm for the respective gold concepts.
the minimum path in the hierarchy between the the annotation labels in testing appears in train-
concept and the gold concept. Len is minimum ing. This restriction makes the evaluation more
(0) if the candidate concept is the same as the gold rigurous and conservative as it actually assesses
standard concept. A given annotation ai receives the extent the models learned are applicable to
no credit for DRR if no path is found between the new, previously unseen annotation labels. If
returned concepts and the gold concept. this restriction were relaxed, the baselines would
As an illustration, for a single annotation, the preform equivalently as they do not depend on
right argument of composed-by, the ranked list the training data, but the learned methods would
of concepts returned by an experimental may likely do better.
be [Symphonies by Anton Bruckner, Sym-
4 Evaluation Results
phonies by Joseph Haydn, Symphonies by Gus-
tav Mahler, Musical Compositions, ..], with the 4.1 Quantitative Results
gold concept being Musical Compositions. The Conceptual Hierarchy: The conceptual hierar-
length of the path between Symphonies by An- chy contains 108,810 Wikipedia categories, and
ton Bruckner etc. and Musical Compositions is its maximum depth, measured as the distance
2 (via Symphonies). Therefore, the MRR score from a concept to its farthest descendant, is 16.
would be 0.25 (given by the fourth element of Candidate Concepts: On average, for the gold
the ranked list), whereas the DRR score would be standard, the method propagates a given annota-
0.33 (given by the first element of the ranked list). tion from instances to 1,525 candidate concepts,
MRR and DRR are computed in five-fold cross from which the single best concept must be deter-
validation. Concretely, the gold standard is split mined. The left part of Table 2 illustrates the num-
into five folds such that the sets of annotation la- ber of candidate concepts reached during propa-
bels in each fold are disjoint. Thus, none of gation for a sample of annotations.
509
Experimental Run Accuracy 0.513 (DRR) over the top 20 computed concepts,
N=1 N=20
MRR DRR MRR DRR
and 0.245 (MRR) and 0.456 (DRR) when consid-
With raw-ratio features:
ering only the first concept. These scores corre-
NAIVE BAYES 0.021 0.180 0.054 0.222 spond to the ranked list being less than one step
M AXENT 0.029 0.168 0.045 0.208 away in the hierarchy. The very first computed
P ERCEPTRON 0.029 0.176 0.045 0.216 concept exactly matches the gold concept in about
With scaled-ratio features: one in four cases, and is slightly more than one
NAIVE BAYES 0.050 0.170 0.112 0.243 step away from it. In comparison, the very first
M AXENT 0.245 0.456 0.430 0.513
concept computed by the best baseline matches
P ERCEPTRON 0.245 0.391 0.367 0.461
the gold concept in about one in 35 cases (0.029
With binary features:
NAIVE BAYES 0.115 0.297 0.224 0.361 MRR), and is about 6 steps away (0.173 DRR).
M AXENT 0.165 0.390 0.293 0.441 The accuracies of the various learning algorithms
P ERCEPTRON 0.180 0.332 0.330 0.429 (not shown) were also measured and correlated
For baselines: roughly with the MRR and DRR scores.
I NST P ERCENT 0.029 0.173 0.045 0.224 Discussion: The baseline runs I NST P ERCENT
E NTROPY 0.000 0.110 0.007 0.136
and E NTROPY produce categories that are far
AVG D EPTH 0.007 0.018 0.028 0.045
too specific. For the gold annotation composed-
Table 3: Precision Results: Accuracy of ranked lists by(Composers, Musical Compositions), I NST-
of concepts (Wikipedia categories) computed by var- P ERCENT produces Scottish Flautists for the left
ious runs, as an average over the gold standard of argument and Operas by Ernest Reyer for the
concept-level annotations, considering the top N can- right. AVG D EPTH does not suffer from over-
didate concepts computed for each gold standard entry.
specification, but often produces concepts that
4.2 Qualitative Results have been reached via propagation, yet are not
close to the gold concept. For composed-by,
Precision: Table 3 compares the precision of the
AVG D EPTH produces Film for the left argument
ranked lists of candidate concepts produced by the
and History by Region for the right.
experimental runs. The MRR and DRR scores in
the table consider either at most 20 of the concepts
4.3 Error Analysis
in the ranked list computed by a given experimen-
tal run, or only the first, top ranked computed con- The right part of Table 2 provides a more de-
cept. Note that, in the latter case, the MRR and tailed view into the best performing experimental
DRR scores are equivalent to precision@1 scores. run, showing actual ranked lists of concepts pro-
Several conclusions can be drawn from the re- duced for a sample of the gold standard entries
sults. First, as expected by definition of the by M AXENT with scaled-ratio. A separate analy-
scoring metrics, DRR scores are higher than the sis of the results indicates that the most common
stricter MRR scores, as they give partial credit cause of errors is noise in the conceptual hier-
to concepts that, while not identical to the gold archy, in the form of unbalanced instance-level
concepts, are still close approximations. This is annotations and missing hierarchy edges. Un-
particularly noticeable for the runs M AXENT and balanced annotations are annotations where cer-
P ERCEPTRON with raw-ratio features (4.6 and tain subtrees of the hierarchy are artificially more
4.8 times higher respectively). Second, among populated than other subtrees. For the left argu-
the baselines, I NST P ERCENT is the most accu- ment of the annotation has-profession, 0.05% of
rate, with the computed concepts identifying the New York Politicians are matched but 70% of
gold concept strictly at rank 22 on average (for Bushrangers are matched. Such imbalances may
an MRR score 0.045), and loosely at an aver- be inherent to how annotations are added to Free-
age of 4 steps away from the gold concept (for base: different human contributors may add new
a DRR score of 0.224). Third, the accuracy of annotations to particular portions of Freebase, but
the learning algorithms varies with how the pair- miss other relevant portions.
wise feature values are combined. Overall, raw- The results are also affected by missing edges
ratio feature values perform the worst, and scaled- in the hierarchy. Of the more than 100K con-
ratio the best, with binary in-between. Fourth, cepts in the hierarchy, 3479 are roots of subhier-
the scores of the best experimental run, M AXENT archies that are mutually disconnected. Exam-
with scaled-ratio features, are 0.430 (MRR) and ples are People by Region, Shades of Red, and
510
Members of the Parliament of Northern Ireland, and use this semantic information to construct a
all of which should have parents in the hierarchy. taxonomy. The resulting taxonomy is the concep-
If a few edges are missing in a particular region tual hierarchy used in the evaluation.
of the hierarchy, the method can recover, but if so Another related area of work is the discovery of
many edges are missing that a gold concept has relations between concepts. Nastase and Strube
very few descendants, then propagation can be (2008) use Wikipedia category names and cate-
substantially affected. In the worst case, the gold gory structure to generate a set of relations be-
concept becomes disconnected, and thus will be tween concepts. Yan et al. (2009) discover re-
missing from the set of candidate concepts com- lations between Wikipedia concepts via deep lin-
piled during propagation. For example, for the guistic information and Web frequency informa-
annotation team-color(Sports Clubs, Colors), tion. Mohamed et al. (2011) generate candi-
the only descendant concept of Colors in the hi- date relations by coclustering text contexts for ev-
erarchy is Horse Coat Colors, meaning that the ery pair of concepts in a hierarchy. In a sense,
gold concept Colors is not reached during prop- this area of research is complementary to that dis-
agation from instances upwards in the hierarchy. cussed in this paper. These methods induce new
relations, and the proposed method can be used
5 Related Work to find appropriate levels of generalization for the
arguments of any given relation.
Similar to the task of attaching a semantic anno-
tation to the concept in a hierarchy that has the
best level of generality is the task of finding se- 6 Conclusions
lectional preferences for relations. Most relevant This paper introduces a method to convert flat sets
to this paper is work that seeks to find the appro- of instance-level annotations to hierarchically or-
priate concept in a hierarchy for an argument of ganized, concept-level annotations. The method
a specific relation (Ribas, 1995; McCarthy, 1997; determines the appropriate concept for a given se-
Li and Abe, 1998). Li and Abe (1998) address mantic annotation in three stages. First, it propa-
this problem by attempting to identify the best tree gates annotations upwards in the hierarchy, form-
cut in a hierarchy for an argument of a given verb. ing a set of candidate concepts. Second, it classi-
They use the minimum description length princi- fies each candidate concept as more or less appro-
ple to select a set of concepts from a hierarchy to priate than each other candidate concept within an
represent the selectional preferences. This work annotation. Third, it ranks candidate concepts by
makes several limiting assumptions including that the number of other concepts relative to which it
the hierarchy is a tree, and every instance belongs is classified as more appropriate. Because the fea-
to just one concept. Clark and Weir (2002) inves- tures are comparisons between concepts within a
tigate the task of generalizing a single relation- single semantic annotation, rather than consider-
concept pair. A relation is propagated up a hier- ations of individual concepts, the method is able
archy until a chi-square test determines the differ- to generalize across annotations, and can thus be
ence between the probability of the child and par- applied to new, previously unseen annotations.
ent concepts to be significant where the probabili- Experiments demonstrate that, on average, the
ties are relation-concept frequencies. This method method is able to identify the concept of a given
has no direct translation to the task discussed here; annotations argument within one hierarchy edge
it is unclear how to choose the correct concept if of the gold concept.
instances generalize to different concepts. The proposed method can take advantage of
In other research on selectional preferences, existing work on open-domain information ex-
Pantel et al. (2007), Kozareva and Hovy (2010) traction. The output of such work is usually
and Ritter et al. (2010) focus on generating ad- instance-level annotations, although often at sur-
missible arguments for relations, and Erk (2007) face level (non-disambiguated arguments) rather
and Bergsma et al. (2008) investigate classifying than semantic level (disambiguated arguments).
a relation-instance pair as plausible or not. After argument disambiguation (e.g., (Dredze et
al., 2010)), the annotations can be used as input
Important to this paper is the Wikipedia cate- to determining concept-level annotations. Thus,
gory network (Remy, 2002) and work on refin- the method has the potential to generalize any
ing it. Ponzetto and Navigli (2009) disambiguate existing database of instance-level annotations to
Wikipedia categories by using WordNet synsets concept-level annotations.
511
References Diana McCarthy. 1997. Word sense disambiguation
for acquisition of selectional preferences. In Pro-
Michele Banko, Michael Cafarella, Stephen Soder- ceedings of the ACL/EACL Workshop on Automatic
land, Matt Broadhead, and Oren Etzioni. 2007. Information Extraction and Building of Lexical Se-
Open information extraction from the Web. In Pro- mantic Resources for NLP Applications, pages 52
ceedings of the 20th International Joint Conference 60, Madrid, Spain.
on Artificial Intelligence (IJCAI-07), pages 2670 Tom Mitchell. 1997. Machine Learing. McGraw Hill.
2676, Hyderabad, India. Thahir Mohamed, Estevam Hruschka, and Tom
Cory Barr, Rosie Jones, and Moira Regelson. 2008. Mitchell. 2011. Discovering relations between
The linguistic structure of English Web-search noun categories. In Proceedings of the 2011 Con-
queries. In Proceedings of the 2008 Conference ference on Empirical Methods in Natural Language
on Empirical Methods in Natural Language Pro- Processing (EMNLP-11), pages 14471455, Edin-
cessing (EMNLP-08), pages 10211030, Honolulu, burgh, United Kingdom.
Hawaii. Vivi Nastase and Michael Strube. 2008. Decoding
Shane Bergsma, Dekang Lin, and Randy Goebel. Wikipedia categories for knowledge acquisition. In
2008. Discriminative learning of selectional pref- Proceedings of the 23rd National Conference on
erence from unlabeled text. In Proceedings of the Artificial Intelligence (AAAI-08), pages 12191224,
2008 Conference on Empirical Methods in Natural Chicago, Illinois.
Language Processing (EMNLP-08), pages 5968, M. Pasca and E. Alfonseca. 2009. Web-derived
Honolulu, Hawaii. resources for Web Information Retrieval: From
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim conceptual hierarchies to attribute hierarchies. In
Sturge, and Jamie Taylor. 2008. Freebase: A Proceedings of the 32nd International Conference
collaboratively created graph database for struc- on Research and Development in Information Re-
turing human knowledge. In Proceedings of the trieval (SIGIR-09), pages 596603, Boston, Mas-
2008 International Conference on Management of sachusetts.
Data (SIGMOD-08), pages 12471250, Vancouver, Patrick Pantel, Rahul Bhagat, Timothy Chklovski, and
Canada. Eduard Hovy. 2007. ISP: Learning inferential se-
Stephen Clark and David Weir. 2002. Class-based lectional preferences. In Proceedings of the Annual
probability estimation using a semantic hierarchy. Meeting of the North American Chapter of the Asso-
Computational Linguistics, 28(2):187206. ciation for Computational Linguistics (NAACL-07),
pages 564571, Rochester, New York.
Mark Dredze, Paul McNamee, Delip Rao, Adam Ger-
Simone Paolo Ponzetto and Roberto Navigli. 2009.
ber, and Tim Finin. 2010. Entity disambiguation
Large-scale taxonomy mapping for restructuring
for knowledge base population. In Proceedings
and integrating Wikipedia. In Proceedings of
of the 23rd International Conference on Compu-
the 21st International Joint Conference on Ar-
tational Linguistics (COLING-10), pages 277285,
tifical Intelligence (IJCAI-09), pages 20832088,
Beijing, China.
Barcelona, Spain.
Katrin Erk. 2007. A simple, similarity-based model Melanie Remy. 2002. Wikipedia: The free encyclope-
for selectional preferences. In Proceedings of the dia. Online Information Review, 26(6):434.
45th Annual Meeting of the Association for Com-
Francesc Ribas. 1995. On learning more appropriate
putational Linguistics (ACL-07), pages 216223,
selectional restrictions. In Proceedings of the 7th
Prague, Czech Republic.
Conference of the European Chapter of the Asso-
Zornitsa Kozareva and Eduard Hovy. 2010. Learning ciation for Computational Linguistics (EACL-97),
arguments and supertypes of semantic relations us- pages 112118, Madrid, Spain.
ing recursive patterns. In Proceedings of the 48th Alan Ritter, Mausam, and Oren Etzioni. 2010. A la-
Annual Meeting of the Association for Computa- tent dirichlet allocation method for selectional pref-
tional Linguistics (ACL-10), pages 14821491, Up- erences. In Proceedings of the 48th Annual Meet-
psala, Sweden. ing of the Association for Computational Linguis-
Hang Li and Naoki Abe. 1998. Generalizing case tics (ACL-10), pages 424434, Uppsala, Sweden.
frames using a thesaurus and the mdl principle. In Claude Shannon. 1948. A mathematical theory of
Proceedings of the ECAI-2000 Workshop on Ontol- communication. Bell System Technical Journal,
ogy Learning, pages 217244, Berlin, Germany. 27:379423,623656.
Xiao Li. 2010. Understanding the semantic struc- Ellen Voorhees and Dawn Tice. 2000. Building a
ture of noun phrase queries. In Proceedings of the question-answering test collection. In Proceedings
48th Annual Meeting of the Association for Com- of the 23rd International Conference on Research
putational Linguistics (ACL-10), pages 13371345, and Development in Information Retrieval (SIGIR-
Uppsala, Sweden. 00), pages 200207, Athens, Greece.
512
Fei Wu and Daniel S. Weld. 2010. Open information
extraction using wikipedia. In Proceedings of the
48th Annual Meeting of the Association for Compu-
tational Linguistics (ACL-10), pages 118127, Up-
psala, Sweden.
Yulan Yan, Naoaki Okazaki, Yutaka Matsuo, Zhenglu
Yang, and Mitsuru Ishizuka. 2009. Unsupervised
relation extraction by mining Wikipedia texts using
information from the Web. In Proceedings of the
Joint Conference of the 47th Annual Meeting of the
ACL and the 4th International Joint Conference on
Natural Language Processing of the AFNLP (ACL-
IJCNLP-09), pages 10211029, Suntec, Singapore.
513
Joint Satisfaction of Syntactic and Pragmatic Constraints
Improves Incremental Spoken Language Understanding
Andreas Peldszus Okko Bu
University of Potsdam University of Potsdam
Department for Linguistics Department for Linguistics
peldszus@uni-potsdam.de okko@ling.uni-potsdam.de
514
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 514523,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
mentation of a continuous understanding mod- The components of our model are described in
ule that uses reference information in guiding a the following sections: first the parser which com-
bottom-up chart-parser, which is evaluated on a putes the syntactic probability in an incremental,
single dialogue transcript. In contrast, our model top-down manner; the semantic construction al-
uses a probabilistic top-down parser with beam gorithm which associates (underspecified) logi-
search (following Roark (2001)) and is evalu- cal forms to derivations; the reference resolution
ated on a large number of real-world utterances component that computes the pragmatic plausi-
as processed by an automatic speech recogniser. bility; and the combination that incorporates the
Similarly, DeVault and Stone (2003) describe a feedback from this pragmatic signal.
system that implements interaction between a
parser and higher-level modules (in this case, even 3.2 Parser
more principled, trying to prove presuppositions), Roark (2001) introduces a strategy for incremen-
which however is also only tested on a small, con- tal probabilistic top-down parsing and shows that
structed data-set. it can compete with high-coverage bottom-up
Schuler (2003) and Schuler et al. (2009) present parsers. One of the reasons he gives for choosing
a model where information about reference is a top-down approach is that it enables fully left-
used directly within the speech recogniser, and connected derivations, where at every process-
hence informs not only syntactic processing but ing step new increments directly find their place
also word recognition. To this end, the processing in the existing structure. This monotonically en-
is folded into the decoding step of the ASR, and riched structure can then serve as a context for in-
is realised as a hierarchical HMM. While techni- cremental language understanding, as the author
cally interesting, this approach is by design non- claims, although this part is not further developed
modular and restricted in its syntactic expressiv- by Roark (2001). He discusses a battery of dif-
ity. ferent techniques for refining his results, mostly
The work presented here also has connections based on grammar transformations and on con-
to work in psycholinguistics. Pado et al. (2009) ditioning functions that manipulate a derivation
present a model that combines syntactic and se- probability on the basis of local linguistic and lex-
mantic models into one plausibility judgement ical information.
that is computed incrementally. However, that We implemented a basic version of his parser
work is evaluated for its ability to predict reading without considering additional conditioning or
time data and not for its accuracy in computing lexicalizations. However, we applied left-facto-
meaning. rization to parts of the grammar to delay cer-
tain structural decisions as long as possible. The
3 The Model search-space is reduced by using beam search. To
match the next token, the parser tries to expand
3.1 Overview
the existing derivations. These derivations are
Described abstractly, the model computes the stored in a priorized queue, which means that the
probability of a syntactic derivation (and its ac- most probable derivation will always be served
companying logical form) as a combination of a first. Derivations resulting from rule expansions
syntactic probability (as in a typical PCFG) and are kept in the current queue, derivations result-
a semantic or pragmatic plausibility.1 The prag- ing from a successful lexical match are pushed in
matic plausibility here comes from the presuppo- a new queue. The parser proceeds with the next
sition that the speaker intended her utterance to most probable derivation until the current queue
successfully refer, i. e. to have a denotation in the is empty or until a threshhold is reached at which
current situation (a unique one, in the case of def- remaining analyses are pruned. This threshhold
inite reference). Hence, readings that do have a is determined dynamically: If the probability of
denotation are preferred over those that do not. the current derivation is lower than the product of
1
the best derivations probability on the new queue,
Note that, as described below, in the actual implemen-
tation the weights given to particular derivations are not real
the number of derivations in the new queue, and a
probabilities anymore, as derivations fall out of the beam and base beam factor (an initial parameter for the size
normalisation is not performed after re-weighting. of the search beam), then all further old deriva-
515
TextualWordIU
TagIU
CandidateAnalysisIU
CandidateAnalysisIU CandidateAnalysisIU
LD=[n1/nn-nz, m(nn)] LD=[nz/pp-nz, pp/appr-np, m(appr)]
P=0.06615 P=0.0178605
S=[NZ, VZ, S!] S=[NP, NZ, VZ, S!]
FormulaIU
FormulaIU FormulaIU
... ...
FormulaIU
...
FormulaIU FormulaIU
FormulaIU FormulaIU
... ...
[ [l0:a1:i2] [ [l0:a1:e2]
FormulaIU
{ [l0:a1:i2] } ] { [l18:a19:x14] [l0:a1:e2] }
[ [l0:a1:e2]
ARG1(a1,x8), FormulaIU
{ [l0:a1:e2] } FormulaIU
l6:a7:addressee(x8), ...
ARG1(a1,x8), [ [l0:a1:e2]
l0:a1:_nehmen(e2),
l6:a7:addressee(x8), { [l42:a43:x44] [l29:a30:x14] [l0:a1:e2] }
ARG2(a1,x14),
l0:a1:_nehmen(e2)] FormulaIU ARG1(a1,x8),
BV(a13,x14),
[ [l0:a1:e2] l6:a7:addressee(x8),
RSTR(a13,h21),
{ [l29:a30:x14] [l0:a1:e2] } l0:a1:_nehmen(e2),
BODY(a13,h22),
ARG1(a1,x8), ARG2(a1,x14),
l12:a13:_def(),
l6:a7:addressee(x8), BV(a13,x14),
qeq(h21,l18)]
l0:a1:_nehmen(e2), RSTR(a13,h21),
ARG2(a1,x14), BODY(a13,h22),
BV(a13,x14), l12:a13:_def(),
RSTR(a13,h21), l18:a19:_winkel(x14),
BODY(a13,h22), ARG1(a40,x14),
l12:a13:_def(), ARG2(a40,x44),
l18:a19:_winkel(x14), l39:a40:_in(e41),
qeq(h21,l18)] qeq(h21,l18)]
Figure 1: An example network of incremental units, including the levels of words, POS-tags, syntactic derivations
and logical forms. See section 3 for a more detailed description.
tions are pruned. Due to probabilistic weighing derivations (CandidateAnalysisIUs) are repre-
and the left factorization of the rules, left recur- sented by three features: a list of the last parser ac-
sion poses no direct threat in such an approach. tions of the derivation (LD), with rule expansions
Additionally, we implemented three robust lex- or (robust) lexical matches; the derivation proba-
ical operations: insertions consume the current bility (P); and the remaining stack (S), where S*
token without matching it to the top stack item; is the grammars start symbol and S! an explicit
deletions can consume a requested but actu- end-of-input marker. (To keep the Figure small,
ally non-existent token; repairs adjust unknown we artificially reduced the beam size and cut off
tokens to the requested token. These robust op- alternatives paths, shown in grey.)
erations have strong penalties on the probability
to make sure they will survive in the derivation 3.3 Semantic Construction Using RMRS
only in critical situations. Additionally, only a As a novel feature, we use for the representation
single one of them is allowed to occur between of meaning increments (that is, the contributions
the recognition of two adjacent input tokens. of new words and syntactic constructions) as well
Figure 1 illustrates this process for the first few as for the resulting logical forms the formalism
words of the example sentence nimm den winkel Robust Minimal Recursion Semantics (Copestake,
in der dritten reihe (take the bracket in the third 2006). This is a representation formalism that was
row), using the incremental unit (IU) model to originally constructed for semantic underspecifi-
represent increments and how they are linked; see cation (of scope and other phenomena) and then
(Schlangen and Skantze, 2009).2 Here, syntactic adapted to serve the purposes of semantics repre-
2
Very briefly: rounded boxes in the Figures represent the same predecessor can be regarded as alternatives. Solid
IUs, and dashed arrows link an IU to its predecessor on the arrows indicate which information from a previous level an
same level, where the levels correspond to processing stages. IU is grounded in (based on); here, every semantic IU is
The Figure shows the levels of input words, POS-tags, syn- grounded in a syntactic IU, every syntactic IU in a POS-tag-
tactic derivations and logical forms. Multiple IUs sharing IU, and so on.
516
sentations in heterogeneous situations where in- semantic combination in synchronisation with the
formation from deep and shallow parsers must be syntactic expansion of the tree, i.e. in a top-down
combined. In RMRS, meaning representations of left-to-right fashion. This way, no underspecifica-
a first order logic are underspecified in two ways: tion of projected nodes and no re-interpretation of
First, the scope relationships can be underspeci- already existing parts of the tree is required. This,
fied by splitting the formula into a list of elemen- however, requires adjustments to the slot structure
tary predications (EP) which receive a label ` and of RMRS. Left-recursive rules can introduce mul-
are explicitly related by stating scope constraints tiple slots of the same sort before they are filled,
to hold between them (e.g. qeq-constraints). This which is not allowed in the classic (R)MRS se-
way, all scope readings can be compactly repre- mantic algebra, where only one named slot of
sented. Second, RMRS allows underspecification each sort can be open at a time. We thus organize
of the predicate-argument-structure of EPs. Ar- the slots as a stack of unnamed slots, where mul-
guments are bound to a predicate by anchor vari- tiple slots of the same sort can be stored, but only
ables a, expressed in the form of an argument re- the one on top can be accessed. We then define
lation ARGREL(a,x). This way, predicates can a basic combination operation equivalent to for-
be introduced without fixed arity and arguments ward function composition (as in standard lambda
can be introduced without knowing which predi- calculus, or in CCG (Steedman, 2000)) and com-
cates they are arguments of. We will make use of bine substructures in a principled way across mul-
this second form of underspecification and enrich tiple syntactic rules without the need to represent
lexical predicates with arguments incrementally. slot names.
Combining two RMRS structures involves at Each lexical items receives a generic represen-
least joining their list of EPs and ARGRELs and tation derived from its lemma and the basic se-
of scope constraints. Additionally, equations be- mantic type (individual, event, or underspecified
tween the variables can connect two structures, denotations), determined by its POS tag. This
which is an essential requirement for semantic makes the grammar independent of knowledge
construction. A semantic algebra for the combi- about what later (semantic) components will ac-
nation of RMRSs in a non-lexicalist setting is de- tually be able to process (understand).3 Parallel
fined in (Copestake, 2007). Unsaturated semantic to the production of syntactic derivations, as the
increments have open slots that need to be filled tree is expanded top-down left-to-right, seman-
by what is called the hook of another structure. tic macros are activated for each syntactic rule,
Hook and slot are triples [`:a:x] consisting of a composing the contribution of the new increment.
label, an anchor and an index variable. Every vari- This allows for a monotonic semantics construc-
able of the hook is equated with the corresponding tion process that proceeds in lockstep with the
one in the slot. This way the semantic representa- syntactic analysis.
tion can grow monotonically at each combinatory Figure 1 (in the FormulaIU box) illustrates
step by simply adding predicates, constraints and the results of this process for our example deriva-
equations. tion. Again, alternatives paths have been cut to
Our approach differs from (Copestake, 2007) keep the size of the illustration small. Notice that,
only in the organisation of the slots: In an incre- apart from the end-of-input marker, the stack of
mental setting, a proper semantic representation semantic slots (in curly brackets) is always syn-
is desired for every single state of growth of the chronized with the parsers stack.
syntactic tree. Typically, RMRS composition as-
3.4 Computing Noun Phrase Denotations
sumes that the order of semantic combination is
parallel to a bottom-up traversal of the syntactic Formally, the task of this module is, given a model
tree. Yet, this would require for every incremental M of the current context, to compute the set of
step first to calculate an adequate underspecified all variable assignments such that M satisfies :
semantic representation for the projected nodes G = {g | M |=g }. If |G| > 1, we say that
on the lower right border of the tree and then to refers ambiguously; if |G| = 1, it refers uniquely;
proceed with the combination not only of the new 3
This feature is not used in the work presented here, but
semantic increments but of the complete tree. For it could be used for enabling the system to learn the meaning
our purposes, it is more elegant to proceed with of unknown words.
517
and if |G| = 0, it fails to refer. This process does
not work directly on RMRS formulae, but on ex-
tracted and unscoped first-order representations of
their nominal content.
518
Words Predicates Status
nimm nimm(e) -1
nimm den nimm(e,x) def(x) 0
nimm den Winkel nimm(e,x) def(x) winkel(x) 0
nimm den Winkel in nimm(e,x) def(x) winkel(x) in(x,y) 0
nimm den Winkel in der nimm(e,x) def(x) winkel(x) in(x,y) def(y) 0
nimm den Winkel in der dritten nimm(e,x) def(x) winkel(x) in(x,y) def(y) third(y) 1
nimm den Winkel in der dritten Reihe nimm(e,x) def(x) winkel(x) in(x,y) def(y) third(y) row(y) 1
Table 1: Example of logical forms (flattened into first-order base-language formulae) and reference resolution
results for incrementally parsing and resolving nimm den winkel in der dritten reihe
for a core fragment. We created 30 rules, whose winkel in der dritten reihe (take the bracket in the
weights were also set by hand (as discussed be- third row) is shown in Table 1. The first column
low, this is an obvious area for future improve- shows the incremental word hypothesis string, the
ment), sparingly and according to standard intu- second the set of predicates derived from the most
itions. When parsing, the first step is the assign- recent RMRS representation and the third the res-
ment of a POS tag to each word. This is done by olution status (-1 for no resolution, 0 for some res-
a simple lookup tagger that stores the most fre- olution and 1 for a unique resolution).
quent tag for each word (as determined on a small
subset of our corpus).4 4.3 Baselines and Evaluation Metric
The situation model used in reference resolu- 4.3.1 Variants / Baselines
tion is automatically derived from the internal To be able to accurately quantify and assess the
representation of the current game state. (This effect of our reference-feedback strategy, we im-
was recorded in an XML-format for each utter- plemented different variants / baselines. These all
ance in our corpus.) Variable assignments were differ in how, at each step, the reading is deter-
then derived from the relevant nominal predicate mined that is evaluated against the gold standard,
structures,5 consisting of extracted simple pred- and are described in the following:
ications, e. g. red(x) and cross(x) for the NP in In the Just Syntax (JS) variant, we simply take
a phrase such as take the red cross. For each single-best derivation, as determined by syntax
unique predicate argument X in these EP struc- alone and evaluate this.
tures (such as as x above), the set of domain ob- The External Filtering (EF) variant adds in-
jects that satisfied all predicates of which X was formation from reference resolution, but keeps
an argument were determined. For example for it separate from the parsing process. Here, we
the phrase above, X mapped to all elements that look at the 5 highest ranking derivations (as de-
were red and crosses. termined by syntax alone), and go through them
Finally, the size of these sets was determined: beginning at the highest ranked, picking the first
no elements, one element, or multiple elements, derivation where reference resolution can be per-
as described above. Emptiness of at least one set formed uniquely; this reading is then put up for
denoted that no resolution was possible (for in- evaluation. If there is no such reading, the highest
stance, if no red crosses were available, xs set ranking one will be put forward for evaluation (as
was empty), uniqueness of all sets denoted that in JS).
an exact resolution was possible while multiple Syntax/Pragmatics Interaction (SPI) is the
elements in at least some sets denoted ambiguity. variant described in the previous section. Here,
This status was then leveraged for parse pruning, all active derivations are sent to the reference res-
as per Section 3.5. olution module, and are re-weighted as described
A more complex example using the scene de- above; after this has been done, the highest-
picted in Figure 2 and the sentence nimm den ranking reading is evaluated.
4
Finally, the Combined Interaction and Fil-
A more sophisticated approach has recently been pro-
posed by Beuck et al. (2011); this could be used in our setup.
tering (CIF) variant combines the previous two
5
The domain model did not allow making a plausibility strategies, by using reference-feedback in com-
judgement based on verbal resolution. puting the ranking for the derivations, and then
519
again using reference-information to identify the be used. But as we are building this module for an
most promising reading within the set of 5 highest interactive system, ultimately, accuracy in recov-
ranking ones. ering meaning is what we are interested in, and so
we see this not just as a proxy, but actually as a
4.3.2 Metric
more valuable metric. Moreover, this metric can
When a reading has been identified according be applied at each incremental step, which is not
to one of these methods, a score s is computed as clear how to do with more traditional metrics.
follows: s = 1, if the correct referent (according
to the gold standard) is computed as the denota- 4.4 Experiments
tion for this reading; s = 0 if no unique referent Our parser, semantic construction and reference
can be computed, but the correct one is part of the resolution modules are implemented within the
set of possible referents; s = 1 if no referent InproTK toolkit for incremental spoken dialogue
can be computed at all, or the correct one is not systems development (Schlangen et al., 2010). In
part of the set of those that are computed. this toolkit, incremental hypotheses are modified
As this is done incrementally for each word as more information becomes available over time.
(adding the new word to the parser chart), for an Our modules support all such modifications (i. e.
utterance of length m we get a sequence of m also allow to revert their states and output if word
such numbers. (In our experiments we treat the input is revoked).
end of utterance signal as a pseudo-word, since As explained in Section 4.1, we used offline
knowing that an utterance has concluded allows recognition results in our evaluation. However,
the parser to close off derivations and remove the results would be identical if we were to use
those that are still requiring elements. Hence, we the incremental speech recognition output of In-
in fact have sequences of m+1 numbers.) A com- proTK directly.
bined score for the whole utterance is computed
The system performs several times faster than
according to the following formula:
real-time on a standard workstation computer. We
m
X thus consider it ready to improve practical end-to-
su = (sn n/m) end incremental systems which perform within-
n=1 turn actions such as those outlined in (Bu and
(where sn is the score at position n). The fac- Schlangen, 2010).
tor n/m causes later decisions to count more The parser was run with a base-beam factor of
towards the final score, reflecting the idea that 0.01; this parameter may need to be adjusted if a
it is more to be expected (and less harmful) to larger grammar was used.
be wrong early on in the utterance, whereas the
4.5 Results
longer the utterance goes on, the more pressing
it becomes to get a correct result (and the more Table 2 shows an overview of the experiment re-
damaging if mistakes are made).6 sults. The table lists, separately for the manual
Note that this score is not normalised by utter- transcriptions and the ASR transcripts, first the
ance length m; the maximally achievable score number of times that the final reading did not re-
being (m + 1)/2. This has the additional ef- solve at all, or to a wrong entitiy; did not uniquely
fect of increasing the weight of long utterances resolve, but included the correct entity in its de-
when averaging over the score of all utterances; notiation; or did uniquely resolve to the correct
we see this as desirable, as the analysis task be- entity (-1, 0, and 1, respectively). The next lines
comes harder the longer the utterance is. show strict accuracy (proportion of 1 among
We use success in resolving reference to eval- all results) at the end of utterance, and relaxed
uate the performance of our parsing and semantic accuracy (which allows ambiguity, i.e., is the set
construction component, where more tradition- {0, 1}). incr.scr is the incremental score as de-
ally, metrics like parse bracketing accuracy might scribed above, which includes in the evaluation
6
the development of references and not just the fi-
This metric compresses into a single number some of
the concerns of the incremental metrics developed in (Bau-
nal state. (And in that sense, is the most appro-
mann et al., 2011), which can express more fine-grainedly priate metric here, as it captures the incremental
the temporal development of hypotheses. behaviour.) This score is shown both as absolute
520
JS EF SPI CIF
1 563 518 364 363
0 197 198 267 268
transcript
1 264 308 392 392
str.acc. 25.7 % 30.0 % 38.2 % 38.2 %
rel.acc. 44.9 % 49.3 % 64.2 % 64.3 %
incr.scr 1568 1248 536 504
avg.incr.scr 1.52 1.22 0.52 0.49
1 362 348 254 255
0 122 121 173 173
recogntion
number as well as averaged for each utterance. ber of non-standard constructions in our sponta-
As these results show, the strategy of provid- neous material (e.g., utterances like loschen, un-
ing the parser with feedback about the real-world ten (delete, bottom) which we did not try to cover
utility of constructed phrases (in the form of refer- with syntactic rules, and which may not even con-
ence decisions) improves the parser, in the sense tain NPs. The SPI condition can promote deriva-
that it helps the parser to successfully retrieve the tions resulting from robust rules (here, deletion)
intended meaning more often compared to an ap- which then can refer. In general though state-of-
proach that only uses syntactic information (JS) the art grammar engineering may narrow the gap
or that uses pragmatic information only outside between JS and SPI this remains to be tested
of the main programme: 38.2 % strict or 64.2 % but we see as an advantage of our approach that
relaxed for SPI over 25.7 % / 44.9 % for JS, an it can improve over the (easy-to-engineer) set of
absolute improvement of 12.5 % for strict or even core grammar rules.
more, 19.3 %, for the relaxed metric; the incre-
mental metric shows that this advantage holds not 5 Conclusions
only at the final word, but also consistently within We have described a model of semantic process-
the utterance, the average incremental score for ing of natural, spontaneous speech that strives
an utterance being 0.49 for SPI and 1.52 to jointly satisfy syntactic and pragmatic con-
for JS. The improvement is somewhat smaller straints (the latter being approximated by the as-
against the variant that uses some reference infor- sumption that referring expressions are intended
mation, but does not integrate this into the parsing to indeed successfully refer in the given context).
process (EF), but it is still consistently present. The model is robust, accepting also input of the
Adding such n-best-list processing to the output kind that can be expected from automatic speech
of the parser+reference-combination (as variant recognisers, and incremental, that is, can be fed
CIF does) finally does not further improve the input on a word-by-word basis, computing at each
performance noticeably. When processing par- increment only exactly the contribution of the new
tially defective material (the output of the speech word. Lastly, as another novel contribution, the
recogniser), the difference between the variants model makes use of a principled formalism for se-
is maintained, showing a clear advantage of SPI, mantic representation, RMRS (Copestake, 2006).
although performance of all variants is degraded While the results show that our approach of
somewhat. combining syntactic and pragmatic information
Clearly, accuracy is rather low for the base- can work in a real-world setting on realistic
line condition (JS); this is due to the large num- dataprevious work in this direction has so far
521
only been at the proof-of-concept stagethere is Ann Copestake. 2006. Robust minimal recursion se-
much room for improvement. First, we are now mantics. Technical report, Cambridge Computer
exploring ways of bootstrapping a grammar and Lab. Unpublished draft.
derivation weights from hand-corrected parses. Ann Copestake. 2007. Semantic composition with
Secondly, we are looking at making the variable (robust) minimal recursion semantics. In Proceed-
ings of the Workshop on Deep Linguistic Process-
assignment / model checking function probabilis-
ing, DeepLP 07, pages 7380, Stroudsburg, PA,
tic, assigning probabilities (degree of strength of USA. Association for Computational Linguistics.
belief) to candidate resolutions (as for example David DeVault and Matthew Stone. 2003. Domain
the model of Schlangen et al. (2009) does). An- inference in incremental interpretation. In Proceed-
other next stepwhich will be very easy to take, ings of ICOS 4: Workshop on Inference in Compu-
given the modular nature of the implementation tational Semantics, Nancy, France, September. IN-
framework that we have usedwill be to integrate RIA Lorraine.
this component into an interactive end-to-end sys- David DeVault, Kenji Sagae, and David Traum. 2011.
tem, and testing other domains in the process. Incremental Interpretation and Prediction of Utter-
ance Meaning for Interactive Dialogue. Dialogue
and Discourse, 2(1):143170.
Acknowledgements We thank the anonymous Raquel Fernandez and David Schlangen. 2007. Re-
reviewers for their helpful comments. The work ferring under restricted interactivity conditions. In
reported here was supported by a DFG grant in Simon Keizer, Harry Bunt, and Tim Paek, editors,
the Emmy Noether programme to the last author Proceedings of the 8th SIGdial Workshop on Dis-
course and Dialogue, pages 136139, Antwerp,
and a stipend from DFG-CRC (SFB) 632 to the
Belgium, September.
first author.
Ulrike Pado, Matthew W Crocker, and Frank Keller.
2009. A probabilistic model of semantic plausi-
References bility in sentence processing. Cognitive Science,
33(5):794838.
Gregory Aist, James Allen, Ellen Campana, Car- Matthew Purver, Arash Eshghi, and Julian Hough.
los Gomez Gallo, Scott Stoness, Mary Swift, and 2011. Incremental semantic construction in a di-
Michael K. Tanenhaus. 2007. Incremental under- alogue system. In J. Bos and S. Pulman, editors,
standing in human-computer dialogue and experi- Proceedings of the 9th International Conference on
mental evidence for advantages over nonincremen- Computational Semantics (IWCS), pages 365369,
tal methods. In Proceedings of Decalog 2007, the Oxford, UK, January.
11th International Workshop on the Semantics and
Brian Roark. 2001. Robust Probabilistic Predictive
Pragmatics of Dialogue, Trento, Italy.
Syntactic Processing: Motivations, Models, and
Timo Baumann, Michaela Atterer, and David
Applications. Ph.D. thesis, Department of Cogni-
Schlangen. 2009. Assessing and improving the per-
tive and Linguistic Sciences, Brown University.
formance of speech recognition for incremental sys-
tems. In Proceedings of the North American Chap- David Schlangen and Gabriel Skantze. 2009. A gen-
ter of the Association for Computational Linguis- eral, abstract model of incremental dialogue pro-
tics - Human Language Technologies (NAACL HLT) cessing. In EACL 09: Proceedings of the 12th
2009 Conference, Boulder, Colorado, USA, May. Conference of the European Chapter of the Associa-
Timo Baumann, Okko Bu, and David Schlangen. tion for Computational Linguistics, pages 710718.
2011. Evaluation and optimization of incremen- Association for Computational Linguistics, mar.
tal processors. Dialogue and Discourse, 2(1):113 David Schlangen, Timo Baumann, and Michaela At-
141. terer. 2009. Incremental reference resolution: The
Niels Beuck, Arne Kohn, and Wolfgang Menzel. task, metrics for evaluation, and a bayesian filtering
2011. Decision strategies for incremental pos tag- model that is sensitive to disfluencies. In Proceed-
ging. In Proceedings of the 18th Nordic Con- ings of SIGdial 2009, the 10th Annual SIGDIAL
ference of Computational Linguistics, NODALIDA- Meeting on Discourse and Dialogue, London, UK,
2011, Riga, Latvia. September.
Okko Bu and David Schlangen. 2010. Modelling David Schlangen, Timo Baumann, Hendrik
sub-utterance phenomena in spoken dialogue sys- Buschmeier, Okko Bu, Stefan Kopp, Gabriel
tems. In Proceedings of the 14th International Skantze, and Ramin Yaghoubzadeh. 2010. Middle-
Workshop on the Semantics and Pragmatics of Dia- ware for Incremental Processing in Conversational
logue (Pozdial 2010), pages 3341, Poznan, Poland, Agents. In Proceedings of SigDial 2010, Tokyo,
June. Japan, September.
522
William Schuler, Stephen Wu, and Lane Schwartz.
2009. A framework for fast incremental interpre-
tation during speech decoding. Computational Lin-
guistics, 35(3).
William Schuler. 2003. Using model-theoretic se-
mantic interpretation to guide statistical parsing and
word recognition in a spoken language interface. In
Proceedings of the 41st Meeting of the Association
for Computational Linguistics (ACL 2003), Sap-
poro, Japan. Association for Computational Lin-
guistics.
Gabriel Skantze and Anna Hjalmarsson. 2010. To-
wards incremental speech generation in dialogue
systems. In Proceedings of the SIGdial 2010 Con-
ference, pages 18, Tokyo, Japan, September.
Gabriel Skantze and David Schlangen. 2009. Incre-
mental dialogue processing in a micro-domain. In
Proceedings of the 12th Conference of the Euro-
pean Chapter of the Association for Computational
Linguistics (EACL 2009), pages 745753, Athens,
Greece, March.
Mark Steedman. 2000. The Syntactic Process. MIT
Press, Cambridge, Massachusetts.
Scott C. Stoness, Joel Tetreault, and James Allen.
2004. Incremental parsing with reference inter-
action. In Proceedings of the Workshop on In-
cremental Parsing at the ACL 2004, pages 1825,
Barcelona, Spain, July.
Scott C. Stoness, James Allen, Greg Aist, and Mary
Swift. 2005. Using real-world reference to improve
spoken language understanding. In AAAI Workshop
on Spoken Language Understanding, pages 3845.
523
Learning How to Conjugate the Romanian Verb. Rules for Regular and
Partially Irregular Verbs
Liviu P. Dinu Vlad Niculae Octavia-Maria S, ulea
Faculty of Mathematics Faculty of Mathematics Faculty of Foreign Languages
and Computer Science and Computer Science and Literatures
University of Bucharest University of Bucharest Faculty of Mathematics
ldinu@fmi.unibuc.ro vlad@vene.ro and Computer Science
University of Bucharest
mary.octavia@gmail.com
524
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 524528,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
dict the labels. Person Regexp Example
1st singular (.+)a(.+)t$ tresalt
2 Approach 2nd singular (.+)a(.+)ti$ tresalti
3rd singular (.+)a(.+)ta$ tresalta
The problem which we are aiming to solve is to
1st plural (.+)a(.+)tam$ tresaltam
determine how to conjugate a verb, given its in-
2nd plural (.+)a(.+)tati$ tresaltati
finitive form. The traditional infinitive-based clas-
3rd plural (.+)a(.+)ta$ tresalta
sification taught in school does not take one all the
way to solving this problem. Many conjugational Table 1: Rule 14 modelling a tresalta
patterns exist within each of these four classes.
2.1 Labeling the dataset that they model. Note that, when we say (no) al-
ternation, we mean (no) alternation in the stem.
Following our own observations, the alternations So the difference between rules 1, 20, 22, and the
identified in (Papastergiou et al., 2007) and the sort lies in the suffix that is added to the stem
classes of suffix patterns given in (Barbu, 2007), for each verb form. They may share some suf-
we developed a number of conjugational rules fixes, but not all and/or not for the same person
which were narrowed down to the 30 most pro- and number.
ductive in relation to the dataset. Each of these
30 rules (or patterns) contains 6 regular expres- 1. no alternation; a spera (to hope);
sions through which the rule models how a (dif-
2. alternation: ae for the 2nd person singular;
ferent) type of Romanian verb conjugates in the
a numara (to count);
indicative present. They each consist of 6 reg-
ular expressions because there are three persons 3. no alternation; a intra (to enter), stem ends
(first, second, and third) times two numbers (sin- in tr, pl, bl or fl which determines
gular and plural). the addition of u at the end of the 1st per-
Rule 10, for example, models, as stated in son singular form;
the list that follows, how verbs of the type
a canta (to sing) conjugate in the indicative 4. alternation: it lacks tt for the 2nd person
present, by having the first regular expression singular, which otherwise normally occurs;
model the first person singular form (eu) cant a misca (to move), stem ends in sca;
(in regular expression format: (.+)$), the sec- 5. no alternation; a taia (to cut), ends in ia
ond, model the second person singular form (tu) and has a vowel before;
canti ((.+)ti$), the third, model the third per-
son singular form (ei) canta ((.+)a$), and so 6. no alternation; a speria (to scare), ends in
forth. Thus, rule 10 catches the alternation tt ia and has a consonant before;
for the 2nd person singular, while modelling a
7. no alternation; a dansa (to dance), conju-
particular type of verb class with a particular set
gated with the suffix ez;
of suffixes. Note that the dot accepts any letter
in the Romanian alphabet and that, for each of 8. no alternation; a copia (to copy), conju-
the six forms, the value of the capturing groups gated with a modified ez due to the stem
(those between brackets) remains constant, in this ending in ia;
case can. These groups correspond to all parts of
the stem that remain unchanged and ensure that, 9. altenation cch(e) or ggh(e); a parca
given the infinitive and the regular expressions, (to park), conjugated with ez, ending in
one can work backwards and produce the correct ca or ga;
conjugation.
10. alternation: tt for the 2nd person singular;
For a clearer understanding of one such rule,
a canta (to sing);
Table 1 shows an example of how the verb a
tresalta is modeled by rule 14. 11. alternation: ss which replaces the usual
Below, we list all the rules used, with the stem tt for the 2nd person singular; a exista
alternations they capture and an example of a verb (to exist);
525
12. alternation: aea for the 3rd person singular 27. no alternation; a citi (to read), conjugates
and plural, tt for the 2nd person singular; with the suffix esc ;
a destepta (to awake/arouse);
28. this type preserves the i from the infinitive;
13. alternation: eea for the 3rd person singular a locui (to reside), ends in ai, oi, or ui
and plural, tt for the 2nd person singular; and conjugates with esc;
a deserta (to empty);
29. alternation: ooa in the 3rd person singular
14. alternation: aa for all the forms except the and plural; end in , a omor (to kill);
1st and 2nd person plural; a tresalta (to
start, to take fright); 30. no alternation; a hotar (to decide), ends in
and conjugates with asc, a variant of
15. alternation: aa in the 3rd person singular esc
and plural, ae in the 2nd person singular;
a desfata (to delight); 2.2 Classifiers and features
Each infinitive in the dataset received a label cor-
16. alternation: aa for all the forms except for
responding to the first rule that correctly produces
the 1st and 2nd person plural; a parea (to
a conjugation for it. This was implemented in
seem);
order to reduce the ambiguity of the data, which
17. alternation: dz for the 2nd person singu- was due to some verbs having alternate conjuga-
lar due to palatalization, along with ae; a tion patterns. The unlabeled verbs were thrown
vedea (to see), stem ends in d; out, while the labeled ones were used to train and
evaluate a classifier.
18. alternation: aa for all forms except the 1st The context sensitive nature of the alternations
and 2nd person plural, dz for the 2nd per- leads to the idea that n-gram character windows
son singular due to palatalization; a cadea are useful. In the preprocessing step, the list of in-
(to fall); finitives is transformed to a sparse matrix whose
lines correspond to samples, and whose features
19. no alternation; a veghea (to watch over),
are the occurence or the frequency of a specific n-
conjugates with another type of ez ending
gram. This feature extraction step has three free
pattern;
parameters: the maximum n-gram length, the op-
20. no alternations; a merge (to walk), receives tional binarization of the features (taking only bi-
the typical ending pattern for the third conju- nary occurences instead of counts), and the op-
gational class; tional appending of a terminator character. The
terminator character allows the classifier to iden-
21. alternation: tt for the 2nd person singular; tify and assign a different weight to the n-grams
a promite (to promise); that overlap with the suffix of the string.
22. no alternation; a scrie (to write); For example, consider the English infinitive to
walk. We will assume the following illustrative
23. alternations: stsc for the 1st person singu- values for the parameters: n-gram size of 3 and
lar and 3rd person plural; a naste (to give appending the terminator character. Firstly, a ter-
birth), ends in ste; minator is appended to the end, yielding the string
walk$. Subsequently, the string is broken into 1, 2
24. alternation: n is deleted from the stem in and 3-grams: w, a, l, k, $, wa, al, lk, k$, wal, alk,
the 2nd person singular; a pune (to put), lk$. Next, this list is turned into a vector using a
ends in ne; standard process. We have first built a dictionary
25. alternation: dz in the 2nd person singular of all the n-grams from the whole dataset. These,
due to palatalization; a crede (to believe), in order, encode the features. The verb (to) walk
stem ends in d; is therefore encoded as a row vector with ones in
the columns corresponding to the features w, a,
26. no alternation; a sui (to climb), ends in etc. and zeros in the rest. In this particular case,
ui, ai, or ai; there is no difference between binary and count
526
rule no. verbs rule no. verbs terminator and with non-binarized (count) fea-
1 547 16 13 tures. The estimated correct classification rate is
2 8 17 6 90.64%, with a weighted averaged precision of
3 18 18 4 80.90%, recall of 90.64% and F1 score of 89.89%.
4 5 19 14 Appending the artificial terminator character $
5 8 20 124 consistently improves accuracy by around 0.7%.
6 16 21 25 Because each word was represented as a bag of
7 3330 22 15 character n-grams instead of a continuous string,
8 273 23 7 and because, by its nature, a SVM yields sparse
9 89 24 41 solutions, combined with the evaluation using
10 4 25 51 cross-validation, we can safely say that the model
11 5 26 185 does not overfit and indeed learns useful decision
12 4 27 1554 boundaries.
13 106 28 486
14 13 29 5 4 Conclusions and Future Works
15 5 30 27
Our results show that the labelling system based
Table 2: Number of verbs captured by each of our rules on the verb conjugation model we developed can
be learned with reasonable accuracy. In the future,
we plan to develop a multiple tiered labelling sys-
features because all of the n-grams of this short
tem that will allow for general alternations, such
verb occur only once. But for a verb such as (to)
as the ones occuring as a result of palatalization,
tantalize, the feature corresponding to the 2-gram
to be defined only once for all verbs that have
ta would get a value of 2 in a count reprezentation,
them, taking cues from the idea of letters with
but only a value of 1 in a binary one.
multiple values. This, we feel, will highly im-
The system was put together using the scikit-
prove the acuracy of the classifier.
learn machine learning library for Python (Pe-
dregosa et al., 2011), which provides a fast, scal-
5 Acknowledgements
able implementation of linear support vector ma-
chines based on liblinear (Fan et al., 2008), along The authors would like to thank the anonymous
with n-gram extraction and grid search function- reviewers for their helpful comments. All authors
ality. contributed equally to this work. The research of
Liviu P. Dinu was supported by the CNCS, IDEI
3 Results - PCE project 311/2011, The Structure and In-
Tabel 2 shows how well the rules fitted the dataset. terpretation of the Romanian Nominal Phrase in
Out of 7,295 verbs in the dataset, 349 were uncap- Discourse Representation Theory: the Determin-
tured by our rules. As expected, the rule capturing ers.
the most verbs (3,330) is the one modelling those
from the 1st conjugational class (whose infinitives References
end in a) which conjugate with the ez suffix Ana-Maria Barbu. Conjugarea verbelor roma-
and are regular, namely rule 7, created for verbs nesti. Dictionar: 7500 de verbe romanesti gru-
like a dansa. The second largest class, also as pate pe clase de conjugare. Bucharest: Coresi,
expected, is the one belonging to verbs from the 2007. 4th edition, revised. (In Romanian.) (263
4th conjugational group (whose infinitives end in pp.).
i), which are regular, meaning no alternation in
the stem, and conjugate with the esc suffix. This Ana-Maria Barbu. Romanian lexical databases:
class is modeled by rule number 27. Inflected and syllabic forms dictionaries. In
The support vector classifier was evaluated Sixth International Language Resources and
using a 10-fold cross-validation. The multi- Evaluation (LREC08), 2008.
class problem is treated using the one-versus-all Angelo Roth Costanzo. Romance Conjugational
scheme. The parameters chosen by grid search are Classes: Learning from the Peripheries. PhD
a maximum n-gram length of 5, with appended thesis, Ohio State University, 2011.
527
Figure 1: 10-fold cross validation scores for various combination of parameters. Only the values corresponding
to the best C regularization parameters are shown.
Liviu P. Dinu, Emil Ionescu, Vlad Niculae, and J. Vanderplas, A. Passos, D. Cournapeau,
Octavia-Maria Sulea. Can alternations be M. Brucher, M. Perrot, and E. Duchesnay.
learned? a machine learning approach to verb Scikit-learn: Machine learning in Python. Jour-
alternations. In Recent Advances in Natural nal of Machine Learning Research, 12:2825
Language Processing 2011, September 2011. 2830, Oct 2011.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Valeria Gutu Romalo. Morfologie Structurala a
Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: limbii romane. Editura Academiei Republicii
A library for large linear classification. Journal Socialiste Romania, 1968.
of Machine Learning Research, 9:18711874,
June 2008. ISSN 1532-4435.
Jiri Felix. Classification des verbes roumains, vol-
ume VII. Philosophica Pragensia, 1964.
Alf Lombard. Le verbe roumain. Etude mor-
phologique, volume 1. Lund, C. W. K. Gleerup,
1955.
Grigore C. Moisil. Probleme puse de traduc-
erea automata. conjugarea verbelor n limba
romana. Studii si cercetari lingvistice, XI(1):
729, 1960.
I. Papastergiou, N. Papastergiou, and L. Man-
deki. Verbul romanesc - reguli pentru nlesnirea
nsusirii indicativului prezent. In Romanian
National Symposium Directions in Roma-
nian Philological Research, 7th Edition, May
2007.
F. Pedregosa, G. Varoquaux, A. Gramfort,
V. Michel, B. Thirion, O. Grisel, M. Blon-
del, P. Prettenhofer, R. Weiss, V. Dubourg,
528
Measuring Contextual Fitness Using Error Contexts Extracted from the
Wikipedia Revision History
Torsten Zesch
Ubiquitous Knowledge Processing Lab (UKP-DIPF)
German Institute for Educational Research and Educational Information, Frankfurt
Ubiquitous Knowledge Processing Lab (UKP-TUDA)
Department of Computer Science, Technische Universitat Darmstadt
http://www.ukp.tu-darmstadt.de
529
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 529538,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
be very unlikely to be made by a human, and real-word spelling errors at some point, which
(iii) inserting artificial errors often leads to un- are then corrected in subsequent revisions of the
natural sentences that are quite easy to correct, same article. The challenge lies in discriminating
e.g. if the word class has changed. However, real-word spelling errors from all sorts of other
even if the word class is unchanged, the origi- changes, including non-word spelling errors, re-
nal word and its replacement might still be vari- formulations, or the correction of wrong facts.
ants of the same lemma, e.g. a noun in singu- For that purpose, we apply a set of precision-
lar and plural, or a verb in present and past form. oriented heuristics narrowing down the number
This usually leads to a sentence where the error of possible error candidates. Such an approach
can be easily detected using syntactical or statis- is feasible, as the high number of revisions in
tical methods, but is almost impossible to detect Wikipedia allows to be extremely selective.
for knowledge-based measures of contextual fit-
ness, as the meaning of the word stays more or 2.1 Accessing the Revision Data
less unchanged. To estimate the impact of this is- We access the Wikipedia revision data using
sue, we randomly sampled 1,000 artificially cre- the freely available Wikipedia Revision Toolkit
ated real-word spelling errors1 and found 387 sin- (Ferschke et al., 2011) together with the JWPL
gular/plural pairs and 57 pairs which were in an- Wikipedia API (Zesch et al., 2008a).3 The API
other direct relation (e.g. adjective/adverb). This outputs plain text converted from Wiki-Markup,
means that almost half of the artificially created but the text still contains a small portion of left-
errors are not suited for an evaluation targeted at over markup and other artifacts. Thus, we per-
finding optimal measures of contextual fitness, as form additional cleaning steps removing (i) to-
they over-estimate the performance of statistical kens with more than 30 characters (often URLs),
measures while underestimating the potential of (ii) sentences with less than 5 or more than 200
semantic measures. In order to investigate this tokens, and (iii) sentences containing a high frac-
issue, we present a framework for mining natu- tion of special characters like : usually indicat-
rally occurring errors and their contexts from the ing Wikipedia-specific artifacts like lists of lan-
Wikipedia revision history. We use the resulting guage links. The remaining sentences are part-of-
English and German datasets to evaluate statisti- speech tagged and lemmatized using TreeTagger
cal and knowledge-based measures. (Schmid, 2004). Using these cleaned and anno-
We make the full experimental framework pub- tated articles, we form pairs of adjacent article re-
licly available2 which will allow reproducing our visions (ri and ri+1 ).
experiments as well as conducting follow-up ex-
2.2 Sentence Alignment
periments. The framework contains (i) methods
to extract natural errors from Wikipedia, (ii) ref- Fully aligning all sentences of the adjacent revi-
erence implementations of the knowledge-based sions is a quite costly operation, as sentences can
and the statistical methods, and (iii) the evalua- be split, joined, replaced, or moved in the arti-
tion datasets described in this paper. cle. However, we are only looking for sentence
pairs which are almost identical except for the
2 Mining Errors from Wikipedia real-word spelling error and its correction. Thus,
we form all sentence pairs and then apply an ag-
Measures of contextual fitness have previously gressive but cheap filter that rules out all sentences
been evaluated using artificially created datasets, which (i) are equal, or (ii) whose lengths differ
as there are very few sources of sentences with more than a small number of characters. For the
naturally occurring errors and their corrections. resulting much smaller subset of sentence pairs,
Recently, the revision history of Wikipedia has we compute the Jaro distance (Jaro, 1995) be-
been introduced as a valuable knowledge source tween each pair. If the distance exceeds a cer-
for NLP (Nelken and Yamangil, 2008; Yatskar et tain threshold tsim (0.05 in this case), we do not
al., 2010). It is also a possible source of natural further consider the pair. The small amount of re-
errors, as it is likely that Wikipedia editors make maining sentence pairs is passed to the sentence
1
pair filter for in-depth inspection.
The same artificial data as described in Section 3.2.
2 3
http://code.google.com/p/dkpro-spelling-asl/ http://code.google.com/p/jwpl/
530
2.3 Sentence Pair Filtering tions, the change is likely to be semantically mo-
The sentence pair filter further reduces the num- tivated, e.g. if house was replaced with hut.
ber of remaining sentence pairs by applying a set Thus, we do not consider cases, where we detect
of heuristics including surface level and semantic a direct semantic relation between the original and
level filters. Surface level filters include: the replaced term. For this purpose, we use Word-
Replaced Token Sentences need to consist of Net (Fellbaum, 1998) for English and GermaNet
identical tokens, except for one replaced token. (Lemnitzer and Kunze, 2002) for German.
No Numbers The replaced token may not be a
3 Resulting Datasets
number.
UPPER CASE The replaced token may not be 3.1 Natural Error Datasets
in upper case.
Using our framework for mining real-word
Case Change The change should not only in-
spelling errors in context, we extracted an En-
volve case changes, e.g. changing english into
glish dataset5 , and a German dataset6 . Although
English.
the output generally was of high quality, man-
Edit Distance The edit distance between the
ual post-processing was necessary7 , as (i) for
replaced token and its correction need to be be-
some pairs the available context did not provide
low a certain threshold.
enough information to decide which form was
After applying the surface level filters, the re- correct, and (ii) a problem that might be spe-
maining sentence pairs are well-formed and con- cific to Wikipedia vandalism. The revisions are
tain exactly one changed token at the same posi- full of cases where words are replaced with simi-
tion in the sentence. However, the change does lar sounding but greasy alternatives. A relatively
not need to characterize a real-word spelling er- mild example is In romantic comedies, there is
ror, but could also be a normal spelling error or a a love story about a man and a woman who fall
semantically motivated change. Thus, we apply a in love, along with silly or funny comedy farts.,
set of semantic filters: where parts was replaced with farts only to be
Vocabulary The replaced token needs to occur changed back shortly afterwards by a Wikipedia
in the vocabulary. We found that even quite com- vandalism hunter. We removed all cases that re-
prehensive word lists discarded too many valid sulted from obvious vandalism. For further ex-
errors as Wikipedia contains articles from a very periments, a small list of offensive terms could be
wide range of domains. Thus, we use a frequency added to the stopword list to facilitate this pro-
filter based on the Google Web1T n-gram counts cess.
(Brants and Franz, 2006). We filter all sentences
A connected problem is correct words that get
where the replaced token has a very low unigram
falsely corrected by Wikipedia editors (without
count. We experimented with different values and
the malicious intend from the previous examples,
found 25,000 for English and 10,000 for German
but with similar consequences). For example, the
to yield good results.
initially correct sentence Dung beetles roll it into
Same Lemma The original token and the re- a ball, sometimes being up to 50 times their own
placed token may not have the same lemma, e.g. weight. was corrected by exchanging weight
car and cars would not pass this filter. with wait. We manually removed such obvious
Stopwords The replaced token should not be in mistakes, but are still left with some borderline
a short list of stopwords (mostly function words). cases. In the sentence By the 1780s the goals
Named Entity The replaced token should not of England were so full that convicts were often
be part of a named entity. For this purpose, we chained up in rotting old ships. the obvious error
applied the Stanford NER (Finkel et al., 2005).
5
Normal Spelling Error We apply the Jazzy Using a revision dump from April 5, 2011.
6
spelling detector4 and rule out all cases in which Using a revision dump from August 13, 2010.
7
The most efficient and precise way of finding real-word
it is able to detect the error. spelling errors would of course be to apply measures of con-
Semantic Relation If the original token and the textual fitness. However, the resulting dataset would then
replaced token are in a close lexical-semantic rela- only contain errors that are detectable by the measures we
want to evaluate a clearly unacceptable bias. Thus, a cer-
4
http://jazzy.sourceforge.net/ tain amount of manual validation is inevitable.
531
goal was changed by some Wikipedia editor to corpus that is known to be free of spelling errors,
jail. However, actually it should have been the sentences are randomly sampled. For each sen-
old English form for jail gaol which can be de- tence, a random word is selected and all strings
duced when looking at the full context and later with edit distance smaller than a given threshold
versions of the article. We decided to not remove (2 in our case) are generated. If one of those gen-
these rare cases, because jail is a valid correction erated strings is a known word from the vocabu-
in this context. lary, it is picked as the artificial error.
After manual inspection, we are left with 466 Previous work on evaluating real-word spelling
English and 200 German errors. Given that we correction (Hirst and Budanitsky, 2005; Wilcox-
restricted our experiment to 5 million English and OHearn et al., 2008; Islam and Inkpen, 2009)
German revisions, much larger datasets can be ex- used a dataset sampled from the Wall Street Jour-
tracted if the whole revision history is taken into nal corpus which is not freely available. Thus, we
account. Our snapshot of the English Wikipedia created a comparable English dataset of 1,000 ar-
contains 305106 revisions. Even if not all of them tificial errors based on the easily available Brown
correspond to article revisions, it is safe to assume corpus (Francis W. Nelson and Kucera, 1964).8
that more than 10,000 real-word spelling errors Additionally, we created a German dataset with
can be extracted from this version of Wikipedia. 1,000 artificial errors based on the TIGER cor-
Using the same amount of source revisions, we pus.9
found significantly more English than German er-
rors. This might be due to (i) English having more 4 Measuring Contextual Fitness
short nouns or verbs than German that are more There are two main approaches for measuring the
likely to be confused with each other, and (ii) the contextual fitness of a word in its context: the
English Wikipedia being known to attract a larger statistical (Mays et al., 1991) and the knowledge-
amount of non-native editors which might lead to based approach (Hirst and Budanitsky, 2005).
higher rates of real-word spelling errors. How-
ever, this issue needs to be further investigated 4.1 Statistical Approach
e.g. based on comparable corpora build on the ba- Mays et al. (1991) introduced an approach based
sis of different language editions of Wikipedia. on the noisy-channel model. The model assumes
Further refining the identification of real-word er- that the correct sentence s is transmitted through
rors in Wikipedia would allow evaluating how fre- a noisy channel adding noise which results in a
quent such errors actually occur, and how long word w being replaced by an error e leading the
it takes the Wikipedia editors to detect them. If wrong sentence s0 which we observe. The prob-
errors persist over a long time, using measures ability of the correct word w given that we ob-
of contextual fitness for detection would be even serve the error e can be computed as P (w|e) =
more important. P (w) P (e|w). The channel model P (e|w) de-
Another interesting observation is that the av- scribes how likely the typist is to make an error.
erage edit distance is around 1.4 for both datasets. This is modeled by the parameter .10 The re-
This means that a substantial proportion of errors maining probability mass (1 ) is distributed
involve more than one edit operation. Given that equally among all words in the vocabulary within
many measures of contextual fitness allow at most an edit distance of 1 (edits(w)):
one edit, many naturally occurring errors will not
(
be detected. However, allowing a larger edit dis- if e = w
tance enormously increases the search space re- P (e|w) =
(1 )/|edits(w)| if e 6= w
sulting in increased run-time and possibly de-
creased detection precision due to more false pos- The source model P (w) is estimated using a
itives. trigram language model, i.e. the probability of the
3.2 Artificial Error Datasets 8
http://www.archive.org/details/BrownCorpus (CC-by-na).
9
http://www.ims.uni-stuttgart.de/projekte/TIGER/
In contrast to the quite challenging process of The corpus contains 50,000 sentences of German newspaper
mining naturally occurring errors, creating artifi- text, and is freely available under a non-commercial license.
10
cial errors is relatively straightforward. From a We optimize on a held-out development set of errors.
532
intended word wi is computed as the conditional Dataset P R F
probability P (wi |wi1 wi2 ). Hence, the proba- Artificial-English .77 .50 .60
bility of the correct sentence s = w1 . . . wn can Natural-English .54 .26 .35
be estimated as Artificial-German .90 .49 .63
Natural-German .77 .20 .32
n+2
Y
P (s) = P (wi |wi1 wi2 ) Table 1: Performance of the statistical approach using
i=1
a trigram model based on Google Web1T.
The set of candidate sentences Sc contains all ver-
sions of the observed sentence s0 derived by re- It is unclear which list was used. We could use
placing one word with a word from edits(w), multi-words from WordNet, but coverage would
while all other words in the sentence remain be rather limited. We decided not to use both fil-
unchanged. The correct sentence s is those ters in order to better assess the influence of the
sentence from Sc that maximizes P (s|s0 ) = underlying semantic relatedness measure on the
arg maxsSc P (s) P (s0 |s). overall performance.
The knowledge based approach uses semantic
4.2 Knowledge Based Approach
relatedness measures to determine the cohesion
Hirst and Budanitsky (2005) introduced a between a candidate and its context. In the exper-
knowledge-based approach that detects real-word iments by Budanitsky and Hirst (2006), the mea-
spelling errors by checking the semantic relations sure by (Jiang and Conrath, 1997) yields the best
of a target word with its context. For this pur- results. However, a wide range of other measures
pose, they apply WordNet as the source of lexical- have been proposed, cf. (Zesch and Gurevych,
semantic knowledge. 2010). Some measures using a wider defini-
The algorithm flags all words as error can- tion of semantic relatedness (Gabrilovich and
didates and then applies filters to remove those Markovitch, 2007; Zesch et al., 2008b) instead
words from further consideration that are unlikely of only using taxonomic relations in a knowledge
to be errors. First, the algorithm removes all source.
closed-class word candidates as well as candi- As semantic relatedness measures usually re-
dates which cannot be found in the vocabulary. turn a numeric value, we need to determine a
Candidates are then tested for having lexical co- threshold in order to come up with a binary
hesion with their context, by (i) checking whether related/unrelated decision. Budanitsky and Hirst
the same surface form or lemma appears again in (2006) used a characteristic gap in the stan-
the context, or (ii) a semantically related concept dard evaluation dataset by Rubenstein and Good-
is found in the context. In both cases, the candi- enough (1965) that separates unrelated from re-
date is removed from the list of candidates. For lated word pairs. We do not follow this approach,
each remaining possible real-word spelling error, but optimize the threshold on a held-out develop-
edits are generated by inserting, deleting, or re- ment set of real-word spelling errors.
placing characters up to a certain edit distance
(usually 1). Each edit is then tested for lexical 5 Results & Discussion
cohesion with the context. If at least one of it fits
into the context, the candidate is selected as a real- In this section, we report on the results obtained
word error. in our evaluation of contextual fitness measures
Hirst and Budanitsky (2005) use two additional using artificial and natural errors in English and
filters: First, they remove candidates that are German.
common non-topical words. It is unclear how
the list of such words was compiled. Their list 5.1 Statistical Approach
of examples contains words like find or world Table 1 summarizes the results obtained by the
which we consider to be perfectly valid candi- statistical approach using a trigram model based
dates. Second, they also applied a filter using a on the Google Web1T data (Brants and Franz,
list of known multi-words, as the probability for 2006). On the English artificial errors, we ob-
words to accidentally form multi-words is low. serve a quite high F-measure of .60 that drops to
533
Dataset N-gram model Size P R F Dataset P R F
7 1011 .77 .50 .60 Artificial-English .26 .15 .19
Google Web 7 1010 .78 .48 .59 Natural-English .29 .18 .23
Art-En
7 109 .76 .42 .54
Artificial-German .47 .16 .24
Wikipedia 2 109 .72 .37 .49 Natural-German .40 .13 .19
7 1011 .54 .26 .35
Google Web 7 1010 .51 .23 .31 Table 3: Performance of the knowledge-based ap-
Nat-En proach using the JiangConrath semantic relatedness
7 109 .46 .19 .27
measure.
Wikipedia 2 109 .49 .19 .27
10
8 10 .90 .49 .63
Google Web 8 109 .90 .47 .61 not targeted towards the Wikipedia articles from
Art-De
8 108 .88 .36 .51 which we sampled the natural errors. Thus, we
Wikipedia 7 108 .90 .37 .52 also tested a trigram model based on Wikipedia.
8 10 10
.77 .20 .32 However, it is much smaller than the Web model,
Google Web 8 109 .68 .14 .23 which leads us to additionally testing smaller Web
Nat-De
8 108 .65 .10 .17 models. Table 2 summarizes the results.
Wikipedia 7 108 .70 .13 .22 We observe that more data is better data still
holds, as the largest Web model always outper-
Table 2: Influence of the n-gram model on the perfor- forms the Wikipedia model in terms of recall. If
mance of the statistical approach. we reduce the size of the Web model to the same
order of magnitude as the Wikipedia model, the
.35 when switching to the naturally occurring er- performance of the two models is comparable.
rors which we extracted from Wikipedia. On the We would have expected to see better results for
German dataset, we observe almost the same per- the Wikipedia model in this setting, but its higher
formance drop (from .63 to .32). quality does not lead to a significant difference.
These observations correspond to our earlier Even if statistical approaches quite reliably de-
analysis where we showed that the artificial data tect real-word spelling errors, the size of the re-
contains many cases that are quite easy to correct quired n-gram models remains a serious obstacle
using a statistical model, e.g. where a plural form for use in real-world applications. The English
of a noun is replaced with its singular form (or Web1T trigram model is about 25GB, which cur-
vice versa) as in I bought a car. vs. I bought rently is not suited for being applied in settings
a cars.. The naturally occurring errors often con- with limited storage capacities e.g. for intelligent
tain much harder contexts, as shown in the fol- input assistance in mobile devices. As we have
lowing example: Through the open window they seen above, using smaller models will decrease
heard sounds below in the street: cartwheels, a recall to a point where hardly any error will be de-
tired horses plodding step, vices. where vices tected anymore. Thus, we will now have a look on
should be corrected to voices. While the lemma knowledge-based approaches which are less de-
voice is clearly semantically related to other manding in terms of the required resources.
words in the context like hear or sound, the
5.2 Knowledge-based Approach
position at the end of the sentence is especially
difficult for the trigram-based statistical approach. Table 3 shows the results for the knowledge-based
The only trigram that connects the error to the measure. In contrast to the statistical approach,
context is (step, ,, vices/voices) which will the results on the artificial errors are not higher
probably yield a low frequency count even for than on the natural errors, but almost equal for
very large trigram models. Higher order n-gram German and even lower for English; another piece
models would help, but suffer from the usual data- of evidence supporting our view that the proper-
sparseness problems. ties of artificial datasets over-estimate the perfor-
mance of statistical measures.
Influence of the N-gram Model For building
the trigram model, we used the Google Web1T Influence of the Relatedness Measure As was
data, which has some known quality issues and is pointed out before, Budanitsky and Hirst (2006)
534
Dataset Measure P R F Dataset Comb.-Strategy P R F
JiangConrath 0.5 .26 .15 .19 Best-Single .77 .50 .60
Lin 0.5 .22 .17 .19 Artificial-English Union .52 .55 .54
Lesk 0.5 .19 .16 .17 Intersection .91 .15 .25
Art-En
ESA-Wikipedia 0.05 .43 .13 .20 Best-Single .54 .26 .35
ESA-Wiktionary 0.05 .35 .20 .25 Natural-English Union .40 .36 .38
ESA-Wordnet 0.05 .33 .15 .21 Intersection .82 .11 .19
JiangConrath 0.5 .29 .18 .23
Lin 0.5 .26 .21 .23 Table 5: Results obtained by a combination of the best
Lesk 0.5 .19 .19 .19 statistical and knowledge-based configuration. Best-
Nat-En Single is the best precision or recall obtained by a sin-
ESA-Wikipedia 0.05 .48 .14 .22
gle measure. Union merges the detections of both
ESA-Wiktionary 0.05 .39 .21 .27
ESA-Wordnet 0.05 .36 .15 .21
approaches. Intersection only detects an error if both
methods agree on a detection.
Table 4: Performance of knowledge-based approach
using different relatedness measures.
count an error as detected if both methods agree
on a detection (Intersection). When compar-
show that the measure by Jiang and Conrath ing the combined results in Table 5 with the best
(1997) yields the best results in their experi- precision or recall obtained by a single measure
ments on malapropism detection. In addition, we (Best-Single), we observe that precision can be
test another path-based measure by Lin (1998), significantly improved using the Union strategy,
the gloss-based measure by Lesk (1986), and while recall is only moderately improved using
the ESA measure (Gabrilovich and Markovitch, the Intersect strategy. This means that (i) a large
2007) based on concept vectors from Wikipedia, subset of errors is detected by both approaches
Wiktionary, and WordNet. Table 4 summarizes that due to their different sources of knowledge
the results. In contrast to the findings of Budanit- mutually reinforce the detection leading to in-
sky and Hirst (2006), JiangConrath is not the best creased precision, and (ii) a small but otherwise
path-based measure, as Lin provides equal or bet- undetectable subset of errors requires considering
ter performance. Even more importantly, other detections made by one approach only.
(non path-based) measures yield better perfor-
mance than both path-based measures. Especially 6 Related Work
ESA based on Wiktionary provides a good over-
To our knowledge, we are the first to create a
all performance, while ESA based on Wikipedia
dataset of naturally occurring errors based on the
provides excellent precision. The advantage of
revision history of Wikipedia. Max and Wis-
ESA over the other measure types can be ex-
niewski (2010) used similar techniques to create
plained with its ability to incorporate semantic re-
a dataset of errors from the French Wikipedia.
lationships beyond classical taxonomic relations
However, they target a wider class of errors in-
(as used by path-based measures).
cluding non-word spelling errors, and their class
of real-word errors conflates malapropisms as
5.3 Combining the Approaches
well as other types of changes like reformulations.
The statistical and the knowledge-based approach Thus, their dataset cannot be easily used for our
use quite different methods to assess the con- purposes and is only available in French, while
textual fitness of a word in its context. This our framework allows creating datasets for all ma-
makes it worthwhile trying to combine both ap- jor languages with minimal manual effort.
proaches. We ran the statistical method (using the Another possible source of real-word spelling
full Wikipedia trigram model) and the knowledge- errors are learner corpora (Granger, 2002), e.g.
based method (using the ESA-Wiktionary related- the Cambridge Learner Corpus (Nicholls, 1999).
ness measure) in parallel and then combined the However, annotation of errors is difficult and
resulting detections using two strategies: (i) we costly (Rozovskaya and Roth, 2010), only a small
merge the detections of both approaches in order fraction of observed errors will be real-word
to obtain higher recall (Union), and (ii) we only spelling errors, and learners are likely to make dif-
535
ferent mistakes than proficient language users. word spelling errors. For that purpose, we ex-
Islam and Inkpen (2009) presented another sta- tracted a dataset with naturally occurring errors
tistical approach using the Google Web1T data and their contexts from the Wikipedia revision
(Brants and Franz, 2006) to create the n-gram history. We show that evaluating measures of con-
model. It slightly outperformed the approach by textual fitness on this dataset provides a more re-
Mays et al. (1991) when evaluated on a corpus of alistic picture of task performance. In particular,
artificial errors based on the WSJ corpus. How- using artificial datasets over-estimates the perfor-
ever, the results are not directly comparable, as mance of the statistical approach, while it under-
Mays et al. (1991) used a much smaller n-gram estimates the performance of the knowledge-
model and our results in Section 5.1 show that based approach.
the size of the n-gram model has a large influence We show that n-gram models targeted towards
on the results. Eventually, we decided to use the the domain from which the errors are sampled
Mays et al. (1991) approach in our study, as it is do not improve the performance of the statisti-
easier to adapt and augment. cal approach if larger n-gram models are avail-
In a re-evaluation of the statistical model by able. We further show that the performance of
Mays et al. (1991), Wilcox-OHearn et al. (2008) the knowledge-based approach can be improved
found that it outperformed the knowledge-based by using semantic relatedness measures that in-
method by Hirst and Budanitsky (2005) when corporate knowledge beyond the taxonomic rela-
evaluated on a corpus of artificial errors based on tions in a classical lexical-semantic resource like
the WSJ corpus. This is consistent with our find- WordNet. Finally, by combining both approaches,
ings on the artificial errors based on the Brown significant increases in precision or recall can be
corpus, but - as we have seen in the previous sec- achieved.
tion - evaluation on the naturally occurring errors
shows a different picture. They also tried to im- In future work, we want to evaluate a wider
prove the model by permitting multiple correc- range of contextual fitness measures, and learn
tions and using fixed-length context windows in- how to combine them using more sophisticated
stead of sentences, but obtained discouraging re- combination strategies. Both - the statistical as
sults. well as the knowledge-based approach - will ben-
All previously discussed methods are unsuper- efit from a better model of the typist, as not all
vised in a way that they do not rely on any training edit operations are equally likely (Kernighan et
data with annotated errors. However, real-word al., 1990). On the side of the error extraction, we
spelling correction has also been tackled by su- are going to further improve the extraction pro-
pervised approaches (Golding and Schabes, 1996; cess by incorporating more knowledge about the
Jones and Martin, 1997; Carlson et al., 2001). revisions. For example, vandalism is often re-
Those methods rely on predefined confusion-sets, verted very quickly, which can be detected when
i.e. sets of words that are often confounded e.g. looking at the full set of revisions of an article.
{peace, piece} or {weather, whether}. For each We hope that making the experimental frame-
set, the methods learn a model of the context in work publicly available will foster future research
which one or the other alternative is more proba- in this field, as our results on the natural errors
ble. This yields very high precision, but only for show that the problem is still quite challenging.
the limited number of previously defined confu-
sion sets. Our framework for extracting natural
errors could be used to increase the number of Acknowledgments
known confusion sets.
This work has been supported by the Volk-
7 Conclusions and Future Work swagen Foundation as part of the Lichtenberg-
In this paper, we evaluated two main approaches Professorship Program under grant No. I/82806.
for measuring the contextual fitness of terms: the We Andreas Kellner and Tristan Miller for check-
statistical approach by Mays et al. (1991) and ing the datasets, and the anonymous reviewers for
the knowledge-based approach by Hirst and Bu- their helpful feedback.
danitsky (2005) on the task of detecting real-
536
References ical cohesion. Natural Language Engineering,
11(1):87111, March.
David Bean and Ellen Riloff. 2004. Unsupervised
Diana Inkpen and Alain Desilets. 2005. Semantic
learning of contextual role knowledge for corefer-
similarity for detecting recognition errors in auto-
ence resolution. In Proc. of HLT/NAACL, pages
matic speech transcripts. In Proceedings of the con-
297304.
ference on Human Language Technology and Em-
Igor A. Bolshakov and Alexander Gelbukh. 2003. On pirical Methods in Natural Language Processing -
Detection of Malapropisms by Multistage Colloca- HLT 05, number October, pages 4956, Morris-
tion Testing. In Proceedings of NLDB-2003, 8th town, NJ, USA. Association for Computational Lin-
International Workshop on Applications of Natural guistics.
Language to Information Systems, number Cic.
Aminul Islam and Diana Inkpen. 2009. Real-word
Thorsten Brants and Alex Franz. 2006. Web 1T 5- spelling correction using Google Web IT 3-grams.
gram Version 1. In Proceedings of the 2009 Conference on Empiri-
Alexander Budanitsky and Graeme Hirst. 2006. Eval- cal Methods in Natural Language Processing Vol-
uating wordnet-based measures of lexical semantic ume 3 - EMNLP 09, Morristown, NJ, USA. Asso-
relatedness. Computational Linguistics, 32(1):13 ciation for Computational Linguistics.
47. M A Jaro. 1995. Probabilistic linkage of large public
Andrew J Carlson, Jeffrey Rosen, and Dan Roth. health data file. Statistics in Medicine, 14:491498.
2001. Scaling Up Context-Sensitive Text Correc- Jay J Jiang and David W Conrath. 1997. Seman-
tion. In Proceedings of IAAI. tic Similarity Based on Corpus Statistics and Lex-
C Fellbaum. 1998. WordNet An Electronic Lexical ical Taxonomy. In Proceedings of the 10th Inter-
Database. MIT Press, Cambridge, MA. national Conference on Research in Computational
Oliver Ferschke, Torsten Zesch, and Iryna Gurevych. Linguistics, Taipei, Taiwan.
2011. Wikipedia Revision Toolkit: Efficiently Michael P Jones and James H Martin. 1997. Contex-
Accessing Wikipedias Edit History. In Proceed- tual spelling correction using latent semantic analy-
ings of the 49th Annual Meeting of the Associa- sis. In Proceedings of the fifth conference on Ap-
tion for Computational Linguistics: Human Lan- plied natural language processing -, pages 166
guage Technologies. System Demonstrations, pages 173, Morristown, NJ, USA. Association for Com-
97102, Portland, OR, USA. putational Linguistics.
Jenny Rose Finkel, Trond Grenager, and Christopher Mark D Kernighan, Kenneth W Church, and
Manning. 2005. Incorporating non-local informa- William A Gale. 1990. A Spelling Correc-
tion into information extraction systems by Gibbs tion Program Based on a Noisy Channel Model.
sampling. In Proceedings of the 43rd Annual Meet- In Proceedings of the 13th International Confer-
ing on Association for Computational Linguistics - ence on Computational Linguistics, pages 205210,
ACL 05, pages 363370, Morristown, NJ, USA. Helsinki, Finland.
Association for Computational Linguistics. Lothar Lemnitzer and Claudia Kunze. 2002. Ger-
Francis W. Nelson and Henry Kucera. 1964. Manual maNet - Representation, Visualization, Application.
of information to accompany a standard corpus of In Proceedings of the 3rd International Conference
present-day edited American English, for use with on Language Resources and Evaluation (LREC),
digital computers. pages 14851491.
Evgeniy Gabrilovich and Shaul Markovitch. 2007. M Lesk. 1986. Automatic sense disambiguation using
Computing Semantic Relatedness using Wikipedia- machine readable dictionaries: how to tell a pine
based Explicit Semantic Analysis. In Proceedings cone from an ice cream cone. Proceedings of the
of the 20th International Joint Conference on Arti- 5th annual international conference, pages 2426.
ficial Intelligence, pages 16061611. Dekang Lin. 1998. An Information-Theoretic Defini-
Andrew R. Golding and Yves Schabes. 1996. Com- tion of Similarity. In Proceedings of International
bining Trigram-based and feature-based methods Conference on Machine Learning, pages 296304,
for context-sensitive spelling correction. In Pro- Madison, Wisconsin.
ceedings of the 34th annual meeting on Association Aurelien Max and Guillaume Wisniewski. 2010.
for Computational Linguistics -, pages 7178, Mor- Mining Naturally-occurring Corrections and Para-
ristown, NJ, USA. Association for Computational phrases from Wikipedias Revision History. In Pro-
Linguistics. ceedings of the Seventh conference on International
Sylviane Granger, 2002. A birds-eye view of learner Language Resources and Evaluation (LREC10),
corpus research, pages 333. John Benjamins Pub- pages 31433148.
lishing Company. Eric Mays, Fred. J Damerau, and Robert L Mercer.
Graeme Hirst and Alexander Budanitsky. 2005. Cor- 1991. Context based spelling correction. Informa-
recting real-word spelling errors by restoring lex- tion Processing & Management, 27(5):517522.
537
Rani Nelken and Elif Yamangil. 2008. Mining Torsten Zesch, Christof Muller, and Iryna Gurevych.
Wikipedias Article Revision History for Train- 2008b. Using wiktionary for computing semantic
ing Computational Linguistics Algorithms. In relatedness. In Proceedings of the 23rd AAAI Con-
Proceedings of the AAAI Workshop on Wikipedia ference on Artificial Intelligence, pages 861867,
and Artificial Intelligence: An Evolving Synergy Chicago, IL, USA, Jul.
(WikiAI), WikiAI08.
Diane Nicholls. 1999. The Cambridge Learner Cor-
pus - Error Coding and Analysis for Lexicography
and ELT. In Summer Workshop on Learner Cor-
pora, Tokyo, Japan.
Alla Rozovskaya and Dan Roth. 2010. Annotating
ESL Errors: Challenges and Rewards. In The 5th
Workshop on Innovative Use of NLP for Building
Educational Applications (NAACL-HLT).
H Rubenstein and J B Goodenough. 1965. Contextual
Correlates of Synonymy. Communications of the
ACM, 8(10):627633.
Helmut Schmid. 2004. Efficient Parsing of Highly
Ambiguous Context-Free Grammars with Bit Vec-
tors. In Proceedings of the 20th International
Conference on Computational Linguistics (COL-
ING 2004), Geneva, Switzerland.
Daniel D. Walker, William B. Lund, and Eric K. Ring-
ger. 2010. Evaluating Models of Latent Document
Semantics in the Presence of OCR Errors. Proceed-
ings of the 2010 Conference on Empirical Methods
in Natural Language Processing, (October):240
250.
M. Wick, M. Ross, and E. Learned-Miller. 2007.
Context-sensitive error correction: Using topic
models to improve OCR. In Ninth International
Conference on Document Analysis and Recogni-
tion (ICDAR 2007) Vol 2, pages 11681172. Ieee,
September.
Amber Wilcox-OHearn, Graeme Hirst, and Alexander
Budanitsky. 2008. Real-word spelling correction
with trigrams: A reconsideration of the Mays, Dam-
erau, and Mercer model. In Proceedings of the 9th
international conference on Computational linguis-
tics and intelligent text processing (CICLing).
Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu-
Mizil, and Lillian Lee. 2010. For the sake of sim-
plicity: unsupervised extraction of lexical simplifi-
cations from Wikipedia. In Human Language Tech-
nologies: The 2010 Annual Conference of the North
American Chapter of the Association for Computa-
tional Linguistics, HLT 10, pages 365368.
Torsten Zesch and Iryna Gurevych. 2010. Wisdom
of Crowds versus Wisdom of Linguists - Measur-
ing the Semantic Relatedness of Words. Journal of
Natural Language Engineering, 16(1):2559.
Torsten Zesch, Christof Muller, and Iryna Gurevych.
2008a. Extracting Lexical Semantic Knowledge
from Wikipedia and Wiktionary. In Proceedings of
the Conference on Language Resources and Evalu-
ation (LREC).
538
Perplexity Minimization for Translation Model Domain Adaptation in
Statistical Machine Translation
Rico Sennrich
Institute of Computational Linguistics
University of Zurich
Binzmhlestr. 14
CH-8050 Zrich
sennrich@cl.uzh.ch
539
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 539549,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
domain of a text. The German word Wort (engl. dividual model probabilities. It is defined as fol-
word) is typically translated as floor in Europarl, lows:
a corpus of Parliamentary Proceedings (Koehn,
n
2005), owing to the high frequency of phrases X
such as you have the floor, which is translated into p(x|y; ) = i pi (x|y) (1)
i=1
German as Sie haben das Wort. This translation
is highly idiomatic and unlikely to occur in other with i being the P interpolation weight of each
contexts. Still, adding Europarl as out-of-domain model i, and with ( i i ) = 1.
training data shifts the probability distribution of For SMT, linear interpolation of translation
p(t|Wort) in favour of p(floor|Wort), and models has been used in numerous systems. The
may thus lead to improper translations. approaches diverge in how they set the inter-
We will refer to the two problems as the data polation weights. Some authors use uniform
sparseness problem and the ambiguity problem. weights (Cohn and Lapata, 2007), others em-
Adding out-of-domain data typically mitigates the pirically test different interpolation coefficients
data sparseness problem, but exacerbates the am- (Finch and Sumita, 2008; Yasuda et al., 2008;
biguity problem. The net gain (or loss) of adding Nakov and Ng, 2009; Axelrod et al., 2011), others
more data changes from case to case. Because apply monolingual metrics to set the weights for
there are (to our knowledge) no tools that predict TM interpolation (Foster and Kuhn, 2007; Koehn
this net effect, it is a matter of empirical investi- et al., 2010).
gation (or, in less suave terms, trial-and-error), to There are reasons against all these approaches.
determine which corpora to use.2 Uniform weights are easy to implement, but give
From this understanding of the reasons for and little control. Empirically, it has been shown that
against out-of-domain data, we formulate the fol- they often do not perform optimally (Finch and
lowing hypotheses: Sumita, 2008; Yasuda et al., 2008). An opti-
mization of B LEU scores on a development set is
1. A weighted combination can control the con- promising, but slow and impractical. There is no
tribution of the out-of-domain corpus on the easy way to integrate linear interpolation into log-
probability distribution, and thus limit the linear SMT frameworks and perform optimization
ambiguity problem. through MERT. Monolingual optimization objec-
tives such as language model perplexity have the
2. A weighted combination eliminates the need advantage of being well-known and readily avail-
for data selection, offering a robust baseline able, but their relation to the ambiguity problem
for domain-specific machine translation. is indirect at best.
Linear interpolation is seemingly well-defined
We will discuss three mixture modelling tech-
in equation 1. Still, there are a few implemen-
niques for translation models. Our aim is to adapt
tation details worth pointing out. If we directly
all four features of the standard Moses SMT trans-
interpolate each feature in the translation model,
lation model: the phrase translation probabilities
and define the feature values of non-occurring
p(t|s) and p(s|t), and the lexical weights lex(t|s)
phrase pairs as 0, this disregards the meaning of
and lex(s|t).3
each feature. If we estimate p(x|y) via MLE as in
2.1 Linear Interpolation equation 2, and c(y) = 0, then p(x|y) is strictly
speaking undefined. Alternatively to a naive al-
A well-established approach in language mod-
gorithm, which treats unknown phrase pairs as
elling is the linear interpolation of several mod-
having a probability of 0, which results in a defi-
els, i.e. computing the weighted average of the in-
cient probability distribution, we propose and im-
2
A frustrating side-effect is that these findings rarely gen- plement the following algorithm. For each value
eralize. For instance, we were unable to reproduce the find- pair (x, y) for which we compute p(x|y), we re-
ing by Ceausu et al. (2011) that patent translation systems place i with 0 for all models i with p(y) =
are highly domain-sensitive and suffer from the inclusion of
parallel training data from other patent subdomains.
0, then renormalize the weight vector to 1.
3
We can ignore the fifth feature, the phrase penalty, We do this for p(t|s) and lex(t|s), but not for
which is a constant. p(s|t) and lex(s|t), the reasoning being the con-
540
sequences for perplexity minimization (see sec- 2.3 Alternative Paths
tion 2.4). Namely, we do not want to penalize
A third method is using multiple translation mod-
a small in-domain model for having a high out-
els as alternative decoding paths (Birch et al.,
of-vocabulary rate on the source side, but we do
2007), an idea which Koehn and Schroeder (2007)
want to penalize models that know the source
first used for domain adaptation. This approach
phrase, but not its correct translation. A sec-
has the attractive theoretical property that adding
ond modification pertains to the lexical weights
new models is guaranteed to lead to equal or bet-
lex(s|t) and lex(t|s), which form no true proba-
ter performance, given the right weights. At best,
bility distribution, but are derived from the indi-
a model is beneficial with appropriate weights. At
vidual word translation probabilities of a phrase
worst, we can set the feature weights so that the
pair (see (Koehn et al., 2003)). We propose to
decoding paths of one model are never picked for
not interpolate the features directly, but the word
the final translation. In practice, each translation
translation probabilities which are the basis of the
model adds 5 features and thus 5 more dimensions
lexical weight computation. The reason for this is
to the weight space, which leads to longer search,
that word pairs are less sparse than phrase pairs,
search errors, and/or overfitting. The expectation
so that we can even compute lexical weights for
is that, at least with MERT, using alternative de-
phrase pairs which are unknown in a model.4
coding paths does not scale well to a high number
2.2 Weighted Counts of models.
Weighting of different corpora can also be imple- A suboptimal choice of weights is not the only
mented through a modified Maximum Likelihood weakness of alternative paths, however. Let us
Estimation. The traditional equation for MLE is: assume that all models have the same weights.
Note that, if a phrase pair occurs in several mod-
c(x, y) c(x, y) els, combining models through alternative paths
p(x|y) = =P 0
(2)
c(y) x0 c(x , y) means that the decoder selects the path with the
where c denotes the count of an observation, and highest probability, whereas with linear interpo-
p the model probability. If we generalize the for- lation, the probability of the phrase pair would
mula to compute a probability from n corpora, be the (weighted) average of all models. Select-
and assign a weight i to each, we get5 : ing the highest-scoring phrase pair favours statis-
Pn tical outliers and hence is the less robust decision,
c (x, y)
p(x|y; ) = Pn i=1 P i i 0
(3) prone to data noise and data sparseness.
i=1 x0 i ci (x , y)
The main difference to linear interpolation is 2.4 Perplexity Minimization
that this equation takes into account how well- In language modelling, perplexity is frequently
evidenced a phrase pair is. This includes the dis- used as a quality measure for language models
tinction between lack of evidence and negative ev- (Chen and Goodman, 1998). Among other appli-
idence, which is missing in a naive implementa- cations, language model perplexity has been used
tion of linear interpolation. for domain adaptation (Foster and Kuhn, 2007).
Translation models trained with weighted For translation models, perplexity is most closely
counts have been discussed before, and have associated with EM word alignment (Brown et
been shown to outperform uniform ones in some al., 1993) and has been used to evaluate different
settings. However, researchers who demon- alignment algorithms (Al-Onaizan et al., 1999).
strated this fact did so with arbitrary weights (e.g.
We investigate translation model perplexity
(Koehn, 2002)), or by empirically testing differ-
minimization as a method to set model weights
ent weights (e.g. (Nakov and Ng, 2009)). We do
in mixture modelling. For the purpose of opti-
not know of any research on automatically deter-
mization, the cross-entropy H(p), the perplexity
mining weights for this method, or which is not
2H(p) , and other derived measures are equivalent.
limited to two corpora.
The cross-entropy H(p) is defined as:6
4
For instance if the word pairs (the,der) and (man,Mann)
6
are known, but the phrase pair (the man, der Mann) is not. See (Chen and Goodman, 1998) for a short discussion
5
P Unlike equation 1, equation 3 does not require that of the equation. In short, a lower cross-entropy indicates that
( i i ) = 1. the model is better able to predict the development set.
541
X Our main technical contributions are as fol-
H(p) = p(x, y) log2 p(x|y) (4) lows: Additionally to perplexity optimization for
x,y linear interpolation, which was first applied by
The phrase pairs (x, y) whose probability we Foster et al. (2010), we propose perplexity opti-
measure, and their empirical probability p need mization for weighted counts (equation 3), and a
to be extracted from a development set, whereas modified implementation of linear interpolation.
p is the model probability. To obtain the phrase Also, we independently perform perplexity mini-
pairs, we process the development set with the mization for all four features of the standard SMT
same word alignment and phrase extraction tools translation model: the phrase translation proba-
that we use for training, i.e. GIZA++ and heuris- bilities p(t|s) and p(s|t), and the lexical weights
tics for phrase extraction (Och and Ney, 2003). lex(t|s) and lex(s|t).
The objective function is the minimization of the
3 Other Domain Adaptation Techniques
cross-entropy, with the weight vector as argu-
ment: So far, we discussed mixture modelling for trans-
X lation models, which is only a subset of domain
= arg min p(x, y) log2 p(x|y; ) (5) adaptation techniques in SMT.
x,y
Mixture-modelling for language models is well
We can fill in equations 1 or 3 for p(x|y; ). The established (Foster and Kuhn, 2007). Language
optimization itself is convex and can be done with model adaptation serves the same purpose as
off-the-shelf software.7 We use L-BFGS with translation model adaptation, i.e. skewing the
numerically approximated gradients (Byrd et al., probability distribution in favour of in-domain
1995). translations. This means that LM adaptation may
Perplexity minimization has the advantage that have similar effects as TM adaptation, and that
it is well-defined for both weighted counts and lin- the two are to some extent redundant. Foster and
ear interpolation, and can be quickly computed. Kuhn (2007) find that both TM and LM adap-
Other than in language modelling, where p(x|y) tation are effective, but that combined LM and
is the probability of a word given a n-gram his- TM adaptation is not better than LM adaptation
tory, conditional probabilities in translation mod- on its own.
els express the probability of a target phrase given A second strand of research in domain adap-
a source phrase (or vice versa), which connects tation is data selection, i.e. choosing a subset of
the perplexity to the ambiguity problem. The the training data that is considered more relevant
higher the probability of correct phrase pairs, for the task at hand. This has been done for lan-
the lower the perplexity, and the more likely guage models using techniques from information
the model is to successfully resolve the ambigu- retrieval (Zhao et al., 2004), or perplexity (Lin et
ity. The question is in how far perplexity min- al., 1997; Moore and Lewis, 2010). Data selec-
imization coincides with empirically good mix- tion has also been proposed for translation mod-
ture weights.8 This depends, among others, on els (Axelrod et al., 2011). Note that for transla-
the other model components in the SMT frame- tion models, data selection offers an unattractive
work, for instance the language model. We will trade-off between the data sparseness and the am-
not evaluate perplexity minimization against em- biguity problem, and that the optimal amount of
pirically optimized mixture weights, but apply it data to select is hard to determine.
in situations where the latter is infeasible, e.g. be- Our discussion of mixture-modelling is rela-
cause of the number of models. tively coarse-grained, with 2-10 models being
7
combined. Matsoukas et al. (2009) propose an ap-
A quick demonstration of convexity: equation 1 is
affine; equation 3 linear-fractional. Both are convex in the proach where each sentence is weighted accord-
domain R>0 . Consequently, equation 4 is also convex be- ing to a classifier, and Foster et al. (2010) ex-
cause it is the weighted sum of convex functions. tend this approach by weighting individual phrase
8
There are tasks for which perplexity is known to be un- pairs. These more fine-grained methods need not
reliable, e.g. for comparing models with different vocabular-
ies. However, such confounding factors do not affect the op-
be seen as alternatives to coarse-grained ones.
timization algorithm, which works with a fixed set of phrase Foster et al. (2010) combine the two, apply-
pairs, and merely varies . ing linear interpolation to combine the instance-
542
weighted out-of-domain model with an in-domain Data set sentences words (fr)
model. Alpine (in-domain) 220k 4 700k
Europarl 1 500k 44 000k
4 Evaluation JRC Acquis 1 100k 24 000k
OpenSubtitles v2 2 300k 18 000k
Apart from measuring the performance of the ap-
Total train 5 200k 91 000k
proaches introduced in section 2, we want to in-
Dev 1424 33 000
vestigate the following open research questions.
Test 991 21 000
1. Does an implementation of linear interpola- Table 1: Parallel data sets for German French trans-
tion that is more closely tailored to trans- lation task.
lation modelling outperform a naive imple-
mentation? Data set sentences words
Alpine (in-domain) 650k 13 000k
2. How do the approaches perform outside a News-commentary 150k 4 000k
binary setting, i.e. when we do not work Europarl 2 000k 60 000k
with one in-domain and one out-of-domain News 25 000k 610 000k
model, but with a higher number of models? Total 28 000k 690 000k
Table 2: Monolingual French data sets for German
3. Can we apply perplexity minimization to French translation task.
other translation model features such as the
lexical weights, and if yes, does a separate all translation model features, and a modified one
optimization of each translation model fea- that normalizes for each phrase pair (s, t) for
ture improve performance? p(t|s) and recomputes the lexical weights based
on interpolated word translation probabilites. The
4.1 Data and Methods
fourth weighted combination is using alternative
In terms of tools and techniques used, we mostly decoding paths with weights set through MERT.
adhere to the work flow described for the WMT The four weighted combinations are evaluated
2011 baseline system9 . The main tools are Moses twice: once applied to the original four or ten par-
(Koehn et al., 2007), SRILM (Stolcke, 2002), and allel data sets, once in a binary setting in which
GIZA++ (Och and Ney, 2003), with settings as all out-of-domain data sets are first concatenated.
described in the WMT 2011 guide. We report Since we want to concentrate on translation
two translation measures: B LEU (Papineni et al., model domain adaptation, we keep other model
2002) and METEOR 1.3 (Denkowski and Lavie, components, namely word alignment and the lex-
2011). All results are lowercased and tokenized, ical reordering model, constant throughout the ex-
measured with five independent runs of MERT periments. We contrast two language models. An
(Och and Ney, 2003) and MultEval (Clark et al., unadapted, out-of-domain language model trained
2011) for resampling and significance testing. on data sets provided for the WMT 2011 transla-
We compare three baselines and four transla- tion task, and an adapted language model which is
tion model mixture techniques. The three base- the linear interpolation of all data sets, optimized
lines are a purely in-domain model, a purely out- for minimal perplexity on the in-domain develop-
of-domain model, and a model trained on the con- ment set.
catenation of the two, which corresponds to equa- While unadapted language models are becom-
tion 3 with uniform weights. Additionally, we ing more rare in domain adaptation research, they
evaluate perplexity optimization with weighted allow us to contrast different TM mixtures with-
counts and the two implementations of linear in- out the effect on performance being (partially)
terpolation contrasted in section 2.1. The two lin- hidden by language model adaptation with the
ear interpolations that are contrasted are a naive same effect.
one, i.e. a direct, unnormalized interpolation of The first data set is a DEFR translation sce-
9
http://www.statmt.org/wmt11/baseline. nario in the domain of mountaineering. The in-
html domain corpus is a collection of Alpine Club pub-
543
lications (Volk et al., 2010). As parallel out-of- Data set units words (en)
domain dataset, we use Europarl, a collection of SMS (in-domain) 16 500 380 000
parliamentary proceedings (Koehn, 2005), JRC- Medical 1 600 10 000
Acquis, a collection of legislative texts (Stein- Newswire 13 500 330 000
berger et al., 2006), and OpenSubtitles v2, a par- Glossary 35 700 90 000
allel corpus extracted from film subtitles10 (Tiede- Wikipedia 8 500 110 000
mann, 2009). For language modelling, we use in- Wikipedia NE 10 500 34 000
domain data and data from the 2011 Workshop Bible 30 000 920 000
on Statistical Machine Translation. The respec- Haitisurf dict 3 700 4000
tive sizes of the data sets are listed in tables 1 and Krengle dict 1 600 2 600
2. Krengle 650 4 200
As the second data set, we use the Haitian Cre- Total train 120 000 1 900 000
ole English data from the WMT 2011 featured Dev 900 22 000
translation task. It consists of emergency SMS Test 1274 25 000
sent in the wake of the 2010 Haiti earthquake.
Table 3: Parallel data sets for Haiti Creole English
Originally, Microsoft Research and CMU oper-
translation task.
ated under severe time constraints to build a trans-
lation system for this language pair. This limits
the ability to empirically verify how much each Data set sentences words
data set contributes to translation quality, and in- SMS (in-domain) 16k 380k
creases the importance of automated and quick News 113 000k 2 650 000k
domain adaptation methods.
Table 4: Monolingual English data sets for Haiti Cre-
Note that both data sets have a relatively high ole English translation task.
ratio of in-domain to out-of-domain parallel train-
ing data (1:20 for DEEN and 1:5 for HTEN)
Previous research has been performed with ratios
of 1:100 (Foster et al., 2010) or 1:400 (Axelrod LM performs better than an out-of-domain one,
et al., 2011). Since domain adaptation becomes and using all available in-domain parallel data is
more important when the ratio of IN to OUT is better than using only part of it. The same is not
low, and since such low ratios are also realistic11 , true for out-of-domain data, which highlights the
we also include results for which the amount of problem discussed in the introduction. For the
in-domain parallel data has been restricted to 10% DEFR task, adding 86 million words of out-of-
of the available data set. domain parallel data to the 5 million in-domain
We used the same development set for lan- data set does not lead to consistent performance
guage/translation model adaptation and setting gains. We observe a decrease of 0.3 B LEU points
the global model weights with MERT. While it with an out-of-domain LM, and an increase of 0.4
is theoretically possible that MERT will give too B LEU points with an adapted LM. The out-of-
high weights to models that are optimized on the domain training data has a larger positive effect
same development set, we found no empirical evi- if less in-domain data is available, with a gain of
dence for this in experiments with separate devel- 1.4 B LEU points. The results in the HTEN trans-
opment sets. lation task (table 6) paint a similar picture. An
interesting side note is that even tiny amounts of
4.2 Results in-domain parallel data can have strong effects on
performance. A training set of 1600 emergency
The results are shown in tables 5 and 6. In the
SMS (38 000 tokens) yields a comparable perfor-
DEFR translation task, results vary between 13.5
mance to an out-of-domain data set of 1.5 million
and 18.9 B LEU points; in the HTEN task, be-
tokens.
tween 24.3 and 33.8. Unsurprisingly, an adapted
10
As to the domain adaptation experiments,
http://www.opensubtitles.org
11
We predict that the availability of parallel data will
weights optimized through perplexity minimiza-
steadily increase, most data being out-of-domain for any tion are significantly better in the majority of
given task. cases, and never significantly worse, than uniform
544
out-of-domain LM adapted LM
System full IN TM full IN TM small IN TM
B LEU METEOR B LEU METEOR B LEU METEOR
in-domain 16.8 35.9 17.9 37.0 15.7 33.5
out-of-domain 13.5 31.3 14.8 32.3 14.8 32.3
counts (concatenation) 16.5 35.7 18.3 37.3 17.1 35.4
binary in/out
weighted counts 17.4 36.6 18.7 37.9 17.6 36.2
linear interpolation (naive) 17.4 36.7 18.8 37.9 17.6 36.1
linear interpolation (modified) 17.2 36.5 18.9 38.0 17.6 36.2
alternative paths 17.2 36.5 18.6 37.8 17.4 36.0
4 models
weighted counts 17.3 36.6 18.8 37.8 17.4 36.0
linear interpolation (naive) 17.1 36.5 18.5 37.7 17.3 35.9
linear interpolation (modified) 17.2 36.5 18.7 37.9 17.3 36.0
alternative paths 17.0 36.2 18.3 37.4 16.3 35.1
Table 5: Domain adaptation results DEFR. Domain: Alpine texts. Full IN TM: Using the full in-domain parallel
corpus; small IN TM: using 10% of available in-domain parallel data.
weights.12 However, the difference is smaller for mization methods does not change significantly.
the experiments with an adapted language model This is positive for perplexity optimization be-
than for those with an out-of-domain one, which cause it demonstrates that it requires less a priori
confirms that the benefit of language model adap- information, and opens up new research possibil-
tation and translation model adaptation are not ities, i.e. experiments with different clusterings of
fully cumulative. Performance-wise, there seems parallel data. The performance degradation for
to be no clear winner between weighted counts alternative paths is partially due to optimization
and the two alternative implementations of lin- problems in MERT, but also due to a higher sus-
ear interpolation. We can still argue for weighted ceptibility to statistical outliers, as discussed in
counts on theoretical grounds. A weighted MLE section 2.3.14
(equation 3) returns a true probability distribution, A pessimistic interpretation of the results
whereas a naive implementation of linear interpo- would point out that performance gains compared
lation results in a deficient model. Consequently, to the best baseline system are modest or even
probabilities are typically lower in the naively in- inexistent in some settings. However, we want
terpolated model, which results in higher (worse) to stress two important points. First, we often
perplexities. While the deficiency did not affect do not know a priori whether adding an out-of-
MERT or decoding negatively, it might become domain data set boosts or weakens translation per-
problematic in other applications, for instance if formance. An automatic weighting of data sets re-
we want to use an interpolated model as a compo- duces the need for trial-and-error experimentation
nent in a second perplexity-based combination of and is worthwhile even if a performance increase
models.13 is not guaranteed. Second, the potential impact
When moving from a binary setting with of a weighted combination depends on the trans-
one in-domain and one out-of-domain transla- lation scenario and the available data sets. Gen-
tion model (trained on all available out-of-domain erally, we expect non-uniform weighting to have
data) to 410 translation models, we observe a a bigger impact when the models that are com-
serious performance degradation for alternative bined are more dissimilar (in terms of fitness for
paths, while performance of the perplexity opti- the task), and if the ratio of in-domain to out-of-
domain data is low. Conversely, there are situa-
12
This also applies to linear interpolation with uniform
14
weights, which is not shown in the tables. We empirically verified this weakness in a synthetic ex-
13
Specifically, a deficient model would be dispreferred by periment with a randomly split training corpus and identical
the perplexity minimization algorithm. weights for each path.
545
out-of-domain LM adapted LM
System full IN TM full IN TM small IN TM
B LEU METEOR B LEU METEOR B LEU METEOR
in-domain 30.4 30.7 33.4 31.7 29.7 28.6
out-of-domain 24.3 28.0 28.9 30.2 28.9 30.2
counts (concatenation) 30.3 31.2 33.6 32.4 31.3 31.3
binary in/out
weighted counts 31.0 31.6 33.8 32.4 31.5 31.3
linear interpolation (naive) 30.8 31.4 33.7 32.4 31.9 31.3
linear interpolation (modified) 30.8 31.5 33.7 32.4 31.7 31.2
alternative paths 30.8 31.3 33.2 32.4 29.8 30.7
10 models
weighted counts 31.0 31.5 33.5 32.3 31.8 31.5
linear interpolation (naive) 30.9 31.4 33.8 32.4 31.9 31.3
linear interpolation (modified) 31.0 31.6 33.8 32.5 32.1 31.5
alternative paths 25.9 29.2 24.3 29.1 29.8 30.9
Table 6: Domain adaptation results HTEN. Domain: emergency SMS. Full IN TM: Using the full in-domain
parallel corpus; small IN TM: using 10% of available in-domain parallel data.
546
average of the other four. For linear interpolation, elling. We envision that a weighted combination
we also include a model whose weights have been could be useful to deal with noisy datasets, or ap-
optimized through language model perplexity op- plied after a clustering of training data.
timization, with a 3-gram language model (modi-
fied Knesey-Ney smoothing) trained on the target Acknowledgements
side of each parallel data set. This research was funded by the Swiss National
Table 7 shows the results. In terms of B LEU Science Foundation under grant 105215_126999.
score, a separate optimization of each feature is a
winner in our experiment in that no other scheme
is better, with 8 of the 11 alternative weighting References
schemes (excluding uniform weights) being sig- Yaser Al-Onaizan, Jan Curin, Michael Jahr, Kevin
nificantly worse than a separate optimization. The Knight, John Lafferty, Dan Melamed, Franz-Josef
differences in B LEU score are small, however, Och, David Purdy, Noah A. Smith, and David
since the alternative weighting schemes are gen- Yarowsky. 1999. Statistical machine translation.
erally felicitious in that they yield both a lower Technical report, Final Report, JHU Summer Work-
shop.
perplexity and better B LEU scores than uniform
Amittai Axelrod, Xiaodong He, and Jianfeng Gao.
weighting. While our general expectation is that 2011. Domain adaptation via pseudo in-domain
lower perplexities correlate with higher transla- data selection. In Proceedings of the EMNLP 2011
tion performance, this relation is complicated by Workshop on Statistical Machine Translation.
several facts. Since the interpolated models are Alexandra Birch, Miles Osborne, and Philipp Koehn.
deficient (i.e. their probabilities do not sum to 1), 2007. CCG supertags in factored statistical ma-
perplexities for weighted counts and our imple- chine translation. In Proceedings of the Second
mentation of linear interpolation cannot be com- Workshop on Statistical Machine Translation, pages
916, Prague, Czech Republic, June. Association
pard. Also, note that not all features are equally
for Computational Linguistics.
important for decoding. Their weights in the log- Peter F. Brown, Vincent J. Della Pietra, Stephen A.
linear model are set through MERT and vary be- Della Pietra, and Robert L. Mercer. 1993. The
tween optimization runs. Mathematics of Statistical Machine Translation:
Parameter Estimation. Computational Linguistics,
5 Conclusion 19(2):263311.
Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and
This paper contributes to SMT domain adaptation Ciyou Zhu. 1995. A limited memory algorithm
research in several ways. We expand on work for bound constrained optimization. SIAM J. Sci.
by (Foster et al., 2010) in establishing transla- Comput., 16:11901208, September.
tion model perplexity minimization as a robust Alexandru Ceausu, John Tinsley, Jian Zhang, and
Andy Way. 2011. Experiments on domain adap-
baseline for a weighted combination of translation
tation for patent machine translation in the PLuTO
models.15 We demonstrate perplexity optimiza- project. In Proceedings of the 15th conference of
tion for weighted counts, which are a natural ex- the European Association for Machine Translation,
tension of unadapted MLE training, but are of lit- Leuven, Belgium.
tle prominence in domain adaptation research. We Stanley F. Chen and Joshua Goodman. 1998. An em-
also show that we can separately optimize the four pirical study of smoothing techniques for language
variable features in the Moses translation model modeling. Computer Speech & Language, 13:359
through perplexity optimization. 393.
Jonathan H. Clark, Chris Dyer, Alon Lavie, and
We break with prior domain adaptation re-
Noah A. Smith. 2011. Better hypothesis testing for
search in that we do not rely on a binary clustering statistical machine translation: Controlling for op-
of in-domain and out-of-domain training data. We timizer instability. In Proceedings of the 49th An-
demonstrate that perplexity minimization scales nual Meeting of the Association for Computational
well to a higher number of translation models. Linguistics: Human Language Technologies, pages
This is not only useful for domain adaptation, but 176181, Portland, Oregon, USA, June. Associa-
for various tasks that profit from mixture mod- tion for Computational Linguistics.
Trevor Cohn and Mirella Lapata. 2007. Machine
15 Translation by Triangulation: Making Effective Use
The source code is available in the Moses repository
http://github.com/moses-smt/mosesdecoder of Multi-Parallel Corpora. In Proceedings of the
547
45th Annual Meeting of the Association of Compu- Philipp Koehn, Barry Haddow, Philip Williams, and
tational Linguistics, pages 728735, Prague, Czech Hieu Hoang. 2010. More linguistic annotation
Republic, June. Association for Computational Lin- for statistical machine translation. In Proceedings
guistics. of the Joint Fifth Workshop on Statistical Machine
Michael Denkowski and Alon Lavie. 2011. Meteor Translation and MetricsMATR, pages 115120, Up-
1.3: Automatic Metric for Reliable Optimization psala, Sweden, July. Association for Computational
and Evaluation of Machine Translation Systems. In Linguistics.
Proceedings of the EMNLP 2011 Workshop on Sta- Philipp Koehn. 2002. Europarl: A Multilingual Cor-
tistical Machine Translation. pus for Evaluation of Machine Translation.
Andrew Finch and Eiichiro Sumita. 2008. Dynamic Philipp Koehn. 2005. Europarl: A parallel corpus for
model interpolation for statistical machine transla- statistical machine translation. In Machine Transla-
tion. In Proceedings of the Third Workshop on tion Summit X, pages 7986, Phuket, Thailand.
Statistical Machine Translation, StatMT 08, pages Sung-Chien Lin, Chi-Lung Tsai, Lee-Feng Chien,
208215, Stroudsburg, PA, USA. Association for Keh-Jiann Chen, and Lin-Shan Lee. 1997. Chinese
Computational Linguistics. language model adaptation based on document clas-
George Foster and Roland Kuhn. 2007. Mixture- sification and multiple domain-specific language
model adaptation for smt. In Proceedings of the models. In George Kokkinakis, Nikos Fakotakis,
Second Workshop on Statistical Machine Transla- and Evangelos Dermatas, editors, EUROSPEECH.
tion, StatMT 07, pages 128135, Stroudsburg, PA, ISCA.
USA. Association for Computational Linguistics.
Spyros Matsoukas, Antti-Veikko I. Rosti, and Bing
George Foster, Cyril Goutte, and Roland Kuhn. 2010. Zhang. 2009. Discriminative corpus weight esti-
Discriminative instance weighting for domain adap- mation for machine translation. In Proceedings of
tation in statistical machine translation. In Proceed- the 2009 Conference on Empirical Methods in Nat-
ings of the 2010 Conference on Empirical Methods ural Language Processing: Volume 2 - Volume 2,
in Natural Language Processing, pages 451459, pages 708717, Stroudsburg, PA, USA. Association
Stroudsburg, PA, USA. Association for Computa- for Computational Linguistics.
tional Linguistics.
Robert C. Moore and William Lewis. 2010. Intelli-
Philipp Koehn and Kevin Knight. 2001. Knowledge
gent selection of language model training data. In
sources for word-level translation models. In Lil-
Proceedings of the ACL 2010 Conference Short Pa-
lian Lee and Donna Harman, editors, Proceedings
pers, ACLShort 10, pages 220224, Stroudsburg,
of the 2001 Conference on Empirical Methods in
PA, USA. Association for Computational Linguis-
Natural Language Processing, pages 2735.
tics.
Philipp Koehn and Josh Schroeder. 2007. Experi-
ments in domain adaptation for statistical machine Preslav Nakov and Hwee Tou Ng. 2009. Improved
translation. In Proceedings of the Second Work- statistical machine translation for resource-poor
shop on Statistical Machine Translation, StatMT languages using related resource-rich languages. In
07, pages 224227, Stroudsburg, PA, USA. Asso- Proceedings of the 2009 Conference on Empiri-
ciation for Computational Linguistics. cal Methods in Natural Language Processing: Vol-
ume 3 - Volume 3, EMNLP 09, pages 13581367,
Philipp Koehn, Franz Josef Och, and Daniel Marcu.
Stroudsburg, PA, USA. Association for Computa-
2003. Statistical phrase-based translation. In
tional Linguistics.
NAACL 03: Proceedings of the 2003 Conference
of the North American Chapter of the Association Franz Josef Och and Hermann Ney. 2003. A sys-
for Computational Linguistics on Human Language tematic comparison of various statistical alignment
Technology, pages 4854, Morristown, NJ, USA. models. Computational Linguistics, 29(1):1951.
Association for Computational Linguistics. Kishore Papineni, Salim Roukos, Todd Ward, and
Philipp Koehn, Hieu Hoang, Alexandra Birch, Wei-Jing Zhu. 2002. Bleu: A method for automatic
Chris Callison-Burch, Marcello Federico, Nicola evaluation of machine translation. In ACL 02: Pro-
Bertoldi, Brooke Cowan, Wade Shen, Christine ceedings of the 40th Annual Meeting on Associa-
Moran, Richard Zens, Chris Dyer, Ondrej Bojar, tion for Computational Linguistics, pages 311318,
Alexandra Constantin, and Evan Herbst. 2007. Morristown, NJ, USA. Association for Computa-
Moses: Open Source Toolkit for Statistical Ma- tional Linguistics.
chine Translation. In ACL 2007, Proceedings of the Ralf Steinberger, Bruno Pouliquen, Anna Widiger,
45th Annual Meeting of the Association for Com- Camelia Ignat, Tomaz Erjavec, Dan Tufis, and
putational Linguistics Companion Volume Proceed- Daniel Varga. 2006. The JRC-Acquis: A multilin-
ings of the Demo and Poster Sessions, pages 177 gual aligned parallel corpus with 20+ languages. In
180, Prague, Czech Republic, June. Association for Proceedings of the 5th International Conference on
Computational Linguistics. Language Resources and Evaluation (LREC2006).
548
A. Stolcke. 2002. SRILM An Extensible Language
Modeling Toolkit. In Seventh International Confer-
ence on Spoken Language Processing, pages 901
904, Denver, CO, USA.
Jrg Tiedemann. 2009. News from opus - a col-
lection of multilingual parallel corpora with tools
and interfaces. In N. Nicolov, K. Bontcheva,
G. Angelova, and R. Mitkov, editors, Recent
Advances in Natural Language Processing, vol-
ume V, pages 237248. John Benjamins, Amster-
dam/Philadelphia, Borovets, Bulgaria.
Martin Volk, Noah Bubenhofer, Adrian Althaus, Maya
Bangerter, Lenz Furrer, and Beni Ruef. 2010. Chal-
lenges in building a multilingual alpine heritage
corpus. In Proceedings of the Seventh conference
on International Language Resources and Evalu-
ation (LREC10), Valletta, Malta. European Lan-
guage Resources Association (ELRA).
Keiji Yasuda, Ruiqiang Zhang, Hirofumi Yamamoto,
and Eiichiro Sumita. 2008. Method of selecting
training data to build a compact and efficient trans-
lation model. In Proceedings of the 3rd Interna-
tional Joint Conference on Natural Language Pro-
cessing (IJCNLP).
Bing Zhao, Matthias Eck, and Stephan Vogel. 2004.
Language model adaptation for statistical machine
translation with structured query models. In Pro-
ceedings of the 20th international conference on
Computational Linguistics, COLING 04, Strouds-
burg, PA, USA. Association for Computational Lin-
guistics.
549
Subcat-LMF: Fleshing out a standardized format
for subcategorization frame interoperability
Judith Eckle-Kohler and Iryna Gurevych
Ubiquitous Knowledge Processing Lab (UKP-DIPF)
German Institute for Educational Research and Educational Information
Ubiquitous Knowledge Processing Lab (UKP-TUDA)
Department of Computer Science
Technische Universitat Darmstadt
http://www.ukp.tu-darmstadt.de
550
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 550560,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
prerequisite for re-using this format in different fies a core package and a number of extensions
contexts, thus contributing to the standardization for modeling different types of lexicons, includ-
and interoperability of language resources. ing subcategorization lexicons.
While LMF models exist that cover the rep- The development of an LMF-compliant lexi-
resentation of SCFs (see Quochi et al. (2008), con model requires two steps: in the first step,
Buitelaar et al. (2009)), their suitability for repre- the structure of the lexicon model has to be de-
senting SCFs at a large scale remains unclear: nei- fined by choosing a combination of the LMF core
ther of these LMF-models has been used for stan- package and zero to many extensions (i.e. UML
dardizing lexicons with a large number of SCFs, packages). While the LMF core package models
such as VerbNet. Furthermore, the question of a lexicon in terms of lexical entries, each of which
their applicability to different languages has not is defined as the pairing of one to many forms and
been investigated yet, a situation that is compli- zero to many senses, the LMF extensions provide
cated by the fact that SCFs are highly language- UML classes for different types of lexicon orga-
specific. nization, e.g., covering the synset-based organiza-
The goal of this paper is to address these gaps tion of WordNet and the class-based organization
for the two languages English and German by pre- of VerbNet. The first step results in a set of UML
senting a uniform LMF representation of SCFs classes that are associated according to the UML
for English and German which is utilized for the diagrams given in ISO LMF.
standardization of large-scale English and Ger- In the second step, these UML classes may be
man SCF lexicons. The contributions of this enriched by attributes. While neither attributes
paper are threefold: (1) We present the LMF nor their values are given by the standard, the
model Subcat-LMF, an LMF-compliant lexicon standard states that both are to be linked to Data
representation format featuring a uniform and Categories (DCs) defined in a Data Category Reg-
very fine-grained representation of SCFs for En- istry (DCR) such as ISOCat.4 DCs that are not
glish and German. Subcat-LMF is a subset of available in ISOCat may be defined and submit-
Uby-LMF (Eckle-Kohler et al., 2012), the LMF ted for standardization. The second step results in
model of the large integrated lexical resource Uby a so-called Data Category Selection (DCS).
(Gurevych et al., 2012). (2) We convert lexicons DCs specify the linguistic vocabulary used in
with large-scale SCF information to Subcat-LMF: an LMF model. Consider as an example the
the English VerbNet and two German lexicons, linguistic term direct object that often occurs in
i.e., GermaNet (Kunze and Lemnitzer, 2002) and SCFs of verbs taking an accusative NP as argu-
a subset of IMSlex3 (Eckle-Kohler, 1999). (3) We ment. In ISOCat, there are two different specifi-
perform a comparison of these three lexicons re- cations of this term, one explicitly referring to the
garding SCF coverage and SCF overlap, based on capability of becoming the clause subject in pas-
the standardized representation. sivization5 , the other not mentioning passivization
The remainder of this paper is structured as fol- at all.6 Consequently, the use of a DCR plays a
lows: Section 2 gives a detailed description of major role regarding the semantic interoperability
Subcat-LMF and section 3 demonstrates its use- of lexicons (Ide and Pustejovsky, 2010). Different
fulness for representing and cross-lingually com- resources that share a common definition of their
paring large-scale English and German lexicons. linguistic vocabulary are said to be semantically
Section 4 provides a discussion including related interoperable.
work and section 5 concludes.
2.2 Fleshing out ISO-LMF
2 Subcat-LMF Approach: We started our development of
2.1 ISO-LMF: a meta-model Subcat-LMF with a thorough inspection of large-
scale English and German resources providing
LMF defines a meta-model of lexical resources,
SCFs for verbs, nouns, and adjectives. For
covering NLP lexicons and Machine Readable
4
Dictionaries. This meta-model is based on the http://www.isocat.org/, the implementation of the ISO
Unified Modeling Language (UML) and speci- 12620 DCR (Broeder et al., 2010).
5
http://www.isocat.org/datcat/DC-1274
3 6
http://www.ims.uni-stuttgart.de/projekte/IMSLex/ http://www.isocat.org/datcat/DC-2263
551
English, our analysis included VerbNet7 and vs.
FrameNet syntactically annotated example sen- Er schlug vor, das Haus zu putzen. (to-
tences from Ruppenhofer et al. (2010). For Ger- infinitive)
man, we inspected GermaNet, SALSA annota-
tion guidelines (Burchardt et al., 2006) and IM- morphosyntactic marking of verb phrase ar-
Slex documentation (Eckle-Kohler, 1999). In ad- guments in the main clause: He managed to
dition, the EAGLES synopsis on morphosyntactic win. (no marking) vs.
phenomena8 (Calzolari and Monachini, 1996), as Er hat es geschafft zu gewinnen. (obligatory
well as the EAGLES recommendations on subcat- es)
egorization9 have been used to identify DCs rele-
vant for SCFs. morphosyntactic marking of clausal argu-
We specified Subcat-LMF by a DTD yielding ments in the main clause: That depends on
an XML serialization of ISO-LMF. Thus, existing who did it. (preposition) vs.
lexicons can be standardized, i.e. converted into Das hangt davon ab, wer es getan hat.
Subcat-LMF format, based on the DTD.10 (pronominal adverb)
Lexicon structure: Next, we defined the
lexicon structure of Subcat-LMF. In addition Uniform Data Categories for English and Ger-
to the core package, Subcat-LMF primarily man: Thus, the main challenge in developing
makes use of the LMF Syntax and Seman- Subcat-LMF has been the specification of DCs
tics extension. Figure 1 shows the most (attributes and attribute values) in such a way,
important classes of Subcat-LMF including that a uniform specification of SCFs in the two
SynsemCorrespondence where the linking of
languages English and German can be achieved.
syntactic and semantic arguments is encoded. It The specification of DCs for Subcat-LMF in-
might by worth noting that both synsets from Ger- volved fleshing out ISO-LMF, because it is a
maNet and verb classes from VerbNet can be rep- meta-standard in the sense that it provides only
resented in Subcat-LMF by using the Synset and few linguistic terms, i.e. DCs, and these DCs
SubcategorizationFrameSet class.
are not linked to any DCR: in the Syntax Exten-
Diverging linguistic properties of SCFs in sion, the standard only provides 7 class names,
English and German: For verbs (and also for see Figure 1), complemented by 17 example at-
predicate-like nouns and adjectives), SCFs spec- tributes given in an informative, non-binding An-
ify the syntactic and morphosyntactic properties nex F. These are by far not sufficient to repre-
of their arguments that have to be present in con- sent the fine-grained SCFs available in such large-
crete realizations of these arguments within a sen- scale lexicons as VerbNet.
tence. While some properties of syntactic argu- In contrast, the Syntax part of Subcat-LMF
ments in English and German correspond (both comprises 58 DCs that are properly linked to
English and German are Germanic languages and ISOCat DCs; a number of DCs were missing in
hence closely related), there are other properties, ISOCat, so we entered them ourselves.11 The
mainly morphosyntactic ones that diverge. By majority of the attributes in Subcat-LMF are at-
way of examples, we illustrate some of these di- tached to the SyntacticArgument class. The
vergences in the following (we contrast English corresponding DCs can be divided into two main
examples with their German equivalents): groups:
Cross-lingually valid DCs for the spec-
overt case marking in German: ification of grammatical functions (e.g.
He helps him. vs. Er hilft ihm. (dative) subject, prepositionalComplement)
and syntactic categories (e.g. nounPhrase,
specific verb form in verb phrase arguments: prepositionalPhrase), see Table 1.
He suggested cleaning the house. (ing-form) Partly language-specific morphosyntactic
7
SCFs in VerbNet also cover SCFs in VALEX, a lexicon DCs that further specify the syntactic arguments
automatically extracted from corpora. (e.g. attribute case, attribute verbForm and
8
http://www.ilc.cnr.it/EAGLES96/morphsyn/
9 11
http://www.ilc.cnr.it/EAGLES96/synlex/ The Subcat-LMF DCS is publicly available on the ISO-
10
Available at http://www.ukp.tu-darmstadt.de/data/uby Cat website.
552
Figure 1: Selected classes of Subcat-LMF.
Table 1: Cross-lingually valid (English-German) attributes and values of the SyntacticArgument class.
553
Morphosyntactic attributes and values NP PP VP C
case: nominative, genitive, dative, accusative x x
determiner: possessive, indefinite x x
number: singular, plural x
verbForm: toInfinitive, bareInfinitive, ingForm(!), Participle x x
tense: present, past x
complementizer: thatType, whType, yesNoType x
prepositionType: external ontological type, e.g. locative x x x
preposition: (string) (!) x x x
lexeme: (string) (!) x x
Table 2: Morphosyntactic attributes of SyntacticArgument and phrase types for which the attributes are
appropriate (NP: noun phrase, PP: prepositional phrase, VP: verb phrase, C: clause). Language-specific attributes
are marked by (!).
554
# LexicalEntry # Sense # Subcat.Frame # SemanticPred.
LMF-VN 3962 31891 284 617
orig. VN (3962 verbs) (31891 groups of verb, (568 frames) (572 sem. Pred.)
frame, sem.pred.)
LMF-GN 8626 12981 147 84
orig. GN (8626 verbs) (12981 verb-synset pairs) (202 GN frames) (no sem. Pred.)
LMF-ILS 784 3675 217 10
orig. ILS (784 verbs) (3675 verb-frame pairs) (220 SCFs) (no sem. Pred.)
Table 3: Evaluation of the automatic conversion. Numbers of Subcat-LMF instances in the converted lexicons
compared to numbers of corresponding units in original lexicons.
Evaluation of Automatic Conversion: Table 3 as GN; these few cases were mapped in the same
shows the mapping of the major source lexicon way as for GN. Therefore, the LMF version of
units (such as verb-synset pairs) to Subcat-LMF ILS, too, specifies less SCFs, but additional se-
and lists the corresponding numbers of units. mantic predicates not present in the original.
For VN, groups of VN verb, frame and se- Discussion: Grammatical functions of argu-
mantic predicate have been mapped to LMF ments are specified distinctly in the three lexicons.
senses. VN classes have been mapped to While both GN and ILS specify grammatical
SubcategorizationFrameSet. Thus, the functions, they are not explicitly encoded in VN.
original VN-sense, a pairing of verb lemma and They have to be inferred on the basis of the phrase
class, can be recovered by grouping LMF senses structure rules given in the SYNTAX element. We
that share the same verb class. There is a signif- assigned subject to the noun phrase which di-
icant difference between the original VN frames rectly precedes the verb and directObject to
and their Subcat-LMF representation: the seman- the noun phrase directly following the verb and
tic information present in VN frames (seman- having the semantic role Patient. The semantic
tic roles and selectional restrictions) is mapped role information has to be considered at this point,
to semantic arguments in Subcat-LMF, i.e. the because not all noun phrase arguments are able
mapping splits VN frames into a purely syntac- to become the subject in a corresponding passive
tic and a purely semantic part. Consequently, sentence. An example is the verb learn which
the number of unique SCFs in the Subcat-LMF has the VN frame NP(Agent) V NP(Topic);
version of VN is much smaller than the num- here, the Topic-NP is not able to become the sub-
ber of frames in the original VN. The conversion ject of a corresponding passive sentence. We as-
tool creates for each sense (specifying a unique signed the grammatical function complement to
verb, frame, semantic predicate combination) a all other phrase types.
SynSemCorrespondence. Argument order constraints in SCFs are repre-
On the other hand, the Subcat-LMF version of VN sented in LMF by a list implementation of syntac-
contains more semantic predicates than VN. This tic arguments. Most SCFs from VN require the
is due to selectional restrictions for semantic ar- subject to be the first argument, reflecting the ba-
guments that are specified in Subcat-LMF within sic word order in English sentences. VN lists one
semantic predicates, in contrast to VN. exception to this rule for the verb appear, illus-
For GN, verb-synset pairs (i.e., GN lexical trated by the example On the horizon appears a
units), have been mapped to LMF senses. Few ship.
GN frame codes also specify semantic role in- Argument optionality in VN is expressed at the
formation, e.g. manner, location. These were semantic level and at the syntactic level in paral-
mapped to the semantics part of Subcat-LMF re- lel: it is explicitly specified at the semantic level
sulting in 84 semantic predicates that encode the and implicitly specified at the syntactic level. At
semantic role information in their semantic argu- the syntactic level, two SCF versions exist in VN,
ments. one with the optional argument, the other without
ILS specifies similar semantic role information it. In addition, the semantic predicate attached to
555
these SCFs marks optional (semantic) arguments tribute values apart from the attribute optional
by a ?-sign. GN, on the other hand, expresses which is specific to GN (resulting in a consid-
argument optionality at the level of syntactic ar- erably smaller number of SCFs in GN). Sec-
guments, i.e., within the frame code. In Subcat- ond, fine-grained, but cross-lingual string SCFs
LMF, optionality is represented at the syntactic were considered; these omit the attributes case,
level by an (optional) attribute optional for syn- lexeme, preposition and the attribute value
tactic arguments, thus reflecting the explicit repre- ingForm. Finally, coarse-grained cross-lingual
sentation used in GN and the implicit representa- string SCFs were compared. These only con-
tion present in VN.18 tain the values of the attributes syntactic
GN frames specify syntactic alternations of ar- category, complementizer and verbForm
gument realizations, e.g. adverbial complements (without the attribute value ingForm). For in-
that can alternatively be realized as adverb phrase, stance, a coarse cross-lingual string SCF for tran-
prepositional phrase or noun phrase. We encoded sitive verbs is nounPhrasenounPhrase.
this generalization in Subcat-LMF by introducing Table 4 lists the results of our quantitative com-
attribute values for these aggregated syntactic cat- parison. For each lexicon pair, the number of
egories. overlapping SCFs and the numbers of comple-
mentary SCFs are given. Regarding VN and the
3.2 Cross-lingual comparison of lexicons German lexicons, the overlap at the language-
Lexicons that are standardized according to specific level is (close to) zero, which is due to the
Subcat-LMF can be quantitatively compared re- specification of case, e.g. dative, for German ar-
garding SCFs. For two lexicons, such a com- guments. However, the numbers for cross-lingual
parison gives answers to questions, such as: how SCFs clearly validate our claim: the numbers of
many SCFs are present in both lexicons (overlap- overlapping SCFs for the German lexicon pair and
ping SCFs), how many SCFs are only listed in one for the two German-English pairs are comparable,
of the lexicons (complementary SCFs). Answers ranging from 12 to 18 for the fine-grained SCFs
to these questions are important, for instance, for and from 20 to 21 for the coarse SCFs.
assessing the potential gain in SCF coverage that Based on the sets of cross-lingually overlap-
can be achieved by lexicon merging. ping SCFs, we made an estimation on how many
In order to validate our claim that Subcat-LMF high frequent verbs actually have SCFs that are
yields a cross-lingually uniform SCF represen- in the cross-lingual SCF overlap of an English-
tation, we contrast the monolingual comparison German lexicon pair. For this, we used the lemma
of GN and ILS with the cross-lingual compari- frequency lists of the English and German WaCky
son of VN, GN and VN and ILS. Assuming that corpora (Baroni et al., 2009) and extracted verbs
our claim is valid, the cross-lingual comparisons from VN, GN and ILS that are on 100 top ranked
can be expected to yield similar results regard- positions of these lists, starting from rank 100.19
ing overlapping and complementary SCFs as the Table 5 shows the results for the cross-lingual
monolingual comparison. SCF overlap between VN GN and between VN
Comparison: The comparison of SCFs from ILS. While only around 40% of the high fre-
two lexicons that are in Subcat-LMF format can quent verbs have an SCF in the fine-grained SCF
be performed on the basis of the uniform DCs. overlap, more than 70% are in the coarse overlap
As Subcat-LMF is implemented in XML, we between VN GN, and even more than 80% in
compared string representations of SCFs. SCFs the coarse overlap between VN ILS.
from VN, GN and ILS were converted to strings Analysis of results: The small numbers of
by concatenating attribute values of syntactic ar- overlapping cross-lingual SCFs (relative to the to-
guments and lexemeProperty. We created tal number of SCFs), at both levels of granularity,
string representations of different granularities: indicate that the three lexicons each encode sub-
First, fine-grained, language-specific string SCFs stantially different lexical-syntactic properties of
have been generated by concatenating all at- 19
Since the WaCky frequency lists do not contain POS in-
18
As a consequence, all semantic arguments specified in formation, our lists of extracted verbs contain some noise,
the Subcat-LMF version of VN have a corresponding syn- which we tolerated, because we aimed at an approximate es-
tactic argument. timate.
556
language-specific cross-lingual cross-lingual
(fine-grained) (fine-grained) (coarse)
GN vs. ILS 72 GN 21 both, 196 ILS 61 GN, 23 both, 69 ILS 40 GN, 24 both, 23 ILS
VN vs. GN 284 VN, 0 both, 93 GN 96 VN, 15 both, 69 GN 29 VN, 24 both, 40 GN
VN vs. ILS 283 VN, 1 both, 216 ILS 93 VN, 18 both, 74 ILS 31 VN, 22 both, 25 ILS
Table 4: Comparison of lexicon pairs regarding SCF overlap and complementary SCFs.
Table 5: Percentage of 100 high frequent verbs from VN, GN, ILS with a SCF in the cross-lingual SCF overlap
(fine-grained vs. coarse) between VN GN and VN ILS.
verbs. This can at least partly be explained by the ified as language-independent preposition types.
historic development of these lexicons in differ- A large number of complementary SCFs in VN
ent contexts, e.g., Levins work on verb classes vs. GN and GN vs. ILS are due to a diverging lin-
(VN), Lexical Functional Grammar (ILS), as well guistic analysis of extraposed subject clauses with
as their use for different purposes and applica- an es (it) in the main clause (e.g., It annoys him
tions. that the train is late.). In GN, such clauses are not
Another reason of the small SCF overlap is specified as subject, whereas in VN and ILS they
the comparison of strings derived from the XML are.
format. A more sophisticated representation for- Regarding VN and ILS, only VN lists subject
mat, notably one that provides semantic typing control for verbs, while both VN and ILS list ob-
and type hierarchies, e.g., OWL, could be em- ject control and subject raising. GN, on the other
ployed to define hierarchies of grammatical func- hand, does not specify control or raising at all.
tions (e.g. direct object would be a sub-type of
complement) and other attributes. These would 4 Discussion
presumably support the identification of further 4.1 Previous Work
overlapping SCFs.
Merging SCFs: Previous work on merging SCF
During a subsequent qualitative analysis of the
lexicons has only been performed in a mono-
overlapping and complementary SCFs, we col-
lingual setting and lacks the use of standards.
lected some enlightening background informa-
King and Crouch (2005) describe the process of
tion. Overlapping SCFs in the cross-lingual com-
unifying several large-scale verb lexicons for En-
parison (both fine-grained and coarse) include
glish, including VN and WordNet. They perform
prominent SCFs corresponding to transitive and
a conversion of these lexicons into a uniform, but
intransitive verbs, as well as verbs with that-
non-standard representation format, resulting in a
clause and verbs with to-infinitive.
lexicon which is integrated at the level of verb
GN and ILS are highly complementary regard-
senses, SCFs and lexical-semantics. Thus, the re-
ing SCFs: for instance, while many SCFs with ad-
sult of their work is not applicable to cross-lingual
verbial arguments are unique in GN, only ILS pro-
settings.
vides a fine-grained specification of prepositional
Necsulescu et al. (2011) and Padro et al. (2011)
complements including the preposition, as well
report on approaches to automatic merging of
as the case the preposition requires.20 VN, too,
two Spanish SCF lexicons. As these lexicons
contains a large number of SCFs with a detailed
lack sense information apart from the SCFs, their
specification of possible prepositions, partly spec-
merging approach only works on a very coarse-
20
In German, prepositions govern the case of their noun grained sense level given by lemma-SCF pairs.
phrase. The fully automatic merging approach described
557
in (Padro et al., 2011) assumes that one of the lex- SCF lexicons to Subcat-LMF, we have demon-
icons to be integrated is already represented in the strated its usability for uniformly representing a
target representation format, i.e. given two lexi- wide range of SCFs and other lexical-syntactic in-
cons, they map one lexicon to the format of the formation types in English and German.
other. Moreover, their approach requires a signif- As our cross-lingual comparison of lexicons
icant overlap of SCFs and verbs in any two lex- has revealed many complementary SCFs in VN,
icons to be merged. The authors state that it is GN and ILS, mono- and cross-lingual alignments
presently unclear, how much overlap is required of these lexicons at sense level would lead to a
to obtain sufficiently precise merging results. major increase in SCF coverage. Moreover, the
Standardizing SCFs: Much previous work on cross-lingually uniform representation of SCFs
standardizing NLP lexicons in LMF has focused can be exploited for an additional alignment of
on WordNet-like resources. Soria et al. (2009) de- the lexicons at the level of SCF arguments. Such
scribe WordNet-LMF, an LMF model for repre- a fine-grained alignment of SCFs can be used, for
senting wordnets which has been used in the KY- instance, to project VN semantic roles to GN, thus
OTO project.21 Later, WordNet-LMF has been yielding a German resource for semantic role la-
adapted by Henrich and Hinrichs (2010) to Ger- beling (see Gildea and Jurafsky (2002), Swier and
maNet and by Toral et al. (2010) to the Ital- Stevenson (2005)).
ian WordNet. WordNet-LMF does not provide Subcat-LMF could be used for standardizing
the possibility to represent subcategorization at further English and German lexicons. The auto-
all. The adaption of WordNet-LMF to GN (Hen- matic conversion of lexicons to Subcat-LMF re-
rich and Hinrichs, 2010) allows SCFs to be re- quires the manual definition of a mapping, at least
spresented as string values. However, this ex- for syntactic arguments. Furthermore, the auto-
tension is not sufficient, because it provides no matic merging approach by Padro et al. (2011)
means to model the syntax-semantics interface, could be tested for English: given our standard-
which specifies correspondences between syntac- ized version of VN, other English SCF lexicons
tic and semantic arguments of verbs and other could be merged fully automatically with the
predicates. Quochi et al. (2008) report on an LMF Subcat-LMF version of VN.
model that covers the syntax-semantics mapping
just mentioned; it has been used for standardizing 5 Conclusion
an Italian domain-specific lexicon. Buitelaar et al.
Subcat-LMF contributes to fostering the standard-
(2009) describe LexInfo, an LMF-model that is
ization of language resources and their interop-
used for lexicalizing ontologies. LexInfo is imple-
erability at the lexical-syntactic level across En-
mented in OWL and specifies a linking of syntac-
glish and German. The Subcat-LMF DTD in-
tic and semantic arguments. For SCFs and argu-
cluding links to ISOCat, all conversion tools,
ments, a type hierarchy is defined. In their paper,
and the standardized versions of VN and
Buitelaar et al. (2009) show only few SCFs and
ILS23 are publicly available at http://www.ukp.tu-
do not indicate what kinds of SCFs can be repre-
darmstadt.de/data/uby.
sented with LexInfo in principle. On the LexInfo
website22 , the current LexInfo version 2.0 can be Acknowledgments
viewed, but no further documentation is given.
We inspected LexInfo version 2.0 and found that This work has been supported by the Volks-
it specifies a large number of fine-grained SCFs. wagen Foundation as part of the Lichtenberg-
However, LexInfo has not been evaluated so far Professorship Program under grant No. I/82806.
on large-scale SCF lexicons, such as VerbNet. We thank the anonymous reviewers for their valu-
able comments. We also thank Dr. Jungi Kim
4.2 Subcat-LMF and Christian M. Meyer for their contributions to
Subcat-LMF enables the uniform representation this paper, and Yevgen Chebotar and Zijad Mak-
of fine-grained SCFs across the two languages suti for their contributions to the conversion soft-
English and German. By mapping large-scale ware.
21 23
http://www.kyoto-project.eu/ The converted version of GN can not be made available
22
See http://lexinfo.net/ due to licensing.
558
References and Evaluation (LREC 2012), page (to appear), Is-
tanbul, Turkey.
Galen Andrew, Trond Grenager, and Christopher D.
Judith Eckle-Kohler. 1999. Linguistisches Wissen zur
Manning. 2004. Verb sense and subcategoriza-
automatischen Lexikon-Akquisition aus deutschen
tion: using joint inference to improve performance
Textcorpora. Logos-Verlag, Berlin, Germany.
on complementary tasks. In Proceedings of the
PhDThesis.
2004 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 150157, Gil Francopoulo, Nuria Bel, Monte George, Nico-
Barcelona, Spain. letta Calzolari, Monica Monachini, Mandy Pet, and
Claudia Soria. 2006. Lexical Markup Framework
Marco Baroni, Silvia Bernardini, Adriano Ferraresi,
(LMF). In Proceedings of the Fifth International
and Eros Zanchetta. 2009. The WaCky wide web:
Conference on Language Resources and Evaluation
a collection of very large linguistically processed
(LREC), pages 233236, Genoa, Italy.
web-crawled corpora. Language Resources and
Evaluation, 43(3):209226. Daniel Gildea and Daniel Jurafsky. 2002. Automatic
Daan Broeder, Marc Kemps-Snijders, Dieter Van Uyt- labeling of semantic roles. Computational Linguis-
vanck, Menzo Windhouwer, Peter Withers, Peter tics, 28:245288, September.
Wittenburg, and Claus Zinn. 2010. A Data Cat- Ralph Grishman, Catherine Macleod, and Adam Mey-
egory Registry- and Component-based Metadata ers. 1994. Comlex Syntax: Building a Computa-
Framework. In Proceedings of the Seventh Inter- tional Lexicon. In Proceedings of the 15th Inter-
national Conference on Language Resources and national Conference on Computational Linguistics
Evaluation (LREC), pages 4347, Valletta, Malta. (COLING), pages 268272, Kyoto, Japan.
Susan Windisch Brown, Dmitriy Dligach, and Martha Iryna Gurevych, Judith Eckle-Kohler, Silvana Hart-
Palmer. 2011. VerbNet Class Assignment as a mann, Michael Matuschek, Christian M. Meyer,
WSD Task. In Proceedings of the 9th International and Christian Wirth. 2012. Uby - A Large-Scale
Conference on Computational Semantics (IWCS), Unified Lexical-Semantic Resource. In Proceed-
pages 8594, Oxford, UK. ings of the 13th Conference of the European Chap-
Paul Buitelaar, Philipp Cimiano, Peter Haase, and ter of the Association for Computational Linguistics
Michael Sintek. 2009. Towards Linguistically (EACL 2012), page (to appear), Avignon, France.
Grounded Ontologies. In Lora Aroyo, Paolo Verena Henrich and Erhard Hinrichs. 2010. Standard-
Traverso, Fabio Ciravegna, Philipp Cimiano, Tom izing wordnets in the ISO standard LMF: Wordnet-
Heath, Eero Hyvonen, Riichiro Mizoguchi, Eyal LMF for GermaNet. In Proceedings of the 23rd In-
Oren, Marta Sabou, and Elena Simperl, editors, The ternational Conference on Computational Linguis-
Semantic Web: Research and Applications, pages tics (COLING), pages 456464, Beijing, China.
111125, Berlin Heidelberg. Springer-Verlag. Nancy Ide and James Pustejovsky. 2010. What Does
Aljoscha Burchardt, Katrin Erk, Anette Frank, Andrea Interoperability Mean, anyway? Toward an Op-
Kowalski, Sebastian Pado, and Manfred Pinkal. erational Definition of Interoperability. In Pro-
2006. The SALSA Corpus: a German Corpus Re- ceedings of the Second International Conference
source for Lexical Semantics. In Proceedings of on Global Interoperability for Language Resources,
the Fifth International Conference on Language Re- Hong Kong.
sources and Evaluation (LREC), pages 969974, Tracy Holloway King and Dick Crouch. 2005. Uni-
Genoa, Italy. fying lexical resources. In Proceedings of the In-
Nicoletta Calzolari and Monica Monachini. 1996. terdisciplinary Workshop on the Identification and
EAGLES Proposal for Morphosyntactic Stan- Representation of Verb Features and Verb Classes,
dards: in view of a ready-to-use package. In Saarbruecken, Germany.
G. Perissinotto, editor, Research in Humanities Karin Kipper, Anna Korhonen, Neville Ryant, and
Computing, volume 5, pages 4864. Oxford Uni- Martha Palmer. 2008. A Large-scale Classification
versity Press, Oxford, UK. of English Verbs. Language Resources and Evalu-
Tejaswini Deoskar. 2008. Re-estimation of lexi- ation, 42:2140.
cal parameters for treebank PCFGs. In Proceed- Manfred Klenner. 2007. Shallow dependency la-
ings of the 22nd International Conference on Com- beling. In Proceedings of the 45th Annual Meet-
putational Linguistics (COLING), pages 193200, ing of the Association for Computational Linguis-
Manchester, United Kingdom. tics (ACL), Companion Volume Proceedings of the
Judith Eckle-Kohler, Iryna Gurevych, Silvana Hart- Demo and Poster Sessions, pages 201204, Prague,
mann, Michael Matuschek, and Christian M. Czech Republic.
Meyer. 2012. UBY-LMF A Uniform Format Claudia Kunze and Lothar Lemnitzer. 2002. Ger-
for Standardizing Heterogeneous Lexical-Semantic maNet representation, visualization, applica-
Resources in ISO-LMF. In Proceedings of the 8th tion. In Proceedings of the Third International
International Conference on Language Resources Conference on Language Resources and Evaluation
559
(LREC), pages 14851491, Las Palmas, Canary Is- Claudia Soria, Monica Monachini, and Piek Vossen.
lands, Spain. 2009. Wordnet-LMF: fleshing out a standardized
Beth Levin. 1993. English Verb Classes and Alterna- format for Wordnet interoperability. In Proceedings
tions. The University of Chicago Press, Chicago, of the 2009 International Workshop on Intercultural
USA. Collaboration, pages 139146, Palo Alto, Califor-
Christian M. Meyer and Iryna Gurevych. 2011. What nia, USA.
Psycholinguists Know About Chemistry: Align- Robert S. Swier and Suzanne Stevenson. 2005. Ex-
ing Wiktionary and WordNet for Increased Domain ploiting a verb lexicon in automatic semantic role
Coverage. In Proceedings of the 5th International labelling. In Proceedings of the conference on Hu-
Joint Conference on Natural Language Processing man Language Technology and Empirical Methods
(IJCNLP), pages 883892, Chiang Mai, Thailand. in Natural Language Processing (HLT05), pages
883890, Vancouver, British Columbia, Canada.
Roberto Navigli and Simone Paolo Ponzetto. 2010.
Antonio Toral, Stefania Bracale, Monica Monachini,
BabelNet: Building a very large multilingual se-
and Claudia Soria. 2010. Rejuvenating the Italian
mantic network. In Proceedings of the 48th Annual
WordNet: upgrading, standarising, extending. In
Meeting of the Association for Computational Lin-
Proceedings of the 5th Global WordNet Conference,
guistics (ACL), pages 216225, Uppsala, Sweden.
Bombay, India.
Silvia Necsulescu, Nuria Bel, Munsta Padro, Montser-
rat Marimon, and Eva Revilla. 2011. Towards
the Automatic Merging of Language Resources. In
Proceedings of the 2011 ESSLI Workshop on Lexi-
cal Resources (WoLeR 2011), Ljubljana, Slovenia.
Elisabeth Niemann and Iryna Gurevych. 2011. The
Peoples Web meets Linguistic Knowledge: Auto-
matic Sense Alignment of Wikipedia and WordNet.
In Proceedings of the 9th International Conference
on Computational Semantics (IWCS), pages 205
214, Oxford, UK.
Muntsa Padro, Nuria Bel, and Silvia Necsulescu.
2011. Towards the Automatic Merging of Lexical
Resources: Automatic Mapping. In Proceedings of
the International Conference on Recent Advances
in Natural Language Processing, pages 296301,
Hissar, Bulgaria.
Valeria Quochi, Monica Monachini, Riccardo Del
Gratta, and Nicoletta Calzolari. 2008. A lexicon
for biology and bioinformatics: the bootstrep expe-
rience. In Proceedings of the Sixth International
Conference on Language Resources and Evalua-
tion (LREC08), pages 22852292, Marrakech, Mo-
rocco, may.
Josef Ruppenhofer, Michael Ellsworth, Miriam R. L.
Petruck, Christopher R. Johnson, and Jan Schef-
fczyk. 2010. FrameNet II: Extended Theory and
Practice, September.
Lei Shi and Rada Mihalcea. 2005. Putting pieces to-
gether: Combining FrameNet, VerbNet and Word-
Net for robust semantic parsing. In Proceedings
of the Sixth International Conference on Intelligent
Text Processing and Computational Linguistics (CI-
CLing), pages 100111, Mexico City, Mexico.
Anthony Sigogne, Matthieu Constant, and Eric La-
porte. 2011. Integration of data from a syntac-
tic lexicon into generative and discriminative proba-
bilistic parsers. In Proceedings of the International
Conference on Recent Advances in Natural Lan-
guage Processing, pages 363370, Hissar, Bulgaria.
560
The effect of domain and text type on text prediction quality
Suzan Verberne, Antal van den Bosch, Helmer Strik, Lou Boves
Centre for Language Studies
Radboud University Nijmegen
s.verberne@let.ru.nl
561
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 561569,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
versational speech and web pages of Fre- All modern methods share the general idea that
quently Asked Questions (FAQ). previous context (which we will call the buffer)
2. A series of across-text type experiments in can be used to predict the next block of charac-
which we train and test on different text ters (the predictive unit). If the user gets correct
types; suggestions for continuation of the text then the
3. A case study using texts from a specific do- number of keystrokes needed to type the text is
main and text type: questions about neuro- reduced. The unit to be predicted by a text pre-
logical issues. Training data for this combi- diction algorithm can be anything ranging from a
nation of language (Dutch), text type (FAQ) single character (which actually does not save any
and domain (medical/neurological) is sparse. keystrokes) to multiple words. Single words are
Therefore, we search for the type of training the most widely used as prediction units because
data that gives the best prediction results for they are recognizable at a low cognitive load for
this corpus. We compare the following train- the user, and word prediction gives good results
ing corpora: in terms of keystroke savings (Garay-Vitoria and
The corpora that we compared in the Abascal, 2006).
text type experiments: Wikipedia, Twit- There is some variation among methods in the
ter, Speech and FAQ, 1.5 Million words size and type of buffer used. Most methods use
per corpus. character n-grams as buffer, because they are pow-
A 1.5 Million words training corpus that erful and can be implemented independently of the
is of the same domain as the target data: target language (Carlberger, 1997). In many al-
medical pages from Wikipedia; gorithms the buffer is cleared at the start of each
The 359 questions from the neuro-QA new word (making the buffer never larger than
data themselves, evaluated in a leave- the length of the current word). In the paper
one-out setting (359 times training on by (Van den Bosch and Bogers, 2008), two ex-
358 questions and evaluating on the re- tensions to the basic prefix-model are compared.
maining questions). They found that an algorithm that uses the previ-
ous n characters as buffer, crossing word borders
The prospective application of the third series without clearing the buffer, performs better than
of experiments is the development of a text predic- both a prefix character model and an algorithm
tion algorithm in an online care platform: an on- that includes the full previous word as feature. In
line community for patients seeking information addition to using the previously typed characters
about their illness. In this specific case the target and/or words in the buffer, word characteristics
group is patients with language disabilities due to such as frequency and recency could also be taken
neurological disorders. into account (Garay-Vitoria and Abascal, 2006).
The remainder of this paper is organized as fol- Possible evaluation measures for text predic-
lows: In Section 2 we give a brief overview of text tion are the proportion of words that are correctly
prediction methods discussed in the literature. In predicted, the percentage of keystrokes that could
Section 3 we present our approach to text predic- maximally be saved (if the user would always
tion. Sections 4 and 5 describe the experiments make the correct decision), and the time saved by
that we carried out and the results we obtained. the use of the algorithm (Garay-Vitoria and Abas-
We phrase our conclusions in Section 6. cal, 2006). The performance that can be obtained
by text prediction algorithms depends on the lan-
2 Text prediction methods guage they are evaluated on. Lower results are ob-
Text prediction methods have been developed for tained for higher-inflected languages such as Ger-
several different purposes. The older algorithms man than for low-inflected languages such as En-
were built as communicative devices for people glish (Matiasek et al., 2002). In their overview of
with disabilities, such as motor and speech impair- text prediction systems, (Garay-Vitoria and Abas-
ments. More recently, text prediction is developed cal, 2006) report performance scores ranging from
for writing with reduced keyboards, specifically 29% to 56% of keystrokes saved.
for writing (composing messages) on mobile de- An important factor that is known to influence
vices (Garay-Vitoria and Abascal, 2006). the quality of text prediction systems, is training
562
set size (Lesher et al., 1999; Van den Bosch, 3.1 Evaluation
2011). The paper by (Van den Bosch, 2011) shows We evaluate our algorithms on corpus data. This
log-linear learning curves for word prediction (a means that we have to make assumptions about
constant improvement each time the training cor- user behaviour. We assume that the user confirms
pus size is doubled), when the training set size is a suggested word as soon as it is suggested cor-
increased incrementally from 102 to 3107 words. rectly, not typing any additional characters before
confirming. We evaluate our text prediction al-
3 Our approach to text prediction
gorithms in terms of the percentage of keystrokes
We implement a text prediction algorithm for saved K:
Dutch, which is a productive compounding lan-
guage like German, but has a somewhat simpler Pn Pn
inflectional system. We do not focus on the effect i=0 (Fi ) i=0 (Wi )
K= Pn 100 (1)
of training set size, but on the effect of text type i=0 (Fi )
and topic domain differences. in which n is the number of words in the test
Our approach to text prediction is largely in- set, Wi is the number of keystrokes that have been
spired by (Van den Bosch and Bogers, 2008). We typed before the word i is correctly suggested
experiment with two different buffer types that are and Fi is the number of keystrokes that would be
based on character n-grams: needed to type the complete word i. For example,
Prefix of current word contains all char- our algorithm correctly predicts the word niveau
acters of only the word currently keyed in, after the context i n g t o t e e n n i
where the buffer shifts by one character posi- v in the test set. Assuming that the user confirms
tion with every new character. the word niveau at this point, three keystrokes
were needed for the prefix niv. So, Wi = 3 and
Buffer15 buffer also includes any other Fi = 6. The number of keystrokes needed for
characters keyed in belonging to previously whitespace and punctuation are unchanged: these
keyed-in words. have to be typed anyway, independently of the
Modeling character history beyond the current support by a text prediction algorithm.
word can naturally be done with a buffer model in 4 Text type experiments
which the buffer shifts by one position per charac-
ter, while a typical left-aligned prefix model (that In this section, we describe the first and second se-
never shifts and fixes letters to their positional fea- ries of experiments. The case study on questions
ture) would not be able to do this. from the neurological domain is described in Sec-
In the buffer, all characters from the text are tion 5.
kept, including whitespace and punctuation. The
4.1 Data
predictive unit is one token (word or punctuation
symbol). In both the buffer and the prediction la- In the text type experiments, we evaluate our text
bel, any capitalization is kept. At each point in the prediction algorithm on four different types of
typing process, our algorithm gives one sugges- Dutch text: Wikipedia, Twitter data, transcriptions
tion: the word that is the most likely continuation of conversational speech, and web pages of Fre-
of the current buffer. quently Asked Questions (FAQ). The Wikipedia
We save the training data as a classification data corpus that we use is part of the Lassy cor-
set: each character in the buffer fills a feature slot pus (Van Noord, 2009); we obtained a version
and the word that is to be predicted is the classi- from the summer of 2010.1 The Twitter data
fication label. Figures 1 and 2 give examples of are collected continuously and automatically fil-
each of the buffer types Prefix and Buffer15 that tered for language by Erik Tjong Kim Sang (Tjong
we created for the text fragment tot een niveau Kim Sang, 2011). We used the tweets from all
in the context stelselmatig bij elke verkiezing tot users that posted at least 19 tweets (excluding
een niveau van (structurally with each election retweets) during one day in June 2011. This is
to a level of ). We use the implementation of the a set of 1 Million Twitter messages from 30,000
IGTree decision tree algorithm in TiMBL (Daele- 1
http://www.let.rug.nl/vannoord/trees/Treebank/Machine/
mans et al., 1997) to train our models. NLWIKI20100826/COMPACT/
563
t tot
t o tot
t o t tot
e een
e e een
e e n een
n niveau
n i niveau
n i v niveau
n i v e niveau
n i v e a niveau
n i v e a u niveau
Figure 1: Example of buffer type Prefix for the text fragment (elke verkiezing) tot een niveau. Un-
derscores represent whitespaces.
l k e v e r k i e z i n g tot
k e v e r k i e z i n g t tot
e v e r k i e z i n g t o tot
v e r k i e z i n g t o t tot
v e r k i e z i n g t o t een
e r k i e z i n g t o t e een
r k i e z i n g t o t e e een
k i e z i n g t o t e e n een
i e z i n g t o t e e n niveau
e z i n g t o t e e n n niveau
z i n g t o t e e n n i niveau
i n g t o t e e n n i v niveau
n g t o t e e n n i v e niveau
g t o t e e n n i v e a niveau
t o t e e n n i v e a u niveau
Figure 2: Example of buffer type Buffer15 for the text fragment (elke verkiezing) tot een niveau.
Underscores represent whitespaces.
different users. The transcriptions of conversa- and 100,000 words respectively. The results are in
tional speech are from the Spoken Dutch Corpus Table 2.
(CGN) (Oostdijk, 2000); for our experiments, we
only use the category spontaneous speech. We 4.4 Discussion of the results
obtained the FAQ data by downloading the first Table 1 shows that for all text types, the buffer
1,000 pages that Google returns for the query faq of 15 characters that crosses word borders gives
with the language restriction Dutch. After clean- better results than the prefix of the current word
ing the pages from HTML and other coding, the only. We get a relative improvement of 35% (for
resulting corpus contained approximately 1.7 Mil- FAQ) to 62% (for Speech) of Buffer15 compared
lion words of questions and answers. to Prefix-only.
4.2 Within-text type experiments Table 2 shows that text type differences have
an influence on text prediction quality: all across-
For each of the four text types, we compare the
text type experiments lead to lower results than
buffer types Prefix and Buffer15. In each ex-
the within-text type experiments. From the re-
periment, we use 1.5 Million words from the cor-
sults in Table 2, we can deduce that of the four
pus to train the algorithm and 100,000 words to
text types, speech and Twitter language resem-
test it. The results are in Table 1.
ble each other more than they resemble the other
4.3 Across-text type experiments two, and Wikipedia and FAQ resemble each other
more. Twitter and Wikipedia data are the least
We investigate the importance of text type differ-
similar: training on Wikipedia data makes the text
ences for text prediction with a series of experi-
prediction score for Twitter data drop from 29.2 to
ments in which we train and test our algorithm on
16.5%.2
texts of different text types. We keep the size of
2
the train and test sets the same: 1.5 Million words Note that the results are not symmetric. For example,
564
Table 1: Results from the within-text type experiments in terms of percentages of saved keystrokes.
Prefix means: use the previous characters of the current word as features. Buffer 15 means use a buffer
of the previous 15 characters as features.
Prefix Buffer15
Wikipedia 22.2% 30.5%
Twitter 21.3% 29.2%
Speech 20.7% 33.4%
FAQ 20.2% 27.2%
Table 2: Results from the across-text type experiments in terms of percentages of saved keystrokes, using
the best-scoring configuration from the within-text type experiments: a buffer of 15 characters
Trained on Tested on Wikipedia Tested on Twitter Tested on Speech Tested on FAQ
Wikipedia 30.5% 16.5% 22.3% 24.9%
Twitter 17.9% 29.2% 27.9% 20.7%
Speech 19.7% 22.5% 33.4% 21.0%
FAQ 22.6% 18.2% 22.9% 27.2%
5 Case study: questions about the list. The newly submitted questions are sent to
neurological issues an expert who answers them and adds both ques-
tion and answer to the chat-by-click database. In
Online care platforms aim to bring together pa- typing the question to be submitted, the user will
tients and experts. Through this medium, patients be supported by a text prediction application.
can find information about their illness, and get in The aim of this section is to find the best train-
contact with fellow-sufferers. Patients who suffer ing corpus for newly formulated questions in the
from neurological damage may have communica- neurological domain. We realize that questions
tive disabilities because their speaking and writ- formulated by users of a web interface are dif-
ing skills are impaired. For these patients, existing ferent from questions formulated by experts for
online care platforms are often not easily accessi- the purpose of a FAQ-list. Therefore, we plan to
ble. Aphasia, for example, hampers the exchange gather real user data once we have a first version
of information because the patient has problems of the user interface running online. For develop-
with word finding. ing the text prediction algorithm that is behind the
In the project Communicatie en revalidatie initial version of the application, we aim to find
DigiPoli (ComPoli), language and speech tech- the best training corpus using the questions from
nologies are implemented in the infrastructure of the chat-by-click data as training set.
an existing online care platform in order to fa-
cilitate communication for patients suffering from 5.1 Data
neurological damage. Part of the online care plat-
The chat-by-click data set on neurological issues
form is a list of frequently asked questions about
consists of 639 questions with corresponding an-
neurological diseases with answers. A user can
swers. A small sample of the data (translated to
browse through the questions using a chat-by-click
English) is shown in Table 3. In order to create the
interface (Geuze et al., 2008). Besides reading the
test data for our experiments, we removed dupli-
listed questions and answers, the user has the op-
cate questions from the chat-by-click data, leaving
tion to submit a question that is not yet included in
a set of 359 questions.3
training on Wikipedia, testing on Twitter gives a different re- In the previous sections, we used corpora of
sult from training on Twitter, testing on Wikipedia. This is 100,000 words as test collections and we calcu-
due to the size and domain of the vocabularies in both data
sets and the richness of the contexts (in order for the algo- lated the percentage of saved keystrokes over the
rithm to predict a word, it has to have seen it in the train set).
3
If the test set has a larger vocabulary than the train set, a lower Some questions and answers are repeated several times
proportion of words can be predicted than when it is the other in the chat-by-click data because they are located at different
way around. places in the chat-by-click hierarchy.
565
Table 3: A sample of the neuro-QA data, translated to English.
question 0 505 Can (P)LS be cured?
answer 0 505 Unfortunately, a real cure is not possible. However, things can be done to combat the effects of the
diseases, mainly relieving symptoms such as stiffness and spasticity. The phisical therapist and reha-
bilitation specialist can play a major role in symptom relief. Moreover, there are medications that can
reduce spasticity.
question 0 508 How is (P)LS diagnosed?
answer 0 508 The diagnosis PLS is difficult to establish, especially because the symptoms strongly resemble HSP
symptoms (Strumpells disease). Apart from blood and muscle research, several neurological examina-
tions will be carried out.
Table 4: Results for the neuro-QA questions only in terms of percentages of saved keystrokes, using
different training sets. The text prediction configuration used in all settings is Buffer15. The test samples
are 359 questions with an average length of 7.5 words. The percentages of saved keystrokes are means
over the 359 questions.
Training corpus # words Mean % of saved keystrokes in OOV-rate
neuro-QA questions (stdev)
Twitter 1.5 Million 13.3% (12.5) 28.5%
Speech 1.5 Million 14.1% (13.2) 26.6%
Wikipedia 1.5 Million 16.1% (13.1) 19.4%
FAQ 1.5 Million 19.4% (15.6) 20.0%
Medical Wikipedia 1.5 Million 28.1% (16.5) 7.0%
Neuro-QA questions (leave-one-out) 2,672 26.5% (19.9) 17.8%
complete test corpus. In the reality of our case evaluating on the remaining questions).
study however, users will type only brief frag-
ments of text: the length of the question they want In order to create the medical Wikipedia cor-
to submit. This means that there is potentially a pus, we consulted the category structure of the
large deviation in the effectiveness of the text pre- Wikipedia corpus. The Wikipedia category Ge-
diction algorithm per user, depending on the con- neeskunde (Medicine) contains 69,898 pages and
tent of the small text they are typing. Therefore, in the deeper nodes of the hierarchy we see many
we decided to evaluate our training corpora sepa- non-medical pages, such as trappist beers (or-
rately on each of the 359 unique questions, so that dered under beer, booze, alcohol, Psychoactive
we can report both mean and standard deviation drug, drug, and then medicine). If we remove all
of the text prediction scores on small (realistically pages that are more than five levels under the Ge-
sized) samples. The average number of words per neeskunde category root, 21,071 pages are left,
question is 7.5; the total size of the neuro-QA cor- which contain fairly over the 1.5 Million words
pus is 2,672 words. that we need. We used the first 1.5 Million words
of the corpus in our experiments.
5.2 Experiments The text prediction results for the different cor-
We aim to find the training set that gives the best pora are in Table 4. For each corpus, the out-of-
text prediction result for the neuro-QA questions. vocabulary rate is given: the percentage of words
We compare the following training corpora: in the Neuro-QA questions that do not occur in the
corpus.4
The corpora that we compared in the text type
experiments: Wikipedia, Twitter, Speech and 5.3 Discussion of the results
FAQ, 1.5 Million words per corpus. We measured the statistical significance of the
A 1.5 Million words training corpus that is mean differences between all text prediction
of the same topic domain as the target data: scores using a Wilcoxon Signed Rank test on
Wikipedia articles from the medical domain; paired results for the 359 questions. We found that
The 359 questions from the neuro-QA data 4
The OOV-rate for the Neuro-QA corpus itself is the av-
themselves, evaluated in a leave-one-out set- erage of the OOV-rate of each leave-one-out experiment: the
ting (359 times training on 358 questions and proportion of words that only occur in one question.
566
ECDFs for text prediction scores on NeuroQA questions
using six different training corpora
1.0
0.8
Cumulative Percent of test corpus
0.6
0.4
Twitter
Speech
0.2
Wikipedia
FAQ
NeuroQA (leaveoneout)
0.0
Medical Wikipedia
0 10 20 30 40 50 60
Figure 3: Empirical CDFs for text prediction scores on Neuro-QA data. Note that the curves that are at
the bottom-right side represent the better-performing settings.
the difference between the Twitter and Speech cor- Wikipedia corpus.
pora on the task is not significant (P = 0.18). Table 4 also shows that the standard devia-
The difference between Neuro-QA and Medical tion among the 359 samples is relatively large.
Wikipedia is significant with P = 0.02; all other For some questions, we 0% of the keystrokes are
differences are significant with P < 0.01. saved, while for other, scores of over 80% are ob-
The Medical Wikipedia corpus and the leave- tained (by the Neuro-QA and Medical Wikipedia
one-out experiments on the Neuro-QA data give training corpora). We further analyzed the differ-
better text prediction scores than the other corpora. ences between the training sets by plotting the Em-
The Medical Wikipedia even scores slightly better pirical Cumulative Distribution Function (ECDF)
than the Neuro-QA data itself. Twitter and Speech for each experiment. An ECDF shows the devel-
are the least-suited training corpora for the Neuro- opment of text prediction scores (shown on the X-
QA questions, and FAQ data gives a bit better re- axis) by walking through the test set in 359 steps
sults than a general Wikipedia corpus. (shown on the Y-axis).
These results suggest that both text type and The ECDFs for our training corpora are in Fig-
topic domain play a role in text prediction qual- ure 3. Note that the curves that are at the bottom-
ity, but the high scores for the Medical Wikipedia right side represent the better-performing settings
corpus shows that topic domain is even more im- (they get to a higher maximum after having seen
portant than text type.5 The column OOV-rate a smaller portion of the samples). From Figure 3,
shows that this is probably due to the high cover- it is again clear that the Neuro-QA and Medical
age of terms in the Neuro-QA data by the Medical Wikipedia corpora outperform the other training
corpora, and that of the other four, FAQ is the best-
5
We should note here that we did not control for domain performing corpus. Figure 3 also shows a large
differences between the four different text types. They are
intended to be general domain but Wikipedia articles will difference in the sizes of the starting percentiles:
naturally be of different topics than conversational speech. The proportion of samples with a text prediction
567
Histogram of text prediction scores for the NeuroQA Histogram of text prediction scores for leaveoneout
questions trained on Medical Wikipedia experiments on NeuroQA questions
80
80
60
60
Frequency
Frequency
40
40
20
20
0
0
0 20 40 60 80 0 20 40 60 80
Figure 4: Histogram of text prediction scores Figure 5: Histogram of text prediction scores
for the Neuro-QA questions trained on Medical for leave-one-out experiments on Neuro-QA ques-
Wikipedia. Each bin represents 36 questions. tions. Each bin represents 36 questions.
score of 0% is less than 10% for the Medical around the mean, while the leave-one-out exper-
Wikipedia up to more than 30% for Speech. iments lead to a larger number of samples with
We inspected the questions that get a text pre- low prediction scores and a larger number of sam-
diction score of 0%. We see many medical terms ples with high prediction scores. This is also re-
in these questions, and many of the utterances are flected by the higher standard deviation for Neuro-
not even questions, but multi-word terms repre- QA than for Medical Wikipedia.
senting topical headers in the chat-by-click data. Since both the leave-one-out training on the
Seven samples get a zero-score in the output of all Neuro-QA questions and the Medical Wikipedia
six training corpora, e.g.: led to good results but behave differently for dif-
ferent portions of the test data, we also evaluated a
glycogenose III. combination of both corpora on our test set: We
potassium-aggrevated myotonias. created training corpora consisting of the Medi-
cal Wikipedia corpus, complemented by 90% of
26 samples get a zero-score in the output of all the Neuro-QA questions, testing on the remaining
training corpora except for Medical Wikipedia and 10% of the Neuro-QA questions. This led to mean
Neuro-QA itself. These are mainly short headings percentage of saved keystrokes of 28.6%, not sig-
with domain-specific terms such as: nificantly higher than just the Medical Wikipedia
corpus.
idiopatische neuralgische amyotrofie.
Markesbery-Griggs distale myopathie. 6 Conclusions
oculopharyngeale spierdystrofie.
In Section 1, we asked two questions: (1) What
Interestingly, the ECDFs show that the Med- is the effect of text type differences on the quality
ical Wikipedia and Neuro-QA corpora cross at of a text prediction algorithm? and (2) What is
around percentile 70 (around the point of 40% the best choice of training data if domain- and text
saved keystrokes). This indicates that although the type-specific data is sparse?
means of the two result samples are close to each By training and testing our text prediction al-
other, the distribution the scores for the individ- gorithm on four different text types (Wikipedia,
ual questions is different. The histograms of both Twitter, transcriptions of conversational speech
distributions (Figures 4 and 5) confirm this: the and FAQ) with equal corpus sizes, we found that
algorithm trained on the Medical Wikipedia cor- there is a clear effect of text type on text prediction
pus leads a larger number of samples with scores quality: training and testing on the same text type
568
gave percentages of saved keystrokes between 27 N. Garay-Vitoria and J. Abascal. 2006. Text prediction
and 34%; training on a different text type caused systems: a survey. Universal Access in the Informa-
tion Society, 4(3):188203.
the scores to drop to percentages between 16 and
28%. J. Geuze, P. Desain, and J. Ringelberg. 2008. Re-
In our case study, we compared a number of phrase: chat-by-click: a fundamental new mode of
training corpora for a specific data set for which human communication over the internet. In CHI08
extended abstracts on Human factors in computing
training data is sparse: questions about neuro- systems, pages 33453350. ACM.
logical issues. We found significant differences
between the text prediction scores obtained with G.W. Lesher, B.J. Moulton, D.J. Higginbotham, et al.
1999. Effects of ngram order and training text size
the six training corpora: the Twitter and Speech on word prediction. In Proceedings of the RESNA
corpora were the least suited, followed by the 99 Annual Conference, pages 5254.
Wikipedia and FAQ corpus. The highest scores
were obtained by training the algorithm on the Johannes Matiasek, Marco Baroni, and Harald Trost.
2002. FASTY - A Multi-lingual Approach to Text
medical pages from Wikipedia, immediately fol- Prediction. In Klaus Miesenberger, Joachim Klaus,
lowed by leave-one-out experiments on the 359 and Wolfgang Zagler, editors, Computers Helping
neurological questions. The large differences be- People with Special Needs, volume 2398 of Lec-
tween the lexical coverage of the medical domain ture Notes in Computer Science, pages 165176.
Springer Berlin / Heidelberg.
played a central role in the scores for the different
training corpora. N. Oostdijk. 2000. The spoken Dutch corpus:
Because we obtained good results by both overview and first evaluation. In Proceedings of
LREC-2000, Athens, volume 2, pages 887894.
the Medical Wikipedia corpus and the neuro-QA
questions themselves, we opted for a combination Erik Tjong Kim Sang. 2011. Het gebruik van Twit-
of both data types as training corpus in the initial ter voor Taalkundig Onderzoek. In TABU: Bulletin
version of the online text prediction application. voor Taalwetenschap, volume 39, pages 6272. In
Dutch.
Currently, a demonstration version of the appli-
cation is running for ComPoli-users. We hope to A. Van den Bosch and T. Bogers. 2008. Efficient
collect questions from these users to re-train our context-sensitive word completion for mobile de-
vices. In Proceedings of the 10th international con-
algorithm with more representative examples. ference on Human computer interaction with mobile
devices and services, pages 465470. ACM.
Acknowledgments
A. Van den Bosch. 2011. Effects of context and re-
This work is part of the research programme cency in scaled word completion. Computational
Communicatie en revalidatie digiPoli (Com- Linguistics in the Netherlands Journal, 1:7994,
12/2011.
Poli6 ), which is funded by ZonMW, the Nether-
lands organisation for health research and devel- G. Van Noord. 2009. Huge parsed corpora in LASSY.
opment. In Proceedings of The 7th International Workshop
on Treebanks and Linguistic Theories (TLT7).
S. Westman and L. Freund. 2010. Information Interac-
References tion in 140 Characters or Less: Genres on Twitter. In
Proceedings of the third symposium on Information
J. Carlberger. 1997. Design and Implementation of a Interaction in Context (IIiX), pages 323328. ACM.
Probabilistic Word Prediciton Program. Master the-
sis, Royal Institute of Technology (KTH), Sweden.
569
The Impact of Spelling Errors on Patent Search
Bauhaus-Universitt Weimar
99421 Weimar, Germany
<first name>.<last name>@uni-weimar.de
570
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 570579,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
Table 1: User groups and patent-search-related retrieval tasks in the patent domain (Hunt et al., 2007).
User group
Analyst Attorney Manager Inventor Investor Researcher
Patentability
State of the art
Patent search task Infringement
Opposition
Due diligence
Portfolio
task, where q is expanded by the disjunction of proach more sophisticated than the standard re-
assignee names Aq with Aq = {a A | a q}. trieval approach, which is the expansion of q by
While the trivial expansion of q by the entire the empty set, is needed. Such an approach must
set A ensures maximum recall but entails an un- strive for an expansion of q by a subset of Aq ,
acceptable precision, the expansion of q by the whereby this subset should be as large as possible.
empty set yields a reasonable baseline. The latter
approach is implemented in patent search engines 1.1 Contributions
such as PatBase1 or FreePatentsOnline,2 which
return all patents where the company name q oc- The paper provides a new solution to the problem
curs as a substring of the assignee name a. This outlined. This solution employs machine learn-
baseline is simple but reasonable; due to trade- ing on orthographic features, as well as on patent
mark law, a company name q must be a unique meta features, to reliably detect spelling errors. It
identifier (i.e. a key), and an assignee name a that consists of two steps: (1) the computation of A+ q ,
contains q can be considered as relevant. It should the set of assignee names that are in a certain edit
be noted in this regard that |q| < |a| holds for distance neighborhood to q; and (2) the filtering of
most elements in Aq , since the assignee names A+
q , yielding the set Aq , which contains those as-
often contain company suffixes such as Ltd +
signee names from Aq that are classified as mis-
or Inc. spellings of q. The power of our approach can be
Our hypothesis is that due to misspelled as- seen from Table 3, which also shows a key result
signee names a substantial fraction of relevant of our research; a retrieval system that exploits
patents cannot be found by the baseline ap- our classifier will miss only 0.5% of the relevant
proach. In this regard, the types of spelling er- patents, while retrieval precision is compromised
rors in assignee names given in Table 2 should by only 3.7%.
be considered. Another contribution relates to a new, manu-
Table 2: Types of spelling errors with increasing ally-labeled corpus comprising spelling errors in
problem complexity according to Stein and Curatolo the assignee field of patents (cf. Section 3). In
(2006). The first row refers to lexical errors, whereas this regard, we consider the over 2 million patents
the last two rows refer to phonological errors. For each granted by the USPTO between 2001 and 2010.
type, an example is given, where a misspelled com- Last, we analyze indications of deliberately in-
pany name is followed by the correctly spelled variant. serted spelling errors (cf. Section 4).
Spelling error type Example
Permutations or dropped letters Whirpool Corporation Table 3: Mean average Precision, Recall, and F -
Whirlpool Corporation
Measure ( = 2) for different expansion sets for q in
Misremembering spelling details Whetherford International a portfolio search task, which is conducted on our test
Weatherford International
corpus (cf. Section 3).
Spelling out the pronunciation Emulecks Corporation
Emulex Corporation Expansion set for q Precision Recall F2
(baseline) 0.993 0.967 0.968
In order to raise the recall for portfolio search Aq (machine learning) 0.956 0.995 0.980
571
1.2 Causes for Inconsistencies in Patents names Howlett-Packard and Hewett-Packard
are distinct but refer to the same company. These
We identify the following six factors for inconsis-
kinds of near-duplicates impede the identification
tencies in the bibliographic fields of patents, in
of duplicates (Naumann and Herschel, 2010).
particular for assignee names: (1) Misspellings
are introduced due to the lack of knowledge, the Near-duplicate Detection The problem of
lack of attention, and due to spelling disabili- identifying near-duplicates is also known as
ties. Intellevate Inc. (2006) reports that 98% record linkage, or name matching; it is sub-
of a sample of patents taken from the USPTO ject of active research (Elmagarmid et al., 2007).
database contain errors, most which are spelling With respect to text documents, slightly modi-
errors. (2) Spelling errors are only removed by the fied passages in these documents can be identi-
USPTO upon request (U.S. Patent & Trademark fied using fingerprints (Potthast and Stein, 2008).
Office, 2010). (3) Spelling variations of inventor On the other hand, for data fields which con-
names are permitted by the USPTO. The Manual tain natural language such as the assignee name
of Patent Examining Procedure (MPEP) states in field, string similarity metrics (Cohen et al.,
paragraph 605.04(b) that if the applicants full 2003) as well as spelling correction technol-
name is John Paul Doe, either John P. Doe or ogy are exploited (Damerau, 1964; Monge and
J. Paul Doe is acceptable. Thus, it is valid to in- Elkan, 1997). String similarity metrics com-
troduce many different variations: with and with- pute a numeric value to capture the similarity
out initials, with and without a middle name, or of two strings. Spelling correction algorithms,
with and without suffixes. This convention ap- by contrast, capture the likelihood for a given
plies to assignee names, too. (4) Companies of- word being a misspelling of another word. In
ten have branches in different countries, where our analysis, the similarity metric SoftTfIdf is
each branch has its own company suffix, e.g., applied, which performs best in name matching
Limited (United States), GmbH (Germany), tasks (Cohen et al., 2003), as well as the complete
or Kabushiki Kaisha (Japan). Moreover, the range of spelling correction algorithms shown in
usage of punctuation varies along company suf- Figure 1: Soundex, which relies on similarity
fix abbreviations: L.L.C. in contrast to LLC, hashing (Knuth, 1997), the Levenshtein distance,
for example. (5) Indexing errors emerge from which gives the minimum number of edits needed
OCR processing patent applications, because sim- to transform a word into another word (Leven-
ilar looking letters such as e versus c or l shtein, 1966), and SmartSpell, a phonetic pro-
versus I are likely to be misinterpreted. (6) With duction approach that computes the likelihood
the advent of electronic patent application filing, of a misspelling (Stein and Curatolo, 2006). In
the number of patent reexamination steps was re- order to combine the strength of multiple met-
duced. As a consequence, the chance of unde- rics within a near-duplicate detection task, sev-
tected spelling errors increases (Adams, 2010). eral authors resort to machine learning (Bilenko
All of the mentioned factors add to a highly in- and Mooney, 2002; Cohen et al., 2003). Christen
consistent USPTO corpus. (2006) concludes that it is important to exploit all
kinds of knowledge about the type of data in ques-
2 Related Work tion, and that inconsistencies are domain-specific.
Hence, an effective near-duplicate detection ap-
Information within a corpus can only be retrieved proach should employ domain-specific heuristics
effectively if the data is both accurate and unique and algorithms (Mller and Freytag, 2003). Fol-
(Mller and Freytag, 2003). In order to yield data lowing this argumentation, we augment various
that is accurate and unique, approaches to data word similarity assessments with patent-specific
cleansing can be utilized to identify and remove meta-features.
inconsistencies. Mller and Freytag (2003) clas-
sify inconsistencies, where duplicates of entities Patent Search Commercial patent search en-
in a corpus are part of a semantic anomaly. These gines, such as PatBase and FreePatentsOnline,
duplicates exist in a database if two or more dif- handle near-duplicates in assignee names as fol-
ferent tuples refer to the same entity. With respect lows. For queries which contain a company name
to the bibliographic fields of patents, the assignee followed by a wildcard operator, PatBase suggests
572
Collision-based
Near similarity query expansion sets (cf. Table 3) into two cate-
hashing Neighborhood-based
gories: (1) The trivial as well as the edit distance
Single word Trigram-based expansion sets are underspecific, i.e., users cannot
spelling Editing Edit-distance-based
correction Rule-based
cope with the large amount of irrelevant patents
returned; the precision is close to zero. (2) The
Heuristic search
Phonetic production
Hidden Markov baseline approach, by contrast, is overspecific;
approach
models it returns too few documents, i.e., the achieved
Figure 1: Classification of spelling correction methods recall is not optimal. As a consequence, these
according to Stein and Curatolo (2006). query expansion sets are not suitable for portfolio
a set of additional companies (near-duplicates), search. Our approach, on the other hand, excels
which can be considered alongside the company in both precision and recall.
name in question. These suggestions are solely Query Spelling Correction Queries which are
retrieved based on a trailing wildcard query. Each submitted to standard web search engines differ
additional company name can then be marked in- from queries which are posed to patent search en-
dividually by a user to expand the original query. gines with respect to both length and language
In case the entire set of suggestions is consid- diversity. Hence, research in the field of web
ered, this strategy conforms to the expansion of search is concerned with suggesting reasonable
a query by the empty set, which equals a rea- alternatives to misspelled queries rather than cor-
sonable baseline approach. This query expansion recting single words (Li et al., 2011). Since stan-
strategy, however, has the following drawbacks: dard spelling correction dictionaries (e.g. ASpell)
(1) The strategy captures only inconsistencies that are not able to capture the rich language used in
succeed the given company name in the origi- web queries, large-scale knowledge sources such
nal query. Thus, near-duplicates which contain as Wikipedia (Li et al., 2011), query logs (Chen
spelling errors in the company name itself are not et al., 2007), and large n-gram corpora (Brants et
found. Even if PatBase would support left trailing al., 2007) are employed. It should be noted that
wildcards, then only the full combination of wild- the set of correctly written assignee names is un-
card expressions would cover all possible cases of known for the USPTO patent corpus.
misspellings. (2) Given an acronym of a company Moreover, spelling errors are modeled on the
such as IBM, it is infeasible to expand the ab- basis of language models (Li et al., 2011). Okuno
breviation to International Business Machines (2011) proposes a generative model to encounter
without considering domain knowledge. spelling errors, where the original query is ex-
panded based on alternatives produced by a small
Query Expansion Methods for Patent Search edit distance to the original query. This strategy
To date, various studies have investigated query correlates to the trivial query expansion set (cf.
expansion techniques in the patent domain that
Section 1). Unlike using a small edit distance, we
focus on prior-art search and invalidity search
allow a reasonable high edit distance to maximize
(Magdy and Jones, 2011). Since we are dealing
the recall.
with queries that comprise only a company name,
existing methods cannot be applied. Instead, the Trademark Search The trademark search is
near-duplicate task in question is more related to a about identifying registered trademarks which are
text reuse detection task discussed by Hagen and similar to a new trademark application. Sim-
Stein (2011); given a document, passages which ilarities between trademarks are assessed based
also appear identical or slightly modified in other on figurative and verbal criteria. In the former
documents, have to be retrieved by using standard case, the focus is on image-based retrieval tech-
keyword-based search engines. Their approach is niques. Trademarks are considered verbally simi-
guided by the user-over-ranking hypothesis intro- lar for a variety of reasons, such as pronunciation,
duced by Stein and Hagen (2011). It states that spelling, and conceptual closeness, e.g., swapping
the best retrieval performance can be achieved letters or using numbers for words. The verbal
with queries returning about as many results as similarity of trademarks, on the other hand, can
can be considered at user site. If we make use be determined by using techniques comparable
of their terminology, then we can distinguish the to near-duplicate detection: phonological parsing,
573
fuzzy search, and edit distance computation (Fall with a misspelled company name has a low
and Giraud-Carrier, 2005). frequency.
2. IPC Overlap. The IPC codes of a patent
3 Detection of Spelling Errors
specify the technological areas it applies
This section presents our machine learning ap- to. We assume that patents filed under the
proach to expand a company query q; the classi- same company name are likely to share the
fier c delivers the set Aq = {a A | c(q, a) = 1}, same set of IPC codes, regardless whether
an approximation of the ideal set of relevant as- the company name is misspelled or not.
signee names Aq . As a classification technol- Hence, if we determine the IPC codes of
ogy a support vector machine with linear kernel patents which contain q in the assignee
is used, which receives each pair (q, a) as a six- name, IPC(q), and the IPC codes of patents
dimensional feature vector. For training and test filed under assignee name a, IPC(a), then
purposes we identified misspellings for 100 dif- the intersection size of the two sets serves as
ferent company names. A detailed description of an indicator for a misspelled company name
the constructed test corpus and a report on the in a:
classifiers performance is given in the remainder
IPC(q) IPC(a)
of this section. FIPC (q, a) =
IPC(q) IPC(a)
3.1 Feature Set
The feature set comprises six features, three of 3. Company Suffix Match. The suffix match
them being orthographic similarity metrics, which relies on the company suffixes Suffixes(q)
are computed for every pair (q, a). Each metric that occur in the assignee names of A con-
compares a given company name q with the first taining q. Similar to the IPC overlap fea-
|q| words of the assignee name a: ture, we argue that if the company suffix
of a exists in the set Suffixes(q), a mis-
1. SoftTfIdf. The SoftTfIdf metric is consid- spelling in a is likely: FSuffixes (q, a) = 1
ered, since the metric is suitable for the com- iff Suffixes(a) Suffixes(q).
parison of names (Cohen et al., 2003). The
metric incorporates the Jaro-Winkler met- 3.2 Webis Patent Retrieval Assignee Corpus
ric (Winkler, 1999) with a distance threshold A key contribution of our work is a new cor-
of 0.9. The frequency values for the similar- pus called Webis Patent Retrieval Assignee Cor-
ity computation are trained on A. pus 2012 (Webis-PRA-12). We compiled the cor-
2. Soundex. The Soundex spelling correction pus in order to assess the impact of misspelled
algorithm captures phonetic errors. Since the companies on patent retrieval and the effective-
algorithm computes hash values for both q ness of our classifier to detect them.3 The corpus
and a, the feature is 1 if these hash values is built on the basis of 2 132 825 patents D granted
are equal, 0 otherwise. by the USPTO between 2001 and 2010; the patent
3. Levenshtein distance. The Levenshtein dis- corpus is provided publicly by the USPTO in
tance for (q, a) is normalized by the charac- XML format. Each patent contains bibliographic
ter length of q. fields as well as textual information such as the
abstract and the claims section. Since we are in-
To obtain further evidence for a misspelling terested in the assignee name a associated with
in an assignee name, meta information about the each patent d D, we parse each patent and ex-
patents in D, to which the assignee name refers tract the assignee name. This yields the set A of
to, is exploited. In this regard, the following three 202 846 different assignee names. Each assignee
features are derived: name refers to a set of patents, which size varies
1. Assignee Name Frequency. The number from 1 to 37 202 (the number of patents filed
of patents filed under an assignee name a: under International Business Machines Corpo-
FFreq (a) = Freq (a, D). We assume that the ration). It should be noted that for a portfolio
probability of a misspelling to occur multi- 3
The Webis-PRA-12 corpus is freely available via
ple times is low, and thus an assignee name www.webis.de/research/corpora
574
Table 4: Statistics of spelling errors for the 100 companies in the Webis-PRA-12 corpus. Considered are the
number of words and the number of letters in the company names, as well as the number of different company
suffixes that are used together with a company name (denoted as variants of q)
Total Num. of words in q Num. of letters in q Num. of variants of q
1 2 3-4 2-10 11-15 16-35 1-5 6-15 16-96
Number of companies in Q 100 36 53 11 30 35 35 45 32 23
Avg. num. of misspellings in A 3.79 2.13 3.75 9.36 1.16 2.94 6.88 0.91 3.81 9.39
search task the number of patents which refer to signee names A+ q \ Aq form the set of negative
an assignee name matters for the computation of examples (12 651 in total).
precision and recall. If we, however, isolate the During the manual assessment, names of as-
task of detecting misspelled company names, then signees which include the correct company name
it is also reasonable to weight each assignee name q were distinguished from misspelled ones. The
equally and independently from the number of latter holds true for 379 of the 1 538 assignee
patents it refers to. Both scenarios are addressed names. These names are not retrievable by the
in the experiments. baseline system, and thus form the main target for
Given A, the corpus construction task is to map our classifier. The second row of Table 4 reports
each assignee name a A to the company name on the distribution of the 379 misspelled assignee
q it refers to. This gives for each company name names. As expectable, the longer the company
q the set of relevant assignee names Aq . For our name, the more spelling errors occur. Compa-
corpus, we do not construct Aq for all company nies which file patents under many different as-
names but take a selection of 100 company names signee names are likelier to have patents with mis-
from the 2011 Fortune 500 ranking as our set of spellings in the company name.
company names Q. Since the Fortune 500 rank-
3.3 Classifier Performance
ing contains only large companies, the test cor-
pus may appear to be biased towards these com- For the evaluation with the Webis-PRA-12 cor-
panies. However, rather than the company size the pus, we train a support vector machine,4 which
structural properties of a company name are de- considers the six outlined features, and compare
terminative; our sample includes short, medium, it to the other expansion techniques. For the train-
and long company names, as well as company ing phase, we use 2/3 of the positive examples
names with few, medium, and many different to form a balanced training set of 1 025 posi-
company suffixes. Table 4 shows the distribution tive and 1 025 negative examples. After 10-fold
of company names in Q along these criteria in the cross validation, the achieved classification accu-
first row. racy is 95.97%.
For a comparison of the expansion techniques
For each company name q Q, we ap-
on the test set, which contains the examples not
ply a semi-automated procedure to derive the
considered in the training phase, two tasks are
set of relevant assignee names Aq . In a first
distinguished: finding near duplicates in assignee
step, all assignee names in A which do not re-
names (cf. Table 5, Columns 35), and finding all
fer to the company name q are filtered auto-
patents of a company (cf. Table 5, Columns 68).
matically. From a preliminary evaluation we
The latter refers to the actual task of portfo-
concluded that the Levenshtein distance d(q, a)
lio search. It can be observed that the perfor-
with a relative threshold of |q|/2 is a reasonable
mance improvements on both tasks are pretty sim-
choice for this filtering step. The resulting sets
ilar. The baseline expansion yields a recall
A+q = {a A | d(q, a) |q|/2) contain, in total
of 0.83 in the first task. The difference of 0.17
over Q, 14 189 assignee names. These assignee
to a perfect recall can be addressed by consid-
names are annotated by human assessors within a
ering query expansion techniques. If the triv-
second step to derive the final set Aq for each q
ial expansion A is applied to the task the max-
Q. Altogether we identify 1 538 assignee names
imum recall can be achieved, which, however,
that refer to the 100 companies in Q. With respect
to our classification task, the assignee names in 4
We use the implementation of the WEKA toolkit with default
each Aq are positive examples; the remaining as- parameters.
575
Table 5: The search results (macro-averaged) for two retrieval tasks and various expansion techniques. Besides
Precision and Recall, the F-Measure with = 2 is stated.
is bought with precision close to zero. Using nological areas based on the International
the edit distance expansion A+ q yields a precision Patent Classification scheme IPC: A (Hu-
of 0.274 while keeping the recall at maximum. Fi- man necessities), B (Performing operations;
nally, the machine learning expansion Aq leads transporting), C (Chemistry; metallurgy),
to a dramatic improvement (cf. Table 5, bottom D (Textiles; paper), E (Fixed constructions),
lines), whereas the exploitation of patent meta- F (Mechanical engineering; lighting; heat-
features significantly outperforms the exclusive ing; weapons; blasting), G (Physics), and
use of orthography-related features; the increase H (Electricity). If spelling errors are in-
in recall which is achieved by Aq is statistically troduced accidentally, then we expect them
significant (matched pair t-test) for both tasks (as- to be uniformly distributed across all ar-
signee names task: t = 7.6856, df = 99, eas. A biased distribution, on the other
p = 0.00; patents task: t = 2.1113, df = 99, hand, indicates that errors might be in-
p = 0.037). Note that when being applied as a serted deliberately.
single feature none of the spelling metrics (Lev-
enshtein, SoftTfIdf, Soundex) is able to achieve In the following, we compile a second corpus
a recall close to 1 without significantly impairing on the basis of the entire set A of assignee names.
the precision. In order to yield a uniform distribution of the com-
panies across years, technological areas and coun-
4 Distribution of Spelling Errors tries, a set of 120 assignee names is extracted for
each dimension. After the removal of duplicates,
Encouraged by the promising retrieval results we revised these assignee names manually in or-
achieved on the Webis-PRA-12 corpus, we ex- der to check (and correct) their spelling. Finally,
tend the analysis of spelling errors in patents to trailing business suffixes are removed, which re-
the entire USPTO corpus of granted patents be- sults in a set of 3 110 company names. For each
tween 2001 and 2010. The analysis focuses on company name q, we generate the set Aq as de-
the following two research questions: scribed in Section 3.
1. Are spelling errors an increasing issue in The results of our analysis are shown in Table 6.
patents? According to Adams (2010), the Table 6(a) refers to the first research question and
amount of spelling errors should have been shows that the amount of misspellings in compa-
increased in the last years due to the elec- nies decreased over the years from 6.67% in 2001
tronic patent filing process (cf. Section 1.2). to 4.74% in 2010 (cf. Row 3). These results let us
We address this hypothesis by analyzing the reject the hypothesis of Adams (2010). Neverthe-
distribution of spelling errors in company less, the analysis provides evidence that spelling
names that occur in patents granted between errors are still an issue. For example, the company
2001 and 2010. identified with most spelling errors are Konin-
klijke Philips Electronics with 45 misspellings
2. Are misspellings introduced deliberately in in 2008, and Centre National de la Recherche
patents? We address this question by analyz- Scientifique with 28 misspellings in 2009. The
ing the patents with respect to the eight tech- results are consistent with our findings with re-
576
Table 6: Distribution of spelling errors for 3 110 company identifiers in the USPTO patents. The mean of spelling
errors per company identifier and the standard deviation refer to companies with misspellings. The last row in
each table shows the number of patents that are additionally found if the original query q is expanded by Aq .
(a) Distribution of spelling errors between the years 2001 and 2010.
Year
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Number of companies 1 028 1 066 1 115 1 151 1 219 1 261 1 274 1 210 1 224 1 268
Number of companies with misspellings 67 63 53 65 65 60 65 64 53 60
Measure
Companies with misspellings (%) 6.52 5.91 4.75 5.65 5.33 4.76 5.1 5.29 4.33 4.73
Mean 2.78 2.35 2.23 2.28 2.18 2.48 2.23 3.0 2.64 2.8
Standard deviation 4.62 3.3 3.63 3.13 2.8 3.55 2.87 6.37 4.71 4.6
Maximum misspellings per company 24 12 16 12 10 18 12 45 28 22
Additional number of patents 7.1 7.21 7.43 7.68 7.91 8.48 7.83 8.84 8.92 8.92
IPC code
A B C D E F G H
Number of companies 954 1 231 811 277 412 771 1 232 949
Number of companies with misspellings 59 70 51 7 10 33 83 63
Measure
Companies with misspellings (%) 6.18 5.69 6.29 2.53 2.43 4.28 6.74 6.64
Mean 3.0 2.49 3.57 1.86 2.8 1.88 3.29 4.05
Standard deviation 5.28 3.65 7.03 1.99 4.22 2.31 5.72 7.13
Maximum misspellings per company 32 14 40 3 12 6 24 35
Additional number of patents 9.25 9.67 11.12 4.71 4.6 4.79 8.92 12.84
spect to the Fortune 500 sample (cf. Table 4), inconsistencies. With the analysis of spelling er-
where company names that are longer and pre- rors in assignee names we made a first yet consid-
sumably more difficult to write contain more erable contribution in this respect; searches with
spelling errors. assignee constraints become a more sensible op-
In contrast to the uniform distribution of mis- eration. We showed how a special treatment of
spellings over the years, the situation with re- spelling errors can significantly raise the effec-
gard to the technological areas is different (cf. Ta- tiveness of patent search. The identification of
ble 6(b)). Most companies are associated with this untapped potential, but also the utilization of
the IPC sections G and B, which both refer to machine learning to combine patent features with
technical domains (cf. Table 6(b), Row 1). The typography, form our main contributions.
percentage of misspellings in these sections in- Our current research broadens the application
creased compared to the spelling errors grouped of a patent spelling analysis. In order to iden-
by year. A significant difference can be seen for tify errors that are introduced deliberately we
the sections D and E. Here, the number of as- investigate different types of misspellings (edit
signed companies drops below 450 and the per- distance versus phonological). Finally, we con-
centage of misspellings decreases significantly sider the analysis of acquisition histories of com-
from about 6% to 2.5%. These findings might panies as promising research direction: since
support the hypothesis that spelling errors are in- acquired companies often own granted patents,
serted deliberately in technical domains. these patents should be considered while search-
ing for the company in question in order to further
5 Conclusions increase the recall.
While researchers in the patent domain concen-
Acknowledgements
trate on retrieval models and algorithms to im-
prove the search performance, the original aspect This work is supported in part by the German Sci-
of our paper is that it points to a different (and or- ence Foundation under grants STE1019/2-1 and
thogonal) research avenue: the analysis of patent FU205/22-1.
577
References Processing and Information Retrieval (SPIRE 11),
volume 7024 of Lecture Notes in Computer Science,
Stephen Adams. 2010. The Text, the Full Text and
pages 356367. Springer.
nothing but the Text: Part 1 Standards for creating
Textual Information in Patent Documents and Gen- David Hunt, Long Nguyen, and Matthew Rodgers, ed-
eral Search Implications. World Patent Information, itors. 2007. Patent Searching: Tools & Techniques.
32(1):2229, March. Wiley.
Mikhail Bilenko and Raymond J. Mooney. 2002. Intellevate Inc. 2006. Patent Quality, a blog en-
Learning to Combine Trained Distance Metrics try. http://www.patenthawk.com/blog/
for Duplicate Detection in Databases. Technical 2006/01/patent_quality.html, January.
Report AI 02-296, Artificial Intelligence Labora-
Hideo Joho, Leif A. Azzopardi, and Wim Vander-
tory, University of Austin, Texas, USA, Austin,
bauwhede. 2010. A Survey of Patent Users: An
TX, February.
Analysis of Tasks, Behavior, Search Functionality
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. and System Requirements. In IIix 10: Proceed-
Och, and Jeffrey Dean. 2007. Large Language ing of the third symposium on Information Inter-
Models in Machine Translation. In EMNLP-CoNLL action in Context, pages 1324, New York, NY,
07: Proceedings of the 2007 Joint Conference on USA. ACM.
Empirical Methods in Natural Language Process- Donald E. Knuth. 1997. The Art of Computer Pro-
ing and Computational Natural Language Learn- gramming, Volume I: Fundamental Algorithms, 3rd
ing, pages 858867. ACL, June. Edition. Addison-Wesley.
Qing Chen, Mu Li, and Ming Zhou. 2007. Improv- Vladimir I. Levenshtein. 1966. Binary codes capa-
ing Query Spelling Correction Using Web Search ble of correcting deletions, insertions and reversals.
Results. In EMNLP-CoNLL 07: Proceedings of Soviet Physics Doklady, 10(8):707710. Original
the 2007 Joint Conference on Empirical Methods in in Doklady Akademii Nauk SSSR 163(4): 845-848.
Natural Language Processing and Computational
Natural Language Learning, pages 181189. ACL, Yanen Li, Huizhong Duan, and ChengXiang Zhai.
June. 2011. CloudSpeller: Spelling Correction for Search
Queries by Using a Unified Hidden Markov Model
Peter Christen. 2006. A Comparison of Personal with Web-scale Resources. In Spelling Alteration
Name Matching: Techniques and Practical Is- for Web Search Workshop, pages 1014, July.
sues. In ICDM 06: Workshops Proceedings of
the sixth IEEE International Conference on Data Patrice Lopez and Laurent Romary. 2010. Experi-
Mining, pages 290294. IEEE Computer Society, ments with Citation Mining and Key-Term Extrac-
December. tion for Prior Art Search. In Martin Braschler,
Donna Harman, and Emanuele Pianta, editors,
William W. Cohen, Pradeep Ravikumar, and Stephen CLEF 2010 LABs and Workshops, Notebook Pa-
E. Fienberg. 2003. A Comparison of String pers, September.
Distance Metrics for Name-Matching Tasks. In
Subbarao Kambhampati and Craig A. Knoblock, Mihai Lupu, Katja Mayer, John Tait, and Anthony J.
editors, IIWeb 03: Proceedings of the IJCAI Trippe, editors. 2011. Current Challenges in Patent
workshop on Information Integration on the Web, Information Retrieval, volume 29 of The Informa-
pages 7378, August. tion Retrieval Series. Springer.
Fred J. Damerau. 1964. A Technique for Computer Walid Magdy and Gareth J. F. Jones. 2010. Ap-
Detection and Correction of Spelling Errors. Com- plying the KISS Principle for the CLEF-IP 2010
munications of the ACM, 7(3):171176. Prior Art Candidate Patent Search Task. In Martin
Braschler, Donna Harman, and Emanuele Pianta,
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and editors, CLEF 2010 LABs and Workshops, Note-
Vassilios S. Verykios. 2007. Duplicate Record De- book Papers, September.
tection: A Survey. IEEE Trans. Knowl. Data Eng.,
19(1):116. Walid Magdy and Gareth J.F. Jones. 2011. A Study
on Query Expansion Methods for Patent Retrieval.
Caspas J. Fall and Christophe Giraud-Carrier. 2005. In PAIR 11: Proceedings of the 4th workshop on
Searching Trademark Databases for Verbal Similar- Patent information retrieval, AAAI Workshop on
ities. World Patent Information, 27(2):135143. Plan, Activity, and Intent Recognition, pages 19
24, New York, NY, USA. ACM.
Matthias Hagen and Benno Stein. 2011. Candidate
Document Retrieval for Web-Scale Text Reuse De- Alvaro E. Monge and Charles Elkan. 1997. An Ef-
tection. In 18th International Symposium on String ficient Domain-Independent Algorithm for Detect-
578
ing Approximately Duplicate Database Records.
In DMKD 09: Proceedings of the 2nd workshop
on Research Issues on Data Mining and Knowl-
edge Discovery, pages 2329, New York, NY,
USA. ACM.
Heiko Mller and Johann-C. Freytag. 2003. Prob-
lems, Methods and Challenges in Comprehensive
Data Cleansing. Technical Report HUB-IB-164,
Humboldt-Universitt zu Berlin, Institut fr Infor-
matik, Germany.
Felix Naumann and Melanie Herschel. 2010. An In-
troduction to Duplicate Detection. Synthesis Lec-
tures on Data Management. Morgan & Claypool
Publishers.
Yoh Okuno. 2011. Spell Generation based on Edit
Distance. In Spelling Alteration for Web Search
Workshop, pages 2526, July.
Martin Potthast and Benno Stein. 2008. New Is-
sues in Near-duplicate Detection. In Christine
Preisach, Hans Burkhardt, Lars Schmidt-Thieme,
and Reinhold Decker, editors, Data Analysis, Ma-
chine Learning and Applications. Selected papers
from the 31th Annual Conference of the German
Classification Society (GfKl 07), Studies in Classi-
fication, Data Analysis, and Knowledge Organiza-
tion, pages 601609, Berlin Heidelberg New York.
Springer.
Benno Stein and Daniel Curatolo. 2006. Phonetic
Spelling and Heuristic Search. In Gerhard Brewka,
Silvia Coradeschi, Anna Perini, and Paolo Traverso,
editors, 17th European Conference on Artificial In-
telligence (ECAI 06), pages 829830, Amsterdam,
Berlin, August. IOS Press.
Benno Stein and Matthias Hagen. 2011. Introducing
the User-over-Ranking Hypothesis. In Advances in
Information Retrieval. 33rd European Conference
on IR Resarch (ECIR 11), volume 6611 of Lecture
Notes in Computer Science, pages 503509, Berlin
Heidelberg New York, April. Springer.
U.S. Patent & Trademark Office. 2010. Manual of
Patent Examining Procedure (MPEP), Eighth Edi-
tion, July.
William W. Winkler. 1999. The State of Record Link-
age and Current Research Problems. Technical re-
port, Statistical Research Division, U.S. Bureau of
the Census.
Xiaobing Xue and Bruce W. Croft. 2009. Automatic
Query Generation for Patent Search. In CIKM
09: Proceeding of the eighteenth ACM conference
on Information and Knowledge Management, pages
20372040, New York, NY, USA. ACM.
579
U BY A Large-Scale Unified Lexical-Semantic Resource
Based on LMF
Iryna Gurevych , Judith Eckle-Kohler , Silvana Hartmann , Michael Matuschek ,
Christian M. Meyer and Christian Wirth
Ubiquitous Knowledge Processing Lab (UKP-DIPF)
German Institute for Educational Research and Educational Information
Ubiquitous Knowledge Processing Lab (UKP-TUDA)
Department of Computer Science
Technische Universitat Darmstadt
http://www.ukp.tu-darmstadt.de
580
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 580590,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
summarized as follows: (1) We present an LMF- McCrae et al. (2011) propose LEMON, a con-
based model for large-scale multilingual LSRs ceptual model for lexicalizing ontologies as an
called U BY-LMF. We model the lexical-semantic extension of the LexInfo model (Buitelaar et al.,
information down to a fine-grained level of in- 2009). L EMON provides an LMF-implementation
formation (e.g. syntactic frames) and employ in the Web Ontology Language (OWL), which
standardized definitions of linguistic information is similar to U BY-LMF, as it also uses DCs
types from ISOCat. (2) We present U BY, a large- from ISOCat, but diverges further from the stan-
scale LSR implementing the U BY-LMF model. dard (e.g. by removing structural elements such
U BY currently contains nine resources in two as the predicative representation class). While
languages: English WordNet (WN, Fellbaum we focus on modeling lexical-semantic informa-
(1998), Wiktionary2 (WKT-en), Wikipedia3 (WP- tion comprehensively and at a fine-grained level,
en), FrameNet (FN, Baker et al. (1998)), and the goal of LEMON is to support the linking be-
VerbNet (VN, Kipper et al. (2008)); German Wik- tween ontologies and lexicons. This goal entails
tionary (WKT-de), Wikipedia (WP-de), and Ger- a task-targeted application: domain-specific lex-
maNet (GN, Kunze and Lemnitzer (2002)), and icons are extracted from ontology specifications
the English and German entries of OmegaWiki4 and merged with existing LSRs on demand. As a
(OW), referred to as OW-en and OW-de. OW, consequence, there is no available large-scale in-
a novel CCR, is inherently multilingual its ba- stance of the LEMON model.
sic structure are multilingual synsets, which are a Soria et al. (2009) define WordNet-LMF, an
valuable addition to our multilingual U BY. Essen- LMF model for representing wordnets used in
tial to U BY are the nine pairwise sense alignments the KYOTO project, and Henrich and Hinrichs
between resources, which we provide to enable (2010) do this for GN, the German wordnet.
resource interoperability on the sense level, e.g. These models are similar, but they still present
by providing access to the often complementary different implementations of the LMF meta-
information for a sense in different resources. (3) model, which hampers interoperability between
We present a Java-API which offers unified access the resources. We build upon this work, but ex-
to the information contained in U BY. tend it significantly: U BY goes beyond model-
We will make the U BY-LMF model, the re- ing a single ECR and represents a large number
source U BY and the API freely available to the of both ECRs and CCRs with very heterogeneous
research community.5 This will make it easy for content in the same format. Also, U BY-LMF
the NLP community to utilize U BY in a variety of features deeper modeling of lexical-semantic in-
tasks in the future. formation. Henrich and Hinrichs (2010), for
instance, do not explicitly model the argument
2 Related Work structure of subcategorization frames, since each
frame is represented as a string. In U BY-LMF,
The work presented in this paper concerns
we represent them at a fine-grained level neces-
standardization of LSRs, large-scale integration
sary for the transparent modeling of the syntax-
thereof at the representational level, and the uni-
semantics interface.
fied access to lexical-semantic information in the
integrated resources.
Large-scale integration of resources. Most
Standardization of resources. Previous work previous research efforts on the integration of re-
includes models for representing lexical informa- sources targeted at world knowledge rather than
tion relative to ontologies (Buitelaar et al., 2009; lexical-semantic knowledge. Well known exam-
McCrae et al., 2011), and standardized single ples are YAGO (Suchanek et al., 2007), or DBPe-
wordnets (English, German and Italian wordnets) dia (Bizer et al., 2009).
in the ISO standard LMF (Soria et al., 2009; Hen- Atserias et al. (2004) present the Meaning Mul-
rich and Hinrichs, 2010; Toral et al., 2010). tilingual Central Repository (MCR). MCR inte-
2
grates five local wordnets based on the Interlin-
http://www.wiktionary.org/
3
http://www.wikipedia.org/
gual Index of EuroWordNet (Vossen, 1998). The
4
http://www.omegawiki.org/ overall goal of the work is to improve word sense
5
http://www.ukp.tu-darmstadt.de/data/uby disambiguation. This work is similar to ours, as it
581
aims at a large-scale multilingual resource and in- API,6 or the Java-based Wikipedia API.7
cludes several resources. It is however restricted With a stronger focus of the NLP community
to a single type of resource (wordnets) and fea- on sharing data and reproducing experimental re-
tures a single type of lexical information (seman- sults these tools are becoming important as never
tic relations) specified upon synsets. Similarly, before. Therefore, a major design objective of
de Melo and Weikum (2009) create a multilin- U BY is a single API. This is similar in spirit to the
gual wordnet by integrating wordnets, bilingual motivation of Pradhan et al. (2007), who present
dictionaries and information from parallel cor- integrated access to corpus annotations as a main
pora. None of these resources integrate lexical- goal of their work on standardizing and integrat-
semantic information, such as syntactic subcate- ing corpus annotations in the OntoNotes project.
gorization or semantic roles. To summarize, related work focuses either on
McFate and Forbus (2011) present NULEX, the standardization of single resources (or a single
a syntactic lexicon automatically compiled from type of resource), which leads to several slightly
WN, WKT-en and VN. As their goal is to cre- different formats constrained to these resources,
ate an open-license resource to enhance syntactic or on the integration of several resources in an
parsing, they enrich verbs and nouns in WN with idiosyncratic format. CCRs have not been con-
inflection information from WKT-en and syntac- sidered at all in previous work on resource stan-
tic frames from VN. Thus, they only use a small dardization, and the level of detail of the model-
part of the lexical information present in WKT-en. ing is insufficient to fully accommodate different
Padro et al. (2011) present their work on lex- types of lexical-semantic information. API ac-
icon merging within the Panacea Project. One cess is rarely provided. This makes it hard for
goal of Panacea is to create a lexical resource de- the community to exploit their results on a large
velopment platform that supports large-scale lex- scale. Thus, it diminishes the impact that these
ical acquisition and can be used to combine exist- projects might achieve upon NLP beyond their
ing lexicons with automatically acquired ones. To original specific purpose, if their results were rep-
this end, Padro et al. (2011) explore the automatic resented in a unified resource and could easily be
integration of subcategorization lexicons. Their accessed by the community through a single pub-
current work only covers Spanish, and though lic API.
they mention the LMF standard as a potential data
3 U BY Data model
model, they do not make use of it.
Shi and Mihalcea (2005) integrate FN, VN and LMF defines a metamodel of LSRs in the Uni-
WN, and Palmer (2009) presents a combination of fied Modeling Language (UML). It provides a
Propbank, VN and FN in a resource called S EM - number of UML packages and classes for model-
L INK in order to enhance semantic role labeling. ing many different types of resources, e.g. word-
Similar to our work, multiple resources are in- nets and multilingual lexicons. The design of
tegrated, but their work is restricted to a single a standard-compliant lexicon model in LMF in-
language and does not cover CCRs, whose pop- volves two steps: in the first step, the structure
ularity and importance has grown tremendously of the lexicon model has to be defined by choos-
over the past years. In fact, with the excep- ing a combination of the LMF core package and
tion of NULEX, CCRs have only been consid- zero to many extensions (i.e. UML packages). In
ered in the sense alignment of individual resource the second step, these UML classes are enriched
pairs (Navigli and Ponzetto, 2010a; Meyer and by attributes. To contribute to semantic interop-
Gurevych, 2011). erability, it is essential for the lexicon model that
the attributes and their values refer to Data Cat-
egories (DCs) taken from a reference repository.
API access for resources. An important factor
DCs are standardized specifications of the terms
to the success of a large, integrated resource is a
that are used for attributes and their values, or in
single public API, which facilitates the access to
other words, the linguistic vocabulary occurring
the information contained in the resource. The
most important LSRs so far can be accessed us- 6
http://sourceforge.net/projects/jwordnet/
7
ing various APIs, for instance the Java WordNet http://code.google.com/p/jwpl/
582
in a lexicon model. Consider, for instance, the SubcategorizationFrame is com-
term lexeme that is defined differently in WN and posed of syntactic arguments, while
FN: in FN, a lexeme refers to a word form, not SemanticPredicate is composed of se-
including the sense aspect. In WN, on the con- mantic arguments. The linking between syntactic
trary, a lexeme is an abstract pairing of mean- and semantic arguments is represented by the
ing and form. According to LMF, the DCs are SynSemCorrespondence class.
to be selected from ISOCat, the implementation The SenseAxis class is very important in
of the ISO 12620 Data Category Registry (DCR, U BY-LMF, as it connects the different source
Broeder et al. (2010)), resulting in a Data Cate- LSRs. Its role is twofold: first, it links the cor-
gory Selection (DCS). responding word senses from different languages,
e.g. English and German. Second, it represents
Design of U BY-LMF. We have designed U BY-
monolingual sense alignments, i.e. sense align-
LMF8 as a model of the union of various hetero-
ments between different lexicons in the same lan-
geneous resources, namely WN, GN, FN, and VN
guage. The latter is a novel interpretation of
on the one hand and CCRs on the other hand.
SenseAxis introduced by U BY-LMF.
Two design principles guided our development
The organization of lexical-semantic knowl-
of U BY-LMF: first, to preserve the information
edge found in WP, WKT, and OW can be mod-
available in the original resources and to uni-
eled with the classes in U BY-LMF as well. WP
formly represent it in U BY-LMF. Second, to be
primarily provides encyclopedic information on
able to extend U BY in the future by further lan-
nouns. It mainly consists of article pages which
guages, resources, and types of linguistic infor-
are modeled as Senses in U BY-LMF.
mation, in particular, alignments between differ-
WKT is in many ways similar to tradi-
ent LSRs.
tional dictionaries, because it enumerates senses
Wordnets, FN and VN are largely complemen-
under a given headword on an entry page.
tary regarding the information types they provide,
Thus, WKT entry pages can be represented by
see, e.g. Baker and Fellbaum (2009). Accord-
LexicalEntries and WKT senses by Senses.
ingly, they use different organizational units to
represent this information. Wordnets, such as OW is different from WKT and WP, as it is or-
WN and GN, primarily contain information on ganized in multilingual synsets. To model OW
lexical-semantic relations, such as synonymy, and in U BY-LMF, we split the synsets per language
use synsets (groups of lexemes that are synony- and included them as monolingual Synsets in
mous) as organizational units. FN focuses on the corresponding Lexicon (e.g., OW-en or OW-
groups of lexemes that evoke the same prototypi- de). The original multilingual information is pre-
cal situation (so-called semantic frames, Fillmore served by adding a SenseAxis between corre-
(1982)) involving semantic roles (so-called frame sponding synsets in OW-en and OW-de.
elements). VN, a large-scale verb lexicon, is or- The LMF standard itself contains only few lin-
ganized in Levin-style verb classes (Levin, 1993) guistic terms and does neither specify attributes
(groups of verbs that share the same syntactic al- nor their values. Therefore, an important task in
ternations and semantic roles) and provides rich developing U BY-LMF has been the specification
subcategorization frames including semantic roles of attributes and their values along with the proper
and a specification of semantic predicates. attachment of attributes to LMF classes. In partic-
U BY-LMF employs several direct subclasses ular, this task involved selecting DCs from ISO-
of Lexicon in order to account for the various or- Cat and, if necessary, adding new DCs to ISOCat.
ganization types found in the different LSRs con-
Extensions in U BY-LMF. Although U BY-
sidered. While the LexicalEntry class reflects
LMF is largely compliant with LMF, the task of
the traditional headword-based lexicon organiza-
building a homogeneous lexicon model for many
tion, Synset represents synsets from wordnets,
highly heterogeneous LSRs led us to extend LMF
SemanticPredicate models FN semantic
in several ways: we added two new classes and
frames, and SubcategorizationFrameSet
several new relationships between classes.
corresponds to VN alternation classes.
First, we were facing a huge variety of lexical-
8
See www.ukp.tu-darmstadt.de/data/uby semantic labels for many different dimensions of
583
semantic classification. Examples of such dimen- form. Disambiguating the WKT relation targets
sions include ontological type (e.g. selectional re- to infer the target sense is left to future work.
strictions in VN and FN), domain (e.g. Biology in A related issue occurred, when we mapped WN
WN), style and register (e.g. labels in WKT, OW), to LMF. WN encodes morphologically related
or sentiment (e.g. sentiment of lexical units in forms as sense relations. U BY-LMF represents
FN). Since we aim at an extensible LMF-model, these related forms not only as sense relations (as
capable of representing further dimensions of se- in WordNet-LMF), but also at the morphologi-
mantic classification, we did not squeeze the in- cal level using the RelatedForm class from the
formation on semantic classes present in the con- LMF Morphology extension. In LMF, however,
sidered LSRs into existing LMF classes. Instead, the RelatedForm class for morphologically re-
we addressed this issue by introducing a more lated lexemes is not associated with the corre-
general class, SemanticLabel, which is an op- sponding sense in any way. Discarding the WN
tional subclass of Sense, SemanticPredicate, information on the senses involved in a particular
and SemanticArgument. This new class has morphological relation would lead to information
three attributes, encoding the name of the label, loss in some cases. Consider as an example the
its type (e.g. ontological, register, sentiment), and WN verb buy (purchase) which is derivationally
a numeric quantification (e.g. sentiment strength). related to the noun buy, while on the other hand
Second, we attached the subclass Frequency buy (accept as true, e.g. I cant buy this story) is
to most of the classes in U BY-LMF, in order to not derivationally related to the noun buy. We ad-
encode frequency information. This is of partic- dressed this issue by adding a sense attribute to
ular importance when using the resource in ma- the RelatedForm class. Thus, in extension of
chine learning applications. This extension of the LMF, U BY-LMF allows sense relations to refer to
standard has already been made in WordNet-LMF a form relation target and morphological relations
(Soria et al., 2009). Currently, the Frequency to refer to a sense relation target.
class is used to keep corpus frequencies for lex-
Data Categories in U BY-LMF. We encoun-
ical units in FN, but we plan to use it for en-
tered large differences in the availability of DCs
riching many other classes with frequency in-
in ISOCat for the morpho-syntactic, lexical-
formation in future work, such as Senses or
syntactic, and lexical-semantic parts of U BY-
SubcategorizationFrames.
LMF. Many DCs were missing in ISOCat and we
Third, the representation of FN in LMF re- had to enter them ourselves. While this was feasi-
quired adding two new relationships between ble at the morpho-syntactic and lexical-syntactic
LMF classes: we added a relationship between level, due to a large body of standardization re-
SemanticArgument and Definition, in or-
sults available, it was much harder at the lexical-
der to represent the definitions available for frame semantic level where standardization is still on-
elements in FN. In addition, we added a re- going. At the lexical-semantic level, U BY-LMF
lationship between the Context class and the currently allows string values for a number of at-
MonoLingualExternalRef, to represent the
tribute values, e.g. for semantic roles. We can eas-
links to annotated corpus sentences in FN. ily integrate the results of the ongoing standard-
Finally, WKT turned out to be hard to tackle, ization efforts into U BY-LMF in the future.
because it contains a special kind of ambiguity in
the semantic relations and translation links listed 4 U BY Population with information
for senses: the targets of both relations and trans-
4.1 Representing LSRs in U BY-LMF
lation links are ambiguous, as they refer to lem-
mas (word forms), rather than to senses (Meyer U BY-LMF is represented by a DTD (as suggested
and Gurevych, 2010). These ambiguous rela- by the standard) which can be used to automat-
tion targets could not directly be represented in ically convert any given resource into the corre-
LMF, since sense and translation relations are sponding XML format.9 This conversion requires
defined between senses. To resolve this, we a detailed analysis of the resource to be converted,
added a relationship between SenseRelation followed by the definition of a mapping of the
and FormRepresentation, in order to encode 9
Therefore, U BY-LMF can be considered as a serializa-
the ambiguous WKT relation target as a word tion of LMF.
584
concepts and terms used in the original resource tries provide links to the corresponding WP page.
to the U BY-LMF model. There are two major Also, the German and English language editions
tasks involved in the development of an automatic of WP and OW are connected by inter-language
conversion routine: first, the basic organizational links between articles (Senses in U BY). We can
unit in the source LSR has to be identified and expect that these links have high quality, as they
mapped, e.g. synset in WN or semantic frame in were entered manually by users and are subject
FN, and second, it has to be determined, how a to community control. Therefore, we straightfor-
(LMF) sense is defined in the source LSR. wardly imported them into U BY.
A notable aspect of converting resources into
U BY-LMF is the harmonization of linguistic ter- Alignment Framework. Automatically creat-
minology used in the LSRs. For instance, a ing new alignments is difficult because of word
WN Word and a GN Lexical Unit are mapped to ambiguities, different granularities of senses,
Sense in U BY-LMF. or language specific conceptualizations (Navigli,
We developed reusable conversion routines for 2006). To support this task for a large number
the future import of updated versions of the source of resources across languages, we have designed
LSRs into U BY, provided the structure of the a flexible alignment framework based on the
source LSR remains stable. These conversion state-of-the-art method of Niemann and Gurevych
routines extract lexical data from the source LSRs (2011). The framework is generic in order to al-
by calling their native APIs (rather than process- low alignments between different kinds of entities
ing the underlying XML data). Thus, all lexical as found in different resources, e.g. WN synsets,
information which can be accessed via the APIs FN frames or WP articles. The only requirement
is converted into U BY-LMF. is that the individual entities are distinguishable
by a unique identifier in each resource.
Converting the LSRs introduced in the previ-
ous section yielded an instantiation of U BY-LMF The alignment consists of the following steps:
named U BY. The LexicalResource instance First, we extract the alignment candidates for a
U BY currently comprises 10 Lexicon instances, given resource pair, e.g. WN sense candidates for
one each for OW-de and OW-en, and one lexicon a WKT-en entry. Second, we create a gold stan-
each for the remaining eight LSRs. dard by manually annotating a subset of candi-
date pairs as valid or non-valid. Then, we
4.2 Adding Sense Alignments extract the sense representations (e.g. lemmatized
bag-of-words based on glosses) to compute the
Besides the uniform and standardized representa-
similarity of word senses (e.g. by cosine similar-
tion of the single LSRs, one major asset of U BY
ity). The gold standard with corresponding sim-
is the semantic interoperability of resources at the
ilarity values is fed into Weka (Hall et al., 2009)
sense level. In the following, we (i) describe how
to train a machine learning classifier, and in the
we converted already existing sense alignments of
final step this classifier is used to automatically
resources into LMF, and (ii) present a framework
classify the candidate sense pairs as (non-)valid
to infer alignments automatically for any pair of
alignment. Our framework also allows us to train
resources.
on a combination of different similarity measures.
Existing Alignments. Previous work on sense Using our framework, we were able to re-
alignment yielded several alignments, such as produce the results reported by Niemann and
WNWP-en (Niemann and Gurevych, 2011), Gurevych (2011) and Meyer and Gurevych
WNWKT-en (Meyer and Gurevych, 2011) and (2011) based on the publicly available evaluation
VNFN (Palmer, 2009). datasets10 and the configuration details reported
We converted these alignments into U BY-LMF in the corresponding papers.
by creating a SenseAxis instance for each pair of
Cross-Lingual Alignment. In order to align
aligned senses. This involved mapping the sense
word senses across languages, we extended the
IDs from the proprietary alignment files to the
monolingual sense alignment described above to
corresponding sense IDs in U BY.
the cross-lingual setting. Our approach utilizes
In addition, we integrated the sense alignments
10
already present in OW and WP. Some OW en- http://www.ukp.tu-darmstadt.de/data/sense-alignment/
585
Moses,11 trained on the Europarl corpus. The Translation Similarity
lemma of one of the two senses to be aligned direction measure P R F1
as well as its representations (e.g. the gloss) is EN > DE Cosine (Cos) 0.666 0.575 0.594
translated into the language of the other resource, DE > EN Cos 0.674 0.658 0.665
yielding a monolingual setting. E.g., the WN DE > EN PPR 0.721 0.712 0.716
synset {vessel, watercraft} with its gloss a craft DE > EN PPR + Cos 0.723 0.712 0.717
designed for water transportation is translated
Table 1: Cross-lingual alignment results
into {Schiff, Wasserfahrzeug} and Ein Fahrzeug
fur Wassertransport, and then the candidate ex-
traction and all downstream steps can take place into English works significantly better than into
in German. An inherent problem with this ap- German. Also, the more elaborate similarity mea-
proach is that incorrect translations also lead to sure PPR yields better results than cosine similar-
invalid alignment candidates. However, these are ity, while the best result is achieved by a combina-
most probably filtered out by the machine learn- tion of both. Niemann and Gurevych (2011) make
ing classifier as the calculated similarity between a similar observation for the monolingual setting.
the sense representations (e.g. glosses) should be Our F-measure of 0.717 in the best configuration
low if the candidates do not match. lies between the results of Meyer and Gurevych
We evaluated our approach by creating a cross- (2011) (0.66) and Niemann and Gurevych (2011)
lingual alignment between WN and OW-de, i.e. (0.78), and thus verifies the validity of the ma-
the concepts in OW with a German lexicaliza- chine translation approach. Therefore, the best
tion.12 To our knowledge, this is the first study on alignment was subsequently integrated into U BY.
aligning OW with another LSR. OW is especially
interesting for this task due to its multilingual con- 5 Evaluating U BY
cepts, as described by Matuschek and Gurevych
We performed an intrinsic evaluation of U BY by
(2011). The created gold standard could, for in-
computing a number of resource statistics. Our
stance, be re-used to evaluate alignments for other
evaluation covers two aspects: first, it addresses
languages in OW.
the question if our automatic conversion routines
To compute the similarity of word senses, we work correctly. Second, it provides indicators for
followed the approach by Niemann and Gurevych assessing U BY in terms of the gain in coverage
(2011) while covering both translation directions. compared to the single LSRs.
We used the cosine similarity for comparing the
German OW glosses with the German translations Correctness of conversion. Since we aim to
of WN glosses and cosine and personalized page preserve the maximal amount of information from
rank (PPR) similarity for comparison of the Ger- the original LSRs, we should be able to replace
man OW glosses translated into English with the any of the original LSRs and APIs by U BY and
original English WN glosses. Note that PPR sim- the U BY-API without losing information. As
ilarity is not available for German as it is based the conversion is largely performed automatically,
on WN. Thereby, we filtered out the OW con- systematic errors and information loss could be
cepts without a German gloss which left us with introduced by a faulty conversion routine. In or-
11,806 unique candidate pairs. We randomly se- der to detect such errors and to prove the correct-
lected 500 WN synsets for analysis yielding 703 ness of the automatic conversion and the result-
candidate pairs. These were manually annotated ing representation, we have compared the orig-
as being (non-)alignments. For the subsequent inal resource statistics of the classes and infor-
machine learning task we used a simple threshold- mation types in the source LSRs to the cor-
based classifier and ten-fold cross validation. responding classes in their U BY counterparts.
Table 1 summarizes the results of different sys- For instance, the number of lexical relations in
tem configurations. We observe that translation WordNet has been compared to the number of
11
SenseRelations in the U BY WordNet lexi-
http://www.statmt.org/moses/
12 con.13
OmegaWiki consists of interlinked language-
independent concepts to which lexicalizations in several
13
languages are attached. For detailed analysis results see the U BY website.
586
Lexical Sense shows the number of lemmas with entries in one
Lexicon Entry Sense Relation or more than one lexicon, additionally split by
FN 9,704 11,942 POS and language. Lemmas occurring only once
GN 83,091 93,407 329,213 in U BY increase the coverage at lemma level. For
OW-de 30,967 34,691 60,054 lemmas with parallel entries in several U BY lex-
OW-en 51,715 57,921 85,952 icons, new information becomes available in the
WP-de 790,430 838,428 571,286 form of additional sense definitions and comple-
WP-en 2,712,117 2,921,455 3,364,083
mentary information types attached to lemmas.
WKT-de 85,575 72,752 434,358
WKT-en 335,749 421,848 716,595 Finally, the increase in coverage at sense level
WN 156,584 206,978 8,559 can be estimated for senses that are aligned across
VN 3,962 31,891 at least two U BY-lexicons. We gain access to
U BY 4,259,894 4,691,313 5,300,941 all available, partly complementary information
types attached to these aligned senses, e.g. seman-
Table 2: U BY resource statistics (selected classes). tic relations, subcategorization frames, encyclo-
pedic or multilingual information. The number
Lexicon pair Languages SenseAxis
of pairwise sense alignments provided by U BY is
WNWP-en ENEN 50,351 given in Table 3. In addition, we computed how
WNWKT-en ENEN 99,662 many senses simultaneously take part in at least
WNVN ENEN 40,716 two pairwise sense alignments. For English, this
FNVN ENEN 17,529 applies to 31,786 senses, for which information
WP-enOW-en ENEN 3,960 from 3 U BY lexicons is available.
WP-deOW-de DEDE 1,097
WNOW-de ENDE 23,024 EN Lexicons noun verb adjective
WP-enWP-de ENDE 463,311
OW-enOW-de ENDE 58,785 5 1 699 -
4 1,630 1,888 430
U BY All 758,435
3 8,439 1,948 2,271
2 53,856 4,727 12,290
Table 3: U BY alignment statistics. 1 2,900,652 50,209 41,731
(unique EN) 3,080,771
DE Lexicons noun verb adjective
Gain in coverage. U BY offers an increased
coverage compared to the single LSRs as reflected 4 1,546 - -
in the resource statistics. Tables 2 and 3 show the 3 10,374 372 342
2 26,813 3,174 2,643
statistics on central classes in U BY. As U BY is
1 803,770 6,108 7,737
organized in several Lexicons, the number of
(unique DE) 862,879
U BY lexical entries is the sum of the lexical en-
tries in all 10 Lexicons. Thus, U BY contains
Table 4: Number of lemmas (split by POS and lan-
more than 4.2 million lexical entries, 4.6 million
guage) with entries in i U BY lexicons, i = 1, . . . , 5.
senses, 5.3 million semantic relations between
senses and more than 750,000 alignments. These
statistics represent the total numbers of lexical en- 6 Using U BY
tries, senses and sense relations in U BY without
filtering of identical (i.e. corresponding) lexical U BY API. For convenient access to U BY, we
entries, senses and relations. Listing the num- implemented a Java-API which is built around
ber of unique senses would require a full align- the Hibernate14 framework. Hibernate allows to
ment between all integrated resources, which is easily store the XML data which results from
currently not available. converting resources into Uby-LMF into a corre-
We can, however, show that U BY contains over sponding SQL database.
3.08 million unique lemma-POS combinations for Our main design principle was to keep the ac-
English and over 860,000 for German, over 3.94 cess to the resource as simple as possible, despite
million in total, see Table 4. Therefore, we as- the rich and complex structure of U BY. Another
14
sessed the coverage on lemma level. Table 4 also http://www.hibernate.org/
587
important design aspect was to ensure that the censing allows,15 already converted resources. If
functionality of the individual, resource-specific resources cannot be made available for download,
APIs or user interfaces is mirrored in the U BY the conversion tools will still allow users with ac-
API. This enables porting legacy applications to cess to these resources to import them into U BY
our new resource. To facilitate the transition to easily. In this way, it will be possible for users to
U BY, we plan to provide reference tables which build their custom U BY containing selected re-
list the corresponding U BY-API operations for the sources. As the underlying resources are subject
most important operations in the WN API, some to continuous change, updates of the correspond-
of which are shown in Table 5. ing components will be made available on a regu-
lar basis.
WN function U BY function 7 Conclusions
Dictionary U BY
getIndexWord(pos, getLexicalEntries( We presented U BY, a large-scale, standardized
lemma) pos, lemma) LSR containing nine widely used resources in two
IndexWord LexicalEntry languages: English WN, WKT-en, WP-en, FN
getLemma() getLemmaForm() and VN, German WP-de, WKT-de, and GN, and
Synset Synset
OW in English and German. As all resources
getGloss() getDefinitionText()
getWords() getSenses() are modeled in U BY-LMF, U BY enables struc-
Pointer SynsetRelation tural interoperability across resources and lan-
getType() getRelName() guages down to a fine-grained level of informa-
Word Sense tion. For FN, VN and all of the CCRs in En-
getPointers() getSenseRelations() glish and German, this is done for the first time.
Besides, by integrating sense alignments we also
Table 5: Some equivalent operations in WN API and enable the lexical-semantic interoperability of re-
U BY API. sources. We presented a unified framework for
aligning any LSRs pairwise and reported on ex-
periments which align OW-de and WN. We will
While it is possible to limit access to single re-
release the U BY-LMF model, the resource and the
sources by a parameter and thus mimic the behav-
U BY-API at the time of publication.16 Due to the
ior of the legacy APIs (e.g. only retrieve Synsets
added value and the large scale of U BY, as well as
and their relations from WN), the true power of
its ease of use, we believe U BY will boost the per-
U BY API becomes visible when no such con-
formance of NLP making use of lexical-semantic
straints are applied. In this case, all imported re-
knowledge.
sources are queried to get one combined result,
while retaining the source of the respective in- Acknowledgments
formation. On top of this, the information about
existing sense alignments across resources can be This work has been supported by the Emmy
accessed via SenseAxis relations, so that the re- Noether Program of the German Research Foun-
turned combined result covers not only the lexi- dation (DFG) under grant No. GU 798/3-1 and
cal, but also the sense level. by the Volkswagen Foundation as part of the
Lichtenberg-Professorship Program under grant
No. I/82806. We thank Richard Eckart de
Community issues. One of the most important Castilho, Yevgen Chebotar, Zijad Maksuti and Tri
reasons for U BY is creating an easy-to-use pow- Duc Nghiem for their contributions to this project.
erful LSR to advance NLP research and develop-
ment. Therefore, community building around the
resource is one of our major concerns. To this end, References
we will offer free downloads of the lexical data Jordi Atserias, Lus Villarejo, German Rigau, Eneko
and software presented in this paper under open li- Agirre, John Carroll, Bernardo Magnini, and Piek
censes, namely: The U BY-LMF DTD, mappings 15
Only GermaNet is subject to a restricted license and can-
and conversion tools for existing resources and not be redistributed in U BY format.
16
sense alignments, the Java API, and, as far as li- http://www.ukp.tu-darmstadt.de/data/uby
588
Vossen. 2004. The Meaning Multilingual Central Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard
Repository. In Proceedings of the second interna- Pfahringer, Peter Reutemann, and Ian H. Witten.
tional WordNet Conference (GWC 2004), pages 23 2009. The WEKA Data Mining Software: An
30, Brno, Czech Republic. Update. ACM SIGKDD Explorations Newsletter,
Collin F. Baker and Christiane Fellbaum. 2009. Word- 11(1):1018.
Net and FrameNet as complementary resources for Verena Henrich and Erhard Hinrichs. 2010. Standard-
annotation. In Proceedings of the Third Linguis- izing wordnets in the ISO standard LMF: Wordnet-
tic Annotation Workshop, ACL-IJCNLP 09, pages LMF for GermaNet. In Proceedings of the 23rd In-
125129, Suntec, Singapore. ternational Conference on Computational Linguis-
Collin F. Baker, Charles J. Fillmore, and John B. tics (COLING), pages 456464, Beijing, China.
Lowe. 1998. The Berkeley FrameNet project. In Richard Johansson and Pierre Nugues. 2007. Us-
Proceedings of the 36th Annual Meeting of the As- ing WordNet to extend FrameNet coverage. In
sociation for Computational Linguistics and 17th Proceedings of the Workshop on Building Frame-
International Conference on Computational Lin- semantic Resources for Scandinavian and Baltic
guistics (COLING-ACL98, pages 8690, Montreal, Languages, at NODALIDA, pages 2730, Tartu, Es-
Canada. tonia.
Christian Bizer, Jens Lehmann, Georgi Kobilarov, Karin Kipper, Anna Korhonen, Neville Ryant, and
Soren Auer, Christian Becker, Richard Cyganiak, Martha Palmer. 2008. A Large-scale Classification
and Sebastian Hellmann. 2009. DBpedia A Crys- of English Verbs. Language Resources and Evalu-
tallization Point for the Web of Data. Journal of ation, 42:2140.
Web Semantics: Science, Services and Agents on the Claudia Kunze and Lothar Lemnitzer. 2002. Ger-
World Wide Web, (7):154165. maNet representation, visualization, application.
In Proceedings of the Third International Con-
Daan Broeder, Marc Kemps-Snijders, Dieter Van Uyt-
ference on Language Resources and Evaluation
vanck, Menzo Windhouwer, Peter Withers, Peter
(LREC), pages 14851491, Las Palmas, Canary Is-
Wittenburg, and Claus Zinn. 2010. A Data Cat-
lands, Spain.
egory Registry- and Component-based Metadata
Framework. In Proceedings of the 7th International Beth Levin. 1993. English Verb Classes and Alterna-
Conference on Language Resources and Evaluation tions. The University of Chicago Press, Chicago,
(LREC), pages 4347, Valletta, Malta. IL, USA.
Michael Matuschek and Iryna Gurevych. 2011.
Paul Buitelaar, Philipp Cimiano, Peter Haase, and
Where the journey is headed: Collaboratively con-
Michael Sintek. 2009. Towards Linguistically
structed multilingual Wiki-based resources. In
Grounded Ontologies. In Lora Aroyo, Paolo
SFB 538: Mehrsprachigkeit, editor, Hamburger Ar-
Traverso, Fabio Ciravegna, Philipp Cimiano, Tom
beiten zur Mehrsprachigkeit, Hamburg, Germany.
Heath, Eero Hyvonen, Riichiro Mizoguchi, Eyal
John McCrae, Dennis Spohr, and Philipp Cimiano.
Oren, Marta Sabou, and Elena Simperl, editors, The
2011. Linking Lexical Resources and Ontologies
Semantic Web: Research and Applications, pages
on the Semantic Web with Lemon. In The Seman-
111125, Berlin/Heidelberg, Germany. Springer.
tic Web: Research and Applications, volume 6643
Gerard de Melo and Gerhard Weikum. 2009. Towards of Lecture Notes in Computer Science, pages 245
a universal wordnet by learning from combined ev- 259. Springer, Berlin/Heidelberg, Germany.
idence. In Proceedings of the 18th ACM conference
Clifton J. McFate and Kenneth D. Forbus. 2011.
on Information and knowledge management (CIKM
NULEX: an open-license broad coverage lexicon.
09), CIKM 09, pages 513522, New York, NY,
In Proceedings of the 49th Annual Meeting of the
USA. ACM.
Association for Computational Linguistics: Human
Christiane Fellbaum. 1998. WordNet: An Electronic Language Technologies: short papers - Volume 2,
Lexical Database. MIT Press, Cambridge, MA, HLT 11, pages 363367, Portland, OR, USA.
USA. Christian M. Meyer and Iryna Gurevych. 2010. Worth
Charles J. Fillmore. 1982. Frame Semantics. In The its Weight in Gold or Yet Another Resource
Linguistic Society of Korea, editor, Linguistics in A Comparative Study of Wiktionary, OpenThe-
the Morning Calm, pages 111137. Hanshin Pub- saurus and GermaNet. In Alexander Gelbukh, ed-
lishing Company, Seoul, Korea. itor, Computational Linguistics and Intelligent Text
Gil Francopoulo, Nuria Bel, Monte George, Nico- Processing: 11th International Conference, volume
letta Calzolari, Monica Monachini, Mandy Pet, and 6008 of Lecture Notes in Computer Science, pages
Claudia Soria. 2006. Lexical Markup Framework 3849. Berlin/Heidelberg: Springer, Iasi, Romania.
(LMF). In Proceedings of the 5th International Christian M. Meyer and Iryna Gurevych. 2011. What
Conference on Language Resources and Evaluation Psycholinguists Know About Chemistry: Align-
(LREC), pages 233236, Genoa, Italy. ing Wiktionary and WordNet for Increased Domain
589
Coverage. In Proceedings of the 5th International edge. In Proceedings of the 16th International Con-
Joint Conference on Natural Language Processing ference on World Wide Web, pages 697706, Banff,
(IJCNLP), pages 883892, Chiang Mai, Thailand. Canada.
Roberto Navigli and Simone Paolo Ponzetto. 2010a. Antonio Toral, Stefania Bracale, Monica Monachini,
BabelNet: Building a Very Large Multilingual Se- and Claudia Soria. 2010. Rejuvenating the Italian
mantic Network. In Proceedings of the 48th Annual WordNet: Upgrading, Standarising, Extending. In
Meeting of the Association for Computational Lin- Proceedings of the 5th Global WordNet Conference
guistics, pages 216225, Uppsala, Sweden, July. (GWC), Bombay, India.
Roberto Navigli and Simone Paolo Ponzetto. 2010b. Piek Vossen, editor. 1998. EuroWordNet: A Multi-
Knowledge-rich Word Sense Disambiguation Ri- lingual Database with Lexical Semantic Networks.
valing Supervised Systems. In Proceedings of the Kluwer Academic Publishers, Dordrecht, Nether-
48th Annual Meeting of the Association for Com- lands.
putational Linguistics, pages 15221531, Uppsala,
Sweden.
Roberto Navigli. 2006. Meaningful Clustering of
Senses Helps Boost Word Sense Disambiguation
Performance. In Proceedings of the 21st Inter-
national Conference on Computational Linguistics
and the 44th Annual Meeting of the Association for
Computational Linguistics (COLING-ACL), pages
105112, Sydney, Australia.
Elisabeth Niemann and Iryna Gurevych. 2011. The
Peoples Web meets Linguistic Knowledge: Auto-
matic Sense Alignment of Wikipedia and WordNet.
In Proceedings of the 9th International Conference
on Computational Semantics (IWCS), pages 205
214, Oxford, UK.
Muntsa Padro, Nuria Bel, and Silvia Necsulescu.
2011. Towards the Automatic Merging of Lexical
Resources: Automatic Mapping. In Proceedings of
the International Conference on Recent Advances
in Natural Language Processing (RANLP), pages
296301, Hissar, Bulgaria.
Martha Palmer. 2009. Semlink: Linking PropBank,
VerbNet and FrameNet. In Proceedings of the Gen-
erative Lexicon Conference (GenLex-09), pages 9
15, Pisa, Italy.
Sameer S. Pradhan, Eduard Hovy, Mitch Mar-
cus, Martha Palmer, Lance Ramshaw, and Ralph
Weischedel. 2007. OntoNotes: A Unified Rela-
tional Semantic Representation. In Proceedings of
the International Conference on Semantic Comput-
ing, pages 517526, Washington, DC, USA.
Lei Shi and Rada Mihalcea. 2005. Putting Pieces To-
gether: Combining FrameNet, VerbNet and Word-
Net for Robust Semantic Parsing. In Proceedings
of the Sixth International Conference on Intelligent
Text Processing and Computational Linguistics (CI-
CLing), pages 100111, Mexico City, Mexico.
Claudia Soria, Monica Monachini, and Piek Vossen.
2009. Wordnet-LMF: fleshing out a standardized
format for Wordnet interoperability. In Proceed-
ings of the 2009 International Workshop on Inter-
cultural Collaboration, pages 139146, Palo Alto,
CA, USA.
Fabian M. Suchanek, Gjergji Kasneci, and Gerhard
Weikum. 2007. Yago: A Core of Semantic Knowl-
590
Word Sense Induction for Novel Sense Detection
Jey Han Lau, Paul Cook, Diana McCarthy,
David Newman, and Timothy Baldwin
NICTA Victoria Research Laboratory
Dept of Computer Science and Software Engineering, University of Melbourne
Dept of Computer Science, University of California Irvine
Lexical Computing
jhlau@csse.unimelb.edu.au, paulcook@unimelb.edu.au,
diana@dianamccarthy.co.uk, newman@uci.edu, tb@ldwin.net
591
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 591601,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
its own distribution over topics. Words are gen- In our initial experiments, we use LDA topic
erated in each document by first sampling a topic modelling, which requires us to set T , the num-
from the documents topic distribution, then sam- ber of topics to be learned by the model. The
pling a word from that topic. In this work we LDA generative process is: (1) draw a latent
use the topic modelss probabilistic assignment of topic z from a document-specific topic distribu-
topics to words for the WSI task. tion P (t = z|d) then; (2) draw a word w from
the chosen topic P (w|t = z). Thus, the probabil-
2.1 Data Representation and Pre-processing ity of producing a single copy of word w given a
In the context of WSI, topics form our sense rep- document d is given by:
resentation, and words in a sentence are gener-
ated conditioned on a particular sense of the target
T
P (w|d) = P (w|t = z)P (t = z|d).
word. The document in the WSI case is a sin- z=1
gle sentence or a short document fragment con-
taining the target word, as we would not expect In standard LDA, the user needs to specify the
to be able to generate a full document from the number of topics T . In non-parametric variants of
sense of a single target word.1 In the case of the LDA, the model dynamically learns the number of
SemEval datasets, we use the word contexts pro- topics as part of the topic modelling. The particu-
vided in the dataset, while in our novel sense de- lar implementation of non-parametric topic model
tection experiments, we use a context window of we experiment with is Hierarchical Dirichlet Pro-
three sentences, one sentence to either side of the cess (HDP: Teh et al. (2006)),3 where, for each
token occurrence of the target word. document, a distribution of mixture components
As our baseline representation, we use a bag of P (t|d) is sampled from a base distribution G0
words, where word frequency is kept but not word as follows: (1) choose a base distribution G0
order. All words are lemmatised, and stopwords DP (, H); (2) for each document d, generate dis-
and low frequency terms are removed. tribution P (t|d) DP (0 , G0 ); (3) draw a la-
We also experiment with the addition of po- tent topic z from the documents mixture compo-
sitional context word information, as commonly nent distribution P (t|d), in the same manner as
used in WSI. That is, we introduce an additional for LDA; and (4) draw a word w from the chosen
word feature for each of the three words to the left topic P (w|t = z).4
and right of the target word. For both LDA and HDP, we individually topic
Pado and Lapata (2007) demonstrated the im- model each target word, and determine the sense
portance of syntactic dependency relations in the assignment z for a given instance by aggregating
construction of semantic space models, e.g. for over the topic assignments for each word in the
WSD. Based on these findings, we include depen- instance and selecting the sense with the highest
dency relations as additional features in our topic aggregated probability, arg maxz P (t = z|d).
models,2 but just for dependency relations that in-
volve the target word. 3 SemEval Experiments
To facilitate comparison of our proposed method
2.2 Topic Modelling
for WSI with previous approaches, we use the
Topic models learn a probability distribution over dataset from the SemEval-2007 and SemEval-
topics for each document, by simply aggregating 2010 word sense induction tasks (Agirre and
the distributions over topics for each word in the 3
We use the C++ implementation of HDP
document. In WSI terms, we take this distribu- (http://www.cs.princeton.edu/blei/
tion over topics for each target word (instance topicmodeling.html) in our experiments.
4
in WSI parlance) as our distribution over senses The two HDP parameters and 0 control the variabil-
for that word. ity of senses in the documents. In particular, controls the
degree of sharing of topics across documents a high
1
Notwithstanding the one sense per discourse heuristic value leads to more topics, as topics for different documents
(Gale et al., 1992). are more dissimilar. 0 , on the other hand, controls the de-
2
We use the Stanford Parser to do part of speech tagging gree of mixing of topics within a document a high 0 gen-
and to extract the dependency relations (Klein and Manning, erates fewer topics, as topics are less homogeneous within a
2003; De Marneffe et al., 2006). document.
592
Soroa, 2007; Manandhar et al., 2010). We first In our original experiments with LDA, we set
experiment with the SemEval-2010 dataset, as it the number of topics (T ) for each target word to
includes explicit training and test data for each the number of senses represented in the test data
target word and utilises a more robust evaluation for that word (varying T for each target word).
methodology. We then return to experiment with This is based on the unreasonable assumption that
the SemEval-2007 dataset, for comparison pur- we will have access to gold-standard information
poses with other published results for topic mod- on sense granularity for each target word, and is
elling approaches to WSI. done to establish an upper bound score for LDA.
We then relax the assumption, and use a fixed T
3.1 SemEval-2010 setting for each of sets of nouns (T = 7) and
3.1.1 Dataset and Methodology verbs (T = 3), based on the average number of
Our primary WSI evaluation is based on senses from the test data in each case. Finally,
the dataset provided by the SemEval-2010 WSI we introduce positional context features for LDA,
shared task (Manandhar et al., 2010). The dataset once again using the fixed T values for nouns and
contains 100 target words: 50 nouns and 50 verbs. verbs.
For each target word, a fixed set of training and We next apply HDP to the WSI task, using
test instances are supplied, typically 1 to 3 sen- positional features, but learning the number of
tences in length, each containing the target word. senses automatically for each target word via the
The default approach to evaluation for the model. Finally, we experiment with adding de-
SemEval-2010 WSI task is in the form of WSD pendency features to the model.
over the test data, based on the senses that have To summarise, we provide results for the fol-
been automatically induced from the training lowing models:
data. Because the induced senses will likely vary
in number and nature between systems, the WSD 1. LDA+Variable T : LDA with variable T
evaluation has to incorporate a sense alignment for each target word based on the number of
step, which it performs by splitting the test in- gold-standard senses.
stances into two sets: a mapping set and an eval- 2. LDA+Fixed T : LDA with fixed T for each
uation set. The optimal mapping from induced of nouns and verbs.
senses to gold-standard senses is learned from the 3. LDA+Fixed T +Position: LDA with fixed
mapping set, and the resulting sense alignment is T and extra positional word features.
used to map the predictions of the WSI system to 4. HDP+Position: HDP (which automatically
pre-defined senses for the evaluation set. The par- learns T ), with extra positional word fea-
ticular split we use to calculate WSD effective- tures.
ness in this paper is 80%/20% (mapping/test), av- 5. HDP+Position+Dependency: HDP with
eraged across 5 random splits.5 both positional word and dependency fea-
The SemEval-2010 training data consists of ap- tures.
proximately 163K training instances for the 100
target words, all taken from the web. The test We compare our models with two baselines
data is approximately 9K instances taken from a from the SemEval-2010 task: (1) Baseline Ran-
variety of news sources. Following the standard dom randomly assign each test instance to one
approach used by the participating systems in the of four senses; (2) Baseline MFS most fre-
SemEval-2010 task, we induce senses only from quent sense baseline, assigning all test instances
the training instances, and use the learned model to one sense; and also a benchmark system
to assign senses to the test instances. (UoY), in the form of the University of York sys-
5
tem (Korkontzelos and Manandhar, 2010), which
A 60%/40% split is also provided as part of the task
achieved the best overall WSD results in the orig-
setup, but the results are almost identical to those for the
80%/20% split, and so are omitted from this paper. The orig- inal SemEval-2010 task.
inal task also made use of V-measure and Paired F-score to
evaluate the induced word sense clusters, but have degen- 3.2 SemEval-2010 Results
erate behaviour in correlating strongly with the number of
senses induced by the method (Manandhar et al., 2010), and The results of our experiments over the SemEval-
are hence omitted from this paper. 2010 dataset are summarised in Table 1.
593
WSD (80%/20%) not convey a coherent sense. These topics are an
System
All Verbs Nouns
artifact of HDP: they are learnt at a much later
Baselines
Baseline Random 0.57 0.66 0.51 stage of the iterative process of Gibbs sampling
Baseline MFS 0.59 0.67 0.53 and are often smaller than other topics (i.e. have
LDA more zero-probability terms). We notice that they
Variable T 0.64 0.69 0.60 are assigned as topics to instances very rarely (al-
Fixed T 0.63 0.68 0.59
Fixed T +Position 0.63 0.68 0.60 though they are certainly used to assign topics to
HDP non-target words in the instances), and as such,
+Position 0.68 0.72 0.65 they do not present a real issue when assigning
+Position+Dependency 0.68 0.72 0.65 the sense to an instance, as they are likely to be
Benchmark
UoY 0.62 0.67 0.59
overshadowed by the dominant senses.7 This con-
clusion is born out when we experimented with
Table 1: WSD F-score over the SemEval-2010 dataset manually filtering out these topics when assign-
ing instance to senses: there was no perceptible
change in the results, reinforcing our suggestion
Looking first at the results for LDA, we see that these topics do not impact on target word
that the first LDA approach (variable T ) is very sense assignment.
competitive, outperforming the benchmark sys- Comparing the results for HDP back to those
tem. In this approach, however, we assume per- for LDA, HDP tends to learn almost double the
fect knowledge of the number of gold senses of number of senses per target word as are in the
each target word, meaning that the method isnt gold-standard (and hence are used for the Vari-
truly unsupervised. When we fixed T for each able T version of LDA). Far from hurting our
of the nouns and verbs, we see a small drop in WSD F-score, however, the extra topics are dom-
F-score, but encouragingly the method still per- inated by junk topics, and boost WSD F-score for
forms above the benchmark. Adding positional the genuine topics. Based on this insight, we
word features improves the results very slightly ran LDA once again with variable T (and posi-
for nouns. tional and dependency features), but this time set-
When we relax the assumption on the number ting T to the value learned by HDP, to give LDA
of word senses in moving to HDP, we observe a the facility to use junk topics. This resulted in an
marked improvement in F-score over LDA. This F-score of 0.66 across all word classes (verbs =
is highly encouraging and somewhat surprising, 0.71, nouns = 0.62), demonstrating that, surpris-
as in hiding information about sense granularity ingly, even for the same T setting, HDP achieves
from the model, we have actually improved our superior results to LDA. I.e., not only does HDP
results. We return to discuss this effect below. learn T automatically, but the topic model learned
For the final feature, we add dependency features for a given T is superior to that for LDA.
to the HDP model (in addition to retaining the Looking at the other senses discovered for
positional word features), but see no movement cheat, we notice that the model has induced a
in the results.6 While the dependency features myriad of senses: the relationship sense of cheat
didnt reduce F-score, their utility is questionable (senses 1, 3 and 4, e.g. husband cheats); the exam
as the generation of the features from the Stanford usage of cheat (sense 2); the competition/game
parser is computationally expensive. usage of cheat (sense 5); and cheating in the po-
To better understand these results, we present litical domain (sense 6). Although the senses are
the top-10 terms for each of the senses induced for possibly split a little more than desirable (e.g.
the word cheat in Table 2. These senses are learnt senses 1, 3 and 4 arguably describe the same
using HDP with both positional word features sense), the overall quality of the produced senses
(e.g. husband #-1, indicating the lemma husband 7
In the WSD evaluation, the alignment of induced senses
to the immediate left of the target word) and de-
to the gold senses is learnt automatically based on the map-
pendency features (e.g. cheat#prep on#wife). The ping instances. E.g. if all instances that are assigned sense
first observation to make is that senses 7, 8 and a have gold sense x, then sense a is mapped to gold sense
9 are junk senses, in that the top-10 terms do x. Therefore, if the proportion of junk senses in the map-
ping instances is low, their influence on WSD results will be
6
An identical result was observed for LDA. negligible.
594
Sense Num Top-10 Terms
1 cheat think want ... love feel tell guy cheat#nsubj#include find
2 cheat student cheating test game school cheat#aux#to teacher exam study
3 husband wife cheat wife #1 tiger husband #-1 cheat#prep on#wife ... woman cheat#nsubj#husband
4 cheat woman relationship cheating partner reason cheat#nsubj#man woman #-1 cheat#aux#to spouse
5 cheat game play player cheating poker cheat#aux#to card cheated money
6 cheat exchange china chinese foreign cheat #-2 cheat #2 china #-1 cheat#aux#to team
7 tina bette kirk walk accuse mon pok symkyn nick star
8 fat jones ashley pen body taste weight expectation parent able
9 euro goal luck fair france irish single 2000 cheat#prep at#point complain
Table 2: The top-10 terms for each of the senses induced for the verb cheat by the HDP model (with positional
word and dependency features)
595
in mind, however, that the two topic modelling- WSI, in identifying words which have taken on
based approaches were tuned extensively to the novel senses over time, based on analysis of di-
dataset. When we use the tuned hyperparame- achronic data. Our topic modelling approach is
ter settings of YVD, our results rise around 2.5% particularly attractive for this task as, not only
to surpass both topic modelling approaches, and does it jointly perform type-level WSI, and token-
marginally outperform the I2R system from the level WSD based on the induced senses (in as-
original task. Recall that both BL and YVD report signing topics to each instance), but it is possible
higher results again using in-domain training data, to gist the induced senses via the contents of the
so we would expect to see further gains again over topic (typically using the topic words with highest
the I2R system in following this path. marginal probability).
Overall, these results agree with our findings The meanings of words can change over time;
over the SemEval-2010 dataset (Section 3.2), un- in particular, words can take on new senses. Con-
derlining the viability of topic modelling to auto- temporary examples of new word-senses include
mated word sense induction. the meanings of swag and tweet as used below:
596
identifying new word-senses. In contrast to Bam- noted challenge for approaches to identifying lex-
man and Crane (2011) our token-based approach ical semantic differences between corpora (Peirs-
does not require parallel text to induce senses. man et al., 2010), but are difficult to avoid given
the corpora that are available. We use TreeTagger
4.1 Method (Schmid, 1994) to tokenise and lemmatise both
Given two corpora a reference corpus which corpora.
we take to represent standard usage, and a second Evaluating approaches to identifying seman-
corpus of newer texts we identify senses that tic change is a challenge, particularly due to the
are novel to the second corpus compared to the lack of appropriate evaluation resources; indeed,
reference corpus. For a given word w, we pool most previous approaches have used very small
all usages of w in the reference corpus and sec- datasets (Sagi et al., 2009; Cook and Stevenson,
ond corpus, and run the HDP WSI method on this 2010; Bamman and Crane, 2011). Because this
super-corpus to induce the senses of w. We then is a preliminary attempt at applying WSI tech-
tag all usages of w in both corpora with their sin- niques to identifying new word-senses, our evalu-
gle most-likely automatically-induced sense. ation will also be based on a rather small dataset.
Intuitively, if a word w is used in some sense We require a set of words that are known to
s in the second corpus, and w is never used in have acquired a new sense between the late 20th
that sense in the reference corpus, then w has ac- and early 21st centuries. The Concise Oxford
quired a new sense, namely s. We capture this English Dictionary aims to document contempo-
intuition into a novelty score (Nov) that indi- rary usage, and has been published in numerous
cates whether a given word w has a new sense in editions including Thompson (1995, COD95) and
the second corpus, s, compared to the reference Soanes and Stevenson (2008, COD08). Although
corpus, r, as below: some of the entries have been substantially re-
({ }) vised between editions, many have not, enabling
ps (ti ) pr (ti )
Nov(w) = max : ti T us to easily identify new senses amongst the en-
pr (ti )
(1) tries in COD08 relative to COD95. A manual lin-
where ps (ti ) and pr (ti ) are the probability of ear search through the entries in these dictionaries
sense ti in the second corpus and reference cor- would be very time consuming, but by exploit-
pus, respectively, calculated using smoothed max- ing the observation that new words often corre-
imum likelihood estimates, and T is the set of spond to concepts that are culturally salient (Ayto,
senses induced for w. Novelty is high if there is 2006), we can quickly identify some candidates
some sense t that has much higher relative fre- for words that have taken on a new sense.
quency in s than r and that is also relatively infre- Between the time periods of our two corpora,
quent in r. computers and the Internet have become much
more mainstream in society. We therefore ex-
4.2 Data tracted all entries from COD08 containing the
Because we are interested in the identification of word computing (which is often used as a topic la-
novel word-senses for applications such as lexi- bel in this dictionary) that have a token frequency
con maintenance, we focus on relatively newly- of at least 1000 in the BNC. We then read the
coined word-senses. In particular, we take the entries for these 87 lexical items in COD95 and
written portion of the BNC consisting primar- COD08 and identified those which have a clear
ily of British English text from the late 20th cen- computing sense in COD08 that was not present
tury as our reference corpus, and a similarly- in COD95. In total we found 22 such items. This
sized random sample of documents from the process, along with all the annotation in this sec-
ukWaC (Ferraresi et al., 2008) a Web corpus tion, is carried out by a native English-speaking
built from the .uk domain in 2007 which in- author of this paper.
cludes a wide range of text types as our sec- To ensure that the words identified from the
ond corpus. Text genres are represented to dif- dictionaries do in fact have a new sense in the
ferent extents in these corpora with, for example, ukWaC sample compared to the BNC, we exam-
text types related to the Internet being much more ine the usage of these words in the corpora. We
common in the ukWaC. Such differences are a extract a random sample of 100 usages of each
597
lemma from the BNC and ukWaC sample and Lemma Novelty Freq. ratio Novel sense freq.
annotate these usages as to whether they corre- domain (n) 116.2 2.60 41
worm (n) 68.4 1.04 30
spond to the novel sense or not. This binary dis- mirror (n) 38.4 0.53 10
tinction is easier than fine-grained sense annota- guess (v) 16.5 0.93
tion, and since we do not use these annotations export (v) 13.8 0.88 28
for formal evaluation only for selecting items founder (n) 11.0 1.20
cinema (n) 9.7 1.30
for our dataset we do not carry out an inter- poster (n) 7.9 1.83 4
annotator agreement study here. We eliminate any racism (n) 2.4 0.98
lemma for which we find evidence of the novel symptom (n) 2.1 1.16
sense in the BNC, or for which we do not find
Table 4: Novelty score (Nov), ratio of frequency in
evidence of the novel sense in the ukWaC sam-
the ukWaC sample and BNC, and frequency of the
ple.9 We further check word sketches (Kilgarriff novel sense in the manually-annotated 100 instances
and Tugwell, 2002)10 for each of these lemmas from the ukWaC sample (where applicable), for all
in the BNC and ukWaC for collocates that likely lemmas in our dataset. Lemmas shown in boldface
correspond to the novel sense; we exclude any have a novel sense in the ukWaC sample compared to
lemma for which we find evidence of the novel the BNC.
sense in the BNC, or fail to find evidence of the
novel sense in the ukWaC sample. At the end topic modelling. The results are shown in column
of this process we have identified the following Novelty in Table 4. The lemmas with a novel
5 lemmas that have the indicated novel senses in sense have higher novelty scores than the distrac-
the ukWaC compared to the BNC: domain (n) In- tors according to a one-sided Wilcoxon rank sum
ternet domain; export (v) export data; mirror test (p < .05).
(n) mirror website; poster (n) one who posts When a lemma takes on a new sense, it might
online; and worm (n) malicious program. For also increase in frequency. We therefore also con-
each of the 5 lemmas with novel senses, a sec- sider a baseline in which we rank the lemmas by
ond annotator also a native English-speaking the ratio of their frequency in the second and ref-
author of this paper annotated the sample of erence corpora. These results are shown in col-
100 usages from the ukWaC. The observed agree- umn Freq. ratio in Table 4. The difference be-
ment and unweighted Kappa between the two an- tween the frequency ratios for the lemmas with a
notators is 97.2% and 0.92, respectively, indicat- novel sense, and the distractors, is not significant
ing that this is indeed a relatively easy annotation (p > .05).
task. The annotators discussed the small number Examining the frequency of the novel senses
of disagreements to reach consensus. shown in column Novel sense freq. in Table 4
For our dataset we also require items that have we see that the lowest-ranked lemma with a
not acquired a novel sense in the ukWaC sample. novel sense, poster, is also the lemma with the
For each of the above 5 lemmas we identified a least-frequent novel sense. This result is unsur-
distractor lemma of the same part-of-speech that prising as our novelty score will be higher for
has a similar frequency in the BNC, and that has higher-frequency novel senses. The identification
not undergone sense change between COD95 and of infrequent novel senses remains a challenge.
COD08. The 5 distractors are: cinema (n); guess The top-ranked topic words for the sense cor-
(v); symptom (n); founder (n); and racism (n). responding to the maximum in Equation 1 for
the highest-ranked distractor, guess, are the fol-
4.3 Results lowing: @card@, post, ..., nt, comment, think,
We compute novelty (Nov, Equation 1) for all subject, forum, view, guess. This sense seems
10 items in our dataset, based on the output of the to correspond to usages of guess in the context
of online forums, which are better represented
9
We use the IMS Open Corpus Workbench (http:// in the ukWaC sample than the BNC. Because of
cwb.sourceforge.net/) to extract the usages of our the challenges posed by such differences between
target lemmas from the corpora. This extraction process fails
in some cases, and so we also eliminate such items from our
corpora (discussed in Section 4.2) we are unsur-
dataset. prised to see such an error, but this could be ad-
10
http://www.sketchengine.co.uk/ dressed in the future by building comparable cor-
598
Topic Selection Methodology
Lemma Nov Oracle (single topic) Oracle (multiple topics)
Precision Recall F-score Precision Recall F-score Precision Recall F-score
domain (n) 1.00 0.29 0.45 1.00 0.56 0.72 0.97 0.88 0.92
export (v) 0.93 0.96 0.95 0.93 0.96 0.95 0.90 1.00 0.95
mirror (n) 0.67 1.00 0.80 0.67 1.00 0.80 0.67 1.00 0.80
poster (n) 0.00 0.00 0.00 0.44 1.00 0.62 0.44 1.00 0.62
worm (n) 0.93 0.90 0.92 0.93 0.90 0.92 0.86 1.00 0.92
Table 5: Results for identifying the gold-standard novel senses based on the three topic selection methodologies
of: (1) Nov; (2) oracle selection of a single topic; and (3) oracle selection of multiple topics.
pora for use in this application. the sense selection heuristic could theoretically
Having demonstrated that our method for iden- improve our method for identifying novel senses,
tifying novel senses can distinguish lemmas that and that the topic modelling approach proposed
have a novel sense in one corpus compared to an- in this paper has considerable promise for auto-
other from those that do not, we now consider matic novel sense detection. Of particular note is
whether this method can also automatically iden- the result for poster: although the gold-standard
tify the usages of the induced novel sense. novel sense of poster is rare, all of its usages are
For each lemma with a gold-standard novel grouped into a single topic.
sense, we define the automatically-induced novel Finally, we consider whether an oracle which
sense to be the single sense corresponding to the can select the best subset of induced senses in
maximum in Equation 1. We then compute the terms of F-score as the novel sense could of-
precision, recall, and F-score of this novel sense fer further improvements. In this case results
with respect to the gold-standard novel sense, shown in the final three columns of Table 5
based on the 100 annotated tokens for each of we again see an increase in F-score to 0.92 for
the 5 lemmas with a novel sense. The results are domain. For this lemma the gold-standard novel
shown in the first three numeric columns of Ta- sense usages were split across multiple induced
ble 5. topics, and so we are unsurprised to find that a
In the case of export and worm the results are method which is able to select multiple topics as
remarkably good, with precision and recall both the novel sense performs well. Based on these
over 0.90. For domain, the low recall is a result of findings, in future work we plan to consider alter-
the majority of usages of the gold-standard novel native formulations of novelty.
sense (Internet domain) being split across two
induced senses the top-two highest ranked in- 5 Conclusion
duced senses according to Equation 1. The poor
performance for poster is unsurprising due to the We propose the application of topic modelling
very low frequency of this lemmas gold-standard to the task of word sense induction (WSI), start-
novel sense. ing with a simple LDA-based methodology with
These results are based on our novelty rank- a fixed number of senses, and culminating in
ing method (Nov), and the assumption that a nonparametric method based on a Hierarchi-
the novel sense will be represented in a single cal Dirichlet Process (HDP), which automatically
topic. To evaluate the theoretical upper-bound learns the number of senses for a given target
for a topic-ranking method which uses our HDP- word. Our HDP-based method outperforms all
based WSI method and selects a single topic to methods over the SemEval-2010 WSI dataset, and
capture the novel sense, we next evaluate an op- is also superior to other topic modelling-based
timal topic selection approach. In the middle approaches to WSI based on the SemEval-2007
three numeric columns of Table 5, we present re- dataset. We applied the proposed WSI model to
sults for an experimental setup in which the sin- the task of identifying words which have taken on
gle best induced sense in terms of F-score new senses, including identifying the token oc-
is selected as the novel sense by an oracle. We currences of the new word sense. Over a small
see big improvements in F-score for domain and dataset developed in this research, we achieved
poster. This encouraging result suggests refining highly encouraging results.
599
References Dan Klein and Christopher D. Manning. 2003. Fast
exact inference with a factored model for natural
Eneko Agirre and Aitor Soroa. 2007. SemEval-2007 language parsing. In Advances in Neural Informa-
Task 02: Evaluating word sense induction and dis- tion Processing Systems 15 (NIPS 2002), pages 3
crimination systems. In Proceedings of the Fourth 10, Whistler, Canada.
International Workshop on Semantic Evaluations
Ioannis Korkontzelos and Suresh Manandhar. 2010.
(SemEval-2007), pages 712, Prague, Czech Re-
Uoy: Graphs of unambiguous vertices for word
public.
sense induction and disambiguation. In Proceed-
John Ayto. 2006. Movers and Shakers: A Chronology
ings of the 5th International Workshop on Semantic
of Words that Shaped our Age. Oxford University
Evaluation, pages 355358, Uppsala, Sweden.
Press, Oxford.
Suresh Manandhar, Ioannis Klapaftis, Dmitriy Dli-
David Bamman and Gregory Crane. 2011. Measur-
gach, and Sameer Pradhan. 2010. SemEval-2010
ing historical word sense variation. In Proceedings
Task 14: Word sense induction & disambiguation.
of the 2011 Joint International Conference on Dig-
In Proceedings of the 5th International Workshop
ital Libraries (JCDL 2011), pages 110, Ottawa,
on Semantic Evaluation, pages 6368, Uppsala,
Canada.
Sweden.
D. Blei, A. Ng, and M. Jordan. 2003. Latent dirichlet
Roberto Navigli and Giuseppe Crisafulli. 2010. In-
allocation. Journal of Machine Learning Research,
ducing word senses to improve web search result
3:9931022.
clustering. In Proceedings of the 2010 Conference
S. Brody and M. Lapata. 2009. Bayesian word sense
on Empirical Methods in Natural Language Pro-
induction. pages 103111, Athens, Greece.
cessing, pages 116126, Cambridge, USA.
Lou Burnard. 2000. The British National Corpus
Zheng-Yu Niu, Dong-Hong Ji, and Chew-Lim Tan.
Users Reference Guide. Oxford University Com-
2007. I2R: Three systems for word sense discrimi-
puting Services.
nation, chinese word sense disambiguation, and en-
Paul Cook and Suzanne Stevenson. 2010. Automat-
glish word sense disambiguation. In Proceedings
ically identifying changes in the semantic orienta-
of the Fourth International Workshop on Seman-
tion of words. In Proceedings of the Seventh In-
tic Evaluations (SemEval-2007), pages 177182,
ternational Conference on Language Resources and
Prague, Czech Republic.
Evaluation (LREC 2010), pages 2834, Valletta,
Malta. Sebastian Pado and Mirella Lapata. 2007.
Dependency-based construction of semantic
Marie-Catherine De Marneffe, Bill Maccartney, and
space models. Comput. Linguist., 33:161199.
Christopher D. Manning. 2006. Generating typed
dependency parses from phrase structure parses. Yves Peirsman, Dirk Geeraerts, and Dirk Speelman.
Genoa, Italy. 2010. The automatic identification of lexical varia-
tion between language varieties. Natural Language
Christiane Fellbaum, editor. 1998. WordNet: An Elec-
Engineering, 16(4):469491.
tronic Lexical Database. MIT Press, Cambridge,
MA. Eyal Sagi, Stefan Kaufmann, and Brady Clark. 2009.
Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and Semantic density analysis: Comparing word mean-
Silvia Bernardini. 2008. Introducing and evaluat- ing across time and space. In Proceedings of
ing ukwac, a very large web-derived corpus of en- the EACL 2009 Workshop on GEMS: GEometrical
glish. In Proceedings of the 4th Web as Corpus Models of Natural Language Semantics, pages 104
Workshop: Can we beat Google, pages 4754, Mar- 111, Athens, Greece.
rakech, Morocco. Helmut Schmid. 1994. Probabilistic part-of-speech
William A. Gale, Kenneth W. Church, and David tagging using decision trees. In Proceedings of the
Yarowsky. 1992. One sense per discourse. pages International Conference on New Methods in Lan-
233237. guage Processing, pages 4449, Manchester, UK.
Kristina Gulordava and Marco Baroni. 2011. A dis- Hinrich Schutze. 1998. Automatic word sense dis-
tributional similarity approach to the detection of crimination. Computational Linguistics, 24(1):97
semantic change in the Google Books Ngram cor- 123.
pus. In Proceedings of the GEMS 2011 Workshop Catherine Soanes and Angus Stevenson, editors. 2008.
on GEometrical Models of Natural Language Se- The Concise Oxford English Dictionary. Oxford
mantics, pages 6771, Edinburgh, Scotland. University Press, eleventh (revised) edition. Oxford
Adam Kilgarriff and David Tugwell. 2002. Sketch- Reference Online.
ing words. In Marie-Helene Correard, editor, Lex- Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei.
icography and Natural Language Processing: A 2006. Hierarchical Dirichlet processes. Journal
Festschrift in Honour of B. T. S. Atkins, pages 125 of the American Statistical Association, 101:1566
137. Euralex, Grenoble, France. 1581.
600
Della Thompson, editor. 1995. The Concise Oxford
Dictionary of Current English. Oxford University
Press, Oxford, ninth edition.
Xuchen Yao and Benjamin Van Durme. 2011. Non-
parametric bayesian word sense induction. In Pro-
ceedings of TextGraphs-6: Graph-based Methods
for Natural Language Processing, pages 1014,
Portland, Oregon.
601
Learning Language from Perceptual Context
Raymond Mooney
University of Texas at Austin
mooney@cs.utexas.edu
Abstract
Machine learning has become the dominant approach to building natural-language processing sys-
tems. However, current approaches generally require a great deal of laboriously constructed human-
annotated training data. Ideally, a computer would be able to acquire language like a child by being
exposed to linguistic input in the context of a relevant but ambiguous perceptual environment. As
a step in this direction, we have developed systems that learn to sportscast simulated robot soccer
games and to follow navigation instructions in virtual environments by simply observing sample hu-
man linguistic behavior in context. This work builds on our earlier work on supervised learning of
semantic parsers that map natural language into a formal meaning representation. In order to apply
such methods to learning from observation, we have developed methods that estimate the meaning of
sentences given just their ambiguous perceptual context.
602
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, page 602,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
Learning for Microblogs with Distant Supervision:
Political Forecasting with Twitter
603
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 603612,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
Algorithms to identify these moods range from In contrast to lexicons, many approaches in-
matching words in a sentiment lexicon to training stead focus on ways to train supervised classi-
classifiers with a hand-labeled corpus. Since la- fiers. However, labeled data is expensive to cre-
beling corpora is expensive, recent work on Twit- ate, and examples of Twitter classifiers trained on
ter uses emoticons (i.e., ASCII smiley faces such hand-labeled data are few (Jiang et al., 2011). In-
as :-( and :-)) as noisy labels in tweets for distant stead, distant supervision has grown in popular-
supervision (Pak and Paroubek, 2010; Davidov et ity. These algorithms use emoticons to serve as
al., 2010; Kouloumpis et al., 2011). This paper semantic indicators for sentiment. For instance,
presents new analysis of the downstream effects a sad face (e.g., :-() serves as a noisy label for a
of topic identification on sentiment classifiers and negative mood. Read (2005) was the first to sug-
their application to political forecasting. gest emoticons for UseNet data, followed by Go
Interest in measuring the political mood of et al. (Go et al., 2009) on Twitter, and many others
a country has recently grown (OConnor et al., since (Bifet and Frank, 2010; Pak and Paroubek,
2010; Tumasjan et al., 2010; Gonzalez-Bailon et 2010; Davidov et al., 2010; Kouloumpis et al.,
al., 2010; Carvalho et al., 2011; Tan et al., 2011). 2011). Hashtags (e.g., #cool and #happy) have
Here we compare our sentiment results to Presi- also been used as noisy sentiment labels (Davi-
dential Job Approval polls and show that the sen- dov et al., 2010; Kouloumpis et al., 2011). Fi-
timent scores produced by our system are posi- nally, multiple models can be blended into a sin-
tively correlated with both the Approval and Dis- gle classifier (Barbosa and Feng, 2010). Here, we
approval job ratings. adopt the emoticon algorithm for sentiment analy-
In this paper we present a method for cou- sis, and evaluate it on a specific domain (politics).
pling two distantly supervised algorithms for Topic identification in Twitter has received
topic identification and sentiment classification on much less attention than sentiment analysis. The
Twitter. In Section 4, we describe our approach to majority of approaches simply select a single
topic identification and present a new annotated keyword (e.g., Obama) to represent their topic
corpus of political tweets for future study. In Sec- (e.g., US President) and retrieve all tweets that
tion 5, we apply distant supervision to sentiment contain the word (OConnor et al., 2010; Tumas-
analysis. Finally, Section 6 discusses our sys- jan et al., 2010; Tan et al., 2011). The underlying
tems performance on modeling Presidential Job assumption is that the keyword is precise, and due
Approval ratings from Twitter data. to the vast number of tweets, the search will re-
turn a large enough dataset to measure sentiment
2 Previous Work toward that topic. In this work, we instead use
a distantly supervised system similar in spirit to
The past several years have seen sentiment anal- those recently applied to sentiment analysis.
ysis grow into a diverse research area. The idea Finally, we evaluate the approaches presented
of sentiment applied to microblogging domains is in this paper on the domain of politics. Tumasjan
relatively new, but there are numerous recent pub- et al. (2010) showed that the results of a recent
lications on the subject. Since this paper focuses German election could be predicted through fre-
on the microblog setting, we concentrate on these quency counts with remarkable accuracy. Most
contributions here. similar to this paper is that of OConnor et al.
The most straightforward approach to senti- (2010), in which tweets relating to President
ment analysis is using a sentiment lexicon to la- Obama are retrieved with a keyword search and
bel tweets based on how many sentiment words a sentiment lexicon is used to measure overall
appear. This approach tends to be used by appli- approval. This extracted approval ratio is then
cations that measure the general mood of a popu- compared to Gallups Presidential Job Approval
lation. OConnor et al. (2010) use a ratio of posi- polling data. We directly compare their results
tive and negative word counts on Twitter, Kramer with various distantly supervised approaches.
(2010) counts lexicon words on Facebook, and
3 Datasets
Thelwall (2011) uses the publicly available Sen-
tiStrength algorithm to make weighted counts of The experiments in this paper use seven months of
keywords based on predefined polarity strengths. tweets from Twitter (www.twitter.com) collected
604
between June 1, 2009 and December 31, 2009. ID Type Keywords
The corpus contains over 476 million tweets la- PC-1 Obama obama
beled with usernames and timestamps, collected PC-2 General republican, democrat, senate,
congress, government
through Twitters spritzer API without keyword
PC-3 Topic health care, economy, tax cuts,
filtering. Tweets are aligned with polling data in tea party, bailout, sotomayor
Section 6 using their timestamps. PC-4 Politician obama, biden, mccain, reed,
The full system is evaluated against the pub- pelosi, clinton, palin
licly available daily Presidential Job Approval PC-5 Ideology liberal, conservative, progres-
polling data from Gallup1 . Every day, Gallup asks sive, socialist, capitalist
1,500 adults in the United States about whether
Table 1: The keywords used to select positive training
they approve or disapprove of the job Presi-
sets for each political classifier (a subset of all PC-3
dent Obama is doing as president. The results and PC-5 keywords are shown to conserve space).
are compiled into two trend lines for Approval
and Disapproval ratings, as shown in Figure 1.
positive: LOL, obama made a bears refer-
We compare our positive and negative sentiment
ence in green bay. uh oh.
scores against these two trends.
negative: New blog up! It regards the new
4 Topic Identification iPhone 3G S: <URL>
This section addresses the task of Topic Identi- We then use these automatically extracted
fication in the context of microblogs. While the datasets to train a multinomial Naive Bayes classi-
general field of topic identification is broad, its fier. Before feature collection, the text is normal-
use on microblogs has been somewhat limited. ized as follows: (a) all links to photos (twitpics)
Previous work on the political domain simply uses are replaced with a single generic token, (b) all
keywords to identify topic-specific tweets (e.g., non-twitpic URLs are replaced with a token, (c)
OConnor et al. (2010) use Obama to find pres- all user references (e.g., @MyFriendBob) are col-
idential tweets). This section shows that distant lapsed, (d) all numbers are collapsed to INT, (e)
supervision can use the same keywords to build a tokens containing the same letter twice or more
classifier that is much more robust to noise than in a row are condensed to a two-letter string (e.g.
approaches that use pure keyword search. the word ahhhhh becomes ahh), (f) lowercase the
text and insert spaces between words and punctu-
4.1 Distant Supervision ation. The text of each tweet is then tokenized,
Distant supervision uses noisy signals to identify and the tokens are used to collect unigram and bi-
positive examples of a topic in the face of unla- gram features. All features that occur fewer than
beled data. As described in Section 2, recent sen- 10 times in the training corpus are ignored.
timent analysis work has applied distant supervi- Finally, after training a classifier on this dataset,
sion using emoticons as the signals. The approach every tweet in the corpus is classified as either
extracts tweets with ASCII smiley faces (e.g., :) positive (i.e., relevant to the topic) or negative
and ;)) and builds classifiers trained on these pos- (i.e., irrelevant). The positive tweets are then sent
itive examples. We apply distant supervision to to the second sentiment analysis stage.
topic identification and evaluate its effectiveness
on this subtask. 4.2 Keyword Selection
As with sentiment analysis, we need to collect Keywords are the input to our proposed distantly
positive and negative examples of tweets about supervised system, and of course, the input to pre-
the target topic. Instead of emoticons, we extract vious work that relies on keyword search. We
positive tweets containing one or more predefined evaluate classifiers based on different keywords to
keywords. Negative tweets are randomly chosen measure the effects of keyword selection.
from the corpus. Examples of positive and neg- OConnor et al. (2010) used the keywords
ative tweets that can be used to train a classifier Obama and McCain, and Tumasjan et al.
based on the keyword Obama are given here: (2010) simply extracted tweets containing Ger-
1
http://gallup.com/poll/113980/gallup-daily-obama-job- manys political party names. Both approaches
approval.aspx extracted matching tweets, considered them rele-
605
Gallup Daily Obama Job Approval Ratings
Figure 1: Gallup presidential job Approval and Disapproval ratings measured between June and Dec 2009.
vant (correctly, in many cases), and applied sen- domly chosen from the keyword searches of PC-
timent analysis. However, different keywords 2, PC-3, PC-4, and PC-5 with 500 tweets from
may result in very different extractions. We in- each. This combined dataset enables an evalua-
stead attempted to build a generic political topic tion of how well each classifier can identify tweets
classifier. To do this, we experimented with the from other classifiers. The General Dataset con-
five different sets of keywords shown in Table 1. tains 2,000 random tweets from the entire corpus.
For each set, we extracted all tweets matching This dataset allows us to evaluate how well clas-
one or more keywords, and created a balanced sifiers identify political tweets in the wild.
positive/negative training set by then selecting This papers authors initially annotated the
negative examples randomly from non-matching same 200 tweets in the General Dataset to com-
tweets. A couple examples of ideology (PC-5) ex- pute inter-annotator agreement. The Kappa was
tractions are shown here: 0.66, which is typically considered good agree-
ment. Most disagreements occurred over tweets
You often hear of deontologist libertarians
and utilitarian liberals but are there any about money and the economy. We then split the
Aristotelian socialists? remaining portions of the two datasets between
<url> - Then, slather on a liberal amount
the two annotators. The Political Dataset con-
of plaster, sand down smooth, and paint tains 1,691 political and 309 apolitical tweets, and
however you want. I hope this helps! the General Dataset contains 28 political tweets
and 1,978 apolitical tweets. These two datasets of
The second tweet is an example of the noisy 2000 tweets each are publicly available for future
nature of keyword extraction. Most extractions evaluation and comparison to this work2 .
are accurate, but different keywords retrieve very
different sets of tweets. Examples for the political 4.4 Experiments
topics (PC-3) are shown here: Our first experiment addresses the question of
RT @PoliticalMath: hope the presidents keyword variance. We measure performance on
health care predictions <url> are better the Political Dataset, a combination of all of our
than his stimulus predictions <url> proposed political keywords. Each keyword set
@adamjschmidt You mean we could have contributed to 25% of the dataset, so the eval-
chosen health care for every man woman uation measures the extent to which a classifier
and child in America or the Iraq war? identifies other keyword tweets. We classified
the 2000 tweets with the five distantly supervised
Each keyword set builds a classifier using the ap-
classifiers and the one Obama keyword extrac-
proach described in Section 4.1.
tor from OConnor et al. (2010).
4.3 Labeled Datasets Results are shown on the left side of Figure 2.
Precision and recall calculate correct identifica-
In order to evaluate distant supervision against
tion of the political label. The five distantly super-
keyword search, we created two new labeled
vised approaches perform similarly, and show re-
datasets of political and apolitical tweets.
markable robustness despite their different train-
The Political Dataset is an amalgamation of all
ing sets. In contrast, the keyword extractor only
four keyword extractions (PC-1 is a subset of PC-
2
4) listed in Table 1. It consists of 2,000 tweets ran- http://www.usna.edu/cs/nchamber/data/twitter
606
Figure 2: Five distantly supervised classifiers and the Obama keyword classifier. Left panel: the Political Dataset
of political tweets. Right panel: the General Dataset representative of Twitter as a whole.
captures about a quarter of the political tweets. my life I am ashamed of our government.
PC-1 is the distantly supervised analog to the
Obama keyword extractor, and we see that dis- These results also illustrate that distant supervi-
tant supervision increases its F1 score dramati- sion allows for flexibility in construction of the
cally from 0.39 to 0.90. classifier. Different keywords show little change
The second evaluation addresses the question in classifier performance.
of classifier performance on Twitter as a whole, The General Dataset experiment evaluates clas-
not just on a political dataset. We evaluate on the sifier performance in the wild. The keyword ap-
General Dataset just as on the Political Dataset. proach again scores below those trained on noisy
Results are shown on the right side of Figure 2. labels. It classifies most tweets as apolitical and
Most tweets posted to Twitter are not about pol- thus achieves very low recall for tweets that are
itics, so the apolitical label dominates this more actually about politics. On the other hand, distant
representative dataset. Again, the five distant supervision creates classifiers that over-extract
supervision classifiers have similar results. The political tweets. This is a result of using balanced
Obama keyword search has the highest precision, datasets in training; such effects can be mitigated
but drastically sacrifices recall. Four of the five by changing the training balance. Even so, four
classifiers outperform keyword search in F1 score. of the five distantly trained classifiers score higher
than the raw keyword approach. The only under-
4.5 Discussion performer was PC-1, which suggests that when
The Political Dataset results show that distant su- building a classifier for a relatively broad topic
pervision adds robustness to a keyword search. like politics, a variety of keywords is important.
The distantly supervised Obama classifier (PC- The next section takes the output from our clas-
1) improved the basic Obama keyword search sifiers (i.e., our topic-relevant tweets) and eval-
by 0.51 absolute F1 points. Furthermore, dis- uates a fully automated sentiment analysis algo-
tant supervision doesnt require additional human rithm against real-world polling data.
input, but simply adds a trained classifier. Two
example tweets that an Obama keyword search 5 Targeted Sentiment Analysis
misses but that its distantly supervised analog The previous section evaluated algorithms that
captures are shown here: extract topic-relevant tweets. We now evaluate
Why does Congress get to opt out of the methods to distill the overall sentiment that they
Obummercare and we cant. A company express. This section compares two common ap-
gets fined if they dont comply. Kiss free- proaches to sentiment analysis.
dom goodbye. We first replicated the technique used in
I agree with the lady from california, I am OConnor et al. (2010), in which a lexicon of pos-
sixty six years old and for the first time in itive and negative sentiment words called Opin-
607
ionFinder (Wilson and Hoffmann, 2005) is used tweets that contain at least one positive emoti-
to evaluate the sentiment of each tweet (others con and no negative emoticons. We generated a
have used similar lexicons (Kramer, 2010; Thel- negative training set using an analogous process.
wall et al., 2010)). We evaluate our full distantly The emoticon symbols used for positive sentiment
supervised approach to theirs. We also experi- were :) =) :-) :] =] :-] :} :o) :D =D :-D :P =P
mented with SentiStrength, a lexicon-based pro- :-P C:. Negative emoticons were :( =( :-( :[ =[
gram built to identify sentiment in online com- :-[ :{ :-c :c} D: D= :S :/ =/ :-/ :( : (. Using this
ments of the social media website, MySpace. data, we train a multinomial Naive Bayes classi-
Though MySpace is close in genre to Twitter, we fier using the same method used for the political
did not observe a performance gain. All reported classifiers described in Section 4.1. This classifier
results thus use OpinionFinder to facilitate a more is then used to label topic-specific tweets as ex-
accurate comparison with previous work. pressing positive or negative sentiment. Finally,
Second, we built a distantly supervised system the three overall sentiment scores Spos , Sneg , and
using tweets containing emoticons as done in pre- Sratio are calculated from the results.
vious work (Read, 2005; Go et al., 2009; Bifet and
Frank, 2010; Pak and Paroubek, 2010; Davidov 6 Predicting Approval Polls
et al., 2010; Kouloumpis et al., 2011). Although
distant supervision has previously been shown to This section uses the two-stage Targeted Senti-
outperform sentiment lexicons, these evaluations ment Analysis system described above in a real-
do not consider the extra topic identification step. world setting. We analyze the sentiment of Twit-
ter users toward U.S. President Barack Obama.
5.1 Sentiment Lexicon This allows us to both evaluate distant supervision
The OpinionFinder lexicon is a list of 2,304 pos- against previous work on the topic, and demon-
itive and 4,151 negative sentiment terms (Wilson strate a practical application of the approach.
and Hoffmann, 2005). We ignore neutral words
6.1 Experiment Setup
in the lexicon and we do not differentiate between
weak and strong sentiment words. A tweet is la- The following experiments combine both topic
beled positive if it contains any positive terms, and identification and sentiment analysis. The previ-
negative if it contains any negative terms. A tweet ous sections described six topic identification ap-
can be marked as both positive and negative, and proaches, and two sentiment analysis approaches.
if a tweet contains words in neither category, it We evaluate all combinations of these systems,
is marked neutral. This procedure is the same as and compare their final sentiment scores for each
used by OConnor et al. (2010). The sentiment day in the nearly seven-month period over which
scores Spos and Sneg for a given set of N tweets our dataset spans.
are calculated as follows: Gallups Daily Job Approval reports two num-
P
1{xlabel = positive} bers: Approval and Disapproval. We calculate in-
Spos = x (1) dividual sentiment scores Spos and Sneg for each
N
P day, and compare the two sets of trends using
1{xlabel = negative} Pearsons correlation coefficient. OConnor et al.
Spos = x (2)
N do not explicitly evaluate these two, but instead
where 1{xlabel = positive} is 1 if the tweet x is use the ratio Sratio . We also calculate this daily
labeled positive, and N is the number of tweets in ratio from Gallup for comparison purposes by di-
the corpus. For the sake of comparison, we also viding the Approval by the Disapproval.
calculate a sentiment ratio as done in OConnor
et al. (2010): 6.2 Results and Discussion
The first set of results uses the lexicon-based clas-
P
x 1{xlabel = positive}
Sratio = P (3) sifier for sentiment analysis and compares the dif-
x 1{xlabel = negative}
ferent topic identification approaches. The first
5.2 Distant Supervision table in Table 2 reports Pearsons correlation co-
To build a trained classifier, we automatically gen- efficient with Gallups Approval and Disapproval
erated a positive training set by searching for ratings. Regardless of the Topic classifier, all
608
Sentiment Lexicon cient for this approach is 0.71 with Approval and
Topic Classifier Approval Disapproval 0.73 with Disapproval.
keyword -0.22 0.42 Finally, we compute the ratio Sratio between
PC-1 -0.65 0.71 the positive and negative sentiment scores (Equa-
PC-2 -0.61 0.71 tion 3) to compare to OConnor et al. (2010). Ta-
PC-3 -0.51 0.65 ble 3 shows the results. The distantly supervised
PC-4 -0.49 0.60 topic identification algorithms show little change
PC-5 -0.65 0.74 between a sentiment lexicon or a classifier. How-
ever, OConnor et al.s keyword approach im-
Distantly Supervised Sentiment proves when used with a distantly supervised sen-
Topic Classifier Approval Disapproval timent classifier (.22 to .40). Merging Approval
keyword 0.27 0.38 and Disapproval into one ratio appears to mask
PC-1 0.71 0.73 the sentiment lexicons poor correlation with Ap-
PC-2 0.33 0.46 proval. The ratio may not be an ideal evalua-
PC-3 0.05 0.31 tion metric for this reason. Real-world interest in
PC-4 0.08 0.26 Presidential Approval ratings desire separate Ap-
PC-5 0.54 0.62 proval and Disapproval scores, as Gallup reports.
Our results (Table 2) show that distant supervi-
Table 2: Correlation between Gallup polling data and
sion avoids a negative correlation with Approval,
the extracted sentiment with a lexicon (trends shown
in Figure 3) and distant supervision (Figure 4). but the ratio hides this important advantage.
One reason the ratio may mask the negative
Sentiment Lexicon Approval correlation is because tweets are often
keyword PC-1 PC-2 PC-3 PC-4 PC-5 classified as both positive and negative by a lexi-
.22 .63 .46 .33 .27 .61 con (Section 5.1). This could explain the behav-
Distantly Supervised Sentiment ior seen in Figure 3 in which both the positive and
keyword PC-1 PC-2 PC-3 PC-4 PC-5
negative sentiment scores rise over time. How-
.40 .64 .46 .30 .28 .60 ever, further experimentation did not rectify this
pattern. We revised Spos and Sneg to make binary
Table 3: Correlation between Gallup Approval / Dis- decisions for a lexicon: a tweet is labeled posi-
approval ratio and extracted sentiment ratio scores. tive if it strictly contains more positive words than
negative (and vice versa). Correlation showed lit-
systems inversely correlate with Presidential Ap- tle change. Approval was still negatively corre-
proval. However, they correlate well with Dis- lated, Disapproval positive (although less so in
approval. Figure 3 graphically shows the trend both), and the ratio scores actually dropped fur-
lines for the keyword and the distantly supervised ther. The sentiment ratio continued to hide the
system PC-1. The visualization illustrates how poor Approval performance by a lexicon.
the keyword-based approach is highly influenced
by day-by-day changes, whereas PC-1 displays a 6.3 New Baseline: Topic-Neutral Sentiment
much smoother trend. Distant supervision for sentiment analysis outper-
The second set of results uses distant supervi- forms that with a sentiment lexicon (Table 2).
sion for sentiment analysis and again varies the Distant supervision for topic identification further
topic identification approach. The second table improves the results (PC-1 v. keyword). The
in Table 2 gives the correlation numbers and Fig- best system uses distant supervision in both stages
ure 4 shows the keyword and PC-1 trend lines.The (PC-1 with distantly supervised sentiment), out-
results are widely better than when a lexicon is performing the purely keyword-based algorithm
used for sentiment analysis. Approval is no longer of OConnor et al. (2010). However, the question
inversely correlated, and two of the distantly su- of how important topic identification is has not yet
pervised systems strongly correlate (PC-1, PC-5). been addressed here or in the literature.
The best performing system (PC-1) used dis- Both OConnor et al. (2010) and Tumasjan et
tant supervision for both topic identification and al. (2010) created joint systems with two topic
sentiment analysis. Pearsons correlation coeffi- identification and sentiment analysis stages. But
609
Sentiment Lexicon
Figure 3: Presidential job approval and disapproval calculated using two different topic identification techniques,
and using a sentiment lexicon for sentiment analysis. Gallup polling results are shown in black.
Figure 4: Presidential job approval sentiment scores calculated using two different topic identification techniques,
and using the emoticon classifier for sentiment analysis. Gallup polling results are shown in black.
Topic-Neutral Sentiment
Figure 5: Presidential job approval sentiment scores calculated using the entire twitter corpus, with two different
techniques for sentiment analysis. Gallup polling results are shown in black for comparison.
610
Topic-Neutral Sentiment build upon what has recently been shown in the
Algorithm Approval Disapproval literature: distant supervision with emoticons is
Distant Sup. 0.69 0.74 a valuable methodology. We also expand upon
Keyword Lexicon -0.63 0.69 prior work by discovering drastic performance
differences between positive and negative lexi-
Table 4: Pearsons correlation coefficient of Sentiment
con words. The OpinionFinder lexicon failed
Analysis without Topic Identification.
to correlate (inversely) with Gallups Approval
polls, whereas a distantly trained classifier cor-
what if the topic identification step were removed related strongly with both Approval and Disap-
and sentiment analysis instead run on the entire proval (Pearsons .71 and .73). We only tested
Twitter corpus? To answer this question, we OpinionFinder and SentiStrength, so it is possible
ran the distantly supervised emoticon classifier to that another lexicon might perform better. How-
classify all tweets in the 7 months of Twitter data. ever, our results suggest that lexicons vary in their
For each day, we computed the positive and neg- quality across sentiment, and distant supervision
ative sentiment scores as above. The evaluation is may provide more robustness.
identical, except for the removal of topic identifi- Third, our results outperform previous work on
cation. Correlation results are shown in Table 4. Presidential Job Approval prediction (OConnor
This baseline parallels the results seen when et al., 2010). We presented two novel approaches
topic identification is used: the sentiment lexi- to the domain: a coupled distantly supervised sys-
con is again inversely correlated with Approval, tem, and a topic-neutral baseline, both of which
and distant supervision outperforms the lexicon outperform previous results. In fact, the baseline
approach in both ratings. This is not surpris- surprisingly matches or outperforms the more so-
ing given previous distantly supervised work on phisticated approaches that use topic identifica-
sentiment analysis (Go et al., 2009; Davidov et tion. The baseline correlates .69 with Approval
al., 2010; Kouloumpis et al., 2011). However, and .74 with Disapproval. This suggests a new
our distant supervision also performs as well as baseline that should be used in all topic-specific
the best performing topic-specific system. The sentiment applications.
best performing topic classifier, PC-1, correlated Fourth, we described and made available two
with Approval with r=0.71 (0.69 here) and Dis- new annotated datasets of political tweets to facil-
approval with r=0.73 (0.74 here). Computing itate future work in this area.
overall sentiment on Twitter performs as well as Finally, Twitter users are not a representative
political-specific sentiment. This unintuitive re- sample of the U.S. population, yet the high corre-
sult suggests a new baseline that all topic-based lation between political sentiment on Twitter and
systems should compute. Gallup ratings makes these results all the more
intriguing for polling methodologies. Our spe-
7 Discussion cific 7-month period of time differs from previous
This paper introduces a new methodology for work, and thus we hesitate to draw strong con-
gleaning topic-specific sentiment information. clusions from our comparisons or to extend im-
We highlight four main contributions here. plications to non-political domains. Future work
First, this work is one of the first to evaluate should further investigate distant supervision as a
distant supervision for topic identification. All tool to assist topic detection in microblogs.
five political classifiers outperformed the lexicon-
Acknowledgments
driven keyword equivalent that has been widely
used in the past. Our model achieved .90 F1 com- We thank Jure Leskovec for the Twitter data,
pared to the keyword .39 F1 on our political tweet Brendan OConnor for open and frank correspon-
dataset. On twitter as a whole, distant supervision dence, and the reviewers for helpful suggestions.
increased F1 by over 100%. The results also sug-
gest that performance is relatively insensitive to
the specific choice of seed keywords that are used
to select the training set for the political classifier.
Second, the sentiment analysis experiments
611
References Conference On Language Resources and Evalua-
tion (LREC).
Luciano Barbosa and Junlan Feng. 2010. Robust sen-
Jonathon Read. 2005. Using emoticons to reduce de-
timent detection on twitter from biased and noisy
pendency in machine learning techniques for senti-
data. In Proceedings of the 23rd International
ment classification. In Proceedings of the ACL Stu-
Conference on Computational Linguistics (COL-
dent Research Workshop (ACL-2005).
ING 2010).
Chenhao Tan, Lillian Lee, Jie Tang, Long Jiang, Ming
Albert Bifet and Eibe Frank. 2010. Sentiment knowl- Zhou, and Ping Li. 2011. User-level sentiment
edge discovery in twitter streaming data. In Lecture analysis incorporating social networks. In Pro-
Notes in Computer Science, volume 6332, pages 1 ceedings of the 17th ACM SIGKDD Conference on
15. Knowledge Discovery and Data Mining.
Paula Carvalho, Luis Sarmento, Jorge Teixeira, and Mike Thelwall, Kevan Buckley, Georgios Paltoglou,
Mario J. Silva. 2011. Liars and saviors in a senti- Di Cai, and Arvid Kappas. 2010. Sentiment
ment annotated corpus of comments to political de- strength detection in short informal text. Journal of
bates. In Proceedings of the Association for Com- the American Society for Information Science and
putational Linguistics (ACL-2011), pages 564568. Technology, 61(12):25442558.
Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010. Mike Thelwall, Kevan Buckley, and Georgios Pal-
Enhanced sentiment learning using twitter hashtags toglou. 2011. Sentiment in twitter events. Jour-
and smileys. In Proceedings of the 23rd Inter- nal of the American Society for Information Science
national Conference on Computational Linguistics and Technology, 62(2):406418.
(COLING 2010). Andranik Tumasjan, Timm O. Sprenger, Philipp G.
Alec Go, Richa Bhayani, and Lei Huang. 2009. Twit- Sandner, and Isabell M. Welpe. 2010. Election
ter sentiment classification using distant supervi- forecasts with twitter: How 140 characters reflect
sion. Technical report. the political landscape. Social Science Computer
Sandra Gonzalez-Bailon, Rafael E. Banchs, and An- Review.
dreas Kaltenbrunner. 2010. Emotional reactions J.; Wilson, T.; Wiebe and P. Hoffmann. 2005. Recog-
and the pulse of public opinion: Measuring the im- nizing contextual polarity in phrase-level sentiment
pact of political events on the sentiment of online analysis. In Proceedings of the Conference on Hu-
discussions. Technical report. man Language Technology and Empirical Methods
Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and in Natural Language Processing.
Tiejun Zhao. 2011. Target-dependent twitter sen-
timent classification. In Proceedings of the Associ-
ation for Computational Linguistics (ACL-2011).
Efthymios Kouloumpis, Theresa Wilson, and Johanna
Moore. 2011. Twitter sentiment analysis: The good
the bad and the omg! In Proceedings of the Fifth
International AAAI Conference on Weblogs and So-
cial Media.
Adam D. I. Kramer. 2010. An unobtrusive behavioral
model of gross national happiness. In Proceed-
ings of the 28th International Conference on Human
Factors in Computing Systems (CHI 2010).
Mike Mintz, Steven Bills, Rion Snow, and Dan Ju-
rafsky. 2009. Distant supervision for relation ex-
traction without labeled data. In Proceedings of the
Joint Conference of the 47th Annual Meeting of the
ACL and the 4th International Joint Conference on
Natural Language Processing of the AFNLP, ACL
09, pages 10031011.
Brendan OConnor, Ramnath Balasubramanyan,
Bryan R. Routledge, and Noah A. Smith. 2010.
From tweets to polls: Linking text sentiment to
public opinion time series. In Proceedings of the
AAAI Conference on Weblogs and Social Media.
Alexander Pak and Patrick Paroubek. 2010. Twitter
as a corpus for sentiment analysis and opinion min-
ing. In Proceedings of the Seventh International
612
Learning from evolving data streams: online triage of bug reports
Grzegorz Chrupala
Spoken Language Systems
Saarland University
gchrupala@lsv.uni-saarland.de
613
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 613622,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
1.3 Concept drift 1.4 Online learning
Many standard supervised approaches in This paucity of research on online learning from
machine-learning assume a stationary distribution issue tracker streams is rather surprising, given
from which training examples are independently that truly incremental learners have been well-
drawn. The set of training examples is processed known for many years. In fact one of the first
as a batch, and the resulting learned decision learning algorithms proposed was Rosenblatts
function (such as a classifier) is then used on test perceptron, a simple mistake-driven discrimina-
items, which are assumed to be drawn from the tive classification algorithm (Rosenblatt 1958). In
same stationary distribution. the current paper we address this situation and
If we need an automated agent which uses hu- show that by using simple, standard online learn-
man labels to learn to tag objects the batch learn- ing methods we can improve on batch or pseudo-
ing approach is inadequate. Examples arrive one- online learning. We also show that when using
by-one in a stream, not as a batch. Even more a sophisticated state-of-the-art stochastic gradient
importantly, both the output (label) distribution descent technique the performance gains can be
and the input distribution from which the exam- quite large.
ples come are emphatically not stationary. As a
1.5 Contributions
software project progresses and matures, the type
of issues reported is going to change. As project Our main contributions are the following: Firstly,
members and users come and go, the vocabulary we explicitly show that concept-drift is pervasive
they use to describe the issues will vary. As the and serious in real bug report streams. We then
consensus tag folksonomy emerges, the label and address this problem by leveraging state-of-the-
training example distribution will evolve. This art online learning techniques which automati-
phenomenon is sometimes referred to as concept cally track the evolving data stream and incremen-
drift (Widmer and Kubat 1996, Tsymbal 2004). tally update the model after each data item. We
Early research on learning to triage tended to also adopt the continuous evaluation paradigm,
either not notice the problem (Cubranic and Mur- where the learner predicts the output for each ex-
phy 2004), or acknowledge but not address it (An- ample before using it to update the model. Sec-
vik et al. 2006): the evaluation these authors used ondly, we address the important issue of repro-
assigned bug reports randomly to training and ducibility in research in bug triage automation
evaluation sets, discarding the temporal sequenc- by making available the data sets which we col-
ing of the data stream. lected and used, in both their raw and prepro-
cessed forms.
Bhattacharya and Neamtiu (2010) explicitly
address the issue of online training and evalua-
2 Open issue-tracker data
tion. In their setup, the system predicts the out-
put for an item based only on items preceding it Open source software repositories and their as-
in time. However, their approach to incremen- sociated issue trackers are a naturally occurring
tal learning is simplistic: they use a batch clas- source of large amounts of (partially) labeled data.
sifier, but retrain it from scratch after receiving There seems to be growing interest in exploiting
each training example. A fully retrained batch this rich resource as evidenced by existing publi-
classifier will adapt only slowly to changing data cations as well as the appearance of a dedicated
stream, as more recent example have no more in- workshop (Working Conference on Mining Soft-
fluence on the decision function that less recent ware Repositories).
ones. In spite of the fact that the data is publicly avail-
Tamrawi et al. (2011) propose an incremental able in open repositories, it is not possible to di-
approach to bug triage: the classes are ranked rectly compare the results of the research con-
according to a fuzzy set membership function, ducted on bug triage so far: authors use non-
which is based on incrementally updated fea- trivial project-specific filtering, re-labeling and
ture/class co-occurrence counts. The model is ef- pre-processing heuristics; these steps are usually
ficient in online classification, but also adapts only not specified in enough detail that they could be
slowly. easily reproduced.
614
Field Meaning of the closed statuses. We generated two data sets
Identifier Issue ID from the Chromium issues:
Title Short description of issue
Description Content of issue report, which Chromium S UBCOMPONENT. Chromium
may include steps to reproduce, uses special tags to help triage the bug re-
error messages, stack traces etc. ports. Tags prefixed with Area- specify
Author ID of report submitter
which subcomponent of the project the bug
CCS List of IDs of people CCd on
the issue report should be routed to. In some cases more
Labels List of tags associated with is- than one Area- tag is present. Since this
sue affects less than 1% of reports, for simplic-
Status Label describing the current sta- ity we treat these as single, compound labels.
tus of the issue (e.g. Invalid, The development set contains 31,953 items,
Fixed, Wont Fix) and 75 unique output labels.
Assigned To ID of person who has been as-
signed to deal with the issue Chromium A SSIGNED. In this dataset the
Published Date on which issue report was output is the value of the assignedTo
submitted field. We discarded issues where the
field was left empty, as well as the
Table 1: Issue report record
ones which contained the placeholder value
all-bugs-test.chromium.org. The
To help remedy this situation we decided to col- development set contains 16,154 items and
lect data from several open issue trackers, use the 591 unique output labels.
minimal amount of simple preprocessing and fil-
Android Android is a mobile operating sys-
ter heuristics to get useful input data, and publicly
tem project (http://code.google.com/
share both the raw and preprocessed data.
p/android/). We retrieved all the bugs reports,
We designed a simple record type which acts of which 6,341 had a closed status. We generated
as a common denominator for several tracker for- two datasets:
mats. Thus we can use a common representation
for issue reports from various trackers. The fields Android S UBCOMPONENT. The reports
in our record are shown in Table 1. which are labeled with tags prefixed with
Below we describe the issue trackers used Component-. The development set con-
and the datasets we build from them. As dis- tains 888 items and 12 unique output labels.
cussed above (and in more detail in Section 4.1), Android A SSIGNED. The output label is the
we use progressive validation rather than a split value of the assignedTo field. We dis-
into training and test set. However, in order carded issues with the field left empty. The
to avoid developing on the test data, we split development set contains 718 items and 72
each data stream into two substreams, by assign- unique output labels.
ing odd-numbered examples to the test stream
and the even-numbered ones to the development Firefox Firefox is the well-known web-browser
stream. We can use the development stream for project (https://bugzilla.mozilla.
exploratory data analysis and feature and param- org).
eter tuning, and then use progressive validation to We obtained a total of 81,987 issues with a
evaluate on entirely unseen test data. Below we closed status.
specify the size and number of unique labels in Firefox A SSIGNED. We discarded issues
the development sets; the test sets are very similar where the field was left empty, as well as
in size. the ones which contained a placeholder value
(nobody). The development set contains
Chromium Chromium is the open source-
12,733 items and 503 unique output labels.
project behind Googles Chrome browser
(http://code.google.com/p/ Launchpad Launchpad is an issue tracker
chromium/). We retrieved all the bugs run by Canonical Ltd for mostly Ubuntu-related
from the issue tracker, of which 66,704 have one projects (https://bugs.launchpad.
615
net/). We obtained a total of 99,380 issues with
a closed status.
616
Figure 2: A SSIGNED class distribution change over time
617
ranked list of labels for the current item based Algorithm 1 Multiclass online perceptron
on the relative frequencies of output labels in the function PREDICT(Y, W, x)
window of k previous items. We tested windows return {(y, WyT x) | y Y }
of size 100 and 1000 and report the better result.
procedure UPDATE(W, x, y, y)
SVM Minibatch This model uses the mul- if y 6= y then
ticlass linear Support Vector Machine model Wy Wy x
(Crammer and Singer 2002) as implemented in Wy Wy + x
SVM Light (Joachims 1999). SVM is known
as a state-of-the-art batch model in classification
in general and in text categorization in particu- where y is the output label, X the set of features
lar. The output classes for an input example are in the input issue report, n(y, x) the number of ex-
ranked according to the value of the discriminant amples labeled as y which contain feature x, n(y)
values returned by the SVM classifier. In order number of examples labeled y and n(x) number
to adapt the model to an online setting we retrain of examples containing feature x. The counts are
it every n examples on the window of k previous updated online. Tamrawi et al. (2011) also use
examples. The parameters n and k can have large two so called caches: the label cache keeps the
influence on the prediction, but it is not clear how j% most recent labels and the term cache the k
to set them when learning from streams. Here we most significant features for each label. Since
chose the values (100,1000) based on how feasi- in Tamrawi et al. (2011)s experiments the label
ble the run time was and on the performance dur- cache did not affect the results significantly, here
ing exploratory experiments on Chromium S UB - we always set j to 100%. We select the optimal
COMPONENT . Interestingly, keeping the window k parameter from {100, 1000, 5000} based on the
parameter relatively small helps performance: a development set.
window of 1,000 works better than a window of Regression with Stochastic Gradient Descent
5,000. This model performs online multiclass learning
Perceptron We implemented a single-pass on- by means of a reduction to regression. The re-
line multiclass Perceptron with a constant learn- gressor is a linear model trained using Stochastic
ing rate. It maintains a weight vector for each Gradient Descent (Zhang 2004). SGD updates the
output seen so far: the prediction function ranks current parameter vector w(t) based on the gradi-
outputs according to the inner product of the cur- ent of the loss incurred by the regressor on the
rent example with the corresponding weight vec- current example (x(t) , y (t) ):
tor. The update function takes the true output and T
the predicted output. If they are not equal, the w(t+1) = w(t) (t)L(y (t) , w(t) x(t) )
current input is subtracted from the weight vector The parameter (t) is the learning rate at time t,
corresponding to the predicted output and added and L is the loss function. We use the squared
to the weight vector corresponding to the true out- loss:
put (see Algorithm 1). We hash each feature to an L(y, y) = (y y)2
integer value and use it as the features index in
the weight vectors in order to bound memory us- We reduce multiclass learning to regression us-
age in an online setting (Weinberger et al. 2009). ing a one-vs-all-type scheme, by effectively trans-
The Perceptron is a simple but strong baseline for forming an example (x, y) X Y into |Y |
online learning. (x0 , y 0 ) X 0 {0, 1} examples, where Y is the
set of labels seen so far. The transform T is de-
Bugzie This is the model described in Tamrawi fined as follows:
et al. (2011). The output classes are ranked ac-
cording to the fuzzy set membership function de- T (x, y) = {(x0 , I(y = y 0 )) | y 0 Y, x0h(i,y0 ) = xi }
fined as follows:
where h(i, y 0 ) composes the index i with the label
0
Y y (by hashing).
n(y, x) For a new input x the ranking of the outputs
(y, X) = 1 1
n(y) + n(x) n(y, x) y Y is obtained according to the value of the
xX
618
prediction of the base regressor on the binary ex- Dataset RER
ample corresponding to each class label. Chromium S UB 0.36
Android S UB 0.38
As our basic regression learner we use the ef-
Chromium AS 0.21
ficient implementation of regression via SGD, Android AS 0.19
Vowpal Wabbit (VW) (Langford et al. 2011). VW Firefox AS 0.16
implements setting adaptive individual learning Launchpad AS 0.49
rates for each feature as proposed by Duchi et al.
(2010), McMahan and Streeter (2010). Table 2: Best models error relative to baseline on the
This is appropriate when there are many sparse development set
features, and is especially useful in learning from
text from fast evolving data. The features such Task Model MRR Acc
as unigram and bigram counts that we rely on are Chromium Window 0.5747 0.3467
notoriously sparse, and this is exacerbated by the SVM 0.5766 0.4535
Perceptron 0.5793 0.4393
change over time in bug report streams. Bugzie 0.4971 0.2638
Regression 0.7271 0.5672
4.5 Results
Android Window 0.5209 0.3080
Figures 3 and 4 show the progressive validation SVM 0.5459 0.4255
results on all the development data streams. The Perceptron 0.5892 0.4390
horizontal lines indicate the mean MRR scores for Bugzie 0.6281 0.4614
the whole stream. The curves show a moving av- Regression 0.7012 0.5610
erage of MRR in a window comprised of 7% of Table 3: S UBCOMPONENT evaluation results on test
the total number of items. In most of the plots it is set.
evident how the prediction performance depends
on the concept drift illustrated in the plots in Sec-
tion 3: for example on Chromium S UBCOMPO - pression seems to be born out by informal inspec-
NENT the performance of all the models drops a
tion.
bit before the midpoint in the stream while the On the other hand as the scores in Table 2
learners adapt to the change in label distribution indicate, Chromium S UBCOMPONENT, Android
that is happening at this time. This is especially S UBCOMPOMENT and Launchpad A SSIGNED
pronounced for Bugzie, since it is not able to learn contain enough high-quality signal for the best
from mistakes and adapt rapidly, but simply accu- model to substantially outperform the label fre-
mulates counts. quency baseline.
For five out of the six datasets, Regression SGD On Launchpad A SSIGNED Regression SGD
gives the best overall performance. On Launch- performs worse than Bugzie. The concept drift
pad A SSIGNED, Bugzie scores higher we inves- plot for these data suggests one reason: there is
tigate this anomaly below. very little change in class distribution over time
Another observation is that the window-based as compared to the other datasets. In fact, even
frequency baseline can be quite hard to beat: though the issue reports in Launchpad range from
In three out of the six cases, the minibatch year 2005 to 2011, the more recent ones are heav-
SVM model is no better than the baseline. ily overrepresented: 84% of the items in the de-
Bugzie sometimes performs quite well, but for velopment data are from 2011. Thus fast adap-
Chromium S UBCOMPONENT and Firefox A S - tation is less important in this case and Bugzie is
SIGNED it scores below the baseline. able to perform well.
Regarding the quality of the different datasets, On the other hand, the reason for the less than
an interesting indicator is the relative error reduc- stellar score achieved with Regression SGD is due
tion by the best model over the baseline (see Ta- to another special feature of this dataset: it has
ble 2). It is especially hard to extract meaning- by far the largest number of labels, almost 2,000.
ful information about the labeling from the inputs This degrades the performance for the one-vs-all
on the Firefox A SSIGNED dataset. One possible scheme we use with SGD Regression. Prelim-
cause of this can be that the assignment labeling inary investigation indicates that the problem is
practices in this project are not consistent: this im- mostly caused by our application of the hash-
619
Task Model MRR Acc
Chromium Window 0.0999 0.0472
SVM 0.0908 0.0550
Perceptron 0.1817 0.1128
Bugzie 0.2063 0.0960
Regression 0.3074 0.2157
Android Window 0.3198 0.1684
SVM 0.2541 0.1684
Perceptron 0.3225 0.2057
Bugzie 0.3690 0.2086
Regression 0.4446 0.2951
Firefox Window 0.5695 0.4426
SVM 0.4604 0.4166
Perceptron 0.5191 0.4306
Bugzie 0.5402 0.4100
Regression 0.6367 0.5245
Launchpad Window 0.0725 0.0337
SVM 0.1006 0.0704
Perceptron 0.3323 0.2607
Bugzie 0.5271 0.4339
Regression 0.4702 0.3879
620
Figure 4: A SSIGNED evaluation results on the development set
first they predict the most likely developer to as- streams and on the evaluation of learning under its
sign to a bug using a classifier. In a second step constraints. We also show that for evolving issue
they rank candidate developers according to how tracker data, in a large majority of cases SGD Re-
likely they were to take over a bug from the de- gression handily outperforms Bugzie.
veloper predicted in the first step. Their approach
to incremental learning simply involves fully re- 6 Conclusion
training a batch classifier after each item in the We demonstrate that concept drift is a real, perva-
data stream. They test their approach on fixed sive issue for learning from issue tracker streams.
bugs in Mozilla and Eclipse, reporting accuracies We show how to adapt to it by leveraging recent
of 27.5% and 38.2% respectively. research in online learning algorithms. We also
Tamrawi et al. (2011) propose the Bugzie make our dataset collection publicly available to
model where developers are ranked according to enable direct comparisons between different bug
the fuzzy set membership function as defined triage systems.1
in section 4.4. They also use the label (devel- We have identified a good learning framework
oper) cache and term cache to speed up pro- for mining bug reports: in future we would like
cessing and make the model adapt better to the to explore smarter ways of extracting useful sig-
evolving data stream. They evaluate Bugzie and nals from the data by using more linguistically
compare its performance to the models used in informed preprocessing and higher-level features
Bhattacharya and Neamtiu (2010) on seven issue such as word classes.
trackers: Bugzie has superior performance on all
of them ranging from 29.9% to 45.7% for top-1 Acknowledgments
output. They do not use separate validation sets This work was carried out in the context of
for system development and parameter tuning. the Software-Cluster project E MERGENT and was
In comparison to Bhattacharya and Neamtiu partially funded by BMBF under grant number
(2010) and Tamrawi et al. (2011), here we focus 01IC10S01O.
1
much more on the analysis of concept drift in data Available from http://goo.gl/ZquBe
621
References Rosenblatt, F. (1958). The perceptron: A prob-
abilistic model for information storage and or-
Anvik, J., Hiew, L., and Murphy, G. (2006). Who ganization in the brain. Psychological review,
should fix this bug? In Proceedings of the 28th 65(6):386.
international conference on Software engineer-
ing, pages 361370. ACM. Tamrawi, A., Nguyen, T., Al-Kofahi, J., and
Nguyen, T. (2011). Fuzzy set and cache-based
Bhattacharya, P. and Neamtiu, I. (2010). Fine- approach for bug triaging. In Proceedings of
grained incremental learning and multi-feature the 19th ACM SIGSOFT symposium and the
tossing graphs to improve bug triaging. In 13th European conference on Foundations of
International Conference on Software Mainte- software engineering, pages 365375. ACM.
nance (ICSM), pages 110. IEEE.
Tsymbal, A. (2004). The problem of concept
Blum, A., Kalai, A., and Langford, J. (1999). drift: definitions and related work. Computer
Beating the hold-out: Bounds for k-fold and Science Department, Trinity College Dublin.
progressive cross-validation. In Proceedings
Voorhees, E. (2000). The TREC-8 question an-
of the twelfth annual conference on Computa-
swering track report. NIST Special Publication,
tional learning theory, pages 203208. ACM.
pages 7782.
Crammer, K. and Singer, Y. (2002). On the al- Weinberger, K., Dasgupta, A., Langford, J.,
gorithmic implementation of multiclass kernel- Smola, A., and Attenberg, J. (2009). Feature
based vector machines. The Journal of Ma- hashing for large scale multitask learning. In
chine Learning Research, 2:265292. Proceedings of the 26th Annual International
Duchi, J., Hazan, E., and Singer, Y. (2010). Adap- Conference on Machine Learning, pages 1113
tive subgradient methods for online learning 1120. ACM.
and stochastic optimization. Journal of Ma- Widmer, G. and Kubat, M. (1996). Learning in the
chine Learning Research. presence of concept drift and hidden contexts.
Halpin, H., Robu, V., and Shepherd, H. (2007). Machine learning, 23(1):69101.
The complex dynamics of collaborative tag- Zhang, T. (2004). Solving large scale linear
ging. In Proceedings of the 16th international prediction problems using stochastic gradient
conference on World Wide Web, pages 211 descent algorithms. In Proceedings of the
220. ACM. twenty-first international conference on Ma-
Joachims, T. (1999). Making large-scale svm chine learning, page 116. ACM.
learning practical. In Scholkopf, B., Burges, Cubranic, D. and Murphy, G. C. (2004). Auto-
C., and Smola, A., editors, Advances in Kernel matic bug triage using text categorization. In
Methods-Support Vector Learning. MIT-Press. In SEKE 2004: Proceedings of the Sixteenth In-
Langford, J., Hsu, D., Karampatziakis, N., ternational Conference on Software Engineer-
Chapelle, O., Mineiro, P., Hoffman, M., ing & Knowledge Engineering, pages 9297.
Hofman, J., Lamkhede, S., Chopra, S., KSI Press.
Faigon, A., Li, L., Rios, G., and Strehl,
A. (2011). Vowpal wabbit. https:
//github.com/JohnLangford/
vowpal_wabbit/wiki.
Matter, D., Kuhn, A., and Nierstrasz, O. (2009).
Assigning bug reports using a vocabulary-
based expertise model of developers. In Sixth
IEEE Working Conference on Mining Software
Repositories.
McMahan, H. and Streeter, M. (2010). Adap-
tive bound optimization for online convex op-
timization. Arxiv preprint arXiv:1002.4908.
622
Towards a model of formal and informal address in English
623
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 623633,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
V projection V study since they either contain almost no direct
address at all or, if they do, just formal address (V).
Darf ich Sie etwas fragen? Please permit me to ask Fortunately, for many literary texts from the 19th
you a question. and early 20th century, copyright has expired, and
Step 1: German pronoun Step 2: copy T/V class they are freely available in several languages.
provides overt T/V label label to English sentence
We identified 110 stories and novels among the
texts provided by Project Gutenberg (English) and
Figure 1: T/V label induction for English sentences in
a parallel corpus with annotation projection Project Gutenberg-DE (German)1 that were avail-
able in both languages, with a total of 0.5M sen-
tences per language. Examples are Dickens David
in one language, but remain covert in the other. Copperfield or Tolstoys Anna Karenina. We ex-
Examples include morphology (Fraser, 2009) and cluded plays and poems, as well as 19th-century
tense (Schiehlen, 1998). A technique that is often adventure novels by Sir Walter Scott and James F.
applied in such cases is annotation projection, the Cooper which use anachronistic English for stylis-
use of parallel corpora to copy information from a tic reasons, including words that previously (until
language where it is overtly realized to one where the 16th century) indicated T (thee, didst).
it is not (Yarowsky and Ngai, 2001; Hwa et al., We cleaned the English and German novels man-
2005; Bentivogli and Pianta, 2005). ually by deleting the tables of contents, prologues,
The phenomenon of formal and informal ad- epilogues, as well as chapter numbers and titles
dress has been considered in the contexts of transla- occurring at the beginning of each chapter to ob-
tion into (Hobbs and Kameyama, 1990; Kanayama, tain properly parallel texts. The files were then
2003) and generation in Japanese (Bateman, 1988). formatted to contain one sentence per line using
Li and Yarowsky (2008) learn pairs of formal and the sentence splitter and tokenizer provided with
informal constructions in Chinese with a para- EUROPARL (Koehn, 2005). Blank lines were
phrase mining strategy. Other relevant recent stud- inserted to preserve paragraph boundaries. All
ies consider the extraction of social networks from novels were lemmatized and POS-tagged using
corpora (Elson et al., 2010). A related study is TreeTagger (Schmid, 1994).2 Finally, they were
(Bramsen et al., 2011) which considers another sentence-aligned using Gargantuan (Braune and
sociolinguistic distinction, classifying utterances Fraser, 2010), an aligner that supports one-to-many
as upspeak and downspeak based on the social alignments, and word-aligned in both directions
relationship between speaker and addressee. using Giza++ (Och and Ney, 2003).
This paper extends a previous pilot study
(Faruqui and Pad, 2011). It presents more an- 3.2 T/V Gold Labels for English Utterances
notation, investigates a larger and better motivated
As Figure 1 shows, the automatic construction of
feature set, and discusses the findings in detail.
T/V labels for English involves two steps.
3 A Parallel Corpus of Literary Texts Step 1: Labeling German Pronouns as T/V.
This section discusses the construction of T/V gold German has three relevant personal pronouns for
standard labels for English sentences. We obtain the T/V distinction: du (T), sie (V), and ihr (T/V).
these labels from a parallel EnglishGerman cor- However, various ambiguities makes their interpre-
pus using the technique of annotation projection tation non-straightforward.
(Yarowsky and Ngai, 2001) sketched in Figure 1: The pronoun ihr can both be used for plural T
We first identify the T/V status of German pro- address or for a somewhat archaic singular or plu-
nouns, then copy this T/V information onto the ral V address. In principle, these usages should
corresponding English sentence. be distinguished by capitalization (V pronouns
are generally capitalized in German), but many
3.1 Data Selection and Preparation T instances in our corpora informal use are nev-
Annotation projection requires a parallel corpus. ertheless capitalized. Additional, ihr can be the
We found commonly used parallel corpora like EU- 1
http://www.gutenberg.org, http://gutenberg.spiegel.de/
ROPARL (Koehn, 2005) or the JRC Acquis corpus 2
It must be expected that the tagger degrades on this
(Steinberger et al., 2006) to be unsuitable for our dataset; however we did not quantify this effect.
624
dative form of the 3rd person feminine pronoun sie Comparison No context In context
(she/her). These instances are neutral with respect A1 vs. A2 75% (.49) 79% (.58)
A1 vs. GS 60% (.20) 70% (.40)
to T/V but were misanalysed by TreeTagger as in-
A2 vs. GS 65% (.30) 76% (.52)
stances of the T/V lemma ihr. Since TreeTagger
(A1 A2) vs. GS 67% (.34) 79% (.58)
does not provide person information, and we did
not want to use a full parser, we decided to omit Table 1: Manual annotation for T/V on a 200-sentence
ihr/Ihr from consideration.3 sample. Comparison among human annotators (A1 and
Of the two remaining pronouns (du and sie), du A2) and to projected gold standard (GS). All cells show
raw agreement and Cohens (in parentheses).
expresses (singular) T. A minor problem is pre-
sented by novels set in France, where du is used as
an nobiliary particle. These instances can be recog- 18K T sentences4 , of which 255 (0.6%) are labeled
nised reliably since the names before and after du as both T and V. We exclude these sentences.
are generally unknown to the German tagger. Thus Note that this strategy relies on the direct cor-
we do not interpret du as T if the word preceding respondence assumption (Hwa et al., 2005), that
or succeeding it has unknown as its lemma. is, it assumes that the T/V status of an utterance is
The V pronoun, sie, doubles as the pronoun for not changed in translation. We believe that this is
third person (she/they) when not capitalized. We a reasonable assumption, given that T/V is deter-
therefore interpret only capitalized instances of Sie mined by the social relation between interlocutors;
as V. Furthermore, we ignore utterance-initial po- but see Section 4 for discussion.
sitions, where all words are capitalized. This is
defined as tokens directly after a sentence bound- 3.3 Data Splitting
ary (POS $.) or after a bracket (POS $(). Finally, we divided our English data into train-
These rules concentrate on precision rather than ing, development and test sets with 74 novels
recall. They leave many instances of German sec- (26K sentences), 19 novels (9K sentences) and
ond person pronouns unlabeled; however, this is 13 novels (8K sentences), respectively. The cor-
not a problem since we do not currently aim at pus is available for download at http://www.
obtaining complete coverage on the English side nlpado.de/~sebastian/data.shtml.
of our parallel corpus. From the 0.5M German sen-
tences, about 14% of the sentences were labeled 4 Human Annotation of T/V for English
as T or V (37K for V and 28K for T). In a random
sample of roughly 300 German sentences which This section investigates how well the T/V distinc-
we analysed, we did not find any errors. This puts tion can be made in English by human raters, and
the precision of our heuristics at above 99%. on the basis of what information. Two annotators
with near native-speaker competence in English
Step 2: Annotation Projection. We now copy were asked to label 200 random sentences from
the information over onto the English side. We the training set as T or V. Sentences were first pre-
originally intended to transfer T/V labels between sented in isolation (no context). Subsequently,
German and English word-aligned pronouns. How- they were presented with three sentences pre- and
ever, we pronouns are not necessarily translated post-context each (in context).
into pronouns; additionally, we found word align- Table 1 shows the results of the annotation
ment accuracy for pronouns to be far from perfect, study. The first line compares the annotations
due to the variability in function word translation. of the two annotators against each other (inter-
For these reason, we decided to look at T/V labels annotator agreement). The next two lines compare
at the level of complete sentences, ignoring word the taggers annotations against the gold standard
alignment. This is generally unproblematic ad- labels projected from German (GS). The last line
dress is almost always consistent within sentences: compares the annotator-assigned labels to the GS
of the 65K German sentences with T or V labels, for the instances on which the annotators agree.
only 269 (< 0.5%) contain both T and V. Our pro- For all cases, we report raw accuracy and Co-
jection on the English side results in 25K V and hens (1960), i.e. chance-corrected agreement.
3 4
Instances of ihr as possessive pronoun occurred as well, Our sentence aligner supports one-to-many alignments
but could be filtered out on the basis of the POS tag. and often aligns single German to multiple English sentences.
625
We first observe that the T/V distinction is con- be T, as presumed by both annotators. Conver-
siderably more difficult to make for individual sations between lovers or family members form
sentences (no context) than when the discourse is another example, where T is modern usage, but
available. In context, inter-annotator agreement in- the novels tend to use V:
creases from 75% to 79%, and agreement with the
(6) [...] she covered her face with the other
gold standard rises by 10%. It is notable that the
to conceal her tears. Corinne!, said Os-
two annotators agree worse with one another than
wald, Dear Corinne! My absence has
with the gold standard (see below for discussion).
then rendered you unhappy!6
On those instances where they agree, Cohens
reaches 0.58 in context, which is interpreted as In sum, our annotation study establishes that the
approaching good agreement (Fleiss, 1981). Al- T/V distinction, although not realized by different
though far from perfect, this inter-annotator agree- pronouns in English, can be recovered manually
ment is comparable to results for the annotation from text, provided that discourse context is avail-
of fine-grained word sense or sentiment (Navigli, able. A substantial part of the errors is due to social
2009; Bermingham and Smeaton, 2009). changes in T/V usage.
An analysis of disagreements showed that many
sentences can be uttered in both T and V contexts 5 Monolingual T/V Modeling
and cannot be labeled without context: The second part of the paper explores the auto-
matic prediction of the T/V distinction for English
(3) And perhaps sometime you may see her. sentences. Given the ability to create an English
This case (gold label: V) is disambiguated by the training corpus with T/V labels with the annotation
previous sentence which indicates a hierarchical projection methods described in Section 3.2, we
social relation between speaker and addressee: can phrase T/V prediction for English as a standard
supervised learning task. Our experiments have
(4) And she is a sort of relation of your lord- a twin motivation: (a), on the NLP side, we are
ships, said Dawson. . . . mainly interested in obtaining a robust classifier
to assign the labels T and V to English sentences;
Still, even a three-sentence window is often not (b), on the sociolinguistic side, we are interested in
sufficient, since the surrounding sentences may be investigating through which features the categories
just as uninformative. In these cases, more global T and V are expressed in English.
information about the situation is necessary. Even
with perfect information, however, judgments can 5.1 Classification Framework
sometimes deviate, as there are considerable grey We phrase T/V labeling as a binary classification
areas in T/V usage (Kretzenbacher et al., 2006). task at the sentence level, performing the classifica-
In addition, social rules like T/V usage vary tion with L2-regularized logistic regression using
in time and between countries (Schpbach et al., the LibLINEAR library (Fan et al., 2008). Logis-
2006). This helps to explain why annotators agree tic regression defines the probability that a binary
better with one another than with the gold standard: response variable y takes some value as a logit-
21st century annotators tend to be unfamiliar with transformed linear combination of the features fi ,
19th century T/V usage. Consider this example each of which is assigned a coefficient i .
from a book written in second person perspective: 1 X
p(y = 1) = with z = i fi (7)
(5) Finally, you acquaint Caroline with the 1 + ez
i
fatal result: she begins by consoling you. Regularization incorporates the size of the coef-
One hundred thousand francs lost! We ficient vector into the objective function, sub-
shall have to practice the strictest econ- tracting it from the likelihood of the data given the
omy, you imprudently add.5 model. This allows the user to trade faithfulness
to the data against generalization.7
Here, the author and translator use V to refer to the
6
reader, while todays usage would almost certainly A.L.G. de Stal: Corinne
7
We use LIBLINEARs default parameters and set the
5
H. de Balzac: Petty Troubles of Married Life cost (regularization) parameter to 0.01.
626
p(C|V )
p(C|T ) Words indicative for V, ranked by the ratio of probabilities
4.59 Mister, sir, Monsieur, sirrah, . . . for T and V, estimated on the training set.
2.36 Mlle., Mr., M., Herr, Dr., . . .
1.60 Gentlemen, patients, rascals, . . . Politeness Theory Features. The third feature
type is based on the Politeness Theory (Brown
Table 2: 3 of the 400 clustering-based semantic classes and Levinson, 1987). Brown and Levinsons pre-
(classes most indicative for V)
diction is that politeness levels will be detectable
in concrete utterances in a number of ways, e.g.
5.2 Feature Types a higher use of conjunctive or hedges in polite
We experiment with three features types that are speech. Formal address (i.e., V as opposed to T) is
candidates to express the T/V English distinction. one such expression. Politeness Theory therefore
predicts that other politeness indicators should cor-
Word Features. The intuition to use word fea- relate with the T/V classification. This holds in
tures draws on the parallel between T/V and infor- particular for English, where pronoun choice is
mation retrieval tasks like document classification: unavailable to indicate politeness.
some words are presumably correlated with formal We constructed 16 features on the basis of Po-
address (like titles), while others should indicate liteness Theory predictions, that is, classes of ex-
informal address (like first names). In a prelimi- pressions indicating either formality or informality.
nary experiment, we noticed that in the absence of From a computational perspective, the problem
further constraints, many of the most indicative fea- with Politeness Theory predictions is that they are
tures are names of persons from particular novels only described qualitatively and by example, with-
which are systematically addressed formally (like out detailed lists. For each feature, we manually
Phileas Fogg from J. Vernes Around the world in identified around 10 words or multi-word relevant
eighty days) or informally (like Mowgli, Baloo, expressions. Table 3 shows these 16 features with
and Bagheera from R. Kiplings Jungle Book). their intended classes and some example expres-
These features clearly do not generalize to new sions. Similar to the semantic class features, the
books. We therefore added a constraint to remove value of each politeness feature is the sum of the
all features which did not occur in at least three frequencies of its members in a sentence.
novels. To reduce the number of word features to a
reasonable order of magnitude, we also performed 5.3 Context: Size and Type
a 2 -based feature selection (Manning et al., 2008) As our annotation study in Section 4 found, con-
on the training set. Preliminary experiments es- text is crucial for human annotators, and this pre-
tablished that selecting the top 800 word features sumably carries over to automatic methods human
yielded a model with good generalization. annotators: if the features for a sentence are com-
Semantic Class Features. Our second feature puted just on that sentence, we will face extremely
type is semantic class features. These can be seen sparse data. We experiment with symmetrical win-
as another strategy to counteract the sparseness dow contexts, varying the size between n = 0 (just
at the level of word features. We cluster words the target sentence) and n = 10 (target sentence
into 400 semantic classes on the basis of distribu- plus 10 preceding and 10 succeeding sentences).
tional and morphological similarity features which This kind of simple sentence context makes an
are extracted from an unlabeled English collec- important oversimplification, however. It lumps to-
tion of Gutenberg novels comprising more than gether material from different speech turns as well
100M tokens, using the approach by Clark (2003). as from narrative sentences, which may generate
These features measure how similar tokens are to misleading features. For example, narrative sen-
one another in terms of their occurrences in the tences may refer to protagonists by their full names
document and are useful in Named Entity Recog- including titles (strong features for V) even when
nition (Finkel and Manning, 2009). As features these protagonists are in T-style conversations:
in the T/V classification of a given sentence, we
(8) You are the love of my life, said Sir
simply count for each class the number of tokens
Phileas Fogg.8 (T)
in this class present in the current sentence. For
8
illustration, Table 2 shows the three classes most J. Verne: Around the world in 80 days
627
Class Example expressions Class Example expressions
Inclusion (T) lets, shall we Exclamations (T) hey, yeah
Subjunctive I (T) can, will Subjunctive II (V) could, would
Proximity (T) this, here Distance (V) that, there
Negated question (V) didnt I, hasnt it Indirect question (V) would there, is there
Indefinites (V) someone, something Apologizing (V) bother, pardon
Polite adverbs (V) marvellous, superb Optimism (V) I hope, would you
Why + modal (V) why would(nt) Impersonals (V) necessary, have to
Polite markers (V) please, sorry Hedges (V) in fact, I guess
Table 3: 16 Politeness theory-based features with intended classes and example expressions
67
66
individual sentences.
Accuracy (%)
For these reasons, we introduce an alternative
65
concept of context, namely direct speech context,
64
whose purpose is to exclude narrative material. We
compute direct speech context in two steps: (a),
63
segmentation of sentences into chunks that are
62
either completely narrative or speech, and (b), la-
beling of chunks with a classifier that distinguishes
61
628
Model Accuracy Model Accuracy to dev set
Random Baseline 50.0 Frequency baseline 59.3 + 0.2
Frequency Baseline 59.1 Words (no context) 62.5 - 0.4
Words 67.0 Words (context size 6) 67.3 + 1.0
SemClass 57.5 Words (context size 8) 67.5 + 0.5
PoliteClass 59.6 Words (context size 10) 66.8 + 1.0
Words + SemClass 66.6
Words + PoliteClass 66.4 Table 5: T/V classification accuracy on the test set and
Words + PoliteClass + SemClass 66.2 differences to dev set results (direct speech context)
Raw human IAA (no context) 75.0
Raw human IAA (in context) 79.0 not overfit on the development set when picking
Table 4: T/V classification accuracy on the develop- the best model. The tendencies correspond well
ment set (direct speech context, size 8). : Significant to the development set: the frequency baseline is
difference to frequency baseline (p<0.01) almost identical, as are the results for the different
models. The differences to the development set
are all equal to or smaller than 1% accuracy, and
spectively. This indicates that sparseness is indeed
the best result at 67.5% is 0.5% better than on the
a major challenge, and context can become large
development set. This is a reassuring result, as our
before the effects mentioned in Section 5.3 counter-
model appears to generalize well to unseen data.
act the positive effect of more data. Direct speech
context outperforms sentence context throughout, 6.3 Analysis by Feature Types
with a maximum accuracy of 67.0% as compared
The results from Section 6.1 motivate further anal-
to 65.2%, even though it shows higher variation,
ysis of the individual feature types.
which we attribute to the less stable nature of the
direct speech chunks and their automatically cre- Analysis of Word Features. Word features are
ated labels. From now on, we adopt a direct speech by far the most effective features. Table 6 lists
context of size 8 unless specified differently. the top twenty words indicating T and V (ranked
by the ratio of probabilities for the two classes
Influence of Features. Table 4 shows the results
on the training set). The list still includes some
for different feature types. The best model (word
proper names like Vrazumihin or Louis-Gaston
features only) is highly significantly better than
(even though all features have to occur in at least
the frequency baseline (which it beats by 8%) as
three novels), but they are relatively infrequent.
determined by a bootstrap resampling test (Noreen,
The most prominent indicators for the formal class
1989). It gains 17% over the random baseline,
V are titles (monsieur, (ma)am) and instances of
but is still more than 10% below inter-annotator
formulaic language (Permit (me), Excuse (me)).
agreement in context, which is often seen as an
There are also some terms which are not straight-
upper bound for automatic models.
forward indicators of formal address (angelic, stub-
Disappointingly, the comparison of the feature
bornness), but are associated with a high register.
groups yields a null result: We are not able to
There is a notable asymmetry between T and
improve over the results for just word features with
V. The word features for T are considerably more
either the semantic class or the politeness features.
difficult to interpret. We find some forms of earlier
Neither feature type outperforms the frequency
period English (thee, hast, thou, wilt) that result
baseline significantly (p>0.05). Combinations of
from occasional archaic passages in the novels as
the different feature types also do worse than just
well first names (Louis-Gaston, Justine). Never-
words. The differences between the best model
theless, most features are not straightforward to
(just words) and the combination models are all
connect to specifically informal speech.
not significant (p>0.05). These negative results
warrant further analysis. It follows in Section 6.3. Analysis of Semantic Class Features. We
ranked the semantic classes we obtained by distri-
6.2 Results on the Test Set butional clustering in a similar manner to the word
Table 5 shows the results of evaluating models features. Table 2 shows the top three classes in-
with the best feature set and with different context dicative for V. Almost all others of the 400 clusters
sizes on the test set, in order to verify that we did do not have a strong formal/informal association
629
Top 20 words for V Top 20 words for T p(f |V )/p(f |T ) are between 0.9 and 1.3, that is,
P (w|V ) P (w|T )
Word w P (w|T ) Word w P (w|V ) the features were only weakly indicative of one of
Excuse 36.5 thee 94.3 the classes. Furthermore, not all features turned
Permit 35.0 amenable 94.3 out to be indicative of the class we designed them
ai 29.2 stuttering 94.3 for. The best indicator for V was the Indefinites
am 29.2 guardian 94.3
feature (somehow, someone cf. Table 3), as ex-
stubbornness 29.2 hast 92.0
flights 29.2 Louis-Gaston 92.0 pected. In contrast, the best indicator for T was the
monsieur 28.6 lease-making 92.0 Negation question feature which was supposedly
Vrazumihin 28.6 melancholic 92.0 an indicator for V (didnt I, havent we).
mademoiselle 26.5 ferry-boat 92.0 A majority of politeness features (13 of the 16)
angelic 26.5 Justine 92.0 had p(f |V )/p(f |T ) values above 1, that is, were
Allow 24.5 Thou 66.0 indicative for the class V. Thus for this feature type,
madame 21.2 responsibility 63.8
like for the others, it appears to be more difficult to
delicacies 21.2 thou 63.8
entrapped 21.2 Iddibal 63.8 identify T than to identify V. This negative result
lack-a-day 21.2 twenty-fifth 63.8 can be attributed at least in part to our method of
ma 21.0 Chic 63.8 hand-crafting lists of expressions for these features.
duke 18.0 allegiance 63.8 The inadvertent inclusion of overly general terms
policeman 18.0 Jouy 63.8 V might be responsible for the features inability
free-will 18.0 wilt 47.0 to discriminate well, while we have presumably
Canon 18.0 shall 47.0
missed specific terms which has hurt coverage.
Table 6: Most indicative word features for T or V This situation may in the future be remedied with
the semi-automatic acquisition of instantiations of
politeness features.
but mix formal, informal, and neutral vocabulary.
This tendency is already apparent in class 3: Gen- 6.4 Analysis of Individual Novels
tlemen is clearly formal, while rascals is informal.
One possible hypothesis regarding the difficulty
patients can belong to either class. Even in class
of finding indicators for the class T is that indi-
1, we find Sirrah, a contemptuous term used in ad-
cators for T tend to be more novel-specific than
dressing a man or boy with a low formality score
indicators for V, since formal language is more
(p(w|V )/p(w|T ) = 0.22). From cluster 4 onward,
conventionalized (Brown and Levinson, 1987). If
none of the clusters is strongly associated with ei-
this were the case, then our strategy of building
ther V or T (p(c|V )/p(c|T ) 1).
well-generalizing models by combining text from
Our interpretation of these observations is that
different novels would naturally result in models
in contrast to text categorization, there is no clear-
that have problems with picking up T features.
cut topical or domain difference between T and V:
both categories co-occur with words from almost To investigate this hypothesis, we trained mod-
any domain. In consequence, semantic classes do els with the best parameters as before (8-sentence
not, in general, represent strong unambiguous indi- direct speech context, words as features). How-
cators. Similar to the word features, the situation ever, this time we trained novel-specific models,
is worse for T than for V: there still are reasonably splitting each novel into 50% training data and
strong features for V, the marked case, but it is 50% testing data. We required novels to contain
more difficult to find indicators for T. more than 200 labeled sentences. This ruled out
most short stories, leaving us with 7 novels in the
Analysis of politeness features. A major reason test set. The results are shown in Table 7 and show
for the ineffectiveness of the Politeness Theory- a clear improvement. The accuracy is 13% higher
based features seems to be their low frequency: than in our main experiment (67% vs. 80%), even
in the best model, with a direct speech context of though the models were trained on considerably
size 8, only an average of 7 politeness features less data. Six of the seven novels perform above
was active for any given sentence. However, fre- the 67.5% result from the main experiment.
quency was not the only problem the politeness The top-ranked features for T and V show a
features were generally unable to discriminate well much higher percentage of names for both T and
between T and V. For all features, the values of V than in the main experiment. This is to be ex-
630
Novel Accuracy lation system for a task-based evaluation on the
H. Beecher-Stove: Uncle Toms Cabin 90.0 translation of direct address into German and other
J. Spyri: Cornelli 88.3
languages with different T/V pronouns.
E. Zola: Lourdes 83.9
H. de Balzac: Cousin Pons 82.3 Considering our sociolinguistic goal of deter-
C. Dickens: The Pickwick Papers 77.7 mining the ways in which English realizes the T/V
C. Dickens: Nicholas Nickleby 74.8 distinction, we first obtained a negative result: only
F. Hodgson Burnett: Little Lord 61.6 word features perform well, while semantic classes
All (micro average) 80.0 and politeness features do hardly better than a fre-
quency baseline. Notably, there are no clear topi-
Table 7: T/V prediction models for individual novels
(50% of each novel for training and 50% testing) cal divisions between T and V, like for example
in text categorization: almost all words are very
weakly correlated with either class, and seman-
pected, since this experiment does not restrict itself tically similar words can co-occur with different
to features that occurred in at least three novels. classes. Consequently, distributionally determined
The price we pay for this is worse generalization to semantic classes are not helpful for the distinction.
other novels. There is also still a T/V asymmetry: Politeness features are difficult to operationalize
more top features are shared among the V lists of with sufficiently high precision and recall.
individual novels and with the main experiment An interesting result is the asymmetry between
V list than on the T side. Like in the main exper- the linguistic features for V and T at the lexical
iment (cf. Section 6.3), V features indicate titles level. V language appears to be more convention-
and other features of elevated speech, while T fea- alized; the models therefore identified formulaic
tures mostly refer to novel-specific protagonists expressions and titles as indicators for V. On the
and events. In sum, these results provide evidence other hand, very few such generic features exist for
for a difference in status of T and V. the class T; consequently, the classifier has a hard
time learning good discriminating and yet generic
7 Discussion and Conclusions
features. Those features that are indicative of T,
In this paper, we have studied the distinction such as first names, are highly novel-specific and
between formal and information (T/V) address, were deliberately excluded from the main exper-
which is not expressed overtly through pronoun iment. When we switched to individual novels,
choice or morphosyntactic marking in modern En- the models picked up such features, and accuracy
glish. Our hypothesis was that the T/V distinction increased at the cost of lower generalizability
can be recovered in English nevertheless. Our man- between novels. A more technical solution to this
ual annotation study has shown that annotators can problem would be the training of a single-class
in fact tag monolingual English sentences as T or classifier for V, treating T as the default class
V with reasonable accuracy, but only if they have (Tax and Duin, 1999).
sufficient context. We exploited the overt informa- Finally, an error analysis showed that many er-
tion from German pronouns to induce T/V labels rors arise from sentences that are too short or un-
for English and used this labeled corpus to train a specific to determine T or V reliably. This points
monolingual T/V classifier for English. We exper- to the fact that T/V should not be modelled as a
imented with features based on words, semantic sentence-level classification task in the first place:
classes, and Politeness Theory predictions. T/V is not a choice made for each sentence, but
With regard to our NLP goal of building a T/V one that is determined once for each pair of inter-
classifier, we conclude that T/V classification is locutors and rarely changed. In future work, we
a phenomenon that can be modelled on the basis will attempt to learn social networks from novels
of corpus features. A major factor in classifica- (Elson et al., 2010), which should provide con-
tion performance is the inclusion of a wide context straints on all instances of communication between
to counteract sparse data, and more sophisticated a speaker and an addressee. However, the big and
context definitions improve results. We currently unsolved, as far as we know challenge is to au-
achieve top accuracies of 67%-68%, which still tomatically assign turns to interlocutors, given the
leave room for improvement. We next plan to varied and often inconsistent presentation of direct
couple our T/V classifier with a machine trans- speech turns in novels.
631
References Joseph L. Fleiss. 1981. Statistical methods for rates
and proportions. John Wiley, New York, 2nd edi-
John Ardila. 2003. (Non-Deictic, Socio-Expressive) tion.
T-/V-Pronoun Distinction in Spanish/English Formal
Alexander Fraser. 2009. Experiments in morphosyn-
Locutionary Acts. Forum for Modern Language
tactic processing for translating to and from German.
Studies, 39(1):7486.
In Proceedings of the EACL MT workshop, pages
John A. Bateman. 1988. Aspects of clause politeness in 115119, Athens, Greece.
Japanese: An extended inquiry semantics treatment. Jerry Hobbs and Megumi Kameyama. 1990. Trans-
In Proceedings of ACL, pages 147154, Buffalo, lation by abduction. In Proceedings of COLING,
New York. pages 155161, Helsinki, Finland.
Luisa Bentivogli and Emanuele Pianta. 2005. Ex- Rebecca Hwa, Philipp Resnik, Amy Weinberg, Clara
ploiting parallel texts in the creation of multilingual Cabezas, and Okan Kolak. 2005. Bootstrap-
semantically annotated resources: the MultiSemCor ping parsers via syntactic projection across parallel
Corpus. Journal of Natural Language Engineering, texts. Journal of Natural Language Engineering,
11(3):247261. 11(3):311325.
Adam Bermingham and Alan F. Smeaton. 2009. A Hiroshi Kanayama. 2003. Paraphrasing rules for au-
study of inter-annotator agreement for opinion re- tomatic evaluation of translation into Japanese. In
trieval. In Proceedings of ACM SIGIR, pages 784 Proceedings of the Second International Workshop
785. on Paraphrasing, pages 8893, Sapporo, Japan.
Philip Bramsen, Martha Escobar-Molano, Ami Patel, Philipp Koehn. 2005. Europarl: A Parallel Corpus for
and Rafael Alonso. 2011. Extracting social power Statistical Machine Translation. In Proceedings of
relationships from natural language. In Proceedings the 10th Machine Translation Summit, pages 7986,
of ACL/HLT, pages 773782, Portland, OR. Phuket, Thailand.
Fabienne Braune and Alexander Fraser. 2010. Im- Heinz L. Kretzenbacher, Michael Clyne, and Doris
proved unsupervised sentence alignment for symmet- Schpbach. 2006. Pronominal Address in German:
rical and asymmetrical parallel corpora. In Coling Rules, Anarchy and Embarrassment Potential. Aus-
2010: Posters, pages 8189, Beijing, China. tralian Review of Applied Linguistics, 39(2):17.1
Roger Brown and Albert Gilman. 1960. The pronouns 17.18.
of power and solidarity. In Thomas A. Sebeok, edi- Alexander Knzli. 2010. Address pronouns as a prob-
tor, Style in Language, pages 253277. MIT Press, lem in French-Swedish translation and translation
Cambridge, MA. revision. Babel, 55(4):364380.
Penelope Brown and Stephen C. Levinson. 1987. Po- Zhifei Li and David Yarowsky. 2008. Mining and
liteness: Some Universals in Language Usage. Num- modeling relations between formal and informal Chi-
ber 4 in Studies in Interactional Sociolinguistics. nese phrases from web corpora. In Proceedings of
Cambridge University Press. EMNLP, pages 10311040, Honolulu, Hawaii.
Alexander Clark. 2003. Combining distributional and Christopher D. Manning, Prabhakar Raghavan, and
morphological information for part of speech induc- Hinrich Schtze. 2008. Introduction to Information
tion. In Proceedings of EACL, pages 5966, Bu- Retrieval. Cambridge University Press, Cambridge,
dapest, Hungary. UK, 1st edition.
J. Cohen. 1960. A Coefficient of Agreement for Nomi- Andrew Kachites McCallum. 2002. Mal-
nal Scales. Educational and Psychological Measure- let: A machine learning for language toolkit.
ment, 20(1):3746. http://mallet.cs.umass.edu.
David Elson, Nicholas Dames, and Kathleen McKe- Roberto Navigli. 2009. Word Sense Disambiguation:
own. 2010. Extracting social networks from literary a survey. ACM Computing Surveys, 41(2):169.
fiction. In Proceedings of ACL, pages 138147, Up- Eric W. Noreen. 1989. Computer-intensive Methods
psala, Sweden. for Testing Hypotheses: An Introduction. John Wiley
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang- and Sons Inc.
Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: Franz Josef Och and Hermann Ney. 2003. A System-
A library for large linear classification. Journal of atic Comparison of Various Statistical Alignment
Machine Learning Research, 9:18711874. Models. Computational Linguistics, 29(1):1951.
Manaal Faruqui and Sebastian Pad. 2011. I Thou Lance Ramshaw and Mitch Marcus. 1995. Text chunk-
Thee, Thou Traitor: Predicting formal vs. infor- ing using transformation-based learning. In Proceed-
mal address in English literature. In Proceedings of ing of the 3rd ACL Workshop on Very Large Corpora,
ACL/HLT 2011, pages 467472, Portland, OR. Cambridge, MA.
Jenny Rose Finkel and Christopher D. Manning. 2009. Michael Schiehlen. 1998. Learning tense transla-
Nested named entity recognition. In Proceedings of tion from bilingual corpora. In Proceedings of
EMNLP, pages 141150, Singapore. ACL/COLING, pages 11831187, Montreal, Canada.
632
Helmut Schmid. 1994. Probabilistic Part-of-Speech
Tagging Using Decision Trees. In Proceedings of the
International Conference on New Methods in Lan-
guage Processing, pages 4449, Manchester, UK.
Doris Schpbach, John Hajek, Jane Warren, Michael
Clyne, Heinz Kretzenbacher, and Catrin Norrby.
2006. A cross-linguistic comparison of address pro-
noun use in four European languages: Intralingual
and interlingual dimensions. In Proceedings of the
Annual Meeting of the Australian Linguistic Society,
Brisbane, Australia.
Ralf Steinberger, Bruno Pouliquen, Anna Widiger,
Camelia Ignat, Toma Erjavec, and Dan Tufis. 2006.
The JRC-Acquis: A multilingual aligned parallel cor-
pus with 20+ languages. In Proceedings of LREC,
pages 21422147, Genoa, Italy.
David M. J. Tax and Robert P. W. Duin. 1999. Sup-
port vector domain description. Pattern Recognition
Letters, 20:11911199.
David Yarowsky and Grace Ngai. 2001. Inducing mul-
tilingual POS taggers and NP bracketers via robust
projection across aligned corpora. In Proceedings of
NAACL, pages 200207, Pittsburgh, PA.
633
Character-based Kernels for Novelistic Plot Structure
Micha Elsner
Institute for Language, Cognition and Computation (ILCC)
School of Informatics
University of Edinburgh
melsner0@gmail.com
634
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 634644,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
of recognizing acceptable novels (section 6), but ture in terms of both characters and their emo-
recognition is usually a good first step toward tional states. However, they operate at a very de-
generationa recognition model can always be tailed level and so can be applied only to short
used as part of a generate-and-rank pipeline, and texts. Scheherazade (Elson and McKeown, 2010)
potentially its underlying representation can be allows human annotators to mark character goals
used in more sophisticated ways. We show a de- and emotional states in a narrative, and indicate
tailed analysis of the character correspondences the causal links between them. AESOP (Goyal et
discovered by our system, and discuss their po- al., 2010) attempts to learn a similar structure au-
tential relevance to summarization, in section 9. tomatically. AESOPs accuracy, however, is rel-
atively poor even on short fables, indicating that
2 Related work this fine-grained approach is unlikely to be scal-
Some recent work on story understanding has fo- able to novel-length texts; our system relies on a
cused on directly modeling the series of events much coarser analysis.
that occur in the narrative. McIntyre and Lapata Kazantseva and Szpakowicz (2010) summarize
(2010) create a story generation system that draws short stories, although unlike the other projects
on earlier work on narrative schemas (Chambers we discuss here, they explicitly try to avoid giving
and Jurafsky, 2009). Their system ensures that away plot detailstheir goal is to create spoiler-
generated stories contain plausible event-to-event free summaries focusing on characters, settings
transitions and are coherent. Since it focuses only and themes, in order to attract potential readers.
on events, however, it cannot enforce a global no- They do find it useful to detect character men-
tion of what the characters want or how they relate tions, and also use features based on verb aspect to
to one another. automatically exclude plot events while retaining
Our own work draws on representations that descriptive passages. They compare their genre-
explicitly model emotions rather than events. Alm specific system with a few state-of-the-art meth-
and Sproat (2005) were the first to describe sto- ods for summarizing news, and find it outper-
ries in terms of an emotional trajectory. They an- forms them substantially.
notate emotional states in 22 Grimms fairy tales We evaluate our system by comparing real nov-
and discover an increase in emotion (mostly posi- els to artificially produced surrogates, a procedure
tive) toward the ends of stories. They later use this previously used to evaluate models of discourse
corpus to construct a reasonably accurate clas- coherence (Karamanis et al., 2004; Barzilay and
sifier for emotional states of sentences (Alm et Lapata, 2005) and models of syntax (Post, 2011).
al., 2005). Volkova et al. (2010) extend the hu- As in these settings, we anticipate that perfor-
man annotation approach using a larger number of mance on this kind of task will be correlated with
emotion categories and applying them to freely- performance in applied settings, so we use it as an
defined chunks instead of sentences. The largest- easier preliminary test of our capabilities.
scale emotional analysis is performed by Moham-
3 Dataset
mad (2011), using crowd-sourcing to construct a
large emotional lexicon with which he analyzes We focus on the 19th century novel, partly fol-
adult texts such as plays and novels. In this work, lowing Elson et al. (2010) and partly because
we adopt the concept of emotional trajectory, but these texts are freely available via Project Guten-
apply it to particular characters rather than works berg. Our main dataset is composed of romances
as a whole. (which we loosely define as novels focusing on a
In focusing on characters, we follow Elson et courtship or love affair). We select 41 texts, tak-
al. (2010), who analyze narratives by examining ing 11 as a development set and the remaining
their social network relationships. They use an 30 as a test set; a complete list is given in Ap-
automatic method based on quoted speech to find pendix A. We focus on the novels used in Elson
social links between characters in 19th century et al. (2010), but in some cases add additional ro-
novels. Their work, designed for computational mances by an already-included author. We also
literary criticism, does not extract any temporal selected 10 of the least romantic works as an out-
or emotional structure. of-domain set; experiments on these are in section
A few projects attempt to represent story struc- 8.
635
4 Preprocessing reply left-of-[name] 17
right-of-[name] feel 14
In order to compare two texts, we must first ex- right-of-[name] look 10
tract the characters in each and some features of right-of-[name] mind 7
their relationships with one another. Our first step right-of-[name] make 7
is to split the text into chapters, and each chapter
into paragraphs; if the text contains a running di- Table 1: Top five stemmed unigram dependency fea-
alogue where each line begins with a quotation tures for Miss Elizabeth Bennet, protagonist of
mark, we append it to the previous paragraph. Pride and Prejudice, and their frequencies.
We segment each paragraph with MXTerminator
(Reynar and Ratnaparkhi, 1997) and parse it with
the self-trained Charniak parser (McClosky et al., and the first and last names are consistent (Char-
2006). Next, we extract a list of characters, com- niak, 2001). We then merge single-word mentions
pute dependency tree-based unigram features for with matching multiword mentions if they appear
each character, and record character frequencies in the same paragraph, or if not, with the multi-
and relationships over time. word mention that occurs in the most paragraphs.
When this process ends, we have resolved each
4.1 Identifying characters mention in the novel to some specific character.
As in previous work, we discard very infrequent
We create a list of possible character references
characters and their mentions.
for each work by extracting all strings of proper
nouns (as detected by the parser), then discarding For the reasons stated, this method is error-
those which occur less than 5 times. Grouping prone. Our intuition is that the simpler method
these into a useful character list is a problem of described in Elson et al. (2010), which merges
cross-document coreference. each mention to the most recent possible coref-
Although cross-document coreference has been erent, must be even more so. However, due to
extensively studied (Bhattacharya and Getoor, the expense of annotation, we make no attempt to
2005) and modern systems can achieve quite high compare these methods directly.
accuracy on the TAC-KBP task, where the list
of available entities is given in advance (Dredze
4.2 Unigram character features
et al., 2010), novelistic text poses a significant
challenge for the methods normally used. The Once we have obtained the character list, we use
typical 19th-century novel contains many related the dependency relationships extracted from our
characters, often named after one another. There parse trees to compute features for each charac-
are complicated social conventions determining ter. Similar feature sets are used in previous work
which titles are used for whomfor instance, in word classification, such as (Lin and Pantel,
the eldest unmarried daughter of a family can be 2001). A few example features are shown in Table
called Miss Bennet, while her younger sister 1.
must be Miss Elizabeth Bennet. And characters
To find the features, we take each mention in
often use nicknames, such as Lizzie.
the corpus and count up all the words outside the
Our system uses the multi-stage clustering
mention which depend on the mention head, ex-
approach outlined in Bhattacharya and Getoor
cept proper nouns and stop words. We also count
(2005), but with some features specific to 19th
the mentions own head word, and mark whether
century European names. To begin, we merge all
it appears to the right or the left (in general, this
identical mentions which contain more than two
word is a verb and the direction reflects the men-
words (leaving bare first or last names unmerged).
tions role as subject or object). We lemmatize
Next, we heuristically assign each mention a gen-
all feature words with the WordNet (Miller et al.,
der (masculine, feminine or neuter) using a list of
1990) stemmer. The resulting distribution over
gendered titles, then a list of male and female first
words is our set of unigram features for the char-
names2 . We then merge mentions where each is
acter. (We do not prune rare features, although
longer than one word, the genders do not clash,
they have proportionally little influence on our
2
The most frequent names from the 1990 US census. measurement of similarity.)
636
1.6
1.4 Freq of Miss Elizabeth Bennet
Emotions of Miss Elizabeth Bennet
1.2 Cross freq x Mr. Darcy
1.0
0.8
0.6
0.4
0.2
0.00 10 20 30 40 50
Figure 1: Normalized frequency and emotions associated with Miss Elizabeth Bennet, protagonist of Pride
and Prejudice, and frequency of paragraphs about her and Mr. Darcy, smoothed and projected onto 50 basis
points.
4.3 Temporal relationships care mostly about the strength of key relationships
rather than the existence of infrequent ones.
We record two time-varying features for each
Finally, we perform some smoothing, by taking
character, each taking one value per chapter. The
a weighted moving average of each feature value
first is the characters frequency as a proportion
with a window of the three values on either side.
of all character mentions in the chapter. The sec-
Then, in order to make it easy to compare books
ond is the frequency with which the character is
with different numbers of chapters, we linearly in-
associated with emotional languagetheir emo-
terpolate each series of points into a curve and
tional trajectory (Alm et al., 2005). We use the
project it onto a fixed basis of 50 evenly spaced
strong subjectivity cues from the lexicon of Wil-
points. An example of the final output is shown in
son et al. (2005) as a measurement of emotion.
Figure 1.
If, in a particular paragraph, only one character
is mentioned, we count all emotional words in
5 Kernels
that paragraph and add them to the characters
total. To render the numbers comparable across Our plot kernel k(x, y) measures the similarity
works, each paragraph subtotal is normalized by between two novels x and y in terms of the fea-
the amount of emotional language in the novel as tures computed above. It takes the form of a
a whole. Then the chapter score is the average convolution kernel (Haussler, 1999) where the
over paragraphs. parts of each novel are its characters u x,
For pairwise character relationships, we count v y and c is a kernel over characters:
the number of paragraphs in which only two char- XX
acters are mentioned, and treat this number (as a k(x, y) = c(u, v) (1)
proportion of the total) as a measurement of the ux vy
strength of the relationship between that pair3 . El-
We begin by constructing a first-order ker-
son et al. (2010) show that their method of find-
nel over characters, c1 (u, v), which is defined in
ing conversations between characters is more pre-
terms of a kernel d over the unigram features and
cise in showing whether a relationship exists, but
a kernel e over the single-character temporal fea-
the co-occurrence technique is simpler, and we
tures. We represent the unigram feature counts as
3
We tried also counting emotional language in these para-
distributions pu (w) and pv (w), and compute their
graphs, but this did not seem to help in development experi- similarity as the amount of shared mass, times a
ments. small penalty of .1 for mismatched genders:
637
co-occur with), smoothed and normalized as de-
P scribed in subsection 4.3. This produces a single
d(pu , pv ) = exp((1 w min(pu (w), pv (w)))) time-varying curve for each novel, representing
.1 I{genu = genv } the average emotional intensity of each chapter.
We use our curve kernel e (equation 2) to mea-
We compute similarity between a pair of time- sure similarity between novels.
varying curves (which are projected onto 50
evenly spaced points) using standard cosine dis- 6 Experiments
tance, which approximates the normalized inte-
gral of their product. We evaluate our kernels on their ability to distin-
guish between real novels from our dataset and
!
uv artificial surrogate novels of three types. First, we
e(u, v) = p (2) alter the order of a real novel by permuting its
kukkvk
chapters before computing features. We construct
The weights and are parameters of the sys- one uniformally-random permutation for each test
tem, which scale d and e so that they are compa- novel. Second, we change the identities of the
rable to one another, and also determine how fast characters by reassigning the temporal features
the similarity scales up as the feature sets grow for the different characters uniformally at random
closer; we set them to 5 and 10 respectively. while leaving the unigram features unaltered. (For
We sum together the similarities of the char- example, we might assign the frequency, emotion
acter frequency and emotion curves to measure and relationship curves for Mr. Collins to Miss
overall temporal similarity between the charac- Elizabeth Bennet instead.) Again, we produce
ters. Thus our first-order character kernel c1 is: one test instance of this type for each test novel.
Third, we experiment with a more difficult order-
ing task by taking the chapters in reverse.
c1 (u, v) = d(pu , pv )(e(uf req , vf req )+e(uemo , vemo )) In each case, we use our kernel to perform
We use c1 and equation 1 to construct a first- a ranking task, deciding whether k(x, y) >
order plot kernel (which we call k1 ), and also as k(x, yperm ). Since this is a binary forced-choice
an ingredient in a second-order character kernel classification, a random baseline would score
c2 which takes into account the curve of pairwise 50%. We evaluate performance in the case where
frequencies u,d u0 between two characters u and u0 we are given only a single training document x,
in the same novel. and for a whole training set X, in which case we
combine the decisions using a weighted nearest
XX neighbor (WNN) strategy:
c2 (u, v) = c1 (u, v) e(u,
d u0 , v,
d v 0 )c1 (u0 , v 0 ) X X
u0 x v 0 y k(x, y) > k(x, yperm )
xX xX
In other words, u is similar to v if, for some
relationships of u with other characters u0 , there In each case, we perform the experiment in
are similar characters v 0 who serves the same role a leave-one-out fashion; we include the 11 de-
for v. We use c2 and equation 1 to construct our velopment documents in X, but not in the test
full plot kernel k2 . set. Thus there are 1200 single-document compar-
isons and 30 with WNN. The results of our three
5.1 Sentiment-only baseline systems (the baseline, the first-order kernel k1 and
In addition to our plot kernel systems, we imple- the second-order kernel k2 ) are shown in Table
ment a simple baseline intended to test the effec- 2. (The sentiment-only baseline has no character-
tiveness of tracking the emotional trajectory of specific features, and so cannot perform the char-
the novel without using character identities. We acter task.)
give our baseline access to the same subjectiv- Using the full dataset and second-order kernel
ity lexicon used for our temporal features. We k2 , our systems performance on these tasks is
compute the number of emotional words used in quite good; we are correct 90% of the time for
each chapter (regardless of which characters they order and character examples, and 67% for the
638
order character reverse tion.
sentiment only 46.2 - 51.5
single doc k1 59.5 63.7 50.7 7 Significance testing
single doc k2 61.8 67.7 51.6
In addition to using our kernel as a classifier, we
WNN sentiment 50 - 53
can directly test its ability to distinguish real from
WNN k1 77 90 63
altered novels via a non-parametric two-sample
WNN k2 90 90 67
significance test, the Maximum Mean Discrep-
Table 2: Accuracy of kernels ranking 30 real novels ancy (MMD) test (Gretton et al., 2007). Given
against artificial surrogates (chance accuracy 50%). samples from a pair of distributions p and q and
a kernel k, this test determines whether the null
hypothesis that p and q are identically distributed
more difficult reverse cases. Results of this qual- in the kernels feature space can be rejected. The
ity rely heavily on the WNN strategy, which trusts advantage of this test is that, since it takes all
close neighbors more than distant ones. pairwise comparisons (except self-comparisons)
In the single training point setup, the system within and across the classes into account, it uses
is much less accurate. In this setting, the sys- more information than our classification experi-
tem is forced to make decisions for all pairs of ments, and can therefore be more sensitive.
texts independently, including pairs it considers As in Gretton et al. (2007), we find an unbiased
very dissimilar because it has failed to find any estimate of the test statistic M M D2 for sample
useful correspondences. Performance for these sets x p, y q, each with m samples, by pair-
pairs is close to chance, dragging down overall ing the two as z = (xi , yi ) and computing:
scores (52% for reverse) even if the system per-
forms well on pairs where it finds good correspon-
m
dences, enabling a higher WNN score (67%). 1 X
M M D2 (x, y) = h(zi , zj )
The reverse case is significantly harder than (m)(m 1)
i6=j
order. This is because randomly permuting a
novel actually breaks up the temporal continuity h(zi , zj ) = k(xi , xj )+k(yi , yj )k(xi , yj )k(xj , yi )
of the textfor instance, a minor character who
appeared in three adjacent chapters might now ap- Intuitively, M M D2 approaches 0 if the ker-
pear in three separate places. Reversing the text nel cannot distinguish x from y and is positive
does not cause this kind of disruption, so correctly otherwise. The null distribution is computed by
detecting a reversal requires the system to repre- the bootstrap method; we create null-distributed
sent patterns with a distinct temporal orientation, samples by randomly swapping xi and yi in ele-
for instance an intensification in the main char- ments of z and computing the test statistic. We
acters emotions, or in the number of paragraphs use 10000 test permutations. Using both k1 and
focusing on pairwise relationships, toward the end k2 , we can reject the null hypothesis that the dis-
of the text. tribution of novels is equal to order or characters
The baseline system is ineffective at detecting with p < .001; for reversals, we cannot reject the
either ordering or reversals4 . The first-order ker- null hypothesis.
nel k1 is as good as k2 in detecting character per- 8 Out-of-domain data
mutations, but less effective on reorderings and
reversals. As we will show in section 9, k1 places In our main experiments, we tested our kernel
more emphasis on correspondences between mi- only on romances; here we investigate its ability
nor characters and between places, while k2 is to generalize across genres. We take as our train-
more sensitive to protagonists and their relation- ing set X the same romances as above, but as our
ships, which carry the richest temporal informa- test set Y a disjoint set of novels focusing mainly
on crime, children and the supernatural.
4
The baseline detects reversals as well as the plot kernels Our results (Table 3) are not appreciably differ-
given only a single point of comparison, but these results do
not transfer to the WNN strategy. This suggests that unlike
ent from those of the in-domain experiments (Ta-
the plot kernels, the baseline is no more accurate for docu- ble 2) considering the small size of the dataset.
ments it considers similar than for those it judges are distant. This shows our system to be robust, but shallow;
639
order character reverse Emma Woodhouse, both labeled female pro-
sentiment only 33.0 - 53.4 tagonist, contributes 26% of the kernel similarity
single doc k1 59.5 61.7 52.7 between the works in which they appear.) We plot
single doc k2 63.7 62.0 57.3 these as Hinton-style diagrams in Figure 2. The
WNN sentiment 20 - 70 size of each black rectangle indicates the magni-
WNN k1 80 90 80 tude of the contribution. (Since kernel functions
WNN k2 100 80 70 are symmetric, we show only the lower diagonal.)
Table 3: Accuracy of kernels ranking 10 non-romance Under the kernel for unigram features, d
novels against artificial surrogates, with 41 romances (top), the most common character typesnon-
used for comparison. characters (almost always places) and non-
marriageable womencontribute most to the ker-
the patterns it can represent generalize acceptably nel scores; this is especially true for places, since
across domains, but this suggests it is describing they often occur with similar descriptive terms.
broad concepts like main character rather than The diagram also shows the effect of the kernels
genre-specific ones like female romantic lead. penalty for gender mismatches, since females pair
more strongly with females and males with males.
9 Character-level analysis Character roles have relatively little impact.
The first-order kernel c1 (middle), which takes
To gain some insight into exactly what kinds of
into account frequency and emotion as well as un-
similarities the system picks up on when compar-
igrams, is much better than d at distinguishing
ing two works, we sorted the characters detected
places from real characters, and assigns somewhat
by our system into categories and measured their
more weight to protagonists.
contribution to the kernels overall scores. We
selected four Jane Austen works from the devel- Finally, c2 (bottom), which takes into account
opment set5 and hand-categorized each character second-order relationships, places much more
detected by our system. (We performed the cate- emphasis on female protagonists and much less
gorization based on the most common full name on places. This is presumably because the female
mention in each cluster. This name is usually a protagonists of Jane Austens novels are the view-
good identifier for all the mentions in the cluster, point characters, and the novels focus on their re-
but if our coreference system has made an error, it lationships, while characters do not tend to have
may not be.) strong relationships with places. An increased
Our categorization for characters is intended to tendency to match male marriageable characters
capture the stereotypical plot dynamics of liter- with marriageable females, and other males
ary romance, sorting the characters according to with other females, suggests that c2 relies more
their gender and a simple notion of their plot func- on character function and less on unigrams than
tion. The genders are female, male, plural (the c1 when finding correspondences between char-
Crawfords) or not a character (London). The acters.
functional classes are protagonist (used for the
As we concluded in the previous section, the
female viewpoint character and her eventual hus-
frequent confusion between categories suggests
band), marriageable (single men and women
that the analogies we construct are relatively non-
who are seeking to marry within the story) and
specific. We might hope to create role-based sum-
other (older characters, children, and characters
mary of novels by finding their nearest neighbors
married before the story begins).
and then propagating the character categories (for
We evaluate the pairwise kernel similarities
example, is the protagonist of this novel. She
among our four works, and add up the propor-
lives at . She eventually marries , her other
tional contribution made by character pairs of
suitors are and her older guardian is .)
each type to the eventual score. (For instance,
but the present system is probably not adequate
the similarity between Elizabeth Bennet and
for the purpose. We expect that detecting a fine-
5
Pride and Prejudice, Emma, Mansfield Park and Per- grained set of emotions will help to separate char-
suasion. acter functions more clearly.
640
10 Conclusions
This work presents a method for describing nov-
ns
Toke elistic plots at an abstract level. It has three main
s
Type
contributions: the description of a plot in terms
F Pro M Pr F Ma M M Pl M F Ot M O Pl O Non of analogies between characters, the use of emo-
t ot rr. arr. arr. her ther ther -char tional and frequency trajectories for individual
Character frequency by category characters rather than whole works, and evalua-
t tion using artificially disordered surrogate novels.
F Pro
In future work, we hope to sharpen the analogies
ot
M Pr we construct so that they are useful for summa-
rr.
F Ma rization, perhaps by finding an external standard
arr.
MM by which we can make the notion of analogous
arr.
Pl M characters precise. We would also like to investi-
her gate what gains are possible with a finer-grained
F Ot
ther emotional vocabulary.
MO
ther
Pl O
r Acknowledgements
-cha
Non
F Pro M Pr F Ma M M Pl M F Ot M O Pl O Non Thanks to Sharon Goldwater, Mirella Lapata, Vic-
t ot rr. arr. arr. her ther ther -char
Unigram features (d) toria Adams and the ProbModels group for their
comments on preliminary versions of this work,
t
F Pro Kira Mourao for suggesting graph kernels, and
ot
M Pr three reviewers for their comments.
rr.
F Ma
arr.
MM
arr.
References
Pl M
her
F Ot Amjad Abu Jbara and Dragomir Radev. 2011. Coher-
ther ent citation-based summarization of scientific pa-
MO
ther
pers. In Proceedings of ACL 2011, Portland, Ore-
Pl O gon.
r
-cha
Non Cecilia Ovesdotter Alm and Richard Sproat. 2005.
F Pro M Pr F Ma M M Pl M F Ot M O Pl O Non Emotional sequencing and development in fairy
t ot rr. arr. arr. her ther ther -char
First-order (c1) tales. In ACII, pages 668674.
Cecilia Ovesdotter Alm, Dan Roth, and Richard
t
F Pro Sproat. 2005. Emotions from text: Machine learn-
ot
M Pr ing for text-based emotion prediction. In Proceed-
rr. ings of Human Language Technology Conference
F Ma
arr.
and Conference on Empirical Methods in Natural
MM Language Processing, pages 579586, Vancouver,
arr.
Pl M British Columbia, Canada, October. Association for
her
F Ot Computational Linguistics.
ther Regina Barzilay and Mirella Lapata. 2005. Model-
MO
ther
ing local coherence: an entity-based approach. In
Pl O Proceedings of the 43rd Annual Meeting of the As-
r
-cha
Non sociation for Computational Linguistics (ACL05).
F Pro M Pr F Ma M M Pl M F Ot M O Pl O Non Indrajit Bhattacharya and Lise Getoor. 2005. Rela-
t ot rr. arr. arr. her ther ther -char
Second-order (c2) tional clustering for multi-type entity resolution. In
Proceedings of the 4th international workshop on
Figure 2: Affinity diagrams showing character types Multi-relational mining, MRDM 05, pages 312,
contributing to the kernel similarity between four New York, NY, USA. ACM.
works by Jane Austen. Nathanael Chambers and Dan Jurafsky. 2009. Un-
supervised learning of narrative schemas and their
participants. In Proceedings of the Joint Confer-
ence of the 47th Annual Meeting of the ACL and the
641
4th International Joint Conference on Natural Lan- Anna Kazantseva and Stan Szpakowicz. 2010. Sum-
guage Processing of the AFNLP, pages 602610, marizing short stories. Computational Linguistics,
Suntec, Singapore, August. Association for Com- pages 71109.
putational Linguistics. Dekang Lin and Patrick Pantel. 2001. Induction of
Eugene Charniak. 2001. Unsupervised learning of semantic classes from natural language text. In
name structure from coreference data. In Second Proceedings of the seventh ACM SIGKDD interna-
Meeting of the North American Chapter of the Asso- tional conference on Knowledge discovery and data
ciation for Computational Linguistics (NACL-01). mining, KDD 01, pages 317322, New York, NY,
Harr Chen, S.R.K. Branavan, Regina Barzilay, and USA. ACM.
David R. Karger. 2009. Global models of docu- David McClosky, Eugene Charniak, and Mark John-
ment structure using latent permutations. In Pro- son. 2006. Effective self-training for parsing. In
ceedings of Human Language Technologies: The Proceedings of the Human Language Technology
2009 Annual Conference of the North American Conference of the NAACL, Main Conference, pages
Chapter of the Association for Computational Lin- 152159.
guistics, pages 371379, Boulder, Colorado, June. Neil McIntyre and Mirella Lapata. 2010. Plot induc-
Association for Computational Linguistics. tion and evolutionary search for story generation.
Mark Dredze, Paul McNamee, Delip Rao, Adam Ger- In Proceedings of the 48th Annual Meeting of the
ber, and Tim Finin. 2010. Entity disambigua- Association for Computational Linguistics, pages
tion for knowledge base population. In Proceed- 15621572, Uppsala, Sweden, July. Association for
ings of the 23rd International Conference on Com- Computational Linguistics.
putational Linguistics (Coling 2010), pages 277 G. Miller, A.R. Beckwith, C. Fellbaum, D. Gross, and
285, Beijing, China, August. Coling 2010 Organiz- K. Miller. 1990. Introduction to WordNet: an on-
ing Committee. line lexical database. International Journal of Lexi-
David K. Elson and Kathleen R. McKeown. 2010. cography, 3(4).
Building a bank of semantically encoded narratives. Saif Mohammad. 2011. From once upon a time
In Nicoletta Calzolari (Conference Chair), Khalid to happily ever after: Tracking emotions in novels
Choukri, Bente Maegaard, Joseph Mariani, Jan and fairy tales. In Proceedings of the 5th ACL-
Odijk, Stelios Piperidis, Mike Rosner, and Daniel HLT Workshop on Language Technology for Cul-
Tapias, editors, Proceedings of the Seventh con- tural Heritage, Social Sciences, and Humanities,
ference on International Language Resources and pages 105114, Portland, OR, USA, June. Associa-
Evaluation (LREC10), Valletta, Malta, May. Euro- tion for Computational Linguistics.
pean Language Resources Association (ELRA). Matt Post. 2011. Judging grammaticality with tree
David Elson, Nicholas Dames, and Kathleen McKe- substitution grammar derivations. In Proceedings
own. 2010. Extracting social networks from liter- of the 49th Annual Meeting of the Association
ary fiction. In Proceedings of the 48th Annual Meet- for Computational Linguistics: Human Language
ing of the Association for Computational Linguis- Technologies, pages 217222, Portland, Oregon,
tics, pages 138147, Uppsala, Sweden, July. Asso- USA, June. Association for Computational Linguis-
ciation for Computational Linguistics. tics.
Amit Goyal, Ellen Riloff, and Hal Daume III. 2010. Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997.
Automatically producing plot unit representations A maximum entropy approach to identifying sen-
for narrative text. In Proceedings of the 2010 Con- tence boundaries. In Proceedings of the Fifth Con-
ference on Empirical Methods in Natural Language ference on Applied Natural Language Processing,
Processing, pages 7786, Cambridge, MA, Octo- pages 1619, Washington D.C.
ber. Association for Computational Linguistics. Ekaterina P. Volkova, Betty Mohler, Detmar Meur-
Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, ers, Dale Gerdemann, and Heinrich H. Bulthoff.
Bernhard Schlkopf, and Alexander J. Smola. 2007. 2010. Emotional perception of fairy tales: Achiev-
A kernel method for the two-sample-problem. In ing agreement in emotion annotation of text. In
B. Scholkopf, J. Platt, and T. Hoffman, editors, Ad- Proceedings of the NAACL HLT 2010 Workshop on
vances in Neural Information Processing Systems Computational Approaches to Analysis and Gener-
19, pages 513520. MIT Press, Cambridge, MA. ation of Emotion in Text, pages 98106, Los Ange-
David Haussler. 1999. Convolution kernels on dis- les, CA, June. Association for Computational Lin-
crete structures. Technical Report UCSC-CRL-99- guistics.
10, Computer Science Department, UC Santa Cruz. Theresa Wilson, Janyce Wiebe, and Paul Hoffmann.
Nikiforos Karamanis, Massimo Poesio, Chris Mellish, 2005. Recognizing contextual polarity in phrase-
and Jon Oberlander. 2004. Evaluating centering- level sentiment analysis. In Proceedings of Hu-
based metrics of coherence. In ACL, pages 391 man Language Technology Conference and Confer-
398. ence on Empirical Methods in Natural Language
642
Processing, pages 347354, Vancouver, British
Columbia, Canada, October. Association for Com-
putational Linguistics.
643
A List of texts
Dev set (11 works)
Austen Emma, Mansfield Park, Northanger Bronte, Emily Wuthering Heights
Abbey, Persuasion, Pride and Prej-
udice, Sense and Sensibility
Burney Cecilia (1782) Hardy Tess of the DUrbervilles
James The Ambassadors Scott Ivanhoe
Test set (30 works)
Braddon Aurora Floyd Bronte, Anne The Tenant of Wildfell Hall
Bronte, Charlotte Jane Eyre, Villette Bulwer-Lytton Zanoni
Disraeli Coningsby, Tancred Edgeworth The Absentee, Belinda, Helen
Eliot Adam Bede, Daniel Deronda, Mid- Gaskell Mary Barton, North and South
dlemarch
Gissing In the Year of Jubilee, New Grub Hardy Far From the Madding Crowd, Jude
Street the Obscure, Return of the Native,
Under the Greenwood Tree
James The Wings of the Dove Meredith The Egoist, The Ordeal of Richard
Feverel
Scott The Bride of Lammermoor Thackeray History of Henry Esmond, History
of Pendennis, Vanity Fair
Trollope Doctor Thorne
Out-of-domain set (10 works)
Ainsworth The Lancashire Witches Bulwer-Lytton Paul Clifford
Dickens Oliver Twist, The Pickwick Papers Collins The Moonstone
Conan-Doyle A Study in Scarlet, The Sign of the Hughes Tom Browns Schooldays
Four
Stevenson Treasure Island Stoker Dracula
644
Smart Paradigms and the Predictability and Complexity of Inflectional
Morphology
645
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 645653,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
allowed in the function P . In Hellberg (1978), conj19finir(s), if s ends ir
noun paradigms only permit the concatenation of conj53rendre(s), if s ends re
suffixes to a stem. Thus the paradigms are iden- conj14assieger(s), if s ends eger
tified with suffix sets. For instance, the inflection conj11jeter(s), if s ends eler or
patterns bilbilar (carcars) and nyckelnycklar eter
(keykeys) are traditionally both treated as in- conj10ceder(s), if s ends eder
stances of the second declension, with the plural conj07placer(s), if s ends cer
ending ar and the contraction of the unstressed conj08manger(s), if s ends ger
e in the case of nyckel. But in Hellberg, the conj16payer(s), if s ends yer
word nyckel has nyck as its technical stem, to conj06parler(s), if s ends er
which the paradigm numbered 231 adds the sin-
gular ending el and the plural ending lar. Notice that the cases must be applied in the given
The notion of paradigm used in this paper al- order; for instance, the last case applies only to
lows multiple arguments and powerful string op- those verbs ending with er that are not matched
erations. In this way, we will be able to reduce by the earlier cases.
the number of paradigms drastically: in fact, each Also notice that the above paradigm is just
lexical category (noun, adjective, verb), will have like the more traditional ones, in the sense that
just one paradigm but with a variable number of we cannot be sure if it really applies to a given
arguments. Paradigms that follow this design will verb. For instance, the verb partir ends with ir
be called smart paradigms and are introduced and would hence receive the same inflection as
in Section 2. Section 3 defines the notions of finir; however, its real conjugation is number 26
predictability and complexity of smart paradigm in Bescherelle. That mkV uses 19 rather than
systems. Section 4 estimates these figures for four number 26 has a good reason: a vast majority of
different languages of increasing richness in mor- ir verbs is inflected in this conjugation, and it is
phology: English, Swedish, French, and Finnish. also the productive one, to which new ir verbs are
We also evaluate the smart paradigms as a data added.
compression method. Section 5 explores some Even though there is no mathematical differ-
uses of smart paradigms in lexicon building. Sec- ence between the mkV paradigm and the tradi-
tion 6 compares smart paradigms with related tional paradigms like those in Bescherelle, there
techniques such as morphology guessers and ex- is a reason to call mkV a smart paradigm. This
traction tools. Section 7 concludes. name implies two things. First, a smart paradigm
implements some artificial intelligence to pick
2 Smart paradigms the underlying stupid paradigm. Second, a
In this paper, we will assume a notion of paradigm smart paradigm uses heuristics (informed guess-
that allows multiple arguments and arbitrary com- ing) if string matching doesnt decide the matter;
putable string operations. As argued in (Ka- the guess is informed by statistics of the distribu-
plan and Kay, 1994) and amply demonstrated in tions of different inflection classes.
(Beesley and Karttunen, 2003), no generality is One could thus say that smart paradigms are
lost if the string operators are restricted to ones second-order or meta-paradigms, compared
computable by finite-state transducers. Thus the to more traditional ones. They implement a
examples of paradigms that we will show (only lot of linguistic knowledge and intelligence, and
informally), can be converted to matching and re- thereby enable tasks such as lexicon building to
placements with regular expressions. be performed with less expertise than before. For
For example, a majority of French verbs can instance, instead of 07 for foncer and 06
be defined by the following paradigm, which for marcher, the lexicographer can simply write
analyzes a variable-size suffix of the infinitive mkV for all verbs instead of choosing from 88
form and dispatches to the Bescherelle paradigms numbers.
(identified by a number and an example verb): In fact, just V, indicating that the word is
a verb, will be enough, since the name of the
mkV : String String51 paradigm depends only on the part of speech.
mkV(s) = This follows the model of many dictionaries and
646
methods of language teaching, where character- elsewhere in the library. Function application is
istic forms are used instead of paradigm identi- expressed without parentheses, by the juxtaposi-
fiers. For instance, another variant of mkV could tion of the function and the argument.
use as its second argument the first person plural
mkV : Str -> V
present indicative to decide whether an ir verb is
mkV s = case s of {
in conjugation 19 or in 26:
_ + "ir" -> conj19finir s ;
mkV : String2 String51 _ + ("eler"|"eter")
-> conj11jeter s ;
mkV(s, t) =
_ + "er" -> conj06parler s ;
conj26partir(s), if for some x, s = }
x+ir and t = x+ons
conj19finir(s), if s ends with ir The GF Resource Grammar Library1 has
(all the other cases that can be rec- comprehensive smart paradigms for 18 lan-
ognized by this extra form) guages: Amharic, Catalan, Danish, Dutch, En-
mkV(s) otherwise (fall-back to the glish, Finnish, French, German, Hindi, Italian,
one-argument paradigm) Nepalese, Norwegian, Romanian, Russian, Span-
ish, Swedish, Turkish, and Urdu. A few other lan-
In this way, a series of smart paradigms is built guages have complete sets of traditional inflec-
for each part of speech, with more and more ar- tion paradigms but no smart paradigms.
guments. The trick is to investigate which new Six languages in the library have comprehen-
forms have the best discriminating power. For sive morphological dictionaries: Bulgarian (53k
ease of use, the paradigms should be displayed to lemmas), English (42k), Finnish (42k), French
the user in an easy to understand format, e.g. as a (92k), Swedish (43k), and Turkish (23k). They
table specifying the possible argument lists: have been extracted from other high-quality re-
verb parler sources via conversions to GF using the paradigm
verb parler, parlons systems. In Section 4, four of them will be used
verb parler, parlons, parlera, parla, parle for estimating the strength of the smart paradigms,
noun chien that is, the predictability of each language.
noun chien, masculine
noun chien, chiens, masculine 3 Cost, predictability, and complexity
Notice that, for French nouns, the gender is listed Given a language L, a lexical category C, and a set
as one of the pieces of information needed for P of smart paradigms for C, the predictability of
lexicon building. In many cases, it can be in- the morphology of C in L by P depends inversely
ferred from the dictionary form just like the in- on the average number of arguments needed to
flection; for instance, that most nouns ending e generate the correct inflection table for a word.
are feminine. A gender argument in the smart The lower the number, the more predictable the
noun paradigm makes it possible to override this system.
default behaviour.
Predictability can be estimated from a lexicon
2.1 Paradigms in GF that contains such a set of tables. Formally, a
smart paradigm is a family Pm of functions
Smart paradigms as used in this paper have been
implemented in the GF programming language
Pm : Stringm Stringn
(Grammatical Framework, (Ranta, 2011)). GF is
a functional programming lnguage enriched with
where m ranges over some set of integers from 1
regular expressions. For instance, the following
to n, but need not contain all those integers. A
function implements a part of the one-argument
lexicon L is a finite set of inflection tables,
French verb paradigm shown above. It uses a case
expression to pattern match with the argument s;
L = {wi : Stringn | i = 1, . . . , ML }
the pattern _ matches anything, while + divides a
string to two pieces, and | expresses alternation. 1
Source code and documentation in http://www.
The functions conj19finir etc. are defined grammaticalframework.org/lib.
647
As the n is fixed, this is a lexicon specialized to source code size rather than e.g. a finite automa-
one part of speech. A word is an element of the ton size gives in our view a better approximation
lexicon, that is, an inflection table of size n. of the cognitive load of the paradigm system,
An application of a smart paradigm Pm to a its learnability. As a functional programming
word w L is an inflection table resulting from language, GF permits abstractions comparable to
applying Pm to the appropriate subset m (w) of those available for human language learners, who
the inflection table w, dont need to learn the repetitive details of a finite
automaton.
Pm [w] = Pm (m (w)) : Stringn We define the code complexity as the size of
the abstract syntax tree of the source code. This
Thus we assume that all arguments are existing
size is given as the number of nodes in the syntax
word forms (rather than e.g. stems), or features
tree; for instance,
such as the gender. n
X
An application is correct if size(f (x1 , . . . , xn )) = 1 + size(xi )
i=1
Pm [w] = w size(s) = 1, for a string literal s
Using the abstract syntax size makes it possible
The cost of a word w is the minimum number of
to ignore programmer-specific variation such as
arguments needed to make the application correct:
identifier size. Measurements of the GF Resource
cost(w) = argmin(Pm [w] = w) Grammar Library show that code size measured
m in this way is in average 20% of the size of source
For practical applications, it is useful to require files in bytes. Thus a source file of 1 kB has the
Pm to be monotonic, in the sense that increasing code complexity around 200 on the average.
m preserves correctness. Notice that code complexity is defined in a way
The cost of a lexicon L is the average cost for that makes it into a straightforward generaliza-
its words, tion of the cost of a word as expressed in terms
of paradigm applications in GF source code. The
ML
X source code complexity of a paradigm application
cost(wi )
is
i=1
cost(L) = size(Pm [w]) = 1 + m
ML
where ML is the number of words in the lexicon, Thus the complexity for a word w is its cost plus
as defined above. one; the addition of one comes from the applica-
The predictability of a lexicon could be de- tion node for the function Pm and corresponds to
fined as a quantity inversely dependent on its cost. knowing the part of speech of the word.
For instance, an information-theoretic measure
4 Experimental results
could be defined
1 We conducted experiments in four languages (En-
predict(L) = glish, Swedish, French and Finnish2 ), presented
1 + log cost(L)
here in order of morphological richness. We used
with the intuition that each added argument cor- trusted full form lexica (i.e. lexica giving the com-
responds to a choice in a decision tree. However, plete inflection table of every word) to compute
we will not use this measure in this paper, but just the predictability, as defined above, in terms of
the concrete cost. the smart paradigms in GF Resource Grammar Li-
The complexity of a paradigm system is de- brary.
fined as the size of its code in a given coding We used a simple algorithm for computing the
system, following the idea of Kolmogorov com- cost c of a lexicon L with a set Pm of smart
plexity (Solomonoff, 1964). The notion assumes paradigms:
a coding system, which we fix to be GF source 2
This choice correspond to the set of language for which
code. As the results are relative to the coding both comprehensive smart paradigms and morphological
system, they are only usable for comparing def- dictionaries were present in GF with the exception of Turk-
initions in the same system. However, using GF ish, which was left out because of time constraints.
648
set c := 0 one third of the nouns of the lexicon were not in-
cluded in the experiment because one of the form
for each word wi in L, was missing. The vast majority of the remaining
for each m in growing order for which 15,000 nouns are very regular, with predictable
Pm is defined: deviations such as kiss - kisses and fly - flies which
if Pm [w] = w, then c := c + m, else try can be easily predicted by the smart paradigm.
with next m With the average cost of 1.05, this was the most
predictable lexicon in our experiment.
return c Verbs. Verbs are the most interesting category
in English because they present the richest mor-
The average cost is c divided by the size of L. phology. Indeed, as shown by Table 1, the cost
The procedure presupposes that it is always for English verbs, 1.21, is similar to what we got
possible to get the correct inflection table. For for morphologically richer languages.
this to be true, the smart paradigms must have a
worst case scenario version that is able to gen- 4.2 Swedish
erate all forms. In practice, this was not always As gold standard, we used the SALDO lexicon
the case but we checked that the number of prob- (Borin et al., 2008).
lematic words is so small that it wouldnt be sta- Nouns. The noun inflection tables had 8
tistically significant. A typical problem word was forms (singular/plural indefinite/definite nomina-
the equivalent of the verb be in each language. tive/genitive) plus a gender (uter/neuter). Swedish
Another source of deviation is that a lexicon nouns are intrinsically very unpredictable, and
may have inflection tables with size deviating there are many examples of homonyms falling un-
from the number n that normally defines a lex- der different paradigms (e.g. val - val choice vs.
ical category. Some words may be defective, val -valar whale). The cost 1.70 is the highest
i.e. lack some forms (e.g. the singular form of all the lexica considered. Of course, there may
in plurale tantum words), whereas some words be room for improving the smart paradigm.
may have several variants for a given form (e.g. Verbs. The verbs had 20 forms, which in-
learned and learnt in English). We made no ef- cluded past participles. We ran two experiments,
fort to predict defective words, but just ignored by choosing either the infinitive or the present in-
them. With variant forms, we treated a prediction dicative as the base form. In traditional Swedish
as correct if it matched any of the variants. grammar, the base form of the verb is considered
The above algorithm can also be used for help- to be the infinitive, e.g. spela, leka (play in
ing to select the optimal sets of characteristic two different senses). But this form doesnt dis-
forms; we used it in this way to select the first tinguish between the first and the second con-
form of Swedish verbs and the second form of jugation. However, the present indicative, here
Finnish nouns. spelar, leker, does. Using it gives a predictive
The results are collected in Table 1. The sec- power 1.13 as opposed to 1.22 with the infinitive.
tions below give more details of the experiment in Some modern dictionaries such as Lexin4 there-
each language. fore use the present indicative as the base form.
4.1 English 4.3 French
As gold standard, we used the electronic version For French, we used the Morphalou morpholog-
of the Oxford Advanced Learners Dictionary of ical lexicon (Romary et al., 2004). As stated in
Current English3 which contains about 40,000 the documentation5 the current version of the lex-
root forms (about 70,000 word forms). icon (version 2.0) is not complete, and in par-
Nouns. We considered English nouns as hav- ticular, many entries are missing some or all in-
ing only two forms (singular and plural), exclud- flected forms. So for those experiments we only
ing the genitive forms which can be considered to
4
be clitics and are completely predictable. About http://lexin.nada.kth.se/lexin/
5
http://www.cnrtl.fr/lexiques/
3
available in electronic form at http://www.eecs. morphalou/LMF-Morphalou.php#body_3.4.11,
qmul.ac.uk/mpurver/software.html accessed 2011-11-04
649
Table 1: Lexicon size and average cost for the nouns (N) and verbs (V) in four languages, with the percentage of
words correctly inferred from one and two forma (i.e. m = 1 and m 2, respectively).
Lexicon Forms Entries Cost m=1 m2
Eng N 2 15,029 1.05 95% 100%
Eng V 5 5,692 1.21 84% 95%
Swe N 9 59,225 1.70 46% 92%
Swe V 20 4,789 1.13 97% 97%
Fre N 3 42,390 1.25 76% 99%
Fre V 51 6,851 1.27 92% 94%
Fin N 34 25,365 1.26 87% 97%
Fin V 102 10,355 1.09 96% 99%
included entries where all the necessary forms glutinative way. The traditional number and case
were presents. count for nouns gives 26, whereas for verbs the
Nouns: Nouns in French have two forms (sin- count is between 100 and 200, depending on how
gular and plural) and an intrinsic gender (mascu- participles are counted. Notice that the definition
line or feminine), which we also considered to be of predictability used in this paper doesnt depend
a part of the inflection table. Most of the unpre- on the number of forms produced (i.e. not on n
dictability comes from the impossibility to guess but only on m); therefore we can simply ignore
the gender. this question. However, the question is interesting
Verbs: The paradigms generate all of the sim- if we think about paradigms as a data compression
ple (as opposed to compound) tenses given in tra- method (Section 4.5).
ditional grammars such as the Bescherelle. Also Nouns. Compound nouns are a problem for
the participles are generated. The auxiliary verb morphology prediction in Finnish, because inflec-
of compound tenses would be impossible to guess tion is sensitive to the vowel harmony and num-
from morphological clues, and was left out of ber of syllables, which depend on where the com-
consideration. pound boundary goes. While many compounds
are marked in KOTUS, we had to remove some
4.4 Finnish compounds with unmarked boundaries. Another
The Finnish gold standard was the KOTUS lexi- peculiarity was that adjectives were included in
con (Kotimaisten Kielten Tutkimuskeskus, 2006). nouns; this is no problem since the inflection pat-
It has around 90,000 entries tagged with part terns are the same, if comparison forms are ig-
of speech, 50 noun paradigms, and 30 verb nored. The figure 1.26 is better than the one re-
paradigms. Some of these paradigms are rather ported in (Ranta, 2008), which is 1.42; the reason
abstract and powerful; for instance, grade alterna- is mainly that the current set of paradigms has a
tion would multiply many of the paradigms by a better coverage of three-syllable nouns.
factor of 10 to 20, if it was treated in a concate- Verbs. Even though more numerous in forms
native way. For instance, singular nominative- than nouns, Finnish verbs are highly predictable
genitive pairs show alternations such as talotalon (1.09).
(house), kattokaton (roof), kantokannon
(stub), rakoraon (crack), and satosadon 4.5 Complexity and data compression
(harvest). All of these are treated with one and The cost of a lexicon has an effect on learnabil-
the same paradigm, which makes the KOTUS sys- ity. For instance, even though Finnish words have
tem relatively abstract. ten or a hundred times more forms than English
The total number of forms of Finnish nouns and forms, these forms can be derived from roughly
verbs is a question of definition. Koskenniemi the same number of characteristic forms as in En-
(Koskenniemi, 1983) reports 2000 for nouns and glish. But this is of course just a part of the truth:
12,000 for verbs, but most of these forms result by it might still be that the paradigm system itself is
adding particles and possessive suffixes in an ag- much more complex in some languages than oth-
650
gives, for the Finnish verb lexicon, a file of 60 kB,
Table 2: Paradigm complexities for nouns and verbs
in the four languages, computed as the syntax tree size which implies a joint compression rate of 227.
of GF code. That the compression rates for the code can be
language noun verb total higher than the numbers of forms in the full-form
English 403 837 991 lexicon is explained by the fact that the gener-
Swedish 918 1039 1884 ated forms are longer than the base forms. For
instance, the full-form entry of the Finnish verb
French 351 2193 2541
uida (swim) is 850 bytes, which means that the
Finnish 4772 3343 6885
average form size is twice the size of the basic
form.
651
Table 3: Comparison between using bzip2 and paradigms+lexicon source as a compression method. Sizes in
kB.
Lexicon Fullform bzip2 fullform/bzip2 Source fullform/source
Eng N 264 99 2.7 135 2.0
Eng V 245 78 3.2 57 4.4
Swe N 6,243 1,380 4.5 1,207 5.3
Swe V 840 174 4.8 58 15
Fre N 952 277 3.4 450 2.2
Fre V 3,888 811 4.8 98 40
Fin N 11,295 2,165 5.2 343 34
Fin V 13,609 2,297 5.9 123 114
complexity. Also the paradigms for Finnish are ber of forms to determine that a word belongs to a
improved here (cf. Section 4.4 above). certain paradigm. Smart paradigms can then give
Even though smart paradigm-like descriptions the method to actually construct the full inflection
are common in language text books, there is to tables from the characteristic forms.
our knowledge no computational equivalent to the
smart paradigms of GF. Finite state morphology 7 Conclusion
systems often have a function called a guesser, We have introduced the notion of smart
which, given a word form, tries to guess either paradigms, which implement the linguistic
the paradigm this form belongs to or the dictio- knowledge involved in inferring the inflection of
nary form (or both). A typical guesser differs words. We have used the paradigms to estimate
from a smart paradigms in that it does not make the predictability of nouns and verbs in English,
it possible to correct the result by giving more Swedish, French, and Finnish. The main result
forms. Examples of guessers include (Chanod is that, with the paradigms used, less than two
and Tapanainen, 1995) for French, (Hlavacova, forms in average is always enough. In half of the
2001) for Czech, and (Nakov et al., 2003) for Ger- languages and categories, one form is enough to
man. predict more than 90% of forms correctly. This
Another related domain is the unsupervised gives a promise for both manual lexicon building
learning of morphology where machine learning and automatic bootstrapping of lexicon from
is used to automatically build a language mor- word lists.
phology from corpora (Goldsmith, 2006). The To estimate the overall complexity of inflection
main difference is that with the smart paradigms, systems, we have also measured the size of the
the paradigms and the guess heuristics are imple- source code for the paradigm systems. Unsurpris-
mented manually and with a high certainty; in un- ingly, Finnish is around seven times as complex
supervised learning of morphology the paradigms as English, and around three times as complex as
are induced from the input forms with much lower Swedish and French. But this cost is amortized
certainty. Of particular interest are (Chan, 2006) when big lexica are built.
and (Dreyer and Eisner, 2011), dealing with the Finally, we looked at smart paradigms as a data
automatic extraction of paradigms from text and compression method. With simple morphologies,
investigate how good these can become. The main such as English nouns, bzip2 gave a better com-
contrast is, again, that our work deals with hand- pression of the lexicon than the source code us-
written paradigms that are correct by design, and ing paradigms. But with Finnish verbs, the com-
we try to see how much information we can drop pression rate was almost 20 times higher with
before losing correctness. paradigms than with bzip2.
Once given, a set of paradigms can be used in The general conclusion is that smart paradigms
automated lexicon extraction from raw data, as in are a good investment when building morpho-
(Forsberg et al., 2006) and (Clement et al., 2004), logical lexica, as they ease the task of both hu-
by a method that tries to collect a sufficient num- man lexicographers and automatic bootstrapping
652
methods. They also suggest a method to assess [Goldsmith2006] John Goldsmith. 2006. An Algo-
the complexity and learnability of languages, re- rithm for the Unsupervised Learning of Morphol-
lated to Kolmogorov complexity. The results in ogy. Nat. Lang. Eng., 12(4):353371.
the current paper are just preliminary in this re- [Hellberg1978] Staffan Hellberg. 1978. The Morphol-
ogy of Present-Day Swedish. Almqvist & Wiksell.
spect, since they might still tell more about par-
[Hlavacova2001] Jaroslava Hlavacova. 2001. Mor-
ticular implementations of paradigms than about
phological guesser of czech words. In Vaclav Ma-
the languages themselves. tousek, Pavel Mautner, Roman Moucek, and Karel
Tauser, editors, Text, Speech and Dialogue, volume
Acknowledgements 2166 of Lecture Notes in Computer Science, pages
We are grateful to the anonymous referees for 7075. Springer Berlin / Heidelberg.
valuable remarks and questions. The research [Kaplan and Kay1994] R. Kaplan and M. Kay. 1994.
leading to these results has received funding from Regular Models of Phonological Rule Systems.
the European Unions Seventh Framework Pro- Computational Linguistics, 20:331380.
[Koskenniemi1983] Kimmo Koskenniemi. 1983.
gramme (FP7/2007-2013) under grant agreement
Two-Level Morphology: A General Computational
no FP7-ICT-247914 (the MOLTO project). Model for Word-Form Recognition and Production.
Ph.D. thesis, University of Helsinki.
[Kotimaisten Kielten Tutkimuskeskus2006]
References
Kotimaisten Kielten Tutkimuskeskus. 2006.
[Beesley and Karttunen2003] Kenneth R. Beesley and KOTUS Wordlist. http://kaino.kotus.
Lauri Karttunen. 2003. Finite State Morphology. fi/sanat/nykysuomi.
CSLI Publications. [Nakov et al.2003] Preslav Nakov, Yury Bonev, and
[Bescherelle1997] Bescherelle. 1997. La conjugaison et al. 2003. Guessing morphological classes of un-
pour tous. Hatier. known german nouns.
[Borin et al.2008] Lars Borin, Markus Forsberg, and [Ranta2008] Aarne Ranta. 2008. How pre-
Lennart Lonngren. 2008. Saldo 1.0 (svenskt as- dictable is Finnish morphology? an experi-
sociationslexikon version 2). Sprakbanken, 05. ment on lexicon construction. In J. Nivre and
[Chan2006] Erwin Chan. 2006. Learning probabilistic M. Dahllof and B. Megyesi, editor, Resource-
paradigms for morphology in a latent class model. ful Language Technology: Festschrift in Honor
In Proceedings of the Eighth Meeting of the ACL of Anna Sagvall Hein, pages 130148. University
Special Interest Group on Computational Phonol- of Uppsala. http://publications.uu.se/
ogy and Morphology, SIGPHON 06, pages 6978, abstract.xsql?dbid=8933.
Stroudsburg, PA, USA. Association for Computa- [Ranta2011] Aarne Ranta. 2011. Grammatical Frame-
tional Linguistics. work: Programming with Multilingual Grammars.
[Chanod and Tapanainen1995] Jean-Pierre Chanod CSLI Publications, Stanford. ISBN-10: 1-57586-
and Pasi Tapanainen. 1995. Creating a tagset, 626-9 (Paper), 1-57586-627-7 (Cloth).
lexicon and guesser for a french tagger. CoRR, [Romary et al.2004] Laurent Romary, Susanne
cmp-lg/9503004. Salmon-Alt, and Gil Francopoulo. 2004. Standards
[Clement et al.2004] Lionel Clement, Benot Sagot, going concrete: from LMF to Morphalou. In The
and Bernard Lang. 2004. Morphology based au- 20th International Conference on Computational
tomatic acquisition of large-coverage lexica. In Linguistics - COLING 2004, Geneve/Switzerland.
Proceedings of LREC-04, Lisboa, Portugal, pages coling.
18411844. [Solomonoff1964] Ray J. Solomonoff. 1964. A formal
[Dreyer and Eisner2011] Markus Dreyer and Jason theory of inductive inference: Parts 1 and 2. Infor-
Eisner. 2011. Discovering morphological mation and Control, 7:122 and 224254.
paradigms from plain text using a dirichlet process
mixture model. In Proceedings of the Conference
on Empirical Methods in Natural Language Pro-
cessing, EMNLP 11, pages 616627, Stroudsburg,
PA, USA. Association for Computational Linguis-
tics.
[Forsberg et al.2006] Markus Forsberg, Harald Ham-
marstrom, and Aarne Ranta. 2006. Morpholog-
ical Lexicon Extraction from Raw Text Data. In
T. Salakoski, editor, FinTAL 2006, volume 4139 of
LNCS/LNAI.
653
Probabilistic Hierarchical Clustering of
Morphological Paradigms
654
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 654663,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
Dk
{walk, talk, quick}{0,ed,ing,ly, s}
Di
{walk, talk}{0,ed,ing,s} Dj
method is similar to the Dirichlet Process (DP) model can be denoted as p(xi |) where denotes
based model of Goldwater et al. (2006). From the parameters of the probabilistic model.
this perspective, our method can be understood The marginal probability of data in any node
as adding a hierarchical structure learning layer can be calculated as:
on top of the DP based learning method proposed
in Goldwater et al. (2006). Dreyer and Eisner p(Dk ) = p(Dk |)p(|)d (1)
(2011) propose an infinite Diriclet mixture model
for capturing paradigms. However, they do not
The likelihood of data under any subtree is de-
address learning of hierarchy.
fined as follows:
The method proposed in Chan (2006) also
learns within a hierarchical structure where La- p(Dk |Tk ) = p(Dk )p(Dl |Tl )p(Dr |Tr ) (2)
tent Dirichlet Allocation (LDA) is used to find
stem-suffix matrices. However, their work is su- where the probability is defined in terms of left Tl
pervised, as true morphological analyses of words and right Tr subtrees. Equation 2 provides a re-
are provided to the system. In contrast, our pro- cursive decomposition of the likelihood in terms
posed method is fully unsupervised. of the likelihood of the left and the right sub-
trees until the leaf nodes are reached. We use the
3 Probabilistic Hierarchical Model marginal probability (Equation 1) as prior infor-
The hierarchical clustering proposed in this work mation since the marginal probability bears the
is different from existing hierarchical clustering probability of having the data from the left and
algorithms in two aspects: right subtrees within a single cluster.
4 Morphological Segmentation
It is not single-pass as the hierarchical struc-
ture changes. In our model, data points are words to be clus-
tered and each cluster represents a paradigm. In
It is probabilistic and is not dependent on a the hierarchical structure, words will be organised
distance metric. in such a way that morphologically similar words
will be located close to each other to be grouped
3.1 Mathematical Definition in the same paradigms. Morphological similarity
In this paper, a hierarchical structure is a binary refers to at least one common morpheme between
tree in which each internal node represents a clus- words. However, we do not make a distinction be-
ter. tween morpheme types. Instead, we assume that
Let a data set be D = {x1 , x2 , . . . , xn } and each word is organised as a stem+suffix combina-
T be the entire tree, where each data point xi is tion.
located at one of the leaf nodes (see Figure 2).
Here, Dk denotes the data points in the branch 4.1 Model Definition
Tk . Each node defines a probabilistic model for Let a dataset D consist of words to be analysed,
words that the cluster acquires. The probabilistic where each word wi has a latent variable which is
655
the split point that analyses the word into its stem
s m
si and suffix mi :
D = {w1 = s1 + m1 , . . . , wn = sn + mn }
Ps Gs Gm Pm
The marginal likelihood of words in the node k
is defined such that:
si mi
p(Dk ) = p(Sk )p(Mk )
L N
= p(s1 , s2 , . . . , sn )p(m1 , m2 , . . . , mn )
p(w = s + m) = p(s)p(m) (3) where ci denotes the letters, which are distributed
uniformly. Modelling morpheme letters is a way
We define two Dirichlet processes to generate of modelling the morpheme length since shorter
stems and suffixes independently: morphemes are favoured in order to have fewer
factors in Equation 4 (Creutz and Lagus, 2005b).
Gs |s , Ps DP (s , Ps ) The Dirichlet process, DP (m , Pm ), is defined
Gm |m , Pm DP (m , Pm ) for suffixes analogously. The graphical represen-
s|Gs Gs tation of the entire model is given in Figure 3.
m|Gm Gm Once the probability distributions G =
{Gs , Gm } are drawn from both Dirichlet pro-
where DP (s , Ps ) denotes a Dirichlet process cesses, words can be generated by drawing a stem
that generates stems. Here, s is the concentration from Gs and a suffix from Gm . However, we do
parameter, which determines the number of stem not attempt to estimate the probability distribu-
types generated by the Dirichlet process. The tions G; instead, G is integrated out. The joint
smaller the value of the concentration parameter, probability of stems is calculated by integrating
the less likely to generate new stem types the pro- out Gs :
cess is. In contrast, the larger the value of concen-
tration parameter, the more likely it is to generate p(s1 , s2 , . . . , sM )
L
new stem types, yielding a more uniform distribu- (5)
= p(Gs ) p(si |Gs )dGs
tion over stem types. If s < 1, sparse stems are
i=1
supported, it yields a more skewed distribution.
To support a small number of stem types in each where L denotes the number of stem tokens. The
cluster, we chose s < 1. joint probability distribution of stems can be tack-
Here, Ps is the base distribution. We use the led as a Chinese restaurant process. The Chi-
base distribution as a prior probability distribu- nese restaurant process introduces dependencies
tion for morpheme lengths. We model morpheme between stems. Hence, the joint probability of
656
stems S = {s1 , . . . , sL } becomes:
p(s1 , s2 , . . . , sL )
= p(s1 )p(s2 |s1 ) . . . p(sM |s1 , . . . , sM 1 )
(s ) K K
= sK1 Ps (si ) (nsi 1)!
(L + s )
i=1 i=1
(6)
where K denotes the number of stem types. In
the equation, the second and the third factor corre- exclaim+ed
spond to the case where novel stems are generated consist+ed
consist+s
for the first time; the last factor corresponds to the
case in which stems that have already been gener-
ated for nsi times previously are being generated plugg+ed skew+ed
p(m1 , m2 , . . . , mN )
borrow+s borrow+ed
= p(m1 )p(m2 |m1 ) . . . p(mN |m1 , . . . , mN 1 )
() T T
= T Pm (mi ) (nmi 1)!
(N + )
i=1 i=1 Figure 4: A portion of a sample tree.
(7)
where T denotes the number of suffix types and
nmi is the number of stem types mi which have the set of suffixes, excluding the new instance of
been already generated. the suffix mi .
Following the joint probability distribution of A portion of a tree is given in Figure 4. As
stems, the conditional probability of a stem given can be seen on the figure, all words are lo-
previously generated stems can be derived as: cated at leaf nodes. Therefore, the root node
of this subtree consists of words {plugg+ed,
p(si |S si , s , Ps ) skew+ed, exclaim+ed, borrow+s, borrow+ed,
s
nSsi i si
L1+s if si S
(8) liken+s, liken+ed, consist+s, consist+ed}.
=
s Ps (si ) otherwise
L1+s 4.2 Inference
s
where nSsi i denotes the number of stem in- The initial tree is constructed by randomly choos-
stances si that have been previously generated, ing a word from the corpus and adding this into a
where S si denotes the stem set excluding the randomly chosen position in the tree. When con-
new instance of the stem si . structing the initial tree, latent variables are also
The conditional probability of a suffix given the assigned randomly, i.e. each word is split at a ran-
other suffixes that have been previously generated dom position (see Algorithm 1).
is defined similarly: We use Metropolis Hastings algorithm (Hast-
ings, 1970), an instance of Markov Chain Monte
p(mi |M mi ,
m , Pm ) Carlo (MCMC) algorithms, to infer the optimal
mi
nM mi hierarchical structure along with the morphologi-
N 1+m if mi M
mi
= cal segmentation of words (given in Algorithm 2).
m Pm (mi ) otherwise
N 1+m During each iteration i, a leaf node Di = {wi =
(9)
M i
si + mi } is drawn from the current tree structure.
where nmikis the number of instances mi that The drawn leaf node is removed from the tree.
have been generated previously where M m is
i
Next, a node Dk is drawn uniformly from the tree
657
Algorithm 1 Creating initial tree. Algorithm 2 Inference algorithm
1: input: data D = {w1 = s1 + m1 , . . . , wn = 1: input: data D = {w1 = s1 + m1 , . . . , wn =
sn + mn }, sn + mn }, initial tree T , initial temperature
2: initialise: root D1 where of the system , the target temperature of the
D1 = {w1 = s1 + m1 } system , temperature decrement
3: initialise: c n 1 2: initialise: i 1, w wi = si + mi ,
4: while c >= 1 do pcur (D|T ) p(D|T )
5: Draw a word wj from the corpus. 3: while > do
6: Split the word randomly such that wj = 4: Remove the leaf node Di that has the
s j + mj word wi = si + mi
7: Create a new node Dj where Dj = 5: Draw a split point for the word such that
{wj = sj + mj } wi = si + mi
8: Choose a sibling node Dk for Dj 6: Draw a sibling node Dj
9: Merge Dnew Dj Dk 7: Dm Di Dj
10: Remove wj from the corpus 8: Update pnext (D|T )
11: cc1 9: if pnext (D|T ) >= pcur (D|T ) then
12: end while 10: Accept the new tree structure
13: output: Initial tree 11: pcur (D|T ) pnext (D|T )
12: else
13: random N ormal(0, 1)
to make it a sibling node to Di . In addition to a ( )1
sibling node, a split point wi = si + mi is drawn 14: if random < ppnext (D|T )
cur (D|T )
then
uniformly. Next, the node Di = {wi = si + mi } 15: Accept the new tree structure
is inserted as a sibling node to Dk . After updating 16: pcur (D|T ) pnext (D|T )
all probabilities along the path to the root, the new 17: else
tree structure is either accepted or rejected by ap- 18: Reject the new tree structure
plying the Metropolis-Hastings update rule. The 19: Re-insert the node Di at its pre-
likelihood of data under the given tree structure is vious position with the previous
used as the sampling probability. split point
We use a simulated annealing schedule to up- 20: end if
date PAcc : 21: end if
22: w wi+1 = si+1 + mi+1
( )1 23:
pnext (D|T )
PAcc = (10) 24: end while
pcur (D|T )
25: output: A tree structure where each node
where denotes the current temperature, corresponds to a paradigm.
pnext (D|T ) denotes the marginal likelihood
of the data under the new tree structure, and
pcur (D|T ) denotes the marginal likelihood of
ture decreases only tree structures that lead lead to
data under the latest accepted tree structure. If
a considerable improvement in the marginal prob-
(pnext (D|T ) > pcur (D|T )) then the update is
ability p(D|T ) are accepted.
accepted (see line 9, Algorithm 2), otherwise, the
tree structure is still accepted with a probability An illustration of sampling a new tree structure
of pAcc (see line 14, Algorithm 2). In our is given in Figure 5 and 6. Figure 5 shows that
experiments (see section 5) we set to 2. The D0 will be removed from the tree in order to sam-
system temperature is reduced in each iteration ple a new position on the tree, along with a new
of the Metropolis Hastings algorithm: split point of the word. Once the leaf node is re-
moved from the tree, the parent node is removed
from the tree, as the parent node D5 will consist
(11)
of only one child. Figure 6 shows that D8 is sam-
Most tree structures are accepted in the earlier pled to be the sibling node of D0 . Subsequently,
stages of the algorithm, however, as the tempera- the two nodes are merged within a new cluster that
658
D6 p(sj |Sroot , s , Ps ) p(mj |Mroot , m , Pm )
(13)
D7
where Sroot denotes all the stems in Droot and
D5 D8
Mroot denotes all the suffixes in Droot . Here
p(sj |Sroot , s , Ps ) is calculated as given below:
D0 D1 D2 D3 D4
p(si |Sroot
, Ss , Ps ) =
Figure 5: D0 will be removed from the tree. nsiroot
L+s if si Sroot
(14)
s Ps (si ) otherwise
D6 L+s
D8
p(mi |Mroot
, m , Pm ) =
nM root
N +m if mi Mroot
mi
D1 D2 D3 D4 D0 (15)
m Pm (mi ) otherwise
N +m
Figure 6: D8 is sampled to be the sibling of D0 .
4.3.2 Multiple Split Points
In order to discover words with multiple split
introduces a new node D9 . points, we propose a hierarchical segmentation
4.3 Morphological Segmentation where each segment is split further. The rules for
generating multiple split points is given by the fol-
Once the optimal tree structure is inferred, along lowing context free grammar:
with the morphological segmentation of words,
any novel word can be analysed. For the segmen-
tation of novel words, the root node is used as it w s1 m1 |s2 m2 (16)
contains all stems and suffixes which are already s1 s m|s s (17)
extracted from the training data. Morphological
s2 s (18)
segmentation is performed in two ways: segmen-
tation at a single point and segmentation at multi- m1 m m (19)
ple points. m2 s m|m m (20)
4.3.1 Single Split Point
In order to find single split point for the mor- Here, s is a pre-terminal node that generates all
phological segmentation of a word, the split point the stems from the root node. And similarly, m is
yielding the maximum probability given inferred a pre-terminal node that generates all the suffixes
stems and suffixes is chosen to be the final analy- from the root node. First, using Equation 16, the
sis of the word: word (e.g. housekeeper) is split into s1 m1 (e.g.
housekeep+er) or s2 m2 (house+keeper). The first
segment is regarded as a stem, and the second
arg max p(wi = sj + mj |Droot , m , Pm , s , Ps )
j
segment is either a stem or a suffix, consider-
(12) ing the probability of having a compound word.
where Droot refers to the root of the entire tree. Equation 12 is used to decide whether the sec-
Here, the probability of a segmentation of a ond segment is a stem or a suffix. At the sec-
given word given Droot is calculated as given be- ond segmentation level, each segment is split once
low: more. If the first production rule is followed in
the first segmentation level, the first segment s1
p(wi = sj + mj |Droot , m , Pm , s , Ps ) = can be analysed as s m (e.g. housekeep+) or s s
659
!"#$%&%%'%(
!
!"#$% &%%'%(
!"#$% ) &%%' %(
Figure 7: An example that depicts how the word
housekeeper can be analysed further to find more split Figure 8: Marginal likelihood convergence for datasets
points. of size 16K and 22K words.
(e.g. house+keep) (Equation 17). The decision are generated, by splitting each stem and suffix
to choose which production rule to apply is made once more, if it is possible to do so.
using: Morpho Challenge (Kurimo et al., 2011b) pro-
vides a well established evaluation framework
{
s s if p(s|S, s , Ps ) > p(m|M, m , Pm ) that additionally allows comparing our model in
s1
s m otherwise a range of languages. In both sets of experiments,
(21)
the Morpho Challenge 2010 dataset is used (Ku-
where S and M denote all the stems and suffixes
rimo et al., 2011b). Experiments are performed
in the root node.
for English, where the dataset consists of 878,034
Following the same production rule, the second words. Although the dataset provides word fre-
segment m1 can only be analysed as m m (er+). quencies, we have not used any frequency infor-
We postulate that words cannot have more than mation. However, for training our model, we only
two stems and suffixes always follow stems. We chose words with frequency
greater than 200.
do not allow any prefixes, circumfixes, or infixes. In our experiments, we used dataset sizes of
Therefore, the first production rule can output two 10K, 16K, 22K words. However, for final eval-
different analyses: s m m m and s s m m (e.g. uation, we trained our models on 22K words. We
housekeep+er and house+keep+er). were unable to complete the experiments with
On the other hand, if the word is analysed as larger training datasets due to memory limita-
s2 m2 (e.g. house+keeper), then s2 cannot be tions. We plan to report this in future work. Once
analysed further. (e.g. house). The second seg- the tree is learned by the inference algorithm, the
ment m2 can be analysed further, such that s m final tree is used for the segmentation of the entire
(stem+suffix) (e.g. keep+er, keeper+) or m m dataset. Several experiments are performed for
(suffix+suffix). The decision to choose which pro- each setting where the setting varies with the tree
duction rule to apply is made as follows: size and the model parameters. Model parameters
{ are the concentration parameters = {s , m }
s m if p(s|S, s , Ps ) > p(m|M, m , Pm ) of the Dirichlet processes. The concentration pa-
m2
m m otherwise
rameters, which are set for the experiments, are
(22)
0.1, 0.2, 0.02, 0.001, 0.002.
Thus, the second production rule yields two
In all experiments, the initial temperature of the
different analyses: s s m and s m m (e.g.
system is assigned as = 2 and it is reduced to
house+keep+er or house+keeper).
the temperature = 0.01 with decrements =
5 Experiments & Results 0.0001. Figure 8 shows how the log likelihoods of
trees of size 16K and 22K converge in time (where
Two sets of experiments were performed for the the time axis refers to sampling iterations).
evaluation of the model. In the first set of exper- Since different training sets will lead to differ-
iments, each word is split at single point giving a ent tree structures, each experiment is repeated
single stem and a single suffix. In the second set three times keeping the experiment setting the
of experiments, potentially multiple split points same.
660
Data Size P(%) R(%) F(%) s , m System P(%) R(%) F(%)
10K 81.48 33.03 47.01 0.1, 0.1 Allomorf1 68.98 56.82 62.31
16K 86.48 35.13 50.02 0.002, 0.002 Morf. Base.2 74.93 49.81 59.84
22K 89.04 36.01 51.28 0.002, 0.002 PM-Union3 55.68 62.33 58.82
Lignos4 83.49 45.00 58.48
Table 1: Highest evaluation scores of single split point Prob. Clustering (multiple) 57.08 57.58 57.33
experiments obtained from the trees with 10K, 16K, PM-mimic3 53.13 59.01 55.91
and 22K words. MorphoNet5 65.08 47.82 55.13
Rali-cof6 68.32 46.45 55.30
Data Size P(%) R(%) F(%) s , m
CanMan7 58.52 44.82 50.76
10K 62.45 57.62 59.98 0.1, 0.1
1
16K 67.80 57.72 62.36 0.002, 0.002 Virpioja et al. (2009)
2
22K 68.71 62.56 62.56 0.001 0.001 Creutz and Lagus (2002)
3
Monson et al. (2009)
4
Table 2: Evaluation scores of multiple split point ex- Lignos et al. (2009)
5
periments obtained from the trees with 10K, 16K, and Bernhard (2009)
6
22K words. Lavallee and Langlais (2009)
7
Can and Manandhar (2009)
5.1 Experiments with Single Split Points Table 3: Comparison with other unsupervised systems
that participated in Morpho Challenge 2009 for En-
In the first set of experiments, words are split into glish.
a single stem and suffix. During the segmentation,
Equation 12 is used to determine the split position
of each word. Evaluation scores are given in Ta- We compare our system with the other partici-
ble 1. The highest F-measure obtained is 51.28% pant systems in Morpho Challenge 2010. Results
with the dataset of 22K words. The scores are no- are given in Table 6 (Virpioja et al., 2011). Since
ticeably higher with the largest training set. the model is evaluated using the official (hidden)
Morpho Challenge 2010 evaluation dataset where
5.2 Experiments with Multiple Split Points we submit our system for evaluation to the organ-
isers, the scores are different from the ones that
The evaluation scores of experiments with mul-
we presented Table 1 and Table 2.
tiple split points are given in Table 2. The high-
We also demonstrate experiments with Morpho
est F-measure obtained is 62.56% with the dataset
Challenge 2009 English dataset. The dataset con-
with 22K words. As for single split points, the
sists of 384, 904 words. Our results and the re-
scores are noticeably higher with the largest train-
sults of other participant systems in Morpho Chal-
ing set.
lenge 2009 are given in Table 3 (Kurimo et al.,
For both, single and multiple segmentation, the
2009). It should be noted that we only present
same inferred tree has been used.
the top systems that participated in Morpho Chal-
5.3 Comparison with Other Systems lenge 2009. If all the systems are considered, our
system comes 5th out of 16 systems.
For all our evaluation experiments using Mor- The problem of morphologically rich lan-
pho Challenge 2010 (English and Turkish) and guages is not our priority within this research.
Morpho Challenge 2009 (English), we used 22k Nevertheless, we provide evaluation scores on
words for training. For each evaluation, we ran- Turkish. The Turkish dataset consists of 617,298
domly chose 22k words for training and ran our words. We chose words with frequency greater
MCMC inference procedure to learn our model. than 50 for Turkish since the Turkish dataset is not
We generated 3 different models by choosing 3 large enough. The results for Turkish are given in
different randomly generated training sets each Table 4. Our system comes 3rd out of 7 systems.
consisting of 22k words. The results are the best
results over these 3 models. We are reporting the 6 Discussion
best results out of the 3 models due to the small
(22k word) datasets used. Use of larger datasets The model can easily capture common suffixes
would have resulted in less variation and better such as -less, -s, -ed, -ment, etc. Some sample tree
results. nodes obtained from trees are given in Table 6.
661
System P(%) R(%) F(%) System P(%) R(%) F(%)
Morf. CatMAP 79.38 31.88 45.49 Base Inference1 80.77 53.76 64.55
Aggressive Comp. 55.51 34.36 42.45 Iterative Comp.1 80.27 52.76 63.67
Prob. Clustering (multiple) 72.36 25.81 38.04 Aggressive Comp.1 71.45 52.31 60.40
Iterative Comp. 68.69 21.44 32.68 Nicolas2 67.83 53.43 59.78
Nicolas 79.02 19.78 31.64 Prob. Clustering (multiple) 57.08 57.58 57.33
Morf. Base. 89.68 17.78 29.67 Morf. Baseline3 81.39 41.70 55.14
Base Inference 72.81 16.11 26.38 Prob. Clustering (single) 70.76 36.51 48.17
Morf. CatMAP4 86.84 30.03 44.63
Table 4: Comparison with other unsupervised systems 1
Lignos (2010)
that participated in Morpho Challenge 2010 for Turk- 2
Nicolas et al. (2010)
ish. 3
Creutz and Lagus (2002)
4
Creutz and Lagus (2005a)
regard+less, base+less, shame+less, bound+less,
harm+less, regard+ed, relent+less Table 6: Comparison of our model with other unsuper-
solve+d, high+-priced, lower+s, lower+-level, vised systems that participated in Morpho Challenge
high+-level, lower+-income, histor+ians 2010 for English.
pre+mise, pre+face, pre+sumed, pre+, pre+gnant
base+ment, ail+ment, over+looked, predica+ment,
deploy+ment, compart+ment, embodi+ment Sometimes similarities may not yield a valid
anti+-fraud, anti+-war, anti+-tank, anti+-nuclear, analysis of words. For example, the prefix pre-
anti+-terrorism, switzer+, anti+gua, switzer+land
leads the words pre+mise, pre+sumed, pre+gnant
sharp+ened, strength+s, tight+ened, strength+ened,
black+ened
to be analysed wrongly, whereas pre- is a valid
inspir+e, inspir+ing, inspir+ed, inspir+es, earn+ing, prefix for the word pre+face. Another nice fea-
ponder+ing ture about the model is that compounds are easily
downgrade+s, crash+ed, crash+ing, lack+ing, captured through common stems: e.g. doubt+fire,
blind+ing, blind+, crash+, compris+ing, com- bon+fire, gun+fire, clear+cut.
pris+es, stifl+ing, compris+ed, lack+s, assist+ing,
blind+ed, blind+er,
7 Conclusion & Future Work
Table 5: Sample tree nodes obtained from various
trees.
In this paper, we present a novel probabilis-
tic model for unsupervised morphology learn-
As seen from the table, morphologically similar ing. The model adopts a hierarchical structure
words are grouped together. Morphological sim- in which words are organised in a tree so that
ilarity refers to at least one common morpheme morphologically similar words are located close
between words. For example, the words high- to each other.
priced and lower-level are grouped in the same In hierarchical clustering, tree-cutting would be
node through the word high-level which shares a very useful thing to do but it is not addressed
the same stem with high-priced and the same end- in the current paper. We used just the root node
ing with lower-level. as a morpheme lexicon to apply segmentation.
As seen from the sample nodes, prefixes Clearly, adding tree cutting would improve the ac-
can also be identified, for example anti+fraud, curacy of the segmentation and will help us iden-
anti+war, anti+tank, anti+nuclear. This illus- tify paradigms with higher accuracy. However,
trates the flexibility in the model by capturing the the segmentation accuracy obtained without us-
similarities through either stems, suffixes or pre- ing tree cutting provides a very useful indicator
fixes. However, as mentioned above, the model to show whether this approach is promising. And
does not consider any discrimination between dif- experimental results show that this is indeed the
ferent types of morphological forms during train- case.
ing. As the prefix pre- appears at the beginning of In the current model, we did not use any syn-
words, it is identified as a stem. However, identi- tactic information, only words. POS tags can be
fying pre- as a stem does not yield a change in the utilised to group words which are both morpho-
morphological analysis of the word. logically and syntactically similar.
662
References ments, CLEF09, pages 578597, Berlin, Heidel-
berg. Springer-Verlag.
Delphine Bernhard. 2009. Morphonet: Exploring the
Mikko Kurimo, Krista Lagus, Sami Virpioja, and
use of community structure for unsupervised mor-
Ville Turunen. 2011a. Morpho challenge
pheme analysis. In Working Notes for the CLEF
2009. http://research.ics.tkk.fi/
2009 Workshop, September.
events/morphochallenge2009/, June.
Burcu Can and Suresh Manandhar. 2009. Cluster- Mikko Kurimo, Krista Lagus, Sami Virpioja, and
ing morphological paradigms using syntactic cate- Ville Turunen. 2011b. Morpho challenge
gories. In Working Notes for the CLEF 2009 Work- 2010. http://research.ics.tkk.fi/
shop, September. events/morphochallenge2010/, June.
Erwin Chan. 2006. Learning probabilistic paradigms Jean Francois Lavallee and Philippe Langlais. 2009.
for morphology in a latent class model. In Proceed- Morphological acquisition by formal analogy. In
ings of the Eighth Meeting of the ACL Special Inter- Working Notes for the CLEF 2009 Workshop,
est Group on Computational Phonology and Mor- September.
phology, SIGPHON 06, pages 6978, Stroudsburg, Constantine Lignos, Erwin Chan, Mitchell P. Marcus,
PA, USA. Association for Computational Linguis- and Charles Yang. 2009. A rule-based unsuper-
tics. vised morphology learning framework. In Working
Mathias Creutz and Krista Lagus. 2002. Unsu- Notes for the CLEF 2009 Workshop, September.
pervised discovery of morphemes. In Proceed- Constantine Lignos. 2010. Learning from unseen
ings of the ACL-02 workshop on Morphological data. In Mikko Kurimo, Sami Virpioja, Ville Tu-
and phonological learning - Volume 6, MPL 02, runen, and Krista Lagus, editors, Proceedings of the
pages 2130, Stroudsburg, PA, USA. Association Morpho Challenge 2010 Workshop, pages 3538,
for Computational Linguistics. Aalto University, Espoo, Finland.
Mathias Creutz and Krista Lagus. 2005a. Induc- Christian Monson, Kristy Hollingshead, and Brian
ing the morphological lexicon of a natural language Roark. 2009. Probabilistic paramor. In Pro-
from unannotated text. In In Proceedings of the ceedings of the 10th cross-language evaluation fo-
International and Interdisciplinary Conference on rum conference on Multilingual information access
Adaptive Knowledge Representation and Reasoning evaluation: text retrieval experiments, CLEF09,
(AKRR 2005, pages 106113. September.
Mathias Creutz and Krista Lagus. 2005b. Unsu- Lionel Nicolas, Jacques Farre, and Miguel A. Mo-
pervised morpheme segmentation and morphology linero. 2010. Unsupervised learning of concate-
induction from text corpora using morfessor 1.0. native morphology based on frequency-related form
Technical Report A81. occurrence. In Mikko Kurimo, Sami Virpioja, Ville
Markus Dreyer and Jason Eisner. 2011. Discover- Turunen, and Krista Lagus, editors, Proceedings of
ing morphological paradigms from plain text using the Morpho Challenge 2010 Workshop, pages 39
a dirichlet process mixture model. In Proceedings 43, Aalto University, Espoo, Finland.
of the 2011 Conference on Empirical Methods in Matthew G. Snover, Gaja E. Jarosz, and Michael R.
Natural Language Processing, pages 616627, Ed- Brent. 2002. Unsupervised learning of morphol-
inburgh, Scotland, UK., July. Association for Com- ogy using a novel directed search algorithm: Taking
putational Linguistics. the first step. In Proceedings of the ACL-02 Work-
John Goldsmith. 2001. Unsupervised learning of the shop on Morphological and Phonological Learn-
morphology of a natural language. Computational ing, pages 1120, Morristown, NJ, USA. ACL.
Linguistics, 27(2):153198. Sami Virpioja, Oskar Kohonen, and Krista Lagus.
Sharon Goldwater, Thomas L. Griffiths, and Mark 2009. Unsupervised morpheme discovery with al-
Johnson. 2006. Interpolating between types and to- lomorfessor. In Working Notes for the CLEF 2009
kens by estimating power-law generators. In In Ad- Workshop. September.
vances in Neural Information Processing Systems Sami Virpioja, Ville T. Turunen, Sebastian Spiegler,
18, page 18. Oskar Kohonen, and Mikko Kurimo. 2011. Em-
W. K. Hastings. 1970. Monte carlo sampling meth- pirical comparison of evaluation methods for unsu-
ods using markov chains and their applications. pervised learning of morphology. In Traitement Au-
Biometrika, 57:97109. tomatique des Langues.
Mikko Kurimo, Sami Virpioja, Ville T. Turunen,
Graeme W. Blackwood, and William Byrne. 2009.
Overview and results of morpho challenge 2009.
In Proceedings of the 10th cross-language eval-
uation forum conference on Multilingual infor-
mation access evaluation: text retrieval experi-
663
Modeling Inflection and Word-Formation in SMT
664
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 664674,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
to model compounds, a highly productive phe- We then build a standard Moses system trans-
nomenon in German (see Section 8). lating from English to German stems. We obtain
The key linguistic knowledge sources that we a sequence of stems and POS2 from this system,
use are morphological analysis and generation of and then predict the correct inflection using a se-
German based on SMOR, a morphological ana- quence model. Finally we generate surface forms.
lyzer/generator of German (Schmid et al., 2004)
and the BitPar parser, which is a state-of-the-art 2.3 German Stem Markup
parser of German (Schmid, 2004). The translation process consists of two major
steps. The first step is translation of English
2.1 Issues of inflection prediction words to German stems, which are enriched with
In order to ensure coherent German NPs, we some inflectional markup. The second step is
model linguistic features of each word in an NP. the full inflection of these stems (plus markup)
We model case, gender, and number agreement to obtain the final sequence of inflected words.
and whether or not the word is in the scope of The purpose of the additional German inflectional
a determiner (such as a definite article), which markup is to strongly improve prediction of in-
we label in-weak-context (this linguistic feature flection in the second step through the addition of
is necessary to determine the type of inflection of markup to the stems in the first step.
adjectives and other words: strong, weak, mixed). In general, all features to be predicted are
This is a diverse group of features. The number stripped from the stemmed representation because
of a German noun can often be determined given they are subject to agreement restrictions of a
only the English source word. The gender of a noun or prepositional phrase (such as case of
German noun is innate and often difficult to deter- nouns or all features of adjectives). However, we
mine given only the English source word. Case need to keep all morphological features that are
is a function of the slot in the subcategorization not dependent on, and thus not predictable from,
frame of the verb (or preposition). There is agree- the (German) context. They will serve as known
ment in all of these features in an NP. For instance input for the inflection prediction model. We now
the number of an article or adjective is determined describe this markup in detail.
by the head noun, while the type of inflection of an Nouns are marked with gender and number: we
adjective is determined by the choice of article. consider the gender of a noun as part of its stem,
We can have a large number of surface forms. whereas number is a feature which we can obtain
For instance, English blue can be translated as from English nouns.
German blau, blaue, blauer, blaues, blauen. We Personal pronouns have number and gender an-
predict which form is correct given the context. notation, and are additionally marked with nom-
Our system can generate forms not seen in the inative and not-nominative, because English pro-
training data. We follow a two-step process: in nouns are marked for this (except for you).
step-1 we translate to blau (the stem), in step-2 we Prepositions are marked with the case their ob-
predict features and generate the inflected form.1 ject takes: this moves some of the difficulty in pre-
dicting case from the inflection prediction step to
2.2 Procedure the stem translation step. Since the choice of case
We begin building an SMT system by parsing the in a PP is often determined by the PPs meaning
German training data with BitPar. We then extract (and there are often different meanings possible
morphological features from the parse. Next, we given different case choices), it seems reasonable
lookup the surface forms in the SMOR morpholog- to make this decision during stem translation.
ical analyzer. We use the morphological features Verbs are represented using their inflected surface
in the parse to disambiguate the set of possible form. Having access to inflected verb forms has a
SMOR analyses. Finally, we output the stems
positive influence on case prediction in the second
of the German text, with the addition of markup 2
We use an additional target factor to obtain the coarse
taken from the parse (discussed in Section 2.3). POS for each stem, applying a 7-gram POS model. Koehn
and Hoang (2007) showed that the use of a POS factor only
1
E.g., case=nominative, gender=masculine, num- results in negligible BLEU improvements, but we need ac-
ber=singular, in-weak-context=true; inflected: blaue. cess to the POS in our inflection prediction models.
665
input decoder output inflected merged must be inflected before making a decision about
in<APPR><Dat> in
in im whether to merge a preposition and the article into
die<+ART><Def> dem
contrast Gegensatz<+NN><Masc><Sg> Gegensatz Gegensatz a portmanteau. See Table 1 for examples.
to zu<APPR><Dat> zu
zur
the die<+ART><Def> der
animated lebhaft<+ADJ><Pos> lebhaften lebhaften
debate Debatte<+NN><Fem><Sg> Debatte Debatte 4 Models for Inflection Prediction
Table 1: Re-merging of prepositions and articles after We present 5 procedures for inflectional predic-
inflection to form portmanteaus, in dem means in the. tion using supervised sequence models. The first
two procedures use simple N-gram models over
fully inflected surface forms.
step through subject-verb agreement. 1. Surface with no features is presented with an
Articles are reduced to their stems (the stem itself underspecified input (a sequence of stems), and
makes clear the definite or indefinite distinction, returns the most likely inflected sequence.
but lemmatizing involves removing markings of
2. Surface with case, number, gender is a hybrid
case, gender and number features).
system giving the surface model access to linguis-
Other words are also represented by their stems
tic features. In this system prepositions have addi-
(except for words not covered by SMOR, where
tionally been labeled with the case they mark (in
surface forms are used instead).
both the underspecified input and the fully spec-
3 Portmanteaus ified output the sequence model is built on) and
gender and number markup is also available.
Portmanteaus are a word-formation phenomenon
The rest of the procedures predict morpholog-
dependent on inflection. As we have discussed,
ical features (which are input to a morphological
standard phrase-based systems have problems
generator) rather than surface words. We have de-
with picking a definite article with the correct
veloped a two-stage process for predicting fully
case, gender and number (typically due to spar-
inflected surface forms. The first stage takes a
sity in the language model, e.g., a noun which
stem and predicts morphological features for that
was never before seen in dative case will often
stem, based on the surrounding context. The aim
not receive the correct article). In German, port-
of the first stage is to take a stem and predict
manteaus increase this sparsity further, as they
four morphological features: case, gender, num-
are compounds of prepositions and articles which
ber and type of inflection. We experiment with
must agree with a noun.
a number of models for doing this. The sec-
We adopt the linguistically strict definition of
ond stage takes the stems marked with morpho-
the term portmanteau: the merging of two func-
logical features (predicted in the first stage) and
tion words.3 We treat this phenomena by split-
uses a morphological generator to generate the
ting the component parts during training and re-
full surface form. For the second stage, a modified
merging during generation. Specifically for
version of SMOR (Schmid et al., 2004) is used,
German, this requires splitting the words which
which, given a stem annotated with morphologi-
have German POS tag APPRART into an APPR
cal features, generates exactly one surface form.
(preposition) and an ART (article). Merging is re-
stricted, the article must be definite, singular4 and We now introduce our first linguistic feature
the preposition can only take accusative or dative prediction systems, which we call joint sequence
case. Some prepositions allow for merging with models (JSMs). These are standard language
an article only for certain noun genders, for exam- models, where the word tokens are not repre-
ple the preposition inDative is only merged with sented as surface forms, but instead using POS
the following article if the following noun is of and features. In testing, we supply the input as a
masculine or neuter gender. The definite article sequence in underspecified form, where some of
the features are specified in the stem markup (for
3
Some examples are: zum (to the) = zu (to) + dem (the) instance, POS=Noun, gender=masculine, num-
[German], du (from the) = de (from) + le (the) [French] or al
(to the) = a (to) + el (the) [Spanish].
ber=plural), and then use Viterbi search to find the
4
This is the reason for which the preposition + article in most probable fully specified form (for instance,
Table 2 remain unmerged. POS=Noun, gender=masculine, number=plural,
666
output decoder input prediction output prediction inflected forms gloss
haben<VAFIN> haben-V haben-V haben have
Zugang<+NN><Masc><Sg> NN-Sg-Masc NN-Masc.Acc.Sg.in-weak-context=false Zugang access
zu<APPR><Dat> APPR-zu-Dat APPR-zu-Dat zu to
die<+ART><Def> ART-in-weak-context=true ART-Neut.Dat.Pl.in-weak-context=true den the
betreffend<+ADJ><Pos> ADJA ADJA-Neut.Dat.Pl.in-weak-context=true betreffenden respective
Land<+NN><Neut><Pl> NN-Pl-Neut NN-Neut.Dat.Pl.in-weak-context=true Landern countries
Table 2: Overview: inflection prediction steps using a single joint sequence model. All words except verbs and
prepositions are replaced by their POS tags in the input. Verbs are inflected in the input (haben, meaning
have as in they have, in the example). Prepositions are lexicalized (zu in the example) and indicate which
case value they mark (Dat, i.e., Dative in the example).
667
Common lemmawi5 ...wi+5 , tagwi7 ...wi+7 tion 2.3), and the second is inflection prediction
Case casewi5 ...wi+5
Gender genderwi5 ...wi+5 as described previously in the paper. To derive
Number numberwi5 ...wi+5 the stem+markup representation we first parse
in-weak-context in-weak-contextwi5 ...wi+5
the German training data and then produce the
Table 3: Feature functions used in CRF models (fea- stemmed representation. We then build a sys-
ture functions are binary indicators of the pattern). tem for translating from English words to Ger-
man stems (the stem+markup representation), on
the same data (so the German side of the parallel
5 Experimental Setup data, and the German language modeling uses the
To evaluate our end-to-end system, we perform stem+markup representation). Likewise, MERT
the well-studied task of news translation, us- is performed using references which are in the
ing the Moses SMT package. We use the En- stem+markup representation.
glish/German data released for the 2009 ACL To train the inflection prediction systems, we
Workshop on Machine Translation shared task on use the monolingual data. The basic surface form
translation.7 There are 82,740 parallel sentences model is trained on lowercased surface forms,
from news-commentary09.de-en and 1,418,115 the hybrid surface form model with features is
parallel sentences from europarl-v4.de-en. The trained on lowercased surface forms annotated
monolingual data contains 9.8 M sentences.8 with markup. The linguistic feature prediction
To build the baseline, the data was tokenized systems are trained on the monolingual data pro-
using the Moses tokenizer and lowercased. We cessed as described previously (see Table 2).
use GIZA++ to generate alignments, by running Our JSMs are trained using the SRILM Toolkit.
5 iterations of Model 1, 5 iterations of the HMM We use the SRILM disambig tool for predicting
Model, and 4 iterations of Model 4. We sym- inflection, which takes a map that specifies the
metrize using the grow-diag-final-and heuris- set of fully specified representations that each un-
tic. Our Moses systems use default settings. The derspecified stem can map to. For surface form
LM uses the monolingual data and is trained as models, it specifies the mapping from stems to
a five-gram9 using the SRILM-Toolkit (Stolcke, lowercased surface forms (or surface forms with
2002). We run MERT separately for each sys- markup for the hybrid surface model).
tem. The recaser used is the same for all systems.
6 Results for Inflection Prediction
It is the standard recaser supplied with Moses,
trained on all German training data. The dev set We build two different kinds of translation sys-
is wmt-2009-a and the test set is wmt-2009-b, and tem, the baseline and the stem translation system
we report end-to-end case sensitive BLEU scores (where MERT is used to train the system to pro-
against the unmodified reference SGML file. The duce a stem+markup sequence which agrees with
blind test set used is wmt-2009-blind (all lines). the stemmed reference of the dev set). In this sec-
In developing our inflection prediction sys- tion we present the end-to-end translation results
tems (and making such decisions as n-gram order for the different inflection prediction models de-
used), we worked on the so-called clean data fined in Section 4, see Table 4.
task, predicting the inflection on stemmed refer- If we translate from English into a stemmed
ence sentences (rather than MT output). We used German representation and then apply a unigram
the 2000 sentence dev-2006 corpus for this task. stem-to-surface-form model to predict the surface
Our contrastive systems consist of two steps, form, we achieve a BLEU score of 9.97 (line 2).
the first is a translation step using a similar This is only presented for comparison.
Moses system (except that the German side is The baseline10 is 14.16, line 1. We compare
stemmed, with the markup indicated in Sec- this with a 5-gram sequence model11 that predicts
7 10
http://www.statmt.org/wmt09/translation-task.html This is a better case-sensitive score than the baselines
8
However, we reduced the monolingual data (only) by on wmt-2009-b in experiments by top-performers Edinburgh
retaining only one copy of each unique line, which resulted and Karlsruhe at the shared task. We use Moses with default
in 7.55 M sentences. settings.
9 11
Add-1 smoothing for unigrams and Kneser-Ney Note that we use a different set, the clean data set, to
smoothing for higher order n-grams, pruning defaults. determine the choice of n-gram order, see Section 7. We use
668
surface forms without access to morphological 1 baseline 14.16
2 unigram surface (no features) 9.97
features, resulting in a BLEU score of 14.26. In- 3 surface (no features) 14.26
troducing morphological features (case on prepo- 4 surface (with case, number, gender features) 14.58
5 1 JSM morphological features 14.53
sitions, number and gender on nouns) increases 6 4 JSMs morphological features 14.29
the BLEU score to 14.58, which is in the same 7 4 CRFs morphological features, lexical information 14.72
range as the single JSM system predicting all lin-
guistic features at once. Table 4: BLEU scores (detokenized, case sensitive) on
the development test set wmt-2009-b
This result shows that the mostly unlexicalized
single JSM can produce competitive results with
direct surface form prediction, despite not having each linguistic feature performs best (14.72, line
access to a model of inflected forms, which is the 7). The CRF framework combines the advantages
desired final output. This strongly suggests that of surface form prediction and linguistic feature
the prediction of morphological features can be prediction by using feature functions that effec-
used to achieve additional generalization over di- tively cover the feature function spaces used by
rect surface form prediction. When comparing the both forms of prediction. The performance of the
simple direct surface form prediction (line 3) with CRF models results in a statistically significant
the hybrid system enriched with number, gender improvement12 (p < 0.05) over the baseline. We
and case (line 4), it becomes evident that feature also tried CRFs with bilingual features (projected
markup can also aid surface form prediction. from English parses via the alignment output by
Since the single JSM has no access to lexical Moses), but obtained only a small improvement of
information, we used a language model to score 0.03, probably because the required information
different feature predictions: for each sentence of is transferred in our stem markup (also a poor im-
the development set, the 100 best feature predic- provement beyond monolingual features is con-
tions were inflected and scored with a language sistent with previous work, see Section 8.3). De-
model. We then optimized weights for the two tails are omitted due to space.
scores LM (language model on surface forms)
We further validated our results by translating
and FP (feature prediction, the score assigned by
the blind test set from wmt-2009, which we have
the JSM). This method disprefers feature predic-
never looked at in any way. Here we also had
tions with a top FP-score if the inflected sen-
a statistically significant difference between the
tence obtains a bad LM score and likewise dis-
baseline and the CRF-based prediction, the scores
favors low-ranked feature prediction with a high
were 13.68 and 14.18.
LM score. The prediction of case is the most
difficult given no lexical information, thus scor- 7 Analysis of Inflection-based System
ing different prediction possibilities on inflected
words is helpful. An example is when the case of Stem Markup. The first step of translating
a noun phrase leads to an inflected phrase which from English to German stems (with the markup
never occurs in the (inflected) language model we previously discussed) is substantially easier
(e.g., case=genitive vs. case=other). Applying than translating directly to inflected German (we
this method to the single JSM leads to a negligible see BLEU scores on stems+markup that are over
improvement (14.53 vs. 14.56). Using the n-best 2.0 BLEU higher than the BLEU scores on in-
output of the stem translation system did not lead flected forms when running MERT). The addition
to any improvement. of case to prepositions only lowered the BLEU
The comparison between different feature pre- score reached by MERT by about 0.2, but is very
diction models is also illustrative. Performance helpful for prediction of the case feature.
decreases somewhat when using individual joint Inflection Prediction Task. Clean data task re-
sequence models (one for each linguistic feature) sults13 are given in Table 5. The 4 CRFs outper-
compared to one single model (14.29, line 6). form the 4 JSMs by more than 2%.
The framework using the individual CRFs for 12
We used Kevin Gimpels implementation of pairwise
a 5-gram for surface forms and a 4-gram for JSMs, and the bootstrap resampling with 1000 samples.
13
same smoothing (Kneser-Ney, add-1 for unigrams, default 26,061 of 55,057 tokens in our test set are ambiguous.
pruning). We report % surface form matches for ambiguous tokens.
669
Model Accuracy generalize from the accusative example with no
unigram surface (no features) 55.98
surface (no features) 86.65 portmanteau and take advantage of longer phrase
surface (with case, number, gender features) 91.24 pairs, even when translating to something that will
1 JSM morphological features 92.45
4 JSMs morphological features 92.01
be inflected as dative and should be realized as a
4 CRFs morphological features, lexical information 94.29 portmanteau. The baseline does not have this ca-
pability. It should be noted that the portmanteau
Table 5: Comparing predicting surface forms directly merging method described in Section 3 remerges
with predicting morphological features.
all occurrences of APPR and ART that can techni-
cally form a portmanteau. There are a few cases
training data 1 model 4 models
7.3 M sentences 92.41 91.88
where merging, despite being grammatical, does
1.5 M sentences 92.45 92.01 not lead to a good result. Such exceptions require
100000 sentences 90.20 90.64 semantic interpretation and are difficult to capture
1000 sentences 83.72 86.94
with a fixed set of rules.
Table 6: Accuracy for different training data sizes of
the single and the four separate joint sequence models. 8 Adding Compounds to the System
Compounds are highly productive in German and
lead to data sparsity. We split the German com-
As we mentioned in Section 4, there is a spar- pounds in the training data, so that our stem trans-
sity issue at small training data sizes for the sin- lation system can now work with the individual
gle joint sequence model. This is shown in Ta- words in the compounds. After we have trans-
ble 6. At the largest training data sizes, model- lated to a split/stemmed representation, we deter-
ing all 4 features together results in the best pre- mine whether to merge words together to form a
dictions of inflection. However using 4 separate compound. Then we merge them to create stems
models is worth this minimal decrease in perfor- in the same representation as before and we per-
mance, since it facilitates experimentation with form inflection and portmanteau merging exactly
the CRF framework for which the training of a as previously discussed.
single model is not currently tractable.
Overall, the inflection prediction works well for 8.1 Details of Splitting Process
gender, number and type of inflection, which are We prepare the training data by splitting com-
local features to the NP that normally agree with pounds in two steps, following the technique of
the explicit markup output by the stem transla- Fritzinger and Fraser (2010). First, possible split
tion system (for example, the gender of a com- points are extracted using SMOR, and second, the
mon noun, which is marked in the stem markup, best split points are selected using the geometric
is usually successfully propagated to the rest of mean of word part frequencies.
the NP). Prediction of case does not always work
compound word parts gloss
well, and could maybe be improved through hier- Inflationsrate Inflation Rate inflation rate
archical labeled-syntax stem translation. auszubrechen aus zu brechen out to break (to break out)
Portmanteaus. An example of where the sys- Training data is then stemmed as described in
tem is improved because of the new handling of Section 2.3. The formerly modifying words of the
portmanteaus can be seen in the dative phrase compound (in our example the words to the left
im internationalen Rampenlicht (in the interna- of the rightmost word) do not have a stem markup
tional spotlight), which does not occur in the par- assigned, except for two cases: i) they are nouns
allel data. The accusative phrase in das interna- themselves or ii) they are particles separated from
tionale Rampenlicht does occur, however in this a verb. In these cases, former modifiers are rep-
case there is no portmanteau, but a one-to-one resented identically to their individual occurring
mapping between in the and in das. For a given counterparts, which helps generalization.
context, only one of accusative or dative case is
valid, and a strongly disfluent sentence results 8.2 Model for Compound Merging
from the incorrect choice. In our system, these After translation, compound parts have to be
two cases are handled in the same way (def-article resynthesized into compounds before inflection.
international Rampenlicht). This allows us to Two decisions have to be taken: i) where to
670
merge and ii) how to merge. Following the work 1 1 JSM morphological features 13.94
2 4 CRFs morphological features, lexical information 14.04
of Stymne and Cancedda (2011), we implement
a linear-chain CRF merging system using the
following features: stemmed (separated) surface Table 7: Results with Compounds on the test set
form, part-of-speech14 and frequencies from the
training corpus for bigrams/merging of word and ture can be translated as German Miniatur- and
word+1, word as true prefix, word+1 as true suf- gets the correct output.
fix, plus frequency comparisons of these. The
CRF is trained on the split monolingual data. It 9 Related Work
only proposes merging decisions, merging itself
uses a list extracted from the monolingual data There has been a large amount of work on trans-
(Popovic et al., 2006). lating from a morphologically rich language to
English, we omit a literature review here due to
8.3 Experiments space considerations. Our work is in the opposite
We evaluated the end-to-end inflection system direction, which primarily involves problems of
with the addition of compounds.15 As in the in- generation, rather than problems of analysis.
flection experiments described in Section 5, we The idea of translating to stems and then in-
use a 5-gram surface LM and a 7-gram POS flecting is not novel. We adapted the work of
LM, but for this experiment, they are trained on Toutanova et al. (2008), which is effective but lim-
stemmed, split data. The POS LM helps com- ited by the conflation of two separate issues: word
pound parts and heads appear in correct order. formation and inflection.
The results are in Table 7. The BLEU score of the Given a stem such as brother, Toutanova et. als
CRF on test is 14.04, which is low. However the system might generate the stem and inflection
system produces 19 compound types which are corresponding to and his brother. Viewing and
in the reference but not in the parallel data, and and his as inflection is problematic since a map-
therefore not accessible to other systems. We also ping from the English phrase and his brother to
observe many more compounds in general. The the Arabic stem for brother is required. The situ-
100-best inflection rescoring technique previously ation is worse if there are English words (e.g., ad-
discussed reached 14.07 on the test set. Blind jectives) separating his and brother. This required
test results with CRF prediction are much better, mapping is a significant problem for generaliza-
14.08, which is a statistically significant improve- tion. We view this issue as a different sort of prob-
ment over the baseline (13.68) and approaches the lem entirely, one of word-formation (rather than
result we obtained without compounds (14.18). inflection). We apply a split in preprocessing and
Correctly generated compounds are single words resynthesize in postprocessing approach to these
which usually carry the same information as mul- phenomena, combined with inflection prediction
tiple words in English, and are hence likely un- that is similar to that of Toutanova et. al. The
derweighted by BLEU. We again see many in- only work that we are aware of which deals with
teresting generalizations. For instance, take the both issues is the work of de Gispert and Marino
case of translating English miniature cameras to (2008), which deals with verbal morphology and
the German compound Miniaturkameras. minia- attached pronouns. There has been other work
ture camera or miniature cameras does not occur on solving inflection. Koehn and Hoang (2007)
in the training data, and so there is no appropri- introduced factored SMT. We use more complex
ate phrase pair in any system (baseline, inflec- context features. Fraser (2009) tried to solve the
tion, or inflection&compound-splitting). How- inflection prediction problem by simply building
ever, our system with compound splitting has an SMT system for translating from stems to in-
learned from split composita that English minia- flected forms. Bojar and Kos (2010) improved on
this by marking prepositions with the case they
14
Compound modifiers get assigned a special tag based on mark (one of the most important markups in our
the POS of their former heads, e.g., Inflation in the example
is marked as a non-head of a noun.
system). Both efforts were ineffective on large
15
We found it most effective to merge word parts during data sets. Williams and Koehn (2011) used uni-
MERT (so MERT uses the same stem references as before). fication in an SMT system to model some of the
671
agreement phenomena that we model. Our CRF coded in a rule-based morphological analyser and
framework allows us to use more complex con- then selecting the best analysis based on the ge-
text features. ometric mean of word part frequencies. Other
We have directly addressed the question as to approaches use less deep linguistic resources
whether inflection should be predicted using sur- (e.g., POS-tags Stymne (2008)) or are (almost)
face forms as the target of the prediction, or knowledge-free (e.g., Koehn and Knight (2003)).
whether linguistic features should be predicted, Compound merging is less well studied. Popovic
along with the use of a subsequent generation et al. (2006) used a simple, list-based merging ap-
step. The direct prediction of surface forms is proach, merging all consecutive words included
limited to those forms observed in the training in a merging list. This approach resulted in too
data, which is a significant limitation. How- many compounds. We follow Stymne and Can-
ever, it is reasonable to expect that the use of cedda (2011), for compound merging. We trained
features (and morphological generation) could a CRF using (nearly all) of the features they used
also be problematic as this requires the use of and found their approach to be effective (when
morphologically-aware syntactic parsers to anno- combined with inflection and portmanteau merg-
tate the training data with such features, and addi- ing) on one of our two test sets.
tionally depends on the coverage of morpholog-
ical analysis and generation. Despite this, our 10 Conclusion
research clearly shows that the feature-based ap-
We have shown that both the prediction of sur-
proach is superior for English-to-German SMT.
face forms and the prediction of linguistic features
This is a striking result considering state-of-the-
are of interest for improving SMT. We have ob-
art performance of German parsing is poor com-
tained the advantages of both in our CRF frame-
pared with the best performance on English pars-
work, and also integrated handling of compounds,
ing. As parsing performance improves, the per-
and an inflection-dependent word formation phe-
formance of linguistic-feature-based approaches
nomenon, portmanteaus. We validated our work
will increase.
on a well-studied large corpora translation task.
Virpioja et al. (2007), Badr et al. (2008), Luong
et al. (2010), Clifton and Sarkar (2011), and oth-
Acknowledgments
ers are primarily concerned with using morpheme
segmentation in SMT, which is a useful approach The authors wish to thank the anonymous review-
for dealing with issues of word-formation. How- ers for their comments. Aoife Cahill was partly
ever, this does not deal directly with linguistic fea- supported by Deutsche Forschungsgemeinschaft
tures marked by inflection. In German these lin- grant SFB 732. Alexander Fraser, Marion Weller
guistic features are marked very irregularly and and Fabienne Cap were funded by Deutsche
there is widespread syncretism, making it difficult Forschungsgemeinschaft grant Models of Mor-
to split off morphemes specifying these features. phosyntax for Statistical Machine Translation.
So it is questionable as to whether morpheme seg- The research leading to these results has received
mentation techniques are sufficient to solve the in- funding from the European Communitys Seventh
flectional problem we are addressing. Framework Programme (FP7/2007-2013) under
Much previous work looks at the impact of us- grant agreement Nr. 248005. This work was sup-
ing source side information (i.e., feature func- ported in part by the IST Programme of the Euro-
tions on the aligned English), such as those pean Community, under the PASCAL2 Network
of Avramidis and Koehn (2008), Yeniterzi and of Excellence, IST-2007-216886. This publica-
Oflazer (2010) and others. Toutanova et. al.s tion only reflects the authors views. We thank
work showed that it is most important to model Thomas Lavergne and Helmut Schmid.
target side coherence and our stem markup also
allows us to access source side information. Us-
ing additional source side information beyond the References
markup did not produce a gain in performance. Eleftherios Avramidis and Philipp Koehn. 2008. En-
For compound splitting, we follow Fritzinger riching Morphologically Poor Languages for Statis-
and Fraser (2010), using linguistic knowledge en- tical Machine Translation. In Proceedings of ACL-
672
08: HLT, pages 763770, Columbus, Ohio, June. Thomas Lavergne, Olivier Cappe, and Francois Yvon.
Association for Computational Linguistics. 2010. Practical very large scale CRFs. In Proceed-
Ibrahim Badr, Rabih Zbib, and James Glass. 2008. ings the 48th Annual Meeting of the Association for
Segmentation for English-to-Arabic statistical ma- Computational Linguistics (ACL), pages 504513.
chine translation. In Proceedings of ACL-08: HLT, Association for Computational Linguistics, July.
Short Papers, pages 153156, Columbus, Ohio, Minh-Thang Luong, Preslav Nakov, and Min-Yen
June. Association for Computational Linguistics. Kan. 2010. A Hybrid Morpheme-Word Represen-
Ondrej Bojar and Kamil Kos. 2010. 2010 Failures in tation for Machine Translation of Morphologically
English-Czech Phrase-Based MT. In Proceedings Rich Languages. In Proceedings of the 2010 Con-
of the Joint Fifth Workshop on Statistical Machine ference on Empirical Methods in Natural Language
Translation and MetricsMATR, pages 6066, Upp- Processing, pages 148157, Cambridge, MA, Octo-
sala, Sweden, July. Association for Computational ber. Association for Computational Linguistics.
Linguistics. Maja Popovic, Daniel Stein, and Hermann Ney. 2006.
Ann Clifton and Anoop Sarkar. 2011. Combin- Statistical Machine Translation of German Com-
ing morpheme-based machine translation with post- pound Words. In Proceedings of FINTAL-06, pages
processing morpheme prediction. In Proceed- 616624, Turku, Finland. Springer Verlag, LNCS.
ings of the 49th Annual Meeting of the Associa- Helmut Schmid, Arne Fitschen, and Ulrich Heid.
tion for Computational Linguistics: Human Lan- 2004. SMOR: A German Computational Morphol-
guage Technologies, pages 3242, Portland, Ore- ogy Covering Derivation, Composition, and Inflec-
gon, USA, June. Association for Computational tion. In 4th International Conference on Language
Linguistics. Resources and Evaluation.
Adria de Gispert and Jose B. Marino. 2008. On the Helmut Schmid. 2004. Efficient Parsing of Highly
impact of morphology in English to Spanish statisti- Ambiguous Context-Free Grammars with Bit Vec-
cal MT. Speech Communication, 50(11-12):1034 tors. In Proceedings of Coling 2004, pages 162
1046. 168, Geneva, Switzerland, Aug 23Aug 27. COL-
Alexander Fraser. 2009. Experiments in Morphosyn- ING.
tactic Processing for Translating to and from Ger- Andreas Stolcke. 2002. SRILM - An Extensible Lan-
man. In Proceedings of the Fourth Workshop on guage Modeling Toolkit. In International Confer-
Statistical Machine Translation, pages 115119, ence on Spoken Language Processing.
Athens, Greece, March. Association for Computa- Sara Stymne and Nicola Cancedda. 2011. Produc-
tional Linguistics. tive Generation of Compound Words in Statistical
Fabienne Fritzinger and Alexander Fraser. 2010. How Machine Translation. In Proceedings of the Sixth
to Avoid Burning Ducks: Combining Linguistic Workshop on Statistical Machine Translation, pages
Analysis and Corpus Statistics for German Com- 250260, Edinburgh, Scotland UK, July. Associa-
pound Processing. In Proceedings of the Fifth tion for Computational Linguistics.
Workshop on Statistical Machine Translation, pages Sara Stymne. 2008. German Compounds in Factored
224234. Association for Computational Linguis- Statistical Machine Translation. In Proceedings of
tics. GOTAL-08, pages 464475, Gothenburg, Sweden.
Philipp Koehn and Hieu Hoang. 2007. Factored Springer Verlag, LNCS/LNAI.
Translation Models. In Proceedings of the 2007 Kristina Toutanova, Hisami Suzuki, and Achim
Joint Conference on Empirical Methods in Natural Ruopp. 2008. Applying Morphology Generation
Language Processing and Computational Natural Models to Machine Translation. In Proceedings of
Language Learning (EMNLP-CoNLL), pages 868 ACL-08: HLT, pages 514522, Columbus, Ohio,
876, Prague, Czech Republic, June. Association for June. Association for Computational Linguistics.
Computational Linguistics. Sami Virpioja, Jaakko J. Vayrynen, Mathias Creutz,
Philipp Koehn and Kevin Knight. 2003. Empirical and Markus Sadeniemi. 2007. Morphology-aware
methods for compound splitting. In EACL 03: statistical machine translation based on morphs in-
Proceedings of the 10th conference of the European duced in an unsupervised manner. In PROC. OF
chapter of the Association for Computational Lin- MT SUMMIT XI, pages 491498.
guistics, pages 187193, Morristown, NJ, USA. As- Philip Williams and Philipp Koehn. 2011. Agree-
sociation for Computational Linguistics. ment constraints for statistical machine translation
John Lafferty, Andrew McCallum, and Fernando into German. In Proceedings of the Sixth Workshop
Pereira. 2001. Conditional random fields: Prob- on Statistical Machine Translation, pages 217226,
abilistic models for segmenting and labeling se- Edinburgh, Scotland, July. Association for Compu-
quence data. In Proceedings of the International tational Linguistics.
Conference on Machine Learning, pages 282289. Reyyan Yeniterzi and Kemal Oflazer. 2010. Syntax-
Morgan Kaufmann, San Francisco, CA. to-Morphology Mapping in Factored Phrase-Based
673
Statistical Machine Translation from English to
Turkish. In Proceedings of the 48th Annual Meet-
ing of the Association for Computational Linguis-
tics, pages 454464, Uppsala, Sweden, July. Asso-
ciation for Computational Linguistics.
674
Identifying Broken Plurals, Irregular Gender,
and Rationality in Arabic Text
675
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 675685,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
VRB
J
Figure 1: An example Arabic sentence showing its dependency representation together with the form-based and
functional gender and number features and rationality. The dependency tree is in the CATiB treebank represen-
tation (Habash and Roth, 2009). The shown POS tags are VRB verb, NOM nominal (noun/adjective), and
PRT particle. The relations are SBJ subject, OBJ object and MOD modifier. The form-based features
are only for gender and number.
2 Linguistic Facts
QA mAhrwn (M P ), and H@ QA mAhrAt
(F P ). For a sizable minority of words, these
Arabic has a rich and complex morphology. In features are expressed templatically, i.e., through
addition to being both templatic (root/pattern) and pattern change, coupled with some singular suf-
concatenative (stems/affixes/clitics), Arabics op- fix. A typical example of this phenomenon is the
tional diacritics add to the degree of word ambi- class of broken plurals, which accounts for over
guity. We focus on two problems of Arabic mor- half of all plurals (Alkuhlani and Habash, 2011).
phology: the discrepancy between morphological In such cases, the form of the morphology (sin-
form and function; and the complexity of morpho- gular suffix) is inconsistent with the words func-
syntactic agreement rules. tional number (plural). For example, the word
2.1 Form and Function I.KA kAtb (M S) writer has the broken plural:
H. AJ ktAb ( MMPS ).2 See the second word in the ex-
Arabic nominals (i.e. nouns, proper nouns and
ample in Figure 1, which is the word H
adjectives) and verbs inflect for gender: mascu- . AJ ktAb
line (M ) and feminine (F ), and for number: sin- writers prefixed with the definite article Al+. In
gular (S), dual (D) and plural (P ). These features addition to broken plurals, Arabic has words with
are regularly expressed using a set of suffixes that irregular gender, e.g., the feminine singular ad-
uniquely convey gender and number combina- jective red Z@Qg HmrA ( M S
F S ), and the nouns
tions: + (M S), + +h1 (F S), + +wn (M P ), J
g xlyfh ( MF SS ) caliph and Ag HAml ( MF SS )
and H@ + +At (F P ). For example, the adjective pregnant. Verbs and nominal duals do not dis-
play this discrepancy.
QA mAhr clever has the following forms among
others: QA mAhr (M S), QA mAhrh (F S), 2.2 Morpho-syntactic Agreement
1
Arabic transliteration is presented in the Habash-Soudi- Arabic gender and number features participate in
Buckwalter (HSB) scheme (Habash et al., 2007): (in alpha- morpho-syntactic agreement within specific con-
betical order) AbtjHxdrzsSDTDfqklmnhwy and the ad-
2 F orm
ditional symbols: Z, @, A
@, A @, w ', y Z', h , . This nomenclature denotes ( F unction ).
676
structions such as nouns with their adjectives Altantawy et al., 2010; Alkuhlani and Habash,
and verbs with their subjects. Arabic agreement 2011).
rules are more complex than the simple match- In terms of resources, Smr (2007b)s work
ing rules found in languages such as Spanish contrasting illusory (form) features and functional
(Holes, 2004; Habash, 2010). For instance, Ara- features inspired our distinction of morphologi-
bic adjectives agree with the nouns they mod- cal form and function. However, unlike him, we
ify in gender and number except for plural ir- do not distinguish between sub-functional (logi-
rational (non-human) nouns, which always take cal and formal) features. His ElixirFM analyzer
feminine singular adjectives. Rationality (hu- (Smr, 2007a) extends BAMA by including func-
manness A Q
/ A) is a morpho-lexical tional number and some functional gender infor-
feature that is narrower than animacy. English mation, but not rationality. This analyzer was
expresses it mainly in pronouns (he/she vs. it) used as part of the annotation of the Prague Ara-
and relativizers (men who... vs. cars/cows bic Dependency Treebank (PADT) (Smr and Ha-
which...). We follow the convention by Alkuh- jic, 2006). More recently, Alkuhlani and Habash
lani and Habash (2011) who specify rationality (2011) built on the work of Smr (2007b) and ex-
as part of the functional features of the word. tended beyond it to fully annotate functional gen-
The values of this feature are: rational (R), irra- der, number and rationality in the PATB part 3.
tional (I), and not-specified (N ). N is assigned to We use their resource to train and evaluate our
verbs, adjectives, numbers and quantifiers.3 For system.
example, in Figure 1, the plural rational noun In terms of techniques, Goweder et al. (2004)
H. AJ@ AlktAb ( MMPSR ) writers takes the plural investigated several approaches using root and
adjective JK
Ym '@ AlHdywn ( M P ) modern; pattern morphology for identifying broken plu-
MP N
while the plural irrational word A qSSA sto- rals in undiacritized Arabic text. Their effort re-
ries ( FMPSI ) takes the feminine singular adjective sulted in an improved stemming system for Ara-
YK Yg jdydh ( F S ). bic information retrieval that collapses singulars
. F SN and plurals. They report results on identifying
3 Related Work broken plurals out of context. Similar to them,
we undertake the task of identifying broken plu-
Much work has been done on Arabic morpholog- rals; however, we also target the templatic gen-
ical analysis, morphological disambiguation and der and rationality features, and we do this in-
part-of-speech (POS) tagging (Al-Sughaiyer and context. Elghamry et al. (2008) presented an auto-
Al-Kharashi, 2004; Soudi et al., 2007; Habash, matic cue-based algorithm that uses bilingual and
2010). The bulk of this work does not address monolingual cues to build a web-extracted lexi-
form-function discrepancy or morpho-syntactic con enriched with gender, number and rationality
agreement issues. This includes the most com- features. Their automatic technique achieves an
monly used resources and tools for Arabic NLP: F-score of 89.7% against a gold standard set. Un-
the Buckwalter Arabic Morphological Analyzer like them, we use a manually annotated corpus to
(BAMA) (Buckwalter, 2004) which is used in the train and test the prediction of gender, number and
Penn Arabic Tree Bank (PATB) (Maamouri et al., rationality features.
2004), and the various POS tagging and morpho- Our approach to identifying these features ex-
logical disambiguation tools trained using them plores a large set of orthographic, morphological
(Diab et al., 2004; Habash and Rambow, 2005). and syntactic learning features. This is very much
There are some important exceptions (Goweder et following several previous efforts in Arabic NLP
al., 2004; Habash, 2004; Smr, 2007b; Elghamry in which different tagsets and morphological fea-
et al., 2008; Abbs et al., 2004; Attia, 2008; tures have been studied for a variety of purposes,
3
We previously defined the rationality value N as not- e.g., base phrase chunking (Diab, 2007) and de-
applicable when we only considered nominals (Alkuhlani pendency parsing (Marton et al., 2010). In this
and Habash, 2011). In this work, we rename the rationality paper we use the parser of Marton et al. (2010)
value N as not-specified without changing its meaning. We
use the value N a (not-applicable) for parts-of-speech that
as our source of syntactic learning features. We
do not have a meaningful value for any feature, e.g., prepo- follow their splits for training, development and
sitions have gender, number and rationality values of N a. testing.
677
4 Problem Definition 5 Approach
Our approach involves using two techniques:
Our goal is to predict the functional gender, num-
MLE with back-off and Yamcha. For each tech-
ber and rationality features for all words.
nique, we explore the effects of different learning
features and try to come up with the best tech-
4.1 Corpus and Experimental Settings nique and feature set for each target feature.
678
For all of these features, we train on gold val- Single vs Joint Classification In this paper, we
ues, but only experiment with predicted values in only discuss systems trained for a single classifier
the development and test sets. For predicting mor- (for gender, for number and for rationality). In
phological features, we use the MADA system experiments we have done, we found that training
(Habash and Rambow, 2005). The MADA sys- single classifiers and combining their outcomes
tem corrects for suboptimal orthographic choices almost always outperforms a single joint classi-
and effectively produces a consistent and unnor- fier for the three target features. In other words,
malized orthography. For the syntactic features, combining the results of G and N (G+N) outper-
we use Marton et al. (2010)s system. forms the results of the single classifier GN. The
same is also true for G+N+R, which outperforms
5.2 Techniques GNR and GN+R. Therefore, we only present the
We describe below the two techniques we ex- results for the single classifiers G, N, R and their
plored. combination G+N+R.
MLE with Back-off We implemented an MLE 6 Results
system with multiple back-off modes using our
set of linguistic features. The order of the back-off We perform a series of experiments increasing in
is from specific to general. We start with an MLE feature complexity. We greedily select which fea-
system that uses only the word form, and backs tures to pass on to the next level of experiments.
off to the most common feature value across all In cases of ties, we pass the top two performers
words (excluding unknown and N a values). This to the next step. We discuss each of these exper-
simple MLE system is used as a baseline. iments next for both the MLE and Yamcha tech-
As we add more features to the MLE system, niques. Statistical significance is measured using
it tries to match all these features to predict the the McNemar test of statistical significance (Mc-
value for a given word. If such a combination of Nemar, 1947).
features is not seen in the training set, the sys- 6.1 Experiment Set I: Orthographic
tem backs off to a more general combination of Features
features. For example, if an MLE system is us-
The first set of experiments uses the orthographic
ing the features W2+LMM+BW, the system tries
features. See Table 1. The MLE system with the
to match this combination. If it is not seen in
word only feature (W1) is effectively our base-
training, the system backs off to the following set:
line. It does surprisingly well for seen cases. In
LMM+BW, and tries to return the most common
fact it is the highest performer across all exper-
value for this POS tag and lemma combination. If
iments in this paper for seen cases. For unseen
again it fails to find a match, it backs off to BW,
cases, it produces a miserable and expected low
and returns the most common value for that par-
score of 21.0% accuracy. The addition of the n-
ticular POS tag. If no word is seen with this POS
gram features (W2) improves statistically signif-
tag, the system returns the most common value
icantly over W1 for unseen cases, but it is indis-
across all words.
tinguishable for seen cases. The Yamcha system
Yamcha Sequence Tagger We use Yamcha shows the same difference in results between W1
(Kudo and Matsumoto, 2003), a support-vector- and W2.
machine-based sequence tagger. We perform dif- Across the two sets of features, the MLE sys-
ferent experiments with the different sets of fea- tem consistently outperforms Yamcha in the case
tures presented above. After that, we apply a of seen words, while Yamcha does better for un-
consistency filter that ensures that every word- seen words. This can be explained by the fact that
lemma-pos combination always gets the same the MLE system matches only on the word form
value for gender, number and rationality features. and if the word is unseen, it backs off to the most
Yamcha in its default settings tags words using a common value across all words. Moreover, Yam-
window of two words before and two words af- cha uses some limited context information that al-
ter the word being tagged. This gives Yamcha an lows it to generalize for unseen words.
advantage over the MLE system which tags each Among the target features, number is the easi-
word independently. est to predict, while rationality is the hardest.
679
MLE Yamcha
G N R G+N+R G N R G+N+R
Features seen unseen seen unseen seen unseen seen unseen seen unseen seen unseen seen unseen seen unseen
W1 99.2 61.6 99.3 69.2 97.4 44.7 97.0 21.0 95.9 67.8 96.7 72.0 94.5 67.4 90.2 35.2
W2 99.2 81.7 99.3 81.6 97.4 63.4 97.0 49.1 97.1 86.6 97.7 87.1 95.6 82.0 92.8 65.5
Table 1: Experiment Set I: Baselines and simple orthographic features. W1 is the word only. W2 is the word
with additional 1-gram and 2-gram prefix and suffix features. All numbers are accuracy percentages.
MLE Yamcha
G N R G+N+R G N R G+N+R
Features seen unseen seen unseen seen unseen seen unseen seen unseen seen unseen seen unseen seen unseen
W2+F 99.2 86.9 99.3 88.9 97.4 63.4 96.9 51.9 97.7 89.8 98.1 91.7 96.0 83.5 93.8 72.0
W2+Lemma 97.4 68.3 97.6 71.5 95.6 70.3 95.2 33.8 97.4 86.8 97.7 86.4 96.1 82.2 93.3 65.4
W2+LMM 99.1 68.8 99.3 71.7 97.2 67.6 96.8 33.2 97.5 86.7 97.9 86.6 96.1 82.6 93.5 65.7
W2+CATIB 99.1 85.0 99.3 83.8 97.4 70.0 97.1 56.2 97.5 87.9 98.0 88.6 96.0 83.5 93.6 69.7
W2+CATIB-EX 99.1 85.7 99.3 84.3 97.4 70.4 97.1 56.7 97.5 88.0 97.9 88.1 96.0 83.6 93.6 69.9
W2+Kulick 99.0 86.7 99.1 85.6 97.1 78.7 96.7 65.5 97.3 88.8 97.9 89.4 95.8 83.5 93.3 70.9
W2+BW- 99.0 88.8 99.0 88.8 97.0 80.7 96.6 68.5 97.5 89.7 98.0 91.2 96.0 85.2 93.7 73.2
W2+BW 98.6 87.9 98.5 88.8 96.8 80.3 95.9 67.8 97.5 89.5 97.9 89.5 96.1 85.7 93.7 72.8
Table 2: Experiment Set II.a: Morphological features: (i) form-based gender and number, (ii) lemma and LMM
(undiacritized lemma) and (iii) a variety of POS tag sets. For each subset, the best performers are bolded.
6.2 Experiment Set II: Morphological reasonable given that LMM is easier to predict;
Features although LMM is more ambiguous.
As for the POS tag sets, looking at the MLE
Individual Morphological Features In this set results, CATIB-EX is the best performer for seen
of experiments, we use our best system from the words, and BW- is the best for unseen. CATIB-6
previous set, W2, and add individual morpholog- is a general POS tag set and since the MLE tech-
ical features to it. We organize these features in nique is very strict in its matching process (an ex-
three sub-groups: (i) form-based features (F), (ii) act match or no match), using a general key to
lemma and LMM, and (iii) the five POS tag sets. match on adds a lot of ambiguity. With Yamcha,
See Table 2. BW and BW- are the best among all POS. Yamcha
The F, Lemma and LMM improve over the is still doing consistently better in terms of unseen
baseline in terms of unseen words for both MLE words. The best two systems from both Yamcha
and Yamcha techniques. However, for seen and MLE are used as the basic systems for the
words, these systems do worse than or equal to the next subset of experiments where we combine the
baseline when the MLE technique is used. The morphological features.
MLE system in these cases tries to match the word
and its morphological features as a single unit and Combined Morphological Features Until this
if such a combination is not seen, it backs off to point, all experiments using the two techniques
the morphological feature which is more general. are similar. In this subset, MLE explores the ef-
Since we are using predicted data, prediction er- fect of using the CATIB-EX and BW- with other
rors could be the reason behind this decrease in morphological features. And Yamcha explores
accuracy for seen words. Among these systems, the effect of using BW- and BW with other mor-
W2+F is the best for both Yamcha and MLE ex- phological features. See Table 3. Again, Yamcha
cept for rationality which is expected since there is still doing consistently better in terms of unseen
are no form-based features for rationality. In this words, but when it comes to seen words, MLE
set of experiments, Yamcha consistently outper- performs better. For seen words, our best results
forms MLE when it comes to unseen words, but come from MLE using CATIB-EX and LMM. For
for seen words, MLE does better almost always. unseen words, our best results come from Yam-
LMM overall does better than Lemma. This is cha with the BW- tag and the form-based features
680
MLE Yamcha
Features: G N R G+N+R Features: G N R G+N+R
W2 seen unseen seen unseen seen unseen seen unseen W2 seen unseen seen unseen seen unseen seen unseen
+CATIB-EX 99.1 85.7 99.3 84.3 97.4 70.4 97.0 56.7 +BW 97.5 89.5 97.9 89.5 96.1 85.7 93.7 72.8
+F 98.7 88.6 99.1 89.4 94.9 70.4 94.3 59.7 +F 97.8 90.6 98.2 92.4 96.3 85.3 94.2 75.4
+LMM 99.1 78.9 99.3 80.4 97.3 69.6 96.9 44.7 +LMM 97.6 88.9 98.1 88.9 96.5 85.7 94.1 72.3
+LMM+F 98.7 89.9 99.0 89.7 94.8 69.6 94.2 58.1 +LMM+F 98.1 90.4 98.4 92.5 96.7 85.8 94.8 75.9
+BW- 99.0 88.8 99.0 88.8 97.0 80.7 96.6 68.5 +BW- 97.5 89.7 98.0 91.2 96.0 85.2 93.7 73.2
+F 99.0 88.8 99.1 89.9 97.0 80.7 96.6 69.6 +F 97.7 90.7 98.2 92.5 96.1 85.6 94.0 75.3
+LMM 98.9 90.0 99.0 88.0 97.0 83.6 96.6 69.8 +LMM 97.7 89.6 98.1 90.4 96.2 85.1 94.0 72.5
+LMM+F 98.9 90.0 99.0 89.1 97.0 83.6 96.6 70.8 +LMM+F 98.0 90.3 98.2 92.4 96.5 85.7 94.5 75.1
Yamcha
G N R G+N+R
Features: seen unseen seen unseen seen unseen seen unseen
W2 +BW +F+SYN 97.3 90.6 97.8 92.5 96.1 86.1 93.5 76.0
W2 +BW +LMM+SYN 97.4 89.1 97.5 88.3 96.2 86.0 93.4 71.7
W2 +BW +LMM+F+SYN 97.5 90.8 98.0 92.5 96.4 86.2 93.8 76.2
W2 +BW- +F+SYN 97.4 90.7 97.9 92.7 96.1 85.2 93.5 75.0
W2 +BW- +LMM+SYN 97.4 89.5 97.7 89.8 96.1 85.7 93.4 72.1
W2 +BW- +LMM+F+SYN 97.4 90.8 97.9 92.7 96.2 85.3 93.6 75.2
for both gender and number. For rationality, the words. In Yamcha, we can argue that the +/-2
best features to use with Yamcha are BW, LMM word window allows some form of shallow syn-
and form-based features. The lemma seems to ac- tax modeling, which is why Yamcha is doing bet-
tually hurt when predicting gender and number. ter from the start. But the longer distance features
This can be explained by the fact that gender and are helping even more, perhaps because they cap-
number features are often properties of the word ture agreement relations. The overall best system
form and not of the lemma. This is different for for unseen words is W2+BW+LMM+F+SYN,
rationality, which is a property of the lemma and except for number, where W2+BW-+F+SYN
therefore, we expect the lemma to help. is slightly better. In terms of G+N+R
The fact that the predicted BW set helps is not scores, W2+BW+LMM+F+SYN is statistically
consistent with previous work by Marton et al. significantly better than all other systems in
(2010). In that effort, BW helps parsing only in this set for seen and unseen words, ex-
the gold condition. BW prediction accuracy is cept for unseen words with W2+BW+F+SYN.
low because it includes case endings. We pos- W2+BW+LMM+F+SYN is also statistically sig-
tulate that perhaps in our task, which is far more nificantly better than its non-syntactic variant for
limited than general parsing, errors in case pre- both seen and unseen words. The prediction ac-
diction may not matter too much. The more com- curacy for seen words is still not as good as the
plex tag set may actually help establish good lo- MLE systems.
cal agreement sequences (even if incorrect case-
wise), which is relevant to the target features. 6.4 System Combination
The simple MLE W1 system, which happens to be
6.3 Experiment Set III: Syntactic Features the baseline, is the best predictor for seen words,
This set of experiments adds syntactic features and the more advanced Yamcha system using syn-
to the experiments in set II. We add syntax to tactic features is the best predictor for unseen
the systems that uses Yamcha only since it is words. Next, we create a new system that takes
not obvious how to add syntactic information to advantage of the two systems. We use the sim-
the MLE system. Syntax improves the predic- ple MLE W1 system for seen words, and Yam-
tion accuracy for unseen words but not for seen cha with syntax for unseen words. For unseen
681
words, since each target feature has its own set of All seen unseen
best learning features, we also build a combina- MLE W1 88.5 96.8 21.2
tion system that uses the best systems for gender, Yamcha BW+LMM+F 91.4 94.1 70.4
Yamcha BW+LMM+F+SYN 91.0 93.3 72.2
number and rationality and combine their output
Combination 94.1 96.8 72.4
into a single system for unseen words. For gender
and rationality, we use W2+BW+LMM+F+SYN, Table 5: Results on blind test. Scores for
and for number, we use W2+BW-+F+SYN. As All/Seen/Unseen are shown for the G+N+R condition.
expected the combination system outperforms the We compare the MLE word baseline, with the best
basic systems. For comparison: The MLE W1 Yamcha system with and without syntactic features
and the combined system.
system gets an (all, seen, unseen) scores of (89.3,
97.0, 21.0) for G+N+R, while the best single
Yamcha syntactic system gets (92.0, 93.8, 76.2); Since the Yamcha system uses MADA features,
the combination on the other hand gets (94.9, we investigated the effect of the correctness of
97.0, 76.2). The overall (all) improvement over MADA features on the system prediction accu-
the MLE baseline or the best Yamcha translates racy. The overall MADA accuracy in identifying
into 52% error reduction or 36% error reduction, the lemma and the Buckwalter tag together a
respectively. very harsh measure is 77.0% (79.3% for seen
and 56.8% for unseen). Our error analysis shows
6.5 Error Analysis that when MADA is correct, the prediction ac-
We conducted an analysis of the errors in the out- curacy for G+N+R is 95.6%, 96.5% and 84.4%
put of the combination system as well as the two for all, seen and unseen, respectively. However,
systems that contributed to it. this accuracy goes down to 79.2%, 82.5% and
In the combination system, out of the total er- 65.5% for all, seen and unseen, respectively, when
ror in G+N+R (5.1%), 53% of the cases are for MADA is wrong. This suggests that the Yam-
seen words (3.0% of all seen) and 47% for unseen cha system suffers when MADA makes wrong
words (23.8% of all unseen). Overall, rational- choices and improving MADA would lead to im-
ity errors are the biggest contributor to G+N+R provement in the systems performance.
error at 73% relative, followed by gender (33%
relative) and number (26% relative). Among er- 6.6 Blind Test
ror cases of seen words, rationality errors soar to Finally, we apply our baseline, best combination
87% relative, almost four times the corresponding model and best single Yamcha syntactic model
gender and number errors (27% and 22%, respec- (with and without syntax) to the blind test set.
tively). However, among error cases of unseen The results are in Table 5. The results in the blind
words, rationality errors are 57% relative, while test are consistent with the development set. The
gender and number corresponding errors are (39% MLE baseline is best on seen words, Yamcha is
and 31%, respectively). As expected, rational- best on unseen words, syntactic features help in
ity is much harder to tag than gender and number handling unseen words, and overall combination
due to its higher word-form ambiguity and depen- improves over all specific systems.
dence on context.
We classified the type of errors in the MLE sys- 6.7 Additional Training Data
tem for seen words, which we use in the combi- After experimenting on quarter of the train set to
nation system. We found that 86% of the G+N+R optimize for various settings, we train our com-
errors involve an ambiguity in the training data bination system on the full train set and achieve
where the correct answer was present but not cho- (96.0, 96.8, 74.9) for G+N+R (all, seen, unseen)
sen. This is an expected limitation of the MLE ap- on the development set and (96.5, 96.8, 65.6)
proach. In the rest of the cases, the correct answer on the blind test set. As expected, the overall
was not actually present in the training data. The (all) scores are higher simply due to the addi-
proportion of ambiguity errors is almost identical tional training data. The results on seen and un-
for gender, number and rationality. However ra- seen words, which are redefined against the larger
tionality overall is the biggest cause of error, sim- training set, are not higher than results for the
ply due to its higher degree of ambiguity. quarter training data. Of course, these numbers
682
should not be compared directly. The number of 7 Conclusions and Future Work
unseen word tokens in the full train set is 3.7%
We presented a series of experiments for auto-
compared to 10.2% in quarter of the train set.
matic prediction of the latent features of func-
tional gender and number, and rationality in Ara-
6.8 Comparison with MADA
bic. We compared two techniques, a simple MLE
We compare our results with the form-based with back-off and an SVM-based sequence tag-
features from the state-of-the-art morphological ger, Yamcha, using a number of orthographic,
analyzer MADA (Habash and Rambow, 2005). morphological and syntactic features. Our con-
We use the form-based gender and number fea- clusions are that for words seen in training, the
tures produced by MADA after we filter MADA MLE model does best; for unseen word, Yamcha
choices by tokenization. Since MADA does not does best; and most interestingly, we found that
give a rationality value, we assign the value I (ir- syntactic features help the prediction for unseen
rational) to nouns and proper nouns and the value words.
N (not-specified) to verbs and adjectives. Every- In the future, we plan to explore training on pre-
thing else receives N a (not-applicable). The POS dicted features instead of gold features to mini-
tags are determined by MADA. mize the effect of tagger errors. Furthermore, we
On the development set, MADA achieves plan to use our tools to collect vocabulary not cov-
(72.6, 73.1, 58.6) for G+N+R (all, seen, unseen), ered by commonly used morphological analyzers
where the seen/unseen distinction is based on the and try to assign them correct functional features.
full training set in the previous section and is pro- Finally, we would like to use our predictions for
vided for comparison reasons only. The results for gender, number and rationality as learning fea-
the test set are (71.4, 72.2, 53.7). These results are tures for relevant NLP applications such as senti-
consistent with our expectation that MADA will ment analysis, phrase-based chunking and named
do badly on this task since it is not designed for entity recognition.
it (Alkuhlani and Habash, 2011). We should re-
mind the reader that MADA-derived features are Acknowledgments
used as machine learning features in this paper,
We would like to thank Yuval Marton for help
where they actually help. In the future, we plan to
with the parsing experiments. The first author was
integrate this task inside of MADA.
funded by a scholarship from the Saudi Arabian
Ministry of Higher Education. The rest of the
6.9 Extrinsic Evaluation
work was funded under DARPA projects number
We use the predicted gender, number and rational- HR0011-08-C-0004 and HR0011-08-C-0110.
ity features that we get from training on the full
train set in a dependency syntactic parsing exper-
References
iment. The parsing feature set we use is the best
performing feature set described in (Marton et al., Ramzi Abbs, Joseph Dichy, and Mohamed Has-
2011), which used an earlier unpublished version soun. 2004. The Architecture of a Standard Arabic
of our MLE model. The parser we use is the Easy- Lexical Database. Some Figures, Ratios and Cat-
First Parser (Goldberg and Elhadad, 2010). More egories from the DIINAR.1 Source Program. In
Ali Farghaly and Karine Megerdoomian, editors,
details on this parsing experiment is in Marton et COLING 2004 Computational Approaches to Ara-
al. (2012). bic Script-based Languages, pages 1522, Geneva,
The functional gender and number features in- Switzerland, August 28th. COLING.
crease the labeled attachment score by 0.4% abso- Imad Al-Sughaiyer and Ibrahim Al-Kharashi. 2004.
lute over a comparable model that uses the form- Arabic Morphological Analysis Techniques: A
based gender and number features. Rationality on Comprehensive Survey. Journal of the American
Society for Information Science and Technology,
the other hand does not help much. One possible
55(3):189213.
reason for this is the lower quality of the predicted Sarah Alkuhlani and Nizar Habash. 2011. A Corpus
rationality feature compared to the other features. for Modeling Morpho-Syntactic Agreement in Ara-
Another possible reason is that the rationality fea- bic: Gender, Number and Rationality. In Proceed-
ture is not utilized optimally in the parser. ings of the 49th Annual Meeting of the Association
683
for Computational Linguistics (ACL11), Portland, tional Morphology: Knowledge-based and Empir-
Oregon, USA. ical Methods. Springer.
Mohamed Altantawy, Nizar Habash, Owen Rambow, Nizar Habash, Reem Faraj, and Ryan Roth. 2009.
and Ibrahim Saleh. 2010. Morphological Analy- Syntactic Annotation in the Columbia Arabic Tree-
sis and Generation of Arabic Nouns: A Morphemic bank. In Proceedings of MEDAR International
Functional Approach. In Proceedings of the seventh Conference on Arabic Language Resources and
International Conference on Language Resources Tools, Cairo, Egypt.
and Evaluation (LREC), Valletta, Malta. Nizar Habash. 2004. Large Scale Lexeme Based
Mohammed Attia. 2008. Handling Arabic Morpho- Arabic Morphological Generation. In Proceedings
logical and Syntactic Ambiguity within the LFG of Traitement Automatique des Langues Naturelles
Framework with a View to Machine Translation. (TALN-04), pages 271276. Fez, Morocco.
Ph.D. thesis, The University of Manchester, Manch- Nizar Habash. 2010. Introduction to Arabic Natural
ester, UK. Language Processing. Morgan & Claypool Pub-
Tim Buckwalter. 2004. Buckwalter arabic morpho- lishers.
logical analyzer version 2.0. LDC catalog number Clive Holes. 2004. Modern Arabic: Structures, Func-
LDC2004L02, ISBN 1-58563-324-0. tions, and Varieties. Georgetown Classics in Arabic
Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. Language and Linguistics. Georgetown University
2004. Automatic Tagging of Arabic Text: From Press.
Raw Text to Base Phrase Chunks. In Proceed- Taku Kudo and Yuji Matsumoto. 2003. Fast Meth-
ings of the 5th Meeting of the North Ameri- ods for Kernel-Based Text Analysis. In Proceed-
can Chapter of the Association for Computational ings of the 41st Annual Meeting of the Association
Linguistics/Human Language Technologies Con- for Computational Linguistics (ACL03), pages 24
ference (HLT-NAACL04), pages 149152, Boston, 31, Sapporo, Japan, July.
MA. Seth Kulick, Ryan Gabbard, and Mitch Marcus. 2006.
Mona Diab. 2007. Towards an Optimal POS tag set Parsing the Arabic Treebank: Analysis and Im-
for Modern Standard Arabic Processing. In Pro- provements. In Proceedings of the Treebanks
ceedings of Recent Advances in Natural Language and Linguistic Theories Conference, pages 3142,
Processing (RANLP), Borovets, Bulgaria. Prague, Czech Republic.
Khaled Elghamry, Rania Al-Sabbagh, and Nagwa El- Mohamed Maamouri, Ann Bies, Tim Buckwalter, and
Zeiny. 2008. Cue-based bootstrapping of Arabic Wigdan Mekki. 2004. The Penn Arabic Treebank:
semantic features. In JADT 2008: 9es Journes Building a Large-Scale Annotated Arabic Corpus.
internationales dAnalyse statistique des Donnes In NEMLAR Conference on Arabic Language Re-
Textuelles. sources and Tools, pages 102109, Cairo, Egypt.
Yoav Goldberg and Michael Elhadad. 2010. An effi- Yuval Marton, Nizar Habash, and Owen Rambow.
cient algorithm for easy-first non-directional depen- 2010. Improving Arabic Dependency Parsing with
dency parsing. In Human Language Technologies: Lexical and Inflectional Morphological Features. In
The 2010 Annual Conference of the North American Proceedings of the NAACL HLT 2010 First Work-
Chapter of he Association for Computational Lin- shop on Statistical Parsing of Morphologically-Rich
guistics, pages 742750, Los Angeles, California, Languages, pages 1321, Los Angeles, CA, USA,
June. Association for Computational Linguistics. June.
Abduelbaset Goweder, Massimo Poesio, Anne De Yuval Marton, Nizar Habash, and Owen Rambow.
Roeck, and Jeff Reynolds. 2004. Identifying Bro- 2011. Improving Arabic Dependency Parsing with
ken Plurals in Unvowelised Arabic Text. In Dekang Form-based and Functional Morphological Fea-
Lin and Dekai Wu, editors, Proceedings of EMNLP tures. In Proceedings of the 49th Annual Meet-
2004, pages 246253, Barcelona, Spain, July. ing of the Association for Computational Linguis-
Nizar Habash and Owen Rambow. 2005. Arabic Tok- tics (ACL11), Portland, Oregon, USA.
enization, Part-of-Speech Tagging and Morpholog- Yuval Marton, Nizar Habash, and Owen Rabmow.
ical Disambiguation in One Fell Swoop. In Pro- 2012. Dependency Parsing of Modern Stan-
ceedings of the 43rd Annual Meeting of the Associa- dard Arabic with Lexical and Inflectional Features.
tion for Computational Linguistics (ACL05), pages Manuscript submitted for publication.
573580, Ann Arbor, Michigan. Quinn McNemar. 1947. Note on the sampling error
Nizar Habash and Ryan Roth. 2009. CATiB: The of the difference between correlated proportions or
Columbia Arabic Treebank. In Proceedings of the percentages. Psychometrika, 12(2):153157.
ACL-IJCNLP 2009 Conference Short Papers, pages Otakar Smr and Jan Hajic. 2006. The Other Ara-
221224, Suntec, Singapore. bic Treebank: Prague Dependencies and Functions.
Nizar Habash, Abdelhadi Soudi, and Tim Buckwalter. In Ali Farghaly, editor, Arabic Computational Lin-
2007. On Arabic Transliteration. In A. van den guistics: Current Implementations. CSLI Publica-
Bosch and A. Soudi, editors, Arabic Computa- tions.
684
Otakar Smr. 2007a. ElixirFM implementation of
functional arabic morphology. In ACL 2007 Pro-
ceedings of the Workshop on Computational Ap-
proaches to Semitic Languages: Common Issues
and Resources, pages 18, Prague, Czech Repub-
lic. ACL.
Otakar Smr. 2007b. Functional Arabic Morphology.
Formal System and Implementation. Ph.D. thesis,
Charles University in Prague, Prague, Czech Re-
public.
Abdelhadi Soudi, Antal van den Bosch, and Gn-
ter Neumann, editors. 2007. Arabic Computa-
tional Morphology. Knowledge-based and Empiri-
cal Methods, volume 38 of Text, Speech and Lan-
guage Technology. Springer, August.
685
Framework of Semantic Role Assignment based on Extended Lexical
Conceptual Structure: Comparison with VerbNet and FrameNet
686
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 686695,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
that a single semantic role is assigned to each syn- proach, we demonstrate that some sort of seman-
tactic argument.1 In fact, one syntactic argument tic characteristics that VerbNet and FrameNet in-
can play multiple roles in the event (or events) ex- formally/implicitly describe in their roles can be
pressed by a verb. For example, Table 1 shows a given formal definitions and that multiple argu-
sentence containing the verb throw and seman- ment roles can be represented strictly and natu-
tic roles assigned to its arguments in each frame- rally by extending the LCS theory.
work. The table shows that each framework as- In the first half of this paper, we define our ex-
signs a single role, such as Arg0 and Agent, to tended LCS framework and describe how it gives
each syntactic argument. However, we can ac- a formal definition of roles and solves the problem
quire information from this sentence that John of multiple roles. In the latter half, we discuss
is an agent of the throwing event (the Affec- the analysis of the empirical data we collected
tion row), as well as a source of the movement for 60 Japanese verbs and also discuss theoreti-
event of the ball (the Movement row). Existing cal relationships with the frameworks of existing
frameworks of assigning single roles simply ig- resources. We discuss in detail the relationships
nore such information that verbs inherently have between our role labels and VerbNets thematic
in their semantics. We believe that giving a clear roles. We also describe the relationship between
definition of multiple argument roles would be our framework and FrameNet, with regards to the
beneficial not only as a theoretical framework but definitions of the relationships between semantic
also for practical applications that require detailed frames.
meanings derived from secondary roles.
This issue is also related to fragmentation and
2 Related works
the unclear definition of semantic roles in these There have been several attempts in linguistics
frameworks. As we exemplify in this paper, mul- to assign multiple semantic properties to one ar-
tiple semantic characteristics are conflated in a gument. Gruber (1965) demonstrated the dis-
single role label in these resources due to the man- pensability of the constraint that an argument
ner of single-role assignment. This means that se- takes only one semantic role, with some concrete
mantic roles of existing resources are not mono- examples. Rozwadowska (1988) suggested an
lithic and inherently not mutually independent, approach of feature decomposition for semantic
but they share some semantic characteristics. roles using her three features of change, cause,
The aim of this paper is more on theoreti- and sentient, and defined typical thematic roles
cal discussion for role-labeling frameworks rather by combining these features. This approach made
than introducing a new resource. We developed it possible for us to classify semantic properties
a framework of verb lexical semantics, which is across thematic roles. However, Levin and Rap-
an extension of the lexical conceptual structure paport Hovav (2005) argued that the number of
(LCS) theory, and compare it with other exist- combinations using defined features is usually
ing frameworks which are used in VerbNet and larger than the actual number of possible com-
FrameNet, as an annotation scheme of SRL. LCS binations; therefore, feature decomposition ap-
is a decomposition-based approach to verb se- proaches should predict possible feature combi-
mantics and describes a meaning by composing nations.
a set of primitive predicates. The advantage of Culicover and Wilkins (1984) divided their
this approach is that primitive predicates and their roles into two groups, action and perceptional
compositions are formally defined. As a result, roles, and explained that dual assignment of roles
we can give a strict definition of semantic roles always involves one role from each set. Jackend-
by grounding them to lexical semantic structures off (1990) proposed an LCS framework for rep-
of verbs. In fact, we define semantic roles as ar- resenting the meaning of a verb by using several
gument slots in primitive predicates. With this ap- primitive predicates. Jackendoff also stated that
an LCS represents two tiers in its structure, action
1
To be precise, FrameNet permits multiple-role assign- tier and thematic tier, which are similar to Culi-
ment, while it does not perform this systematically as we
show in Table 1. It mostly defines a single role label for a
cover and Wilkinss two sets. Essentially, these
corresponding syntactic argument, that plays multiple roles two approaches distinguished roles related to ac-
in several sub-events in a verb. tion and change, and successfully restricted com-
687
2 2 3 3
from(locate(in(i))) Predicates Semantic Functions
6 7 7
6cause(affect(i,j), go(j, 6
4fromward(locate(at(k)))5))7 state(x, y) First argument is in state specified by
4 5
toward(locate(at(l))) second argument.
cause(x, y) Action in first argument causes change
specified in second argument.
Figure 1: LCS of the verb throw. act(x) First argument affects itself.
affect(x, y) First argument affects second argument.
react(x, y) First argument affects itself, due to the
binations of roles by taking a role from each set. effect from second argument.
Dorr (1997) created an LCS-based lexical re- go(x, y) First argument changes according to the
path described in the second argument.
source as an interlingual representation for ma-
from(x) Starting point of certain change event.
chine translation. This framework was also used fromward(x) Direction of starting point.
for text generation (Habash et al., 2003). How- via(x) Pass point of certain change event.
ever, the problem of multiple-role assignment was toward(x) Direction of end point.
not completely solved on the resource. As a to(x) End point of certain change event.
along(x) Linear-shaped path of change event.
comparison of different semantic structures, Dorr
(2001) and Hajicova and Kucerova (2002) ana-
Table 2: Major primitive predicates and their semantic
lyzed the connection between LCS and PropBank
functions.
roles, and showed that the mapping between LCS
and PropBank roles was many to many correspon-
dence and roles can map only by comparing a ure 1 represents the action changing the state of j.
whole argument structure of a verb. Habash and The inner structure of the second argument of go
Dorr (2001) tried to map LCS structures into the- represents the path of the change.
matic roles by using their thematic hierarchy. The overall definition of our extended LCS
framework is shown in Figure 2.2 Basically, our
3 Multiple role expression using lexical definition is based on Jackendoffs LCS frame-
conceptual structure work (1990), but performed some simplifications
and added extensions. The modification is per-
Lexical conceptual structure is an approach to de- formed in order to increase strictness and gen-
scribe a generalized structure of an event or state erality of representation and also a coverage for
represented by a verb. A meaning of a verb is rep- various verbs appearing in a corpus. The main
resented as a structure composed of several prim- differences between the two LCS frameworks are
itive predicates. For example, the LCS structure as follows. In our extended LCS framework, (i)
for the verb throw is shown in Figure 1 and the possible combinations of cause, act, affect,
includes the predicates cause, affect, go, from, react, and go are clearly restricted, (ii) multiple
fromward, toward, locate, in, and at. The argu- actions or changes in an event can be described
ments of primitive predicates are filled by core ar- by introducing a combination function (comb for
guments of the verb. This type of decomposition short), (iii) GO, STAY and INCH in Jackendoffs
approach enables us to represent a case that one theory are incorporated into one function go, and
syntactic argument fills multiple slots in the struc- (iv) most of the change-of-state events are repre-
ture. In Figure 1, the argument i appears twice in sented as a metaphor using a spatial transition.
the structure: as the first argument of affect and
The idea of a comb function comes from a nat-
the argument in from.
ural extension of Jackendoffs EXCH function.
The primitives are designed to represent a full
In our case, comb is not limited to describing
or partial action-change-state chain, which con-
a counter-transfer of the main event but can de-
sists of a state, a change in or maintaining of a
scribe subordinate events occurring in relation to
state, or an action that changes/maintains a state.
the main event.3 We can also describe multiple
Table 2 shows primitives that play important roles
2
to represent that chain. Some primitives embed Here we omitted the attributes taken by each predicate,
other primitives as their arguments and the seman- in order to simplify the explanation. We also omitted an
explanation for lower level primitives, such as STATE and
tics of the entire structure of an LCS structure PLACE groups, which are not necessarily important for the
is calculated according to the definition of each topic of this paper.
3
primitive. For instance, the LCS structure in Fig- In our extended LCS theory, we can describe multiple
688
2 3 8 9
EVENT+ Role Description
h i >be
>
>
>
>
>
LCS = 4 5 >
> locate(PLACE) >
> Protagonist Entity which is viewpoint of verb.
comb EVENT * < =
STATE = orient(PLACE) Theme Entity in which its state or change of state
>
> >
> is mentioned.
>extent(PLACE)>
> >
>
: >
;
connect(arg) State Current state of certain entity.
28 93 Actor Entity which performs action that
state(arg, STATE)
6>
>
>
>7
>
>
changes/maintains its state.
6>
>
< go(arg, PATH) >
>7
= Effector Entity which performs action that
6 7
6 cause(act(arg1), go(arg1, PATH)) 7 changes/maintains a state of another entity.
6> >7
6>
>cause(affect(arg1, arg2), go(arg2, PATH))> >7
> Patient Entity which is changed/maintained its
6> 7
6>
: >
;7 state by another entity.
EVENT = 6
6 cause(react(arg1, arg2), go(arg1, PATH)) 7
7
6 7 Stimulus Entity which is cause of the action.
6manner(constant)? 7
6 7 Source Starting point of certain change event.
6mean(constant)? 7
6 7 Source dir Direction of starting point.
6instrument(constant)? 7
4 5 Middle Pass point of certain change event.
purpose(EVENT)* Goal End point of certain change event.
8 9 2 3
> > Goal dir Direction of end point.
>in(arg)
> >
>
from(STATE)?
>
> >
> 6 7 Route Linear-shaped path of certain change event.
>
> on(arg) >
> 6fromward(STATE)?7
>
> >
> 6 7
>cover(arg) >
>
> > PATH=6
> 6via(STATE)?
7
7
>
>fit(arg) >
> 6toward(STATE)? 7 Table 3: Semantic role list for proposing extended LCS
>
> >
> 6 7
<inscribed(arg)>
> = 6 7 framework.
4to(STATE)? 5
PLACE =
>
> beside(arg) > > along(arg)?
>
> >
>
> >
>around(arg) >
>
> >
> tions of the arguments of the primitive predicates
>
> near(arg) >
>
>
> >
> can be explained using generalized semantic roles
>
> >
>
>
> inside(arg) >
>
>
: >
; such as typical thematic roles. In order to sim-
at(arg)
ply represent the semantic functions of the ar-
Figure 2: Description system of our LCS. Operators guments in the LCS primitives or make it eas-
+, , ? follow the basic regular expression syntax. {} ier to compare our extended LCS framework with
represents a choice of the elements. other SRL frameworks, we define a semantic role
set that corresponds to the semantic functions of
the primitive predicates in the LCS structure (Ta-
main events if the agent does more than two ac- ble 3). We employed role names similarly to typ-
tions simultaneously and all the actions are the ical thematic roles in order to easily compare the
focus (e.g., John exchanges A with B). This ex- role sets, but the definition is different. Also, due
tension is simple, but essential for creating LCS to the increase of the generality of LCS represen-
structures of predicates appearing in actual data. tation, we obtained clearer definition to explain a
In our development of 60 Japanese predicates correspondence between LCS primitives and typ-
(verb and verbal noun) frequently appearing in ical thematic roles than the Jackendoffs predi-
Kyoto University Text Corpus (KTC) (Kurohashi cates. Note that the core semantic information of
and Nagao, 1997) , 37.6% of the frames included a verb represented by a LCS framework is em-
multiple events. By using the comb function, we bodied directly in its LCS structure and the in-
can express complicated events with predicate de- formation decreases if the structure is mapped to
composition and prevent missing (multiple) roles. the semantic roles. The mapping is just for con-
A key point for associating LCS framework trasting thematic roles. Each role is given an ob-
with the existing frameworks of semantic roles is vious meaning and designed to fit to the upper-
that each primitive predicate of LCS represents level primitives of the LCS structure, which are
a fundamental function in semantics. The func- the arguments of EVENT and PATH functions. In
events in the semantic structure of a verb. However, gener-
Table 4, we can see that these roles correspond al-
ally, a verb focuses on one of those events and this makes most one-to-one to the primitive arguments. One
a semantic variation among verbs such as buy, sell, and pay special role is Protagonist, which does not match
as well as difference of syntactic behavior of the arguments. an argument of a specific primitive. The Pro-
Therefore, focused event should be distinguished from the
others as lexical information. We expressed focused events
tagonist is assigned to the first argument in the
as main formulae (formulae that are not surrounded by a main formula to distinguish that formula from the
comb function). sub formulae. There are 13 defined roles, and
689
Predicate 1st arg 2nd arg Role Single Multiple Grow (%)
state Theme State Theme 21 108 414
act Actor State 1 1 0
affect Effector Patient Actor 12 13 8.3
react Actor Stimulus Effector 73 92 26
go Theme PATH Patient 77 79 2.5
from Source Stimulus 0 0 0
fromward Source dir Source 11 44 300
via Middle Source dir 4 4 0
toward Goal dir
Middle 1 8 700
to Goal
Goal 42 81 93
along Route
Goal dir 2 3 50
Route 2 2 0
Table 4: Correspondence between semantic roles and
w/o Theme 225 327 45
arguments of LCS primitives
Total 246 435 77
this number is comparatively smaller than that in Table 5: Number of appearances of each role
VerbNet. The discussion with regard to this num-
ber is described in the next section.
ated the dictionary looking at the instances of
Essentially, the semantic functions of the ar-
the target verbs in KTC. To increase the cover-
guments in LCS primitives are similar to those
age of senses and case frames, we also consulted
of traditional, or basic, thematic roles. However,
the online Japanese dictionary Digital Daijisen5
there are two important differences. Our extended
and Kyoto university case frames (Kawahara and
LCS framework principally guarantees that the
Kurohashi, 2006) which is a compilation of case
primitive predicates do not contain any informa-
frames automatically acquired from a huge web
tion concerning (i) selectional preference and (ii)
corpus. There were 97 constructed frames in the
complex structural relation of arguments. Primi-
dictionary.
tives are designed to purely represent a function
in an action-change-state chain, thus the informa- Then we analyzed how many roles are addi-
tion of selectional preference is annotated to a dif- tionally assigned by permitting multiple role as-
ferent layer; specifically, it is directly annotated to signment (see Table 5). The numbers of assigned
core arguments (e.g., we can annotate i with sel- roles for single role are calculated by counting
Pref(animate organization) in Figure 1). Also, roles that appear first for each target argument in
the semantic function is already decomposed and the structure. Table 5 shows that the total number
the structural relation among the arguments is rep- of assigned roles is 1.77 times larger than single-
resented as a structure of primitives in LCS rep- role assignment. The main reason is an increase in
resentation. Therefore, each argument slot of Theme. For single-role assignment, Theme, in our
the primitive predicates does not include compli- sense, in action verbs is always duplicated with
cated meanings and represents a primitive seman- Actor/Patient. On the other hand, LCS strictly
tic property which is highly functional. These divides a function for action and change; there-
characteristics are necessary to ensure clarity of fore the duplicated Theme is correctly annotated.
the semantic role meanings. We believe that even Moreover, we obtained a 45% increase even when
though there surely exists a certain type of com- we did not count duplicated Theme. Most of in-
plex semantic role, it is reasonable to represent crease are a result from the increase in Source
that role based on decomposed properties. and Goal. For example, Effectors of transmission
In order to show an instance of our extended verbs are also annotated with a Source, and Effec-
LCS theory, we constructed a dictionary of LCS tors of movement verbs are sometimes annotated
structures for 60 Japanese verbs (including event with Source or Goal.
nouns) using our extended LCS framework. The contain a phonogram form (Hiragana form) of a certain verb
60 verbs were the most frequent verbs in KTC af- written with Kanji characters, and that phonogram form gen-
ter excluding 100 most frequent ones.4 We cre- erally has a huge ambiguity because many different verbs
have same pronunciation in Japanese.
4 5
We omitted top 100 verbs since these most frequent ones Available at http://dictionary.goo.ne.jp/jn/.
690
Resource Frame-independent # of roles into account specific syntactic behaviors of cer-
LCS yes 13 tain semantic roles. Packing such complex infor-
VerbNet (v3.1) yes 30 mation to semantic roles is useful for analyzing
FrameNet (r1.4) no 8884 argument realization. However, from the view-
point of semantic representation, the clarity for
Table 6: Number of roles in each resource. semantic properties provided using a predicate de-
composition approach is beneficial. The 13 roles
4 Comparison with other resources for the LCS approach is sufficient for obtaining
a function in the action-change-state chain. In
4.1 Number of semantic roles our LCS framework, selectional preference can
The number of roles is related to the number of se- be assigned to arguments in an individual verb or
mantic properties represented in a framework and verb class level instead of role labels themselves
to the generality of that property. Table 6 lists the to maintain generality of semantic functions. In
number of semantic roles defined in our extended addition, our extended LCS framework can easily
LCS framework, VerbNet and FrameNet. separate complex structural information from role
There are two ways to define semantic roles. labels because LCS directly represents a structure
One is frame specific, where the definition of each among the arguments. We can calculate the infor-
role depends on a specific lexical entry and such mation from the LCS structure instead of coding
a role is never used in the other frames. The other it into role labels. As a result, our extended LCS
is frame independent, which is to construct roles framework maintains generality of roles and the
whose semantic function is generalized across number of roles is smaller than other frameworks.
all verbs. The number of roles in FrameNet is
comparatively large because it defines roles in a 4.2 Clarity of role meanings
frame-specific way. FrameNet respects individual We showed that an approach of predicate decom-
meanings of arguments rather than generality of position used in LCS theory clarified role mean-
roles. ings assigned to syntactic arguments. Moreover,
Compared with VerbNet, the number of roles LCS achieves high generality of roles by separat-
defined in our extended LCS framework is less ing selectional preference or structural informa-
than half. However, this fact does not mean tion from role labels. The complex meaning of
that the representation ability of our framework is one syntactic argument is represented by multi-
lower than VerbNet. We manually checked and ple appearances of the argument in an LCS struc-
listed a corresponding representation in our ex- ture. For example, we show an LCS structure
tended LCS framework for each thematic role in and a frame in VerbNet with regard to the verb
VerbNet in Table 6. This table does not provide a buy in Figure 3. The LCS structure consists
perfect or complete mapping between the roles in of four formulae. The first one is the main for-
these two frameworks because the mappings are mula and the others are sub-formulae that rep-
not based on annotated data. However, we can resent co-occurring actions. The semantic-role-
roughly say that the VerbNet roles combine three like representation of the structure is given in Ta-
types of information, a function of the argument ble 4: i = {Protagonist, Effector, Source, Goal},
in the action-change-state chain, selectional pref- j = {Patient, Theme}, k = {Eector, Source,
erence, and structural information of arguments, Goal}, and l = {Patient, Theme}. Selectional
which are in different layers in LCS representa- preference is annotated to each argument as i:
tion. VerbNet has many roles whose functions in selPref(animate organization), j: selPref(any),
the action-change-state chain are duplicated. For k: selPref(animate organization), and l: sel-
example, Destination, Recipient, and Beneficiary Pref(valuable entity). If we want to represent the
have the same property end-state (Goal in LCS) information, such as Source of what?, then we
of a changing event. The difference between such can extend the notation as Source(j) to refer to a
roles comes from a specific sub-type of a chang- changing object.
ing event (possession), selectional preference, and On the other hand, VerbNet combines mul-
structural information among the arguments. By tiple types of information into a single role as
distinguishing such roles, VerbNet roles may take mentioned above. Also, the meaning of some
691
VerbNet role (# of uses) Representation in LCS
Actor (9), Actor1 (9), Actor2 (9) Actor or Effector in symmetric formulas in the structure
Agent (212) (Actor Effector) Protagonist
Asset (6) Theme Source of the change is (locate(in()) Protagonist)
selPref(valuable entity)
Beneficiary (9) (peripheral role (Goal locate(in()))) selPref(animate organization)
(Actor Effector) a transferred entity is something beneficial
Cause (21) ((Effector selPref(animate organization)) Stimulus peripheral role)
Destination (32) Goal
Experiencer (24) Actor of react()
Instrument (25) ((Effector selPref(animate organization)) peripheral role)
Location (45) (Theme PATH roles peripheral role) selPref(location)
Material (6) Theme Source of a change The Goal of the change is locate(fit())
the Goal fullfills selPref(physical object)
Patient (59), Patient 1(11) Patient Theme
Patient2 (11) (Source Goal) connect()
Predicate (23) Theme (Goal locate(fit())) peripheral role
Product (7) Theme (Goal locate(fit()) selPref(physical object))
Proposition (11) Theme
Recipient (33) Goal locate(in()) selPref(animate organization)
Source (34) Source
Theme (162) Theme
Theme1 (13), Theme2 (13) Both of the two is Theme Theme1 is Theme and Theme2 is State
Topic (18) Theme selPref(knowledge infromation)
Table 7: Relationship of roles between VerbNet and our LCS framework. VerbNet roles that appears more than
five times in frame definition are analyzed. Each relationship shown here is only a partial and consistent part of
the complete correspondence table. Note that complete table of mapping highly depends on each lexical entry
(or verb class). Here, locate(in()) generally means possession or recognizing.
roles depends more on selectional preference or Example: John bought a book from Mary for $10.
the structure of the arguments than a primitive VerbNet: Agent V Theme {from} Source {for} Asset.
function in the action-change-state chain. Such has possession(start(E), Source, Theme),
VerbNet roles are used for several different func- has possession(end(E), Agent, Theme),
tions depending on verbs and their alternations, transfer(during(E), Theme), cost(E, Asset)
LCS:
and it is therefore difficult to capture decomposed 2 h i 3
properties from the role label without having spe- 6 cause(aff(i:John, j:a book), go(j, to(loc(in(i))) )) 7
6 2 " # 37
cific lexical knowledge. Moreover, some seman- 6 7
6 from(loc(in(i))) 57
tic functions, such as Mary is a Goal of the money 6comb4cause(aff(i,l:$10), go(l, )) 7
6 to(loc(at(k:Mary))) 7
6 7
in Figure 3, are completely discarded from the 6 2 " # 3 7
6 7
representation at the level of role labels. 6 from(loc(in(k))) 7
6comb4cause(aff(k,j), go(j, )) 5 7
6 to(loc(at(i))) 7
There is another representation related to the 6 7
6 7
6 h i 7
argument meanings in VerbNet. This representa- 4 5
comb cause(aff(k,l), go(l, to(loc(in(k))) ))
tion is a type of predicate decomposition using its
original set of predicates, which are referred to as
Figure 3: Comparison between the semantic predicate
semantic predicates. For example, the verb buy representation and the LCS structure of the verb buy.
in Figure 3 has the predicates has possession,
transfer and cost for composing the meaning of
its event structure. The thematic roles are fillers publicly available. A requirement for obtaining
of the predicates arguments, thus the semantic implicit semantic functions from these semantic
predicates may implicitly provide additional func- predicates is clearly defining how the roles (or
tions to the roles and possibly represent multiple functions) are calculated from these complex re-
roles. Unfortunately, we cannot discover what lations of semantic predicates.
each argument of the semantic predicates exactly FrameNet does not use semantic roles general-
means since the definition of each predicate is not ized among all verbs or does not represent seman-
692
i: selPref(animate organization), j: selPref(any), k: selPref(animate organization), l:
selPref(valuable entity)
Figure 4: LCS of the verbs get, buy, sell, pay, and collect and their relationships calculated from the structures.
693
they contain exactly the same formulae, and the the problems that are directly related to a seman-
only difference is the main formula. The rela- tic role annotation on that we focus in this paper,
tion between buy and get is defined as in- but we plan to solve these problems with further
heritance; a part of the child structure exactly extensions.
equals the parent structure. Interestingly, the re-
lations surrounding the buy are similar to those 5 Conclusion
in FrameNet (see Figure 5). We cannot describe
all types of the relations we considered due to We discussed the two problems in current labeling
space limitations. However, the point is that these approaches for argument-structure analysis: the
relationships are represented as rewriting rules problems in clarity of role meanings and multiple-
between the two LCS representations and thus role assignment. By focusing on the fact that an
they are automatically calculated. Moreover, the approach of predicate decomposition is suitable
grounds for relations maintain clarity based on for solving these problems, we proposed a new
concrete structural relations. A semantic relation framework for semantic role assignment by ex-
construction of frames based on structural rela- tending Jackendoffs LCS framework. The statis-
tionships is another possible application of LCS tics of our LCS dictionary for 60 Japanese verbs
approaches that connects traditional LCS theo- showed that 37.6% of the created frames included
ries with resources representing a lexical network multiple events and the number of assigned roles
such as FrameNet. for one syntactic argument increased 77% from
that in single-role assignment.
4.3 Consistency on semantic structures Compared to the other resources such as Verb-
Constructing a LCS dictionary is generally a dif- Net and FrameNet, the role definitions in our ex-
ficult work since LCS has a high flexibility for tended LCS framework are clearer since the prim-
describing structures and different people tend to itive predicates limit the meaning of each role to
write different structures for a single verb. We a function in the action-change-state chain. We
maintained consistency of the dictionary by tak- also showed that LCS can separate three types of
ing into account a similarity of the structures be- information, the functions represented by primi-
tween the verbs that are in paraphrasing or entail- tives, the selectional preference and structural re-
ment relations. This idea was inspired by auto- lation of arguments, which are conflated in role la-
matic calculation of semantic relations of lexicon bels in existing resources. As a potential of LCS,
as we mentioned above. We created a LCS struc- we demonstrated that several types of frame re-
ture for each lexical entry as we can calculate se- lations, which are similar to those in FrameNet,
mantic relations between related verbs and main- are automatically calculated using the structural
tained high-level consistency among the verbs. relations between LCSs. We still must perform a
Using our extended LCS theory, we success- thorough investigation for enumerating relations
fully created 97 frames for 60 predicates without which can be represented in terms of rewriting
any extra modification. From this result, we be- rules for LCS structures. However, automatic
lieve that our extended theory is stable to some construction of a consistent relation graph of se-
extent. On the other hand, we found that an extra mantic frames may be possible based on lexical
extension of the LCS theory is needed for some structures.
verbs to explain the different syntactic behaviors We believe that this kind of decomposed analy-
of one verb. For example, a condition for a cer- sis will accelerate both fundamental and applica-
tain syntactic behavior of a verb related to re- tion research on argument-structure analysis. As a
ciprocal alteration (see class 2.5 of Levin (Levin, future work, we plan to expand the dictionary and
1993)) such as (connect) and (in- construct a corpus based on our LCS dictionary.
tegrate) cannot be explained without considering
the number of entities in some arguments. Also, Acknowledgment
some verbs need to define an order of the internal
events. For example, the Japanese verb This work was partially supported by JSPS Grant-
(shuttle) means that going is a first action and in-Aid for Scientific Research #22800078.
coming back is a second action. These are not
694
References J. Ruppenhofer, M. Ellsworth, M.R.L. Petruck, C.R.
Johnson, and J. Scheffczyk. 2006. FrameNet II:
P.W. Culicover and W.K. Wilkins. 1984. Locality in Extended Theory and Practice. Berkeley FrameNet
linguistic theory. Academic Press. Release, 1.
Bonnie J. Dorr. 1997. Large-scale dictionary con- Szu-ting Yi, Edward Loper, and Martha Palmer. 2007.
struction for foreign language tutoring and inter- Can semantic roles generalize across genres? In
lingual machine translation. Machine Translation, Proceedings of HLT-NAACL 2007, pages 548555.
12(4):271322.
Bonnie J. Dorr. 2001. Lcs database. http://www.
umiacs.umd.edu/bonnie/LCS Database Document
ation.html.
Jeffrey S Gruber. 1965. Studies in lexical relations.
Ph.D. thesis, MIT.
N. Habash and B. Dorr. 2001. Large scale language
independent generation using thematic hierarchies.
In Proceedings of MT summit VIII.
N. Habash, B. Dorr, and D. Traum. 2003. Hybrid
natural language generation from lexical conceptual
structures. Machine Translation, 18(2):81128.
Eva Hajicova and Ivona Kucerova. 2002. Argu-
ment/valency structure in propbank, lcs database
and prague dependency treebank: A comparative
pilot study. In Proceedings of the Third Inter-
national Conference on Language Resources and
Evaluation (LREC 2002), pages 846851.
Ray Jackendoff. 1990. Semantic Structures. The MIT
Press.
D. Kawahara and S. Kurohashi. 2006. Case frame
compilation from the web using high-performance
computing. In Proceedings of LREC-2006, pages
13441347.
Paul Kingsbury and Martha Palmer. 2002. From Tree-
bank to PropBank. In Proceedings of LREC-2002,
pages 19891993.
Karin Kipper, Hoa Trang Dang, and Martha Palmer.
2000. Class-based construction of a verb lexicon.
In Proceedings of the National Conference on Arti-
ficial Intelligence, pages 691696. Menlo Park, CA;
Cambridge, MA; London; AAAI Press; MIT Press;
1999.
Sadao Kurohashi and Makoto Nagao. 1997. Kyoto
university text corpus project. Proceedings of the
Annual Conference of JSAI, 11:5861.
Beth Levin and Malka Rappaport Hovav. 2005. Argu-
ment realization. Cambridge University Press.
Beth Levin. 1993. English verb classes and alter-
nations: A preliminary investigation. University of
Chicago Press.
Llus Marquez, Xavier Carreras, Kenneth C.
Litkowski, and Suzanne Stevenson. 2008. Se-
mantic role labeling: an introduction to the special
issue. Computational linguistics, 34(2):145159.
B. Rozwadowska. 1988. Thematic restrictions on de-
rived nominals. In W Wlikins, editor, Syntax and
Semantics, volume 21, pages 147165. Academic
Press.
695
Unsupervised Detection of Downward-Entailing Operators By
Maximizing Classification Certainty
696
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 696705,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
proach, as we will show. However, as noted sitional operator for proposition p, then an oper-
by DL10, the performance of the distillation ator is non-veridical if F p 6 p. Positive opera-
method is mixed across languages and in the tors such as past tense adverbials are veridical (4),
semi-supervised bootstrapping setting, and there whereas questions, negation and other DEOs are
is no mathematical grounding of the heuristic to non-veridical (5, 6).
explain why it works and whether the approach
can be refined or extended. This paper supplies (4) She sang yesterday. She sang.
the missing mathematical basis for distillation and (5) She denied singing. 6 She sang.
shows that, while its intentions are fundamentally
sound, the formulation of distillation neglects an (6) Did she sing? 6 She sang.
important requirement that the method not be
easily distracted by other word co-occurrences While Ladusaws hypothesis is thus accepted
in NPI contexts. We call our alternative cer- to be insufficient from a linguistic perspective, it
tainty, which uses an unusual posterior classifica- is nevertheless a useful starting point for compu-
tion confidence score (based on the max function) tational methods for detecting NPIs and DEOs,
to favour single, definite assignments of DEO- and has inspired successful techniques to detect
hood within every NPI context. DLD09 actually DEOs, like the work by DLD09, DL10, and also
speculated on the use of max as an alternative, this work. In addition to this hypothesis, we fur-
but within the context of an EM-like optimization ther assume that there should only be one plausi-
procedure that throws away its initial parameter ble DEO candidate per NPI context. While there
settings too willingly. Certainty iteratively and are counterexamples, this assumption is in prac-
directly boosts the scores of the currently best- tice very robust, and is a useful constraint for our
ranked DEO candidates relative to the alternatives learning algorithm. An analogy can be drawn to
in a Nave Bayes model, which thus pays more re- the one sense per discourse assumption in word
spect to the initial weights, constructively build- sense disambiguation (Gale et al., 1992).
ing on top of what the model already knows. This The relatedand as we will argue, more
method proves to perform better on two corpora difficultproblem of detecting NPIs has also
than distillation, and is more amenable to the co- been studied, and in fact predates the work on
learning of NPIs and DEOs. In fact, the best DEO detection. Hoeksema (1997) performed the
results are obtained by co-learning the NPIs and first corpus-based study of NPIs, predominantly
DEOs in conjunction with our method. for Dutch, and there has also been work on de-
tecting NPIs in German which assumes linguistic
2 Related work knowledge of licensing contexts for NPIs (Lichte
and Soehn, 2007). Richter et al. (2010) make
There is a large body of literature in linguis-
this assumption as well as use syntactic structure
tic theory on downward entailment and polar-
to extract NPIs that are multi-word expressions.
ity items1 , of which we will only mention the
Parse information is an especially important con-
most relevant work here. The connection between
sideration in freer-word-order languages like Ger-
downward-entailing contexts and negative polar-
man where a MWE may not appear as a contigu-
ity items was noticed by Ladusaw (1980), who
ous string. In this paper, we explicitly do not as-
stated the hypothesis that NPIs must be gram-
sume detailed linguistic knowledge about licens-
matically licensed by a DEO. However, DEOs
ing contexts for NPIs and do not assume that a
are not the sole licensors of NPIs, as NPIs can
parser is available, since neither of these are guar-
also be found in the scope of questions, certain
anteed when extending this technique to resource-
numeric expressions (i.e., non-monotone quanti-
poor languages.
fiers), comparatives, and conditionals, among oth-
ers. Giannakidou (2002) proposes that the prop- 3 Distillation as EM Prior Re-estimation
erty shared by these constructions and downward
entailment is non-veridicality. If F is a propo- Let us first review the baseline and distillation
1
methods proposed by DLD09, then show that dis-
See van der Wouden (1997) for a comprehensive refer-
tillation is equivalent to one iteration of EM prior
ence.
697
re-estimation in a Nave Bayes generative proba-
bilistic model up to constant rescaling. The base- Y DEO
line method assigns a score to each word-type
based on the ratio of its relative frequency within
NPI contexts to its relative frequency within a
general corpus. Suppose we are given a corpus C X Context words
with extracted NPI contexts N and they contain
tokens(C) and tokens(N ) tokens respectively. L
Let y be a candidate DEO, countC (y) be the uni-
gram frequency of y in a corpus, and countN (y) Figure 1: Nave Bayes formulation of DEO detection.
be the unigram frequency of y in N . Then, we
define S(y) to be the ratio between the relative
frequencies of y within NPI contexts and in the
entire corpus2 :
NPI contexts which contain y.
countN (y)/tokens(N ) DLD09 find that distillation seems to improve
S(y) = . (7)
countC (y)/tokens(C) the performance of DEO detection in BLLIP.
Later work by DL10, however, shows that distil-
The scores are then used as a ranking to de-
lation does not seem to improve performance over
termine word-types that are likely to be DEOs.
the baseline method in Romanian, and the authors
This method approximately captures Ladusaws
also note that distillation does not improve perfor-
hypothesis by highly ranking words that appear
mance in their experiments on co-learning NPIs
in NPI contexts more often than would be ex-
and DEOs via bootstrapping.
pected by chance. However, the problem with
A better mathematical grounding of the distilla-
this approach is that DEOs are not the only words
tion methods apparent heuristic in terms of exist-
that co-occur with NPIs. In particular, there exist
ing probabilistic models sheds light on the mixed
many piggybackers, which, as defined by DLD09,
performance of distillation across languages and
collocate with DEOs due to semantic relatedness
experimental settings. In particular, it turns out
or chance, and would thus incorrectly receive a
that the distillation method of DLD09 is equiva-
high S(y) score.
lent to one iteration of EM prior re-estimation in
Examples of piggybackers found by DLD09 in-
a Nave Bayes model. Given a lexicon L of L
clude the proper noun Milken, and the adverb vig-
words, let each NPI context be one sample gen-
orously, which collocate with DEOs like deny in
erated by the model. One sample consists of a
the corpus they used. DLD09s solution to the
latent categorical (i.e., a multinomial with one
piggybacker problem is a method that they term
trial) variable Y whose values range over L, cor-
distillation. Let Ny be the NPI contexts that con-
responding to the DEO that licenses the context,
tain word y; i.e., Ny = {c N |c y}. In dis- ~ = Xi=1...L
and observed Bernoulli variables X
tillation, each word-type is given a distilled score
which indicate whether a word appears in the NPI
according to the following equation:
context (Figure 1). This method does not attempt
to model the order of the observed words, nor the
1 X S(y) number of times each word appears. Formally, a
Sd (y) = P . (8)
|Ny | y p S(y ) Nave Bayes model is given by the following ex-
pNy
pression:
where p indexes the set of NPI contexts which
L
contain y 3 , and the denominator is the number of ~ Y)=
Y
P (X, P (Xi |Y )P (Y ). (9)
2
DLD09 actually use the number of NPI contexts con- i=1
taining y rather than countN (y), but we find that using the
raw count works better in our experiments. The probability of a DEO given a particular
3
In DLD09, the corresponding equation does not indicate NPI context is
that p should be the contexts that include y, but it is clear
L
from the surrounding text that our version is the intended
~
Y
meaning. If all the NPI contexts were included in the sum- P (Y |X) P (Xi |Y )P (Y ). (10)
mation, Sd (y) would reduce to inverse relative frequency. i=1
698
The probability of a set of observed NPI con- P (Y ) gives a prior probability that a certain
texts N is the product of the probabilities for each word-type y is a DEO in an NPI context, without
sample: normalizing for the frequency of y in NPI con-
texts. Since we are interested in estimating the
~
Y
P (N ) = P (X) (11) context-independent probability that y is a DEO,
~
XN we must calculate the probability that a word is
~ = ~ y).
X
P (X) P (X, (12) a DEO given that it appears in an NPI context.
yL Let Xy be the observed variable corresponding to
y. Then, the expression we are interested in is
We first instantiate the baseline method of P (y|Xy = 1). We now show that P (y|Xy =
DLD09 by initializing the parameters to the 1) = P (y)/P (Xy = 1), and that this expression
model, P (Xi = 1|y) and P (Y = y), such that is equivalent to (8).
P (Y = y) is proportional to S(y). Recall that this
initialization utilizes domain knowledge about the P (y, Xy = 1)
P (y|Xy = 1) = (17)
correlation between NPIs and DEOs, inspired by P (Xy = 1)
Ladusaws hypothesis:
X Recall that P (y, Xy = 0) = 0 because of the
P (Y = y) = S(y)/ S(y ) (13) assumption that a DEO appears in the NPI context
y that it generates. Thus,
1 if Xi corresponds to y
P (Xi = 1|y) = P (y, Xy = 1) = P (y, Xy = 1) + P (y, Xy = 0)
0.5 otherwise.
(14) = P (y) (18)
This initialization of P (Xi = 1|y) ensures that One iteration of EM to calculate this proba-
the the value of y corresponds to one of the words bility is equivalent to the distillation method of
in the NPI context, and the initialization of P (Y ) DLD09. In particular, the numerator of (17),
is simply a normalization of S(y). which we just showed to be equal to the estimate
Since we are working in an unsupervised set- of P (Y ) given by (16), is exactly the sum of the
ting, there are no labels for Y available. A com- responsibilities for a particular y, and is propor-
mon and reasonable assumption about learning tional to the summation in (8) modulo normaliza-
~
tion, because P (X|y) is constant for all y in the
the parameter settings in this case is to find the pa-
rameters that maximize the likelihood of the ob- context. The denominator P (Xy = 1) is simply
served training data; i.e., the NPI contexts: the proportion of contexts containing y, which is
proportional to |Ny |. Since both the numerator
= argmax P (N ; ). (15) and denominator are equivalent up to a constant
factor, an identical ranking is produced by distil-
The EM algorithm is a well-known iterative al- lation and EM prior re-estimation.
gorithm for performing this optimization. Assum- Unfortunately, the EM algorithm does not pro-
ing that the prior P (Y = y) is a categorical distri- vide good results on this task. In fact, as more
bution, the M-step estimate of these parameters iterations of EM are run, the performance drops
after one iteration through the corpus is as fol- drastically, even though the corpus likelihood
lows: is increasing. The reason is that unsupervised
EM learning is not constrained or biased towards
~
P t (y|X) learning a good set of DEOs. Rather, a higher data
X
P t+1 (Y = y) = (16)
P ~
P t (y |X) likelihood can be achieved simply by assigning
~
XN y
high prior probabilities to frequent word-types.
We do not re-estimate P (Xi = 1|y) because This can be seen qualitatively by consider-
their role is simply to ensure that the DEO re- ing the top-ranking DEOs after several itera-
sponsible for an NPI context exists in the context. tions of EM/distillation (Figure 2). The top-
Estimating these parameters would exacerbate the ranking words are simply function words or other
problems with EM for this task which we will dis- words common in the corpus, which have noth-
cuss shortly. ing to do with downward entailment. In effect,
699
1 iteration 2 iterations 3 iterations fication problem, and then maximizing an objec-
denies the the tive ratio that favours one DEO per context. Our
denied to to method is not guaranteed to increase classification
unaware denied that certainty between iterations, but we will show that
longest than than it does increase certainty very quickly in practice.
hardly that and The key observation that allows us to resolve
lacking if has the tension between trusting the initialization and
deny has if enforcing one DEO per NPI context is that the
nobody denies of distributions of words that co-occur with DEOs
opposes and denied and piggybackers are different, and that this dif-
highest but denies ference follows from Ladusaws hypothesis. In
particular, while DEOs may appear with or with-
Figure 2: Top 10 DEOs after iterations of EM on
out piggybackers in NPI contexts, piggybackers
BLLIP.
do not appear without DEOs in NPI contexts, be-
cause Ladusaws hypothesis stipulates that a DEO
is required to license the NPI in the first place.
Thus, the presence of a high-scoring DEO candi-
EM/distillation overrides the initialization based date among otherwise low-scoring words is strong
on Ladusaws hypothesis and finds another solu- evidence that the high-scoring word is not a pig-
tion with a higher data likelihood. We will also gybacker and its high score from the initialization
provide a quantitative analysis of the effects of is deserved. Conversely, a DEO candidate which
EM/distillation in Section 5. always appears in the presence of other strong
DEO candidates is likely a piggybacker whose
4 Alternative to EM: Maximizing the initial high score should be discounted.
Posterior Classification Certainty We now describe our heuristic method that is
We have seen that in trying to solve the piggy- based on this intuition. For clarity, we use scores
backer problem, EM/distillation too readily aban- rather than probabilities in the following explana-
dons the initialization based on Ladusaws hy- tion, though it is equally applicable to either. As
pothesis, leading to an incorrect solution. Instead in EM/distillation, the method is initialized with
of optimizing the data likelihood, what we need is the baseline S(y) scores. One iteration of the
a measure of the number of plausible DEO candi- method proceeds as follows. Let the score of the
dates there are in an NPI context, and a method strongest DEO candidate in an NPI context p be:
that refines the scores towards having only one M (p) = max Sht (y), (20)
such plausible candidate per context. To this end, yp
we define the classification certainty to be the where Sht (y) is the score of candidate y at the tth
product of the maximum posterior classification iteration according to this heuristic method.
probabilities over the DEO candidates. For a set Then, for each word-type y in each context p,
of hidden variables y N for NPI contexts N , this we compare the current score of y to the scores of
is the expression: the other words in p. If y is currently the strongest
DEO candidate in p, then we give y credit equal
Certainty(y N |N ) = ~
Y
max P (y|X). (19) to the proportional change to M (p) if y were re-
y
~
XN moved (Context p without y is denoted p \ y). A
large change means that y is the only plausible
To increase this certainty score, we propose
DEO candidate in p, while a small change means
a novel iterative heuristic method for refining
that there are other plausible DEO candidates. If
the baseline initializations of P (Y ). Unlike
y is not currently the strongest DEO candidate, it
EM/distillation, our method biases learning to-
receives no credit:
wards trusting the initialization, but refines the (
M (p)M (p\y)
scores towards having only one plausible DEO M (p) if Sht (y) = M (p)
cred(p, y) =
per context in the training corpus. This is accom- 0 otherwise.
plished by treating the problem as a DEO classi- (21)
700
NPI contexts unlikely to be a DEO according to the initializa-
A B C, B C, B C, D C tion.
Original scores 5 Experiments
S(A) = 5, S(B) = 4, S(C) = 1, S(D) = 2
We evaluate the performance of these methods on
Updated scores the BLLIP corpus (30M words) and the AFP
Sh (A) = 5 (5 4)/5 =1 portion of the Gigaword corpus (338M words).
Sh (B) = 4 (0 + 2 (4 1)/4)/3 =2 Following DLD09, we define an NPI context to
be all the words to the left of an NPI, up to the
Sh (C) = 1 (0 + 0 + 0) =0
closest comma or semi-colon, and removed NPI
Sh (D) = 2 (2 1)/2 =1 contexts which contain the most common DEOs
Figure 3: Example of one iteration of the certainty- like not. We further removed all empty NPI con-
based heuristic on four NPI contexts with four words texts or those which only contain other punctua-
in the lexicon. tion. After this filtering, there were 26696 NPI
contexts in BLLIP and 211041 NPI contexts in
AFP, using the same list of 26 NPIs defined by
DLD09.
We first define an automatic measure of per-
Then, the average credit received by each y is formance that is common in information retrieval.
a measure of how much we should trust the cur- We use average precision to quantify how well a
rent score for y. The updated score for each DEO system separates DEOs from non-DEOs. Given a
candidate is the original score multiplied by this list of known DEOs, G, and non-DEOs, the aver-
average: age precision of a ranked list of items, X, is de-
Sht (y) X fined by the following equation:
Sht+1 (y) = cred(p, y). (22)
|Ny | Pn
P (X1...k ) 1(xk G)
pNy
AP (X) = k=1 ,
|G|
The probability P t+1 (Y = y) is then simply (24)
Sht+1 (y)
normalized:
where P (X1...k ) is the precision of the first k
S t+1 (y)
P t+1
(Y = y) = X h . (23) items and 1(xk G) is an indicator function
Sht+1 (y ) which is 1 if x is in the gold standard list of DEOs
y L and 0 otherwise.
We iteratively reduce the scores in this fashion DLD09 simply evaluated the top 150 output
to get better estimates of the relative suitability of DEO candidates by their systems, and qualita-
word-types as DEOs. tively judged the precision of the top-k candidates
An example of this method and how it solves at various values of k up to 150. Average preci-
the piggybacker problem is given in Figure 3. In sion can be seen as a generalization of this evalu-
this example, we would like to learn that B and ation procedure that is sensitive to the ranking of
D are DEOs, A is a piggybacker, and C is a fre- DEOs and non-DEOs. For development purposes,
quent word-type, such as a stop word. Using the we use the list of 150 annotations by DLD09. Of
original scores, piggybacker A would appear to these, 90 were DEOs, 30 were not, and 30 were
be the most likely word to be a DEO. However, classified as other (they were either difficult to
by noticing that it never occurs on its own with classify, or were other types of non-veridical oper-
words that are unlikely to be DEOs (in the exam- ators like comparatives or conditionals). We dis-
ple, word C), our heuristic penalizes A more than carded the 30 other items and ignored all items
B, and ranks B higher after one iteration. EM not in the remaining 120 items when evaluating a
prior re-estimation would not correctly solve this ranked list of DEO candidates. We call this mea-
example, as it would converge on a solution where sure AP120 .
C receives all of the probability mass because it In addition, we annotated DEO candidates from
appears in all of the contexts, even though it is the top-150 rankings produced by our certainty-
701
absolve, abstain, banish, bereft, boycott, cau- Method BLLIP AP120 AFP AP246
tion, clear, coy, delay, denial, desist, devoid, Baseline .879 .734
disavow, discount, dispel, disqualify, down- Distillation .946 .785
play, exempt, exonerate, foil, forbid, forego, This work .955 .809
impossible, inconceivable, irrespective, limit,
Table 1: Average precision results on the BLLIP and
mitigate, nip, noone, omit, outweigh, pre-
AFP corpora.
condition, pre-empt, prerequisite, refute, re-
move5 , repel, repulse, scarcely, scotch, scuttle,
seldom, sensitive, shy, sidestep, snuff, thwart,
waive, zero-tolerance
be obtained by examining the data likelihood and
Figure 4: Lemmata of DEOs identified in this work not
the classification certainty at each iteration of the
found by DLD09.
algorithms (Figure 5). Whereas EM/distillation
maximizes the former expression, the certainty-
based heuristic method actually decreases data
likelihood for the first couple of iterations before
based heuristic on BLLIP and also by the dis- increasing it again. In terms of classification cer-
tillation and heuristic methods on AFP, in order tainty, EM/distillation converges to a lower classi-
to better evaluate the final output of the meth- fication certainty score compared to our heuristic
ods. This produced an additional 68 DEOs (nar- method. Thus, our method better captures the as-
rowly defined) (Figure 4), 58 non-DEOs, and 31 sumption of one DEO per NPI context.
other items4 . Adding the DEOs and non-DEOs
we found to the 120 items from above, we have 6 Bootstrapping to Co-Learn NPIs and
an expanded list of 246 items to rank, and a corre- DEOs
sponding average precision which we call AP246 . The above experiments show that the heuristic
We employ the frequency cut-offs used by method outperforms the EM/distillation method
DLD09 for sparsity reasons. A word-type must given a list of NPIs. We would like to extend
appear at least 10 times in an NPI context and this result to novel domains, corpora, and lan-
150 times in the corpus overall to be considered. guages. DLD09 and DL10 proposed the follow-
We treat BLLIP as a development corpus and use ing bootstrapping algorithm for co-learning NPIs
AP120 on AFP to determine the number of itera- and DEOs given a much smaller list of NPIs as a
tions to run our heuristic (5 iterations for BLLIP seed set.
and 13 iterations for AFP). We run EM/distillation
for one iteration in development and testing, be- 1. Begin with a small set of seed NPIs
cause more iterations hurt performance, as ex-
2. Iterate:
plained in Section 3.
We first report the AP120 results of our ex- (a) Use the current list of NPIs to learn a
periments on the BLLIP corpus (Table 1 sec- list of DEOs
ond column). Our method outperforms both (b) Use the current list of DEOs to learn a
EM/distillation and the baseline method. These list of NPIs
results are replicated on the final test set from
AFP using the full set of annotations AP246 (Ta- Interestingly, DL10 report that while this
ble 1 third column). Note that the scores are lower method works in Romanian data, it does not work
when using all the annotations because there are in the English BLLIP corpus. They speculate that
more non-DEOs relative to DEOs in this list, mak- the reason might be due to the nature of the En-
ing the ranking task more challenging. glish DEO any, which can occur in all classes of
A better understanding of the algorithms can DE contexts according to an analysis by Haspel-
4 math (1997). Further, they find that in Romanian,
The complete list will be made publicly available.
5
We disagree with DLD09 that remove is not downward- distillation does not perform better than the base-
entailing; e.g., The detergent removed stains from his cloth- line method during Step (2a). While this linguis-
ing. The detergent removed stains from his shirts. tic explanation may certainly be a factor, we raise
702
6 5
x 10 x 10
0 0
-0.5 -0.5
Log probability
Log probability
-1 -1
-1.5 -1.5
-2 -2
-2.5 -2.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Iterations Iterations
(a) Data log likelihood. (b) Log classification certainty probabilities.
Figure 5: Log likelihood and classification certainty probabilities of NPI contexts in two corpora. Thinner lines
near the top are for BLLIP; thicker lines for AFP. Blue dotted: baseline; red dashed: distillation; green solid:
~
our certainty-based heuristic method. P (X|y) probabilities are not included since they would only result in a
constant offset in the log domain.
a second possibility that the distillation algorithm other spurious correlations such as piggybackers
itself may be responsible for these results. As ev- as discussed earlier. In the other direction, it is
idence, we show that the heuristic algorithm is not the case that DEOs always or nearly always
able to work in English with just the single seed appear in the context of an NPI. Rather, the most
NPI any, and in fact the bootstrapping approach in common collocations of DEOs are the selectional
conjunction with our heuristic even outperforms preferences of the DEO, such as common argu-
the above approaches when using a static list of ments to verbal DEOs, prepositions that are part
NPIs. of the subcategorization of the DEO, and words
In particular, we use the methods described in that together with the surface form of the DEO
the previous sections for Step (2a), and the follow- comprise an idiomatic expression or multi-word
ing ratio to rank NPI candidates in Step (2b), cor- expression. Further, NPIs are more likely to be
responding to the baseline method to detect DEOs composed of multiple words, while many DEOs
in reverse: are single words, possibly with PP subcategoriza-
tion requirements which can be filled in post hoc.
countD (x)/tokens(D)
T (x) = . (25) Because of these issues, we cannot trust the ini-
countC (x)/tokens(C) tialization to learn NPIs nearly as much as with
DEOs, and cannot use the distillation or certainty
Here, countD (x) refers to the number of oc-
methods for this step. Rather, the hope is that
currences of NPI candidate x in DEO contexts
learning a noisy list of pseudo-NPIs, which of-
D, defined to be the words to the right of a DEO
ten occur in negative contexts but may not actu-
operator up to a comma or semi-colon. We do
ally be NPIs, can still improve the performance of
not use the EM/distillation or heuristic methods in
DEO detection.
Step (2b). Learning NPIs from DEOs is a much
There are a number of parameters to the method
harder problem than learning DEOs from NPIs.
which we tuned to the BLLIP corpus using
Because DEOs (and other non-veridical opera-
AP120 . At the end of Step (2a), we use the cur-
tors) license NPIs, the majority of occurrences of
rent top 25 DEOs plus 5 per iteration as the DEO
NPIs will be in the context of a DEO, modulo am-
list for the next step. To the initial seed NPI of
biguity of DEOs such as the free-choice any and
703
Method BLLIP AP120 AFP AP246 be an instance of EM prior re-estimation, our
Baseline .889 (+.010) .739 (.005) method directly addresses the issue of piggyback-
Distillation .930 (.016) .804 (+.019) ers which spuriously correlate with NPIs but are
This work .962 (+.007) .821 (+.012) not downward-entailing. This is achieved by
maximizing the posterior classification certainty
Table 2: Average precision results with bootstrapping
of the corpus in a way that respects the initializa-
on the BLLIP and AFP corpora. Absolute gain in av-
erage precision compared to using a fixed list of NPIs tion, rather than maximizing the data likelihood
given in brackets. as in EM/distillation. Our method outperforms
distillation and a baseline method on two corpora
anymore, anything, anytime, avail, bother, as well as in a bootstrapping setting where NPIs
bothered, budge, budged, countenance, faze, and DEOs are jointly learned. It achieves the best
fazed, inkling, iota, jibe, mince, nor, whatso- performance in the bootstrapping setting, rather
ever, whit than when using a fixed list of NPIs. The perfor-
mance of our algorithm suggests that it is suitable
Figure 6: Probable NPIs found by bootstrapping using for other corpora and languages.
the certainty-based heuristic method.
Interesting future research directions include
detecting DEOs of more than one word as well as
distinguishing the particular word sense and sub-
categorization that is downward-entailing. An-
any, we add the top 5 ranking NPI candidates at other problem that should be addressed is the
the end of Step (2b) in each subsequent iteration. scope of the downward entailment, generalizing
We ran the bootstrapping algorithm for 11 itera- work being done in detecting the scope of nega-
tions for all three algorithms. The final evaluation tion (Councill et al., 2010, for example).
was done on AFP using AP246 .
Acknowledgments
The results show that bootstrapping can indeed
improve performance, even in English (Table 2). We would like to thank Cristian Danescu-
Using bootstrapping to co-learn NPIs and DEOs Niculescu-Mizil for his help with replicating his
actually results in better performance than spec- results on the BLLIP corpus. This project was
ifying a static list of NPIs. The certainty-based supported by the Natural Sciences and Engineer-
heuristic in particular achieves gains with boot- ing Research Council of Canada.
strapping in both corpora, in contrast to the base-
line and distillation methods. Another factor that
we found to be important is to add a sufficient References
number of NPIs to the NPI list each iteration, as Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa T.
adding too few NPIs results in only a small change Dang, and Danilo Giampiccolo. 2010. The sixth
in the NPI contexts available for DEO detection. pascal recognizing textual entailment challenge. In
The Text Analysis Conference (TAC 2010).
DL10 only added one NPI per iteration, which
Isaac G. Councill, Ryan McDonald, and Leonid Ve-
may explain why they did not find any improve-
likovich. 2010. Whats great and whats not:
ment with bootstrapping in English. It also ap- Learning to classify the scope of negation for im-
pears that learning the pseudo-NPIs does not hurt proved sentiment analysis. In Proceedings of the
performance in detecting DEO, and further, that Workshop on Negation and Speculation in Natural
a number of true NPIs are learned by our method Language Processing, pages 5159. Association for
(Figure 6). Computational Linguistics.
Cristian Danescu-Niculescu-Mizil and Lillian Lee.
7 Conclusion 2010. Dont have a clue?: Unsupervised co-
learning of downward-entailing operators. In Pro-
We have proposed a novel unsupervised method ceedings of the ACL 2010 Conference Short Papers,
for discovering downward-entailing operators pages 247252. Association for Computational Lin-
from raw text based on their co-occurrence with guistics.
negative polarity items. Unlike the distilla- Cristian Danescu-Niculescu-Mizil, Lillian Lee, and
tion method of DLD09, which we show to Richard Ducott. 2009. Without a doubt?: Un-
supervised discovery of downward-entailing oper-
704
ators. In Proceedings of Human Language Tech-
nologies: The 2009 Annual Conference of the North
American Chapter of the Association for Computa-
tional Linguistics.
William A. Gale, Kenneth W. Church, and David
Yarowsky. 1992. One sense per discourse. In Pro-
ceedings of the Workshop on Speech and Natural
Language, pages 233237. Association for Compu-
tational Linguistics.
Anastasia Giannakidou. 2002. Licensing and sensitiv-
ity in polarity items: from downward entailment to
nonveridicality. CLS, 38:2953.
Martin Haspelmath. 1997. Indefinite pronouns. Ox-
ford University Press.
Jack Hoeksema. 1997. Corpus study of negative po-
larity items. IV-V Jornades de corpus linguistics
19961997.
William A. Ladusaw. 1980. On the notion affective
in the analysis of negative-polarity items. Journal
of Linguistic Research, 1(2):116.
Timm Lichte and Jan-Philipp Soehn. 2007. The re-
trieval and classification of negative polarity items
using statistical profiles. Roots: Linguistics in
Search of Its Evidential Base, pages 249266.
Bill MacCartney and Christopher D. Manning. 2008.
Modeling semantic containment and exclusion in
natural language inference. In Proceedings of the
22nd International Conference on Computational
Linguistics.
Frank Richter, Fabienne Fritzinger, and Marion Weller.
2010. Who can see the forest for the trees? ex-
tracting multiword negative polarity items from
dependency-parsed text. Journal for Language
Technology and Computational Linguistics, 25:83
110.
Ton van der Wouden. 1997. Negative Contexts: Col-
location, Polarity and Multiple Negation. Rout-
ledge.
705
Elliphant: Improved Automatic Detection of
Zero Subjects and Impersonal Constructions in Spanish
Subject ellipsis is the omission of the subject in The first ML based approach to this problem
a sentence. We consider not only missing refer- in Spanish and a thorough analysis regarding
ential subject (zero subject) as manifestation of features, learnability, genre and errors.
ellipsis, but also non-referential impersonal con-
The best performing algorithms to automati-
structions.
cally detect explicit subjects and impersonal
Various natural language processing (NLP)
constructions in Spanish.
tasks benefit from the identification of ellip-
tical subjects, primarily anaphora resolution The remainder of the paper is organized as fol-
(Mitkov, 2002) and co-reference resolution (Ng lows. Section 2 describes the classes of Spanish
and Cardie, 2002). The difficulty in detect- subjects, while Section 3 provides a literature re-
ing missing subjects and non-referential pronouns view. Section 4 describes the creation and the an-
has been acknowledged since the first studies on notation of the corpus and in Section 5 the ma-
This work was partially funded by a La Caixa grant chine learning (ML) method is presented. The
for master students. analysis of the features, the learning curves, the
706
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 706715,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
genre impact and the error analysis are all detailed 3 Related Work
in Section 6. Finally, in Section 7, conclusions
Identification of non-referential pronouns, al-
are drawn and plans for future work are discussed.
though a crucial step in co-reference and anaphora
This work is an extension of the first author mas-
resolution systems (Mitkov, 2010),2 has been ap-
ters thesis (Rello, 2010) and a preliminary ver-
plied only to the pleonastic it in English (Evans,
sion of the algorithm was presented in Rello et al.
2001; Boyd et al., 2005; Bergsma et al., 2008)
(2010).
and expletive pronouns in French (Danlos, 2005).
2 Classes of Spanish Subjects Machine learning methods are known to perform
better than rule-based techniques for identifying
Literature related to ellipsis in NLP (Ferrandez non-referential expressions (Boyd et al., 2005).
and Peral, 2000; Rello and Illisei, 2009a; Mitkov, However, there is some debate as to which ap-
2010) and linguistic theory (Bosque, 1989; Bru- proach may be optimal in anaphora resolution
cart, 1999; Real Academia Espanola, 2009) has systems (Mitkov and Hallett, 2007).
served as a basis for establishing the classes of Both English and French texts use an ex-
this work. plicit word, with some grammatical information
Explicit subjects are phonetically realized and (a third person pronoun), which is non-referential
their syntactic position can be pre-verbal or post- (Mitkov, 2010). By contrast, in Spanish, non-
verbal. In the case of post-verbal subjects (a), the referential expressions are not realized by exple-
syntactic position is restricted by some conditions tive or pleonastic pronouns but rather by a certain
(Real Academia Espanola, 2009). kind of ellipsis. For this reason, it is easy to mis-
(a) Careceran de validez las disposiciones que con- take them for zero pronouns, which are, in fact,
tradigan otra de rango superior.1 referential.
The dispositions which contradict higher range Previous work on detecting Spanish subject el-
ones will not be valid. lipsis focused on distinguishing verbs with ex-
plicit subjects and verbs with zero subjects (zero
Zero subjects (b) appear as the result of a nomi- pronouns), using rule-based methods (Ferrandez
nal ellipsis. That is, a lexical element the elliptic and Peral, 2000; Rello and Illisei, 2009b). The
subject, which is needed for the interpretation of Ferrandez and Peral algorithm (2000) outper-
the meaning and the structure of the sentence, is forms the (Rello and Illisei, 2009b) approach
elided; therefore, it can be retrieved from its con- with 57% accuracy in identifying zero subjects.
text. The elision of the subject can affect the en- In (Ferrandez and Peral, 2000), the implementa-
tire noun phrase and not just the noun head when tion of a zero subject identification and resolution
a definite article occurs (Brucart, 1999). module forms part of an anaphora resolution sys-
(b) Fue refrendada por el pueblo espanol. tem.
(It) was countersigned by the people of Spain.
ML based studies on the identification of
explicit non-referential constructions in English
The class of impersonal constructions is present accuracies of 71% (Evans, 2001), 87.5%
formed by impersonal clauses (c) and reflex- (Bergsma et al., 2008) and 88% (Boyd et al.,
ive impersonal clauses with particle se (d) (Real 2005), while 97.5% is achieved for French (Dan-
Academia Espanola, 2009). los, 2005). However, in these languages, non-
referential constructions are explicit and not omit-
(c) No hay matrimonio sin consentimiento.
ted which makes this task more challenging for
(There is) no marriage without consent.
Spanish.
(d) Se estara a lo que establece el apartado siguiente.
(It) will be what is established in the next section. 4 Corpus
1
All the examples provided are taken from our corpus. We created and annotated a corpus composed
In the examples, explicit subjects are presented in italics. of legal texts (law) and health texts (psychiatric
Zero subjects are presented by the symbol and in the En-
2
glish translations the subjects which are elided in Spanish are In zero anaphora resolution, the identification of zero
marked with parentheses. Impersonal constructions are not anaphors first requires that they be distinguished from non-
explicitly indicated. referential impersonal constructions (Mitkov, 2010).
707
papers) originally written in peninsular Spanish. for each of the three categories is shown against
The corpus is named after its annotated content the thirteen annotation tags to which they belong
Explicit Subjects, Zero Subjects and Impersonal (Table 1).
Constructions (ESZIC es Corpus). Afterwards, each of the tags are grouped in one
To the best of our knowledge, the existing cor- of the three main classes.
pora annotated with elliptical subjects belong to
other genres. The Blue Book (handbook) and Explicit subjects: [- elliptic, + referential].
Lexesp (journalistic texts) used in (Ferrandez and
Peral, 2000) contain zero subjects but not imper- Zero subjects: [+ elliptic, + referential].
sonal constructions. On the other hand, the Span-
Impersonal constructions: [+ elliptic, - refer-
ish AnCora corpus based on journalistic texts in-
ential].
cludes zero pronouns and impersonal construc-
tions (Recasens and Mart, 2010) while the Z- Of these annotated verbs, 71% have an explicit
corpus (Rello and Illisei, 2009b) comprises legal, subject, 26% have a zero subject and 3% belong
instructional and encyclopedic texts but has no an- to an impersonal construction (see Table 2).
notated impersonal constructions.
The ESZIC corpus contains a total of 6,827 Number of instances Legal Health All
verbs including 1,793 zero subjects. Except for Explicit subjects 2,739 2,116 4,855
AnCora-ES, with 10,791 elliptic pronouns, our Zero subjects 619 1,174 1,793
corpus is larger than the ones used in previous ap- Impersonals 71 108 179
proaches: about 1,830 verbs including zero and Total 3,429 3,398 6,827
explicit subjects in (Ferrandez and Peral, 2000) Table 2: Instances per class in ESZIC Corpus.
(the exact number is not mentioned in the pa-
per) and 1,202 zero subjects in (Rello and Illisei, To measure inter-annotator reliability we use
2009b). Fleiss Kappa statistical measure (Fleiss, 1971).
The corpus was parsed by Connexors Ma- We extracted 10% of the instances of each of the
chinese Syntax (Connexor Oy, 2006), which re- texts of the corpus covering the two genres.
turns lexical and morphological information as
well as the dependency relations between words Fleiss Kappa Legal Health All
by employing a functional dependency grammar Two Annotators 0.934 0.870 0.902
(Tapanainen and Jarvinen, 1997). Three Annotators 0.925 0.857 0.891
To annotate our corpus we created an annota-
Table 3: Inter-annotator Agreement.
tion tool that extracts the finite clauses and the
annotators assign to each example one of the de-
In Table 3 we present the Fleiss kappa inter-
fined annotation tags. Two volunteer graduate stu-
annotator agreement for two and three annota-
dents of linguistics annotated the verbs after one
tors. These results suggest that the annotation
training session. The annotations of a third volun-
is reliable since it is common practice among re-
teer with the same profile were used to compute
searchers in computational linguistics to consider
the inter-annotator agreement. During the anno-
0.8 as a minimum value of acceptance (Artstein
tation phase, we evaluated the adequacy and clar-
and Poesio, 2008).
ity of the annotation guidelines and established a
typology of the rising borderline cases, which is 5 Machine Learning Approach
included in the annotation guidelines.
Table 1 shows the linguistic and formal criteria We opted for an ML approach given that our
used to identify the chosen categories that served previous rule-based methodology improved only
as the basis for the corpus annotation. For each 0.02 over the 0.55 F-measure of a simple base-
tag, in addition to the two criteria that are crucial line (Rello and Illisei, 2009b). Besides, ML based
for identifying subject ellipsis ([ elliptic] and methods for the identification of explicit non-
[ referential]) a combination of syntactic, se- referential constructions in English appear to per-
mantic and discourse knowledge is also encoded form better than than rule-based ones (Boyd et al.,
during the annotation. The linguistic motivation 2005).
708
L INGUISTIC INFORMATION P HONETIC S YNTACTIC V ERBAL S EMANTIC D ISCOURSE
R EALIZATION CATEGORY D IATHESIS I NTERPR .
Annotation Annotation Elliptic Ell. noun Nominal Active Active Referential
Categories Tags noun phrase subject participant subject
phrase head
Explicit subject + + + +
Explicit Reflex passive + + +
subject subject
Passive subject + +
Omitted subject + + + + +
Omitted subject + + + + +
head
Non-nominal + + +
subject
Zero Reflex passive + + + +
subject omitted subject
Reflex pass. omit- + + + +
ted subject head
Reflex pass. non- + +
nominal subject
Passive omitted + + +
subject
Pass. non-nominal +
subject
Impersonal Reflex imp. clause n/a n/a
construction (with se)
Imp. construction n/a + n/a
(without se)
709
Feature Definition Value
1 PARSER Parsed subject True, False
2 CLAUSE Clause type Main, Rel, Imp, Prop, Punct
3 LEMMA Verb lemma Parsers lemma tag
4 NUMBER Verb morphological number SG, PL
5 PERSON Verb morphological person P1, P2, P3
6 AGREE Agreement in person, number, tense FTFF, TTTT, FFFF, TFTF, TTFF, FTFT, FTTF, TFTT,
and mood FFFT, TTTF, FFTF, TFFT, FFTT, FTTT, TFFF, TTFT
7 NHPREV Previous noun phrases Number of noun phrases previous to the verb
8 NHTOT Total noun phrases Number of noun phrases in the clause
9 INF Infinitive Number of infinitives in the clause
10 SE Spanish particle se True, False
11 A Spanish preposition a True, False
12 POSpre Four parts of the speech previous to 292 different values combining the parsers
the verb POS tags
14 POSpos Four parts of the speech following 280 different values combining the parsers
the verb POS tags
14 VERBtype Type of verb: copulative, impersonal CIPX, XIXX, XXXT, XXPX, XXXI, CIXX, XXPT, XIPX,
pronominal, transitive and intransitive XIPT, XXXX, XIXI, CXPI, XXPI, XIPI, CXPX
tense, and mood with the preceding verb in (e) Se admiten los alumnos que reunan los req-
the sentence and also with the main verb of uisitos.
the sentence.3 (They) accept the students who fulfill the
requirements.
7-9 NHPREV, NHTOT, INF: the candidates for (f) Se admite a los alumnos que reunan los req-
the subject of the clause are represented by uisitos.
the number of noun phrases in the clause that (It) is accepted for the students who fulfill
precede the verb, the total number of noun the requirements.
phrases in the clause, and the number of in-
finitive verbs in the clause. 12-3 POSpre , POSpos : the part of the speech
(POS) of eight tokens, that is, the 4-grams
10 SE: a binary feature encoding the presence preceding and the 4-grams following the in-
or absence of the Spanish particle se when it stance.
occurs immediately before or after the verb
or with a maximum of one token lying be- 14 VERBtype : the verb is classified as copula-
tween the verb and itself. Particle se occurs tive, pronominal, transitive, or with an im-
in passive reflex clauses with zero subjects personal use.4 Verbs belonging to more than
and in some impersonal constructions. one class are also accommodated with dif-
ferent feature values for each of the possible
11 A: a binary feature encoding the presence or combinations of verb type.
absence of the Spanish preposition a in the
5.2 Evaluation
clause. Since the distinction between passive
reflex clauses with zero subjects and imper- To determine the most accurate algorithm for our
sonal constructions sometimes relies on the classification task, two comparisons of learning
appearance of preposition a (to, for, etc.). algorithms implemented in W EKA (Witten and
For instance, example (e) is a passive reflex Frank, 2005) were carried out. Firstly, the classi-
clause containing a zero subject while exam- fication was performed using 20% of the training
ple (s) is an impersonal construction. instances. Secondly, the seven highest perform-
3
ing classifiers were compared using 100% of the
In Spanish, when a finite verb appears in a subordinate
4
clause, its tense and mood can assist in recognition of these We used four lists provided by Molino de Ideas s.a. con-
features in the verb of the main clause and help to enforce taining 11,060 different verb lemmas belonging to the Royal
some restrictions required by this verb, especially when both Spanish Academy Dictionary (Real Academia Espanola,
verbs share the same referent as subject. 2001).
710
Class P R F Acc. Algorithm Explicit Zero Impersonals
Explicit subj. 90.1% 92.3% 91.2% 87.3% subjects subjects
Zero subj. 77.2% 74.0% 75.5% 87.4% RAE 70.4%
Impersonals 85.6% 63.1% 72.7% 98.8% Connexor 71.7% 83.0%
Ferr./Peral 79.7% 98.4%
Table 5: K* performance (87.6% accuracy for ten-fold Elliphant 87.3% 87.4% 98.8%
cross validation).
Table 6: Summary of accuracy comparison with previ-
ous work.
training data and ten-fold cross-validation. The
corpus was partitioned into training and tested
it without impersonal constructions. We achieve
using ten-fold cross-validation for randomly or-
a precision of 87% for explicit subjects compared
dered instances in both cases. The lazy learn-
to 80%, and a precision of 87% for zero subjects
ing classifier K* (Cleary and Trigg, 1995), us-
compared to their 98%. The overall accuracy
ing a blending parameter of 40%, was the best
is the same for both techniques, 87.5%, but our
performing one, with an accuracy of 87.6% for
results are more balanced. Nevertheless, the
ten-fold cross-validation. K* differs from other
approaches and corpora used in both studies are
instance-based learners in that it computes the dis-
different, and hence it is not possible to do a fair
tance between two instances using a method mo-
comparison. For example, their corpus has 46%
tivated by information theory, where a maximum
of zero subjects while ours has only 26%.
entropy-based distance function is used (Cleary
and Trigg, 1995). Table 5 shows the results For impersonal constructions our method out-
for each class using ten-fold cross-validation. performs the RAE baseline (precision 6.5%,
In contrast to previous work, the K* algorithm recall 77.7%, F-measure 12.0% and accuracy
(Cleary and Trigg, 1995) was found to provide the 70.4%). Table 6 summarizes the comparison. The
most accurate classification in the current study. low performance of the RAE baseline is due to the
Other approaches have employed various clas- fact that verbs with impersonal use are often am-
sification algorithms, including JRip in WEKA biguous. For these cases, we first tagged them as
(Muller, 2006), with precision of 74% and recall ambiguous and then, we defined additional crite-
of 60%, and K-nearest neighbors in TiMBL: both ria after analyzing then manually. The resulting
in (Evans, 2001) with precision of 73% and recall annotated criteria are stated in Table 1.
of 69%, and in (Boyd et al., 2005) with precision
6 Analysis
of 82% and recall of 71%.
Since there is no previous ML approach for this Through these analyses we aim to extract the most
task in Spanish, our baselines for the explicit sub- effective features and the information that would
jects and the zero subjects are the parser output complement the output of an standard parser to
and the previous rule-based work with the high- achieve this task. We also examine the learning
est performance (Ferrandez and Peral, 2000). For process of the algorithm to find out how many in-
the impersonal constructions the baseline is a sim- stances are needed to train it efficiently and de-
ple greedy algorithm that classifies as an imper- termine how much Elliphant is genre dependent.
sonal construction every verb whose lemma is cat- The analyses indicate that our approach is robust:
egorized as a verb with impersonal use according it performs nearly as well with just six features,
to the RAE dictionary (Real Academia Espanola, has a steep learning curve, and seems to general-
2001). ize well to other text collections.
Our method outperforms the Connexor parser
which identifies the explicit subjects but makes no 6.1 Best Features
distinction between zero subjects and impersonal We carried out three different experiments to eval-
constructions. Connexor yields 74.9% overall ac- uate the most effective group of features, and
curacy and 80.2% and 65.6% F-measure for ex- the features themselves considering the individ-
plicit and elliptic subjects, respectively. ual predictive ability of each one along with their
To compare with Ferrandez and Peral degree of redundancy.
(Ferrandez and Peral, 2000) we do consider Based on the following three feature selection
711
methods we can state that there is a complex and Omission of all but one of the simple features
balanced interaction between the features. led to a reduction in accuracy, justifying their in-
clusion in the training instances. Nevertheless, the
6.1.1 Grouping Features
majority of features present low informativeness
In the first experiment we considered the 11 except for feature A which does not make any
groups of relevant ordered features from the train- meaningful contribution to the classification. The
ing data, which were selected using each W EKA feature PARSER presents the greatest difference
attribute selection algorithm and performed the in performance (86.3% total accuracy); however,
classifications over the complete training data, us- this is no big loss, considering it is the main fea-
ing only the different groups features selected. ture. Hence, as most features do not bring a sig-
The most effective group of six features (NH- nificant loss in accuracy, the features need to be
PREV, PARSER, NHTOT, POSpos , PERSON, combined to improve the performance.
LEMMA) was the one selected by W EKAs Sym-
metricalUncertAttribute technique, which gives 6.2 Learning Analysis
an accuracy of 83.5%. The most frequently The learning curve of Figure 1 (left) presents the
selected features by all methods are PARSER, increase of the performance obtained by Elliphant
POSpos , and NHTOT, and they alone get an accu- using the training data randomly ordered. The
racy of 83.6% together. As expected, the two pairs performance reaches its plateau using 90% of the
of features that perform best (both 74.8% accu- training instances. Using different ordering of the
racy) are PARSER with either POSpos or NHTOT. training set we obtain the same result.
Based on how frequent each feature is selected Figure 1 (right) presents the precision for each
by W EKAs attribute selection algorithms, we can class and overall in relation to the number of train-
rank the features as following: (1) PARSER, ing instances for each one of them. Recall grows
(2) NHTOT, (3) POSpos , (4) NHPREV and (5) similarly to precision. Under all conditions, sub-
LEMMA. jects are classified with a high precision since the
6.1.2 Complex vs. Simple Features information given by the parser (collected in the
Second, a set of experiments was conducted features) achieves an accuracy of 74.9% for the
in which features were selected on the basis identification of explicit subjects.
of the degree of computational effort needed to The impersonal construction class has the
generate them. We propose two sets of fea- fastest learning curve. When utilizing a training
tures. One group corresponds to simple fea- set of only 163 instances (90% of the training
tures, whose values can be obtained by trivial data), it reaches a precision of 63.2%. The un-
exploitation of the tags produced in the parsers stable behaviour for impersonal constructions can
output (PARSER, LEMMA, PERSON, POSpos , be attributed to not having enough training data
POSpre ). The second group of features, com- for that class, since impersonals are not frequent
plex features (CLAUSE, AGREE, NHPREV, in Spanish. On the other hand, the zero subject
NHTOT, VERBtype ) have values that required the class is learned more gradually.
implementation of more sophisticated modules to The learning curve for the explicit subject class
identify the boundaries of syntactic constituents is almost flat due to the great variety of subjects
such as clauses and noun phrases. The accuracy occurring in the training data. In addition, reach-
obtained when the classifier exclusively exploits ing a precision of 92.0% for explicit subjects us-
complex features is 82.6% while for simple ing just 20% of the training data is far more ex-
features is 79.9%. No impersonal constructions pensive in terms of the number of training in-
are identified when only complex features are stances (978) as seen in Figure 1 (right). Actually,
used. with just 20% of the training data we can already
achieve a precision of 85.9%.
6.1.3 One-left-out Feature This demonstrates that Elliphant does not need
In the third experiment, to estimate the weight very large sets of expensive training data and
of each feature, classifications were made in is able to reach adequate levels of performance
which each feature was omitted from the train- when exploiting far fewer training instances. In
ing instances that were presented to the classifier. fact, we see that we only need a modest set of
712
% 86.5% 86.6% 498 978 1461 1929 2433 2898 3400 3899 4386 4854
86.60 85.9% 93.00
86.0% Explicit subjects
85.8% 86.4% 86.71
86.00 86.3%
85.5% Overall
80.43
85.40 85.8% 1593 1793
Precision (%)
85.6% 85.7% 74.14 354 537 735 898 1094 1249 1416
85.3% Zero subjects
84.80
85.2% 67.86 167
163 179
84.20 82
61.57 103 146
66
129 Impersonal
83.60 55.29 17 49
32 constructions
83.00 49.00
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Precision Recall F-measure
Figure 1: Learning curve for precision, recall and F-measure (left) and with respect to the number of instances
of each class (right) for a given percentage of training data.
annotated instances (fewer than 1,500) to achieve are more homogeneous, as the classifier obtains
good results. higher accuracy when testing and training only on
legal instances (90.0%). In addition, legal texts
6.3 Impact of Genre are also more informative, because when both le-
To examine the influence of the different text gen- gal and health genres are combined as training
res on this method, we divided our training data data, only instances from the health genre show
into two subgroups belonging to different genres a significant increased accuracy (93.7%). These
(legal and health) and analyze the differences. results reveal that the health texts are the most het-
A comparative evaluation using ten-fold cross- erogeneous ones. In fact, we also found subsets of
validation over the two subgroups shows that El- the legal documents where our method achieves
liphant is more successful when classifying in- an accuracy of 94.6%, implying more homoge-
stances of explicit subjects in legal texts (89.8% neous texts.
accuracy) than health texts (85.4% accuracy).
This may be explained by the greater uniformity 6.4 Error Analysis
of the sentences in the legal genre compared to
Since the features of the system are linguisti-
ones from the health genre, as well as the fact that
cally motivated, we performed a linguistic anal-
there are a larger number of explicit subjects in the
ysis of the erroneously classified instances to find
legal training data (2,739 compared with 2,116 in
out which patterns are more difficult to classify
the health texts). Further, texts from the health
and which type of information would improve the
genre present the additional complication of spe-
method (Rello et al., 2011).
cialized named entities and acronyms, which are
used quite frequently. Similarly, better perfor- We extract the erroneously classified instances
mance in the detection of zero subjects and imper- of our training data and classify the errors. Ac-
sonal sentences in the health texts may be due to cording to the distribution of the errors per class
their more frequent occurrence and hence greater (Table 8) we take into account the following four
learnability. classes of errors for the analysis: (a) impersonal
constructions classified as zero subjects, (b) im-
Training/Testing Legal Health All personal constructions classified as explicit sub-
Legal 90.0% 86.8% 89.3% jects, (c) zero subjects classified as explicit sub-
Health 86.8% 85.9% 88.7% jects, and (d) explicit subjects classified as zero
All 92.5% 93.7% 87.6% subjects. The diagonal numbers are the true pre-
Table 7: Accuracy of cross-genre training and testing dicted cases. The classification of impersonal
evaluation (ten-fold evaluation). constructions is less balanced than the ones for
explicit subjects and zero subjects. Most of the
We have also studied the effect of training the wrongly identified instances are classified as ex-
classifier on data derived from one genre and test- plicit subject, given that this class is the largest
ing on instances derived from a different genre. one. On the other hand, 25% of the zero subjects
Table 7 shows that instances from legal texts are classified as explicit subject, while only 8% of
713
the explicit subjects are identified as zero subjects. class.
A possible future avenue to explore could be
Class Zero Explicit Impers. to combine our approach with Ferrandez and
subjects subjects Peral (Ferrandez and Peral, 2000) by employing
Zero subj. 1327 453 (c) 13 both algorithms in sequence: first Ferrandez and
Explicit subj. 368 (d) 4481 6
Perals algorithm to detect all zero subjects and
Impersonals 25 (a) 41 (b) 113
then ours to identify explicit subjects and imper-
sonals. Assuming that the same accuracy could be
Table 8: Confusion Matrix (ten-fold validation).
maintained, on our data set the combined perfor-
For the analysis we first performed an explo- mance could potentially be in the range of 95%.
ration of the feature values which allows us to Future research goals are the extrinsic evalua-
generate smaller samples of the groups of errors tion of our system by integrating our system in
for the further linguistic analyses. Then, we ex- NLP tasks and its adaptation to other Romance
plore the linguistic characteristics of the instances pro-drop languages. Finally, we believe that our
by examining the clause in which the instance ap- ML approach could be improved as it is the first
pears in our corpus. A great variety of different attempt of this kind.
patterns are found. We mention only the linguistic Acknowledgements
characteristics in the errors which at least double
We thank Richard Evans, Julio Gonzalo and the
the corpus general trends.
anonymous reviewers for their wise comments.
In all groups (a-d) there is a tendency of using
the following elements: post-verbal prepositions,
auxiliary verbs, future verbal tenses, subjunctive References
verbal mode, negation, punctuation marks ap- R. Artstein and M. Poesio. 2008. Inter-coder agree-
pearing before the verb and the preceding noun ment for computational linguistics. Computational
phrases, concessive and adverbial subordinate Linguistics, 34(4):555596.
clauses. In groups (a) and (b) the lemma of the S. Bergsma, D. Lin, and R. Goebel. 2008. Distri-
verb may play a relevant role, for instance verb butional identification of non-referential pronouns.
haber (there is/are) appears in the errors seven In Proceedings of the 46th Annual Meeting of the
times more than in the training while verb tratar Association for Computational Linguistics: Human
Language Technologies (ACL/HLT-08), pages 10
(to be about, to deal with) appears 12 times
18.
more. Finally, in groups (c) and (d) we notice I. Bosque. 1989. Clases de sujetos tacitos. In Julio
the frequent occurrence of idioms which include Borrego Nieto, editor, Philologica: homenaje a An-
verbs with impersonal uses, such as es decir (that tonio Llorente, volume 2, pages 91112. Servicio
is to say) and words which can be subject on their de Publicaciones, Universidad Pontificia de Sala-
own i.e. ambos (both) or todo (all). manca, Salamanca.
A. Boyd, W. Gegg-Harrison, and D. Byron. 2005.
7 Conclusions and Future Work Identifying non-referential it: a machine learning
approach incorporating linguistically motivated pat-
In this study we learn which is the most accurate terns. In Proceedings of the ACL Workshop on Fea-
approach for identifying explicit subjects and im- ture Engineering for Machine Learning in Natural
personal constructions in Spanish and which are Language Processing. 43rd Annual Meeting of the
Association for Computational Linguistics (ACL-
the linguistic characteristics and features that help
05), pages 4047.
to perform this task. The corpus created is freely J. M. Brucart. 1999. La elipsis. In I. Bosque
available online.5 Our method complements pre- and V. Demonte, editors, Gramatica descriptiva de
vious work on Spanish anaphora resolution by ad- la lengua espanola, volume 2, pages 27872863.
dressing the identification of non-referential con- Espasa-Calpe, Madrid.
structions. It outperforms current approaches in N. Chomsky. 1981. Lectures on Government and
explicit subject detection and impersonal con- Binding. Mouton de Gruyter, Berlin, New York.
structions, doing better than the parser for every J.G. Cleary and L.E. Trigg. 1995. K*: an instance-
based learner using an entropic distance measure.
5 In Proceedings of the 12th International Conference
ESZIC es Corpus is available at: http:
//luzrello.com/Projects.html. on Machine Learning (ICML-95), pages 108114.
714
Connexor Oy, 2006. Machinese language model. M. Recasens and M.A. Mart. 2010. Ancora-
L. Danlos. 2005. Automatic recognition of French co: Coreferentially annotated corpora for Spanish
expletive pronoun occurrences. In Robert Dale, and Catalan. Language resources and evaluation,
Kam-Fai Wong, Jiang Su, and Oi Yee Kwong, ed- 44(4):315345.
itors, Natural language processing. Proceedings of L. Rello and I. Illisei. 2009a. A comparative study
the 2nd International Joint Conference on Natural of Spanish zero pronoun distribution. In Proceed-
Language Processing (IJCNLP-05), pages 7378, ings of the International Symposium on Data and
Berlin, Heidelberg, New York. Springer. Lecture Sense Mining, Machine Translation and Controlled
Notes in Computer Science, Vol. 3651. Languages, and their application to emergencies
R. Evans. 2001. Applying machine learning: toward and safety critical domains (ISMTCL-09), pages
an automatic classification of it. Literary and Lin- 209214. Presses Universitaires de Franche-Comte,
guistic Computing, 16(1):4557. Besancon.
A. Ferrandez and J. Peral. 2000. A computational ap- L. Rello and I. Illisei. 2009b. A rule-based approach
proach to zero-pronouns in Spanish. In Proceedings to the identification of Spanish zero pronouns. In
of the 38th Annual Meeting of the Association for Student Research Workshop. International Confer-
Computational Linguistics (ACL-2000), pages 166 ence on Recent Advances in Natural Language Pro-
172. cessing (RANLP-09), pages 209214.
J. L. Fleiss. 1971. Measuring nominal scale agree- L. Rello, P. Suarez, and R. Mitkov. 2010. A machine
ment among many raters. Psychological Bulletin, learning method for identifying non-referential im-
76(5):378382. personal sentences and zero pronouns in Spanish.
G. Hirst. 1981. Anaphora in natural language under- Procesamiento del Lenguaje Natural, 45:281287.
standing: a survey. Springer-Verlag. L. Rello, G. Ferraro, and A. Burga. 2011. Error analy-
J. Hobbs. 1977. Resolving pronoun references. Lin- sis for the improvement of subject ellipsis detection.
gua, 44:311338. Procesamiento de Lenguaje Natural, 47:223230.
R. Mitkov and C. Hallett. 2007. Comparing pronoun L. Rello. 2010. Elliphant: A machine learning method
resolution algorithms. Computational Intelligence, for identifying subject ellipsis and impersonal con-
23(2):262297. structions in Spanish. Masters thesis, Erasmus
R. Mitkov. 2002. Anaphora resolution. Longman, Mundus, University of Wolverhampton & Univer-
London. sitat Autonoma de Barcelona.
R. Mitkov. 2010. Discourse processing. In Alexander
P. Tapanainen and T. Jarvinen. 1997. A non-projective
Clark, Chris Fox, and Shalom Lappin, editors, The
dependency parser. In Proceedings of the 5th Con-
handbook of computational linguistics and natural
ference on Applied Natural Language Processing
language processing, pages 599629. Wiley Black-
(ANLP-97), pages 6471.
well, Oxford.
I. H. Witten and E. Frank. 2005. Data mining: practi-
C. Muller. 2006. Automatic detection of nonrefer-
cal machine learning tools and techniques. Morgan
ential it in spoken multi-party dialog. In Proceed-
Kaufmann, London, 2 edition.
ings of the 11th Conference of the European Chap-
S. Zhao and H.T. Ng. 2007. Identification and resolu-
ter of the Association for Computational Linguistics
tion of Chinese zero pronouns: a machine learning
(EACL-06), pages 4956.
approach. In Proceedings of the 2007 Joint Con-
V. Ng and C. Cardie. 2002. Identifying anaphoric
ference on Empirical Methods in Natural Language
and non-anaphoric noun phrases to improve coref-
Processing and Computational Natural Language
erence resolution. In Proceedings of the 19th Inter-
Learning (EMNLP/CNLL-07), pages 541550.
national Conference on Computational Linguistics
(COLING-02), pages 17.
Real Academia Espanola. 2001. Diccionario de la
lengua espanola. Espasa-Calpe, Madrid, 22 edi-
tion.
Real Academia Espanola. 2009. Nueva gramatica de
la lengua espanola. Espasa-Calpe, Madrid.
M. Recasens and E. Hovy. 2009. A deeper
look into features for coreference resolution. In
Lalitha Devi Sobha, Antonio Branco, and Ruslan
Mitkov, editors, Anaphora Processing and Applica-
tions. Proceedings of the 7th Discourse Anaphora
and Anaphor Resolution Colloquium (DAARC-09),
pages 2942. Springer, Berlin, Heidelberg, New
York. Lecture Notes in Computer Science, Vol.
5847.
715
Validation of sub-sentential paraphrases acquired
from parallel monolingual corpora
716
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 716725,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
phrases can be validated using information that of a similar technique on a very large scale.
characterize paraphrases in complement to the set The hypothesis that two words or phrases are
of techniques that proposed them. We propose to interchangeable if they share a common trans-
implement this as a bi-class classification problem lation into one or more other languages has
(i.e. paraphrase vs. not paraphrase), allowing also been extensively studied in works on sub-
any paraphrase acquisition technique to be easily sentential paraphrase acquisition. Bannard and
integrated into the combination system. In this Callison-Burch (2005) described a pivoting ap-
article, we report experiments on two languages, proach that can exploit bilingual parallel corpora
English and French, with 5 individual techniques in several languages. The same technique has
based on a) statistical word alignment models, been applied to the acquisition of local paraphras-
b) translational equivalence, c) handcoded rules of ing patterns in Zhao et al. (2008). The work of
term variation, d) syntactic similarity, and e) edit Callison-Burch (2008) has shown how the mono-
distance on word sequences. We used parallel lingual context of a sentence to paraphrase can be
monolingual parallel corpora obtained via mul- used to improve the quality of the acquired para-
tiple translation from a single language as our phrases.
sources of related sentences, and a large set of Another approach consists in modelling local
features including surface to contextual similarity paraphrasing identification rules. The work of
measures. Relative improvements in F-measure Jacquemin (1999) on the identification of term
close to 18% are obtained on both languages over variants, which exploits rewriting morphosyntac-
the best performing techniques. tic rules and descriptions of morphological and
The remainder of this article is organized as semantic lexical families, can be extended to ex-
follows. We first briefly review previous work tract the various forms corresponding to input pat-
on sub-sentential paraphrase acquisition in sec- terns from large monolingual corpora.
tion 2. We then describe our experimental setting When parallel monolingual corpora aligned at
in section 3 and the individual techniques that we the sentence level are available (e.g. multiple
have studied in section 4. Section 5 is devoted to translations into the same language), the task of
our approach for validating paraphrases proposed sub-sentential paraphrase acquisition can be cast
by individual techniques. Finally, section 6 con- as one of word alignment between two aligned
cludes the article and presents some of our future sentences (Cohn et al., 2008). Barzilay and
work in the area of paraphrase acquisition. McKeown (2001) applied the distributionality hy-
pothesis on such parallel sentences, and Pang et
2 Related work
al. (2003) proposed an algorithm to align sen-
The hypothesis that if two words or, by exten- tences by recursive fusion of their common syn-
sion, two phrases, occur in similar contexts then tactic constituants.
they may be interchangeable has been extensively Finally, they has been a recent interest in auto-
tested. The distributional hypothesis, attributed to matic evaluation of paraphrases (Callison-Burch
Zellig Harris, was for example applied to syntac- et al., 2008; Liu et al., 2010; Chen and Dolan,
tic dependency paths in the work of Lin and Pan- 2011; Metzler et al., 2011).
tel (2001). Their results take the form of equiva-
lence patterns with two arguments such as {X asks 3 Experimental setting
for Y, X requests Y, Xs request for Y, X wants Y,
Y is requested by X, . . .}. We used the main aspects of the methodology
Using comparable corpora, where the same in- described by Cohn et al. (2008) for constructing
formation probably exists under various linguis- evaluation corpora and assessing the performance
tic forms, increases the likelihood of finding very of techniques on the task of sub-sentential para-
close contexts for sub-sentential units. Barzilay phrase acquisition. Pairs of related sentences are
and Lee (2003) proposed a multi-sequence align- hand-aligned to define a set of reference atomic
ment algorithm that takes structurally similar sen- paraphrase pairs at the level of words or phrases,
tences and builds a compact lattice representation denoted as Ratom 1 .
that encodes local variations. The work by Bhagat 1
Note that in this study we do not distinguish between
and Ravichandran (2008) describes an application Sure and Possible alignments, and when reusing anno-
717
single language multiple language video descriptions multiply-translated news headlines
translation translation subtitles
# tokens 4,476 4,630 1,452 2,721 1,908
# unique tokens 656 795 357 830 716
% aligned tokens (excluding identities) 60.58 48.80 23.82 29.76 14.46
lexical overlap (tokens) 77.21 61.03 59.50 32.51 39.63
lexical overlap (lemmas content words) 83.77 71.04 64.83 39.54 45.31
translation edit rate (TER) 0.32 0.55 0.76 0.68 0.62
penalized n-gram prec. (BLEU) 0.33 0.15 0.13 0.14 0.39
Table 1: Various indicators of sentence pair comparability for different corpus types. Statistics are reported for
French on sets of 100 sentence pairs.
We conducted a small-scale study to assess dif- presence of common token may serve as useful
ferent types of corpora of related sentences: clues to guide paraphrase extraction.
For our experiments, we chose to use parallel
1. single language translation Corpora ob- monolingual corpora obtained by single language
tained by several independent human trans- translation, the most direct resource type for ac-
lation of the same sentences (e.g. (Barzilay quiring sub-sentential paraphrase pairs. This al-
and McKeown, 2001)). lows us to define acceptable references for the
2. multiple language translation Same as task and resort to the most consensual evaluation
above, but where a sentence is translated technique for paraphrase acquisition to date. Us-
from 4 different languages into the same lan- ing such corpora, we expect to be able to extract
guage (Bouamor et al., 2010). precise paraphrases (see Table 1), which will be
natural candidates for further validation, which
3. video descriptions Descriptions of short will be addressed in section 5.3.
YouTube videos obtained via Mechanical Figure 1 illustrates a reference alignment ob-
Turk (Chen and Dolan, 2011). tained on a pair of English sentential paraphrases
and the list of atomic paraphrase pairs that can be
4. multiply-translated subtitles Aligned mul-
extracted from it, against which acquisition tech-
tiple translations of contributed movie subti-
niques will be evaluated. Note that we do not con-
tles (Tiedemann, 2007).
sider pairs of identical units during evaluation, so
5. comparable news headlines News head- we filter them out from the list of reference para-
lines collected from Google News clusters phrase pairs.
(e.g. (Dolan et al., 2004)). The example in Figure 1 shows different cases
that point to the inherent complexity of this task,
We collected 100 sentence pairs of each type even for human annotators: it could be argued,
in French, for which various comparability mea- for instance, that a correct atomic paraphrase
sures are reported on Table 1. In particular, the pair should be reached amounted to rather
% aligned tokens row indicates the propor- than reached amounted. Also, aligning in-
tion of tokens from the sentence pairs that could dependently 260 0.26 and million billion
be manually aligned by a native-speaker annota- is assuredly an error, while the pair 260 mil-
tor.2 Obviously, the more common tokens two lion 0.26 billion would have been appropriate.
sentences from a pair contain, the fewer sub- A case of alignment that seems non trivial can be
sentential paraphrases may be extracted from that observed in the provided example (during the en-
pair. However, high lexical overlap increases the tire year annual). The abovementioned rea-
probability that two sentences be indeed para- sons will explain in part the difficulties in reach-
phrases, and in turn the probability that some of ing high performance values using such gold stan-
their phrases be paraphrases. Furthermore, the dards.
tated corpora using them we considered all alignments as be- Reference composite paraphrase pairs (denoted
ing correct.
2
as R), obtained by joining adjacent atomic para-
The same annotator hand-aligned the 5*100=500 para- phrase pairs from Ratom up to 6 tokens3 , will
phrase pairs using the YAWAT (Germann, 2008) manual
3
alignment tool. We used standard biphrase extraction heuristics (Koehn
718
corpus described in (Cohn et al., 2008), consist-
ing of multiply-translated Chinese sentences into
investment
amounted
English, and used as our gold standard both the
actually
foreign
annual
billion
alignments marked as Sure and Possible. For
used
0.26
us$
the
to
French, we used the CESTA corpus of news ar-
the ticles4 obtained by translating into French from
amount English.
of
foreign We used the YAWAT (Germann, 2008) manual
capital
actually alignment tool. Inter-annotator agreement val-
utilized ues (averaging with each annotation set as the
during
the
gold standard) are 66.1 for English and 64.6 for
entire French, which we interpret as acceptable val-
year
reached
ues. Manual inspection of the two corpora reveals
260 that the French corpus tends to contain more lit-
million eral translations, possibly due to the original lan-
us
dollars guages of the sentences, which are closer to the
.
target language than Chinese is to English.
capital investment
utilized used 4 Individual techniques for paraphrase
during the entire year annual
reached amounted
acquisition
260 0.26
million billion As discussed in section 2, the acquisition of sub-
us dollars us$ sentential paraphrases is a challenging task that
has previously attracted a lot of work. In this
Figure 1: Reference alignments for a pair of English work, we consider the scenario where sentential
sentential paraphrases from the annotation corpus of paraphrases are available and words and phrases
Cohn et al. (2008) (note that possible and sure align- from one sentence can be aligned to words and
ments are not distinguished here) and the list of atomic phrases from the other sentence to form atomic
paraphrase pairs extracted from these alignments.
paraphrase pairs. We now describe several tech-
niques that perform the task of sub-sentential unit
also be considered when measuring performance. alignment. We have selected and implemented
Evaluated techniques have to output atomic can- five techniques which we believe are representa-
didate paraphrase pairs (denoted as Hatom ) from tive of the type of knowledge that these techniques
which composite paraphrase pairs (denoted as use, and have reused existing tools, initially devel-
H) are computed. The usual measures of pre- oped for other tasks, when possible.
cision (P ), recall (R) and F-measure (F1 ) can
then be defined in the following way (Cohn et al., 4.1 Statistical learning of word alignments
2008): (Giza)
The GIZA++ tool (Och and Ney, 2004) computes
|Hatom R| |H Ratom | 2pr
P = R= F1 = statistical word alignment models of increasing
|Hatom | |Ratom | p+r complexity from parallel corpora. While origi-
nally developed in the bilingual context of Statis-
We conducted experiments using two different
tical Machine Translation, nothing prevents build-
corpora in English and French. In each case,
ing such models on monolingual corpora. How-
a held-out development corpus of 150 sentential
ever, in order to build reliable models, it is nec-
paraphrase pairs was used for development and
essary to use enough training material includ-
tuning, and all techniques were evaluated on the
ing minimal redundancy of words. To this end,
same test set consisting of 375 sentential para-
we provided GIZA++ with all possible sentence
phrase pairs. For English, we used the MTC
pairs from our mutiply-translated corpus to im-
et al., 2007) : all words from a phrase must be aligned to at prove the quality of its word alignments (note that
least one word from the other and not to words outside, but
4
unaligned words at phrase boundaries are not used. http://www.elda.org/article125.html
719
we used symmetrized alignments from the align- the first sentence and search for variants in the
ments in both directions). This constitutes a sig- other sentence, then do the reverse process and
nificant advantage for this technique that tech- finally take the intersection of the two sets.
niques working on each sentence pair indepen-
dently do not have. 4.4 Syntactic similarity (Synt)
The algorithm introduced by Pang et al. (2003)
4.2 Translational equivalence (Pivot) takes two sentences as input and merges them by
Translational equivalence can be exploited to de- top-down syntactic fusion guided by compatible
termine that two phrases may be paraphrases. syntactic substructure. A lexical blocking mecha-
Bannard and Callison-Burch (2005) defined a nism prevents constituents from fusionning when
paraphrasing probability between two phrases there is evidence of the presence of a word in an-
based on their translation probability through all other constituent of one of the sentence. We use
possible pivot phrases as: the Berkeley Probabilistic parser (Klein and Man-
X ning, 2003) to obtain syntactic trees for English
Ppara (p1 , p2 ) = Pt (piv|p1 )Pt (p2 |piv) and its adapted version for French (Candito et al.,
piv
2010). Because this process is highly sensitive to
where Pt denotes translation probabilies. We used syntactic parse errors, we use in our implemen-
the Europarl corpus5 of parliamentary debates in tation k-best parses and retain the most compact
English and French, consisting of approximately fusion from any pair of candidate parses.
1.7 million parallel sentences : this allowed us
to use the same resource to build paraphrases for 4.5 Edit rate on word sequences (TERp )
English, using French as the pivot language, and TERp (Translation Edit Rate Plus) (Snover et al.,
for French, using English as the pivot language. 2010) is a score designed for the evaluation of
The GIZA++ tool was used for word alignment Machine Translation output. Its typical use takes
and the M OSES Statistical Machine Translation a system hypothesis to compute an optimal set of
toolkit (Koehn et al., 2007) was used to com- word edits that can transform it into some exist-
pute phrase translation probabilities from these ing reference translation. Edit types include ex-
word alignments. For each sentential paraphrase act word matching, word insertion and deletion,
pair, we applied the following algorithm: for each block movement of contiguous words (computed
phrase, we build the entire set of paraphrases us- as an approximation), as well as optionally vari-
ing the previous definition. We then extract its ants substitution through stemming, synonym or
best paraphrase as the one exactly appearing in the paraphrase matching.6 Each edit type is parame-
other sentence with maximum paraphrase proba- terized by at least one weight which can be opti-
bility, using a minimal threshold value of 104 . mized using e.g. hill climbing. TERp being a tun-
able metric, our experiments will include tuning
4.3 Linguistic knowledge on term variation
TERp systems towards either precision ( P ),
(Fastr)
recall ( R), or F-measure ( F1 ).7
The FASTR tool (Jacquemin, 1999) was designed
to spot term/phrase variants in large corpora. 4.6 Evaluation of individual techniques
Variants are described through metarules express- Results for the 5 individual techniques are given
ing how the morphosyntactic structure of a term on the left part of Table 2. It is first apparent
variant can be derived from a given term by means that all techniques but TERp fared better on the
of regular expressions on word morphosyntactic French corpus than on the English corpus. This
categories. Paradigmatic variation can also be ex- can certainly be explained by the fact that the for-
pressed by expressing constraints between words, mer results from more literal translations (from
imposing that they be of the same morphologi- 6
Note that for these experiments we did not use the stem-
cal or semantic family. Both constraints rely on
ming module, the interface to WordNet for synonym match-
preexisting repertoires available for English and ing and the provided paraphrase table for English, due to the
French. To compute candidate paraphrase pairs fact that these resources were available for English only.
7
using FASTR, we first consider all phrases from Hill climbing was used for all tunings as done by Snover
et al. (2010), and we used one iteration starting with uniform
5
http://statmt.org/europarl weights and 100 random restarts.
720
Individual techniques Combinations
TERp
G IZA P IVOT FASTR S YNT union validation
P R F1
English
P 31.01 31.78 37.38 52.17 50.00 29.15 33.37 21.44 50.51
R 38.30 18.50 6.71 2.53 5.83 45.19 45.37 60.87 41.19
F1 34.27 23.39 11.38 4.83 10.44 35.44 38.46 31.71 45.37
French
P 28.99 29.53 52.48 62.50 31.35 30.26 31.43 17.58 40.77
R 45.98 26.66 8.59 8.65 44.22 44.60 44.10 63.36 45.85
F1 35.56 28.02 14.77 15.20 36.69 36.05 36.70 27.53 43.16
Table 2: Results on the test set on English and French for the 5 individual paraphrase acquisition techniques (left
part) and for the 2 combination techniques (right part).
English to French, compared with from Chinese than for highly-inflected French.
to English), which should be consequently eas- P IVOT is on par with G IZA as regards preci-
ier to word-align. This is for example clearly sion, but obtains a comparatively much lower re-
shown by the results of the statistical aligner call (differences of 19.32 and 19.80 on recall on
G IZA, which obtains a 7.68 advantage on recall French and English respectively). This may first
for French over English. be due in part to the paraphrasing score threshold
The two linguistically-aware techniques, used for P IVOT, but most certainly to the use of
FASTR and S YNT, have a very strong precision a bilingual corpus from the domain of parliamen-
on the more parallel French corpus, but fail to tary debates to extract paraphrases when our test
achieve an acceptable recall on their own. This sets are from the news domain: we may be ob-
is not surprising : FASTR metarules are focussed serving differences inherent to the domain, and
on term variant extraction, and S YNT requires possibly facing the issue of numerous out-of-
two syntactic trees to be highly comparable vocabulary phrases, in particular for named en-
to extract sub-sentential paraphrases. When tities which frequently occur in the news domain.
these constrained conditions are met, these two Importantly, we can note that we obtain at best
techniques appear to perform quite well in terms a recall of 45.98 on French (G IZA) and of 45.37
of precision. on English (TERp ). This may come as a disap-
G IZA and TERp perform roughly in the same pointment but, given the broad set of techniques
range on French, with acceptable precision and evaluated, this should rather underline the inher-
recall, TERp performing overall better, with e.g. ent complexity of the task. Also, recall that the
a 1.14 advantage on F-measure on French and metrics used do not consider identity paraphrases
4.19 on English. The fact that TERp performs (e.g. at the same time at the same time), as
comparatively better on English than on French8 , well as the fact that gold standard alignment is
with a 1.76 advantage on F-measure, is not con- a very difficult process as shown by interjudge
tradictory: the implemented edit distance makes agreement values and our example from section 3.
it possible to align reasonably distant words and This, again, confirms that the task that is ad-
phrases independently from syntax, and to find dressed is indeed a difficult one, and provides fur-
alignments for close remaining words, so the dif- ther justification for initially focussing on parallel
ferences of performance between the two lan- monolingual corpora, albeit scarce, for conduct-
guages are not necessarily expected to be com- ing fine-grained studies on sub-sentential para-
parable with the results of a statistical alignment phrasing.
technique. English being a poorly-inflected lan- Lastly, we can also note that precision is not
guage, alignment clues between two sentential very high, with (at best, using TERpP ) average
paraphrases are expected to be more numerous values for all techniques of 40.97 and 40.46 on
8
French and English, respectively. Several facts
Recall that all specific linguistic modules for English
only from TERp had been disabled, so the better perfor-
may provide explanations for this observation.
mance on English cannot be explained by a difference in First, it should be noted that none of those tech-
terms of resources used. niques, except S YNT, was originally developed
721
for the task of sub-sentential paraphrase acqui- Results on the test set for the two languages
sition from monolingual parallel corpora. This are given in Table 3. A number of pairs of tech-
results in definitions that are at best closely re- niques have strong complementarity values, the
lated to this task.9 Designing new techniques strongest one being for G IZA and TERp for both
was not one of the objectives of our study, so we languages. According to these figures, P IVOT
have reused existing techniques, originally devel- identify paraphrases which are slightly more sim-
oped with different aims (bilingual parallel cor- ilar to those of TERp than those of G IZA. Inter-
pora word alignment (G IZA), term variant recog- estingly, FASTR and S YNT exhibit a strong com-
nition (FASTR), Machine Translation evaluation plementarity, where in French, for instance, they
(TERp )). Also, techniques such as G IZA and only have a very small proportion of paraphrases
TERp attempt to align as many words as possi- in common. Considering the set of all other tech-
ble in a sentence pair, when gold standard align- niques, G IZA provides the more new paraphrases
ments sometimes contain gaps.10 Finally, the met- on French and TERp on English.
rics used will count as false small variations of
G IZA P IVOT FASTR S YNT TERpR all others
gold standard paraphrases (e.g. missing function English
word): the acceptability or not of such candi- G IZA - 4.65 2.83 0.59 10.31 8.31
dates could be either evaluated in a scenario where P IVOT 4.65 - 2.30 1.88 3.12 3.72
FASTR 2.83 2.30 - 2.42 1.71 0.53
such acceptable variants would be taken into S YNT 0.59 1.88 2.42 - 0.59 0.00
account, and could be considered in the context TERpR 10.31 3.12 1.71 0.59 - 12.20
of some actual use of the acquired paraphrases French
G IZA - 9.79 3.64 2.20 10.73 8.91
in some application. Nonetheless, on average the
P IVOT 9.79 - 2.26 5.22 7.84 3.39
techniques in our study produce more candidates FASTR 3.64 2.26 - 7.28 3.01 0.19
that are not in the gold standard: this will be an S YNT 2.20 5.22 7.28 - 1.76 0.44
important fact to keep in mind when tackling the TERpR 10.73 7.84 3.01 1.76 - 5.65
722
the next section, we will show how the results of of tokens for the two phrases of a candidate para-
the union can be validated using machine learning phrase pair.
to improve these figures.
Context similarity (CTXT) It can be derived
5.3 Paraphrase validation via automatic from the distributionality hypothesis that the more
classification two phrases will be seen in similar contexts, the
A natural improvement to the naive combination more they are likely to be paraphrases. We used
of paraphrase candidates from all techniques can discretized features indicating how similar the
consist in validating candidate paraphrases by us- contexts of occurrences of two paraphrases are.
ing several models that may be good indicators of For this, we used the full set of bilingual English-
their paraphrasing status. We can therefore cast French data available for the translation task of
our problem as one of biclass classification (i.e. the Workshop on Statistical Machine Transla-
paraphrase vs. not paraphrase). tion13 , totalling roughly 30 million parallel sen-
We have used a maximum entropy classifier11 tences: this again ensures that the same resources
with the following features, aiming at capturing are used for experiments in the two languages. We
information on the paraphrase status of a candi- collect all occurrences for the phrases in a pair,
date pair: and build a vector of content words cooccurring
within a distance of 10 words from each phrase.
Morphosyntactic equivalence (POS) It may We finally compute the cosine between the vec-
be the case that some sequences of part-of-speech tors of the two phrases of a candidate paraphrase
can be rewritten as different sequences, e.g. as pair.
a result of verb nominalization. We therefore
use features to indicate the sequences of part-of- Relative position in a sentence (REL) De-
speech for a pair of candidate paraphrases. We pending on the language in which parallel sen-
used the preterminal symbols of the syntactic tences are analyzed, it may be the case that sub-
trees of the parser used for S YNT. sentential paraphrases occur at close locations in
their respective sentence. We used a discretized
Character-based distance (CAR) Morpholog- feature indicating the relative position of the two
ical variants often have close word forms, and phrases in their original sentence.
more generally close word forms in sentential
paraphase pairs may indicate related words. We Identity check (COOC) We used a binary fea-
used features for discretized values of the edit ture indicating whether one of the two phrases
distance between the two phrases of a candidate from a candidate pair, or the two, occurred at
paraphrase pair as measured by the Levenshtein some other location in the other sentence.
distance.
Phrase length ratio (LEN) We used a dis-
Stem similarity (STEM) Inflectional morphol- cretized feature indicating phrase length ratio.
ogy, which is quite productive in languages such
Source techniques (SRC) Finally, as our set-
as French, can increase vocabulary size signifi-
ting validates paraphrase candidates produced by
cantly, while in sentential paraphrases common
a set of techniques, we used features indicat-
stems may indicate related words. We used a
ing which combination of techniques predicted a
binary feature indicating whether the stemmed
paraphrase candidate. This can allow learning that
phrases of a candidate paraphrase pair match.12
paraphrases in the intersection of the predicted
Token set identity (BOW) Syntactic rearrange- sets for some techniques may produce good re-
ments may involve the same sets of words in var- sults.
ious orders. We used discretized features indicat- We used a held out training set consisting of
ing the proportion of common tokens in the set 150 sentential paraphrase pairs from the same cor-
11
We used the implementation available at: pora as our previous developement and test sets
http://homepages.inf.ed.ac.uk/lzhang10/ for both languages. Positive examples were taken
maxent_toolkit.html from the candidate paraphrase pairs from any of
12
We use the implementations of the Snowball stem-
13
mer from English and French available from: http:// http://www.statmt.org/wmt11/
snowball.tartarus.org translation-task.html
723
the 5 techniques in our study which belong to 43
F-measure
validation experiments of the union set for all pre-
vious techniques. 37
724
tailment Methods. Journal of Artificial Intelligence Philipp Koehn, Hieu Hoang, Alexandra Birch,
Research, 38:135187. Chris Callison-Burch, Marcello Federico, Nicola
Colin Bannard and Chris Callison-Burch. 2005. Para- Bertoldi, Brooke Cowan, Wade Shen, Christine
phrasing with Bilingual Parallel Corpora. In Pro- Moran, Richard Zens, Chris Dyer, Ondrej Bojar,
ceedings of ACL, Ann Arbor, USA. Alexandra Constantin, and Evan Herbst. 2007.
Regina Barzilay and Lillian Lee. 2003. Learn- Moses: Open Source Toolkit for Statistical Machine
ing to paraphrase: an unsupervised approach us- Translation. In Proceedings of ACL, demo session,
ing multiple-sequence alignment. In Proceedings Prague, Czech Republic.
of NAACL-HLT, Edmonton, Canada. Dekang Lin and Patrick Pantel. 2001. Discovery of in-
ference rules for question answering. Natural Lan-
Regina Barzilay and Kathleen R. McKeown. 2001.
guage Engineering, 7(4):343360.
Extracting paraphrases from a parallel corpus. In
Chang Liu, Daniel Dahlmeier, and Hwee Tou Ng.
Proceedings of ACL, Toulouse, France.
2010. PEM: A paraphrase evaluation metric ex-
Rahul Bhagat and Deepak Ravichandran. 2008. Large ploiting parallel texts. In Proceedings of EMNLP,
scale acquisition of paraphrases for learning surface Cambridge, USA.
patterns. In Proceedings of ACL-HLT, Columbus,
Nitin Madnani and Bonnie J. Dorr. 2010. Generat-
USA.
ing Phrasal and Sentential Paraphrases: A Survey
Rahul Bhagat. 2009. Learning Paraphrases from Text. of Data-Driven Methods . Computational Linguis-
Ph.D. thesis, University of Southern California. tics, 36(3).
Houda Bouamor, Aurelien Max, and Anne Vilnat. Nitin Madnani. 2010. The Circle of Meaning: From
2010. Comparison of Paraphrase Acquisition Tech- Translation to Paraphrasing and Back. Ph.D. the-
niques on Sentential Paraphrases. In Proceedings of sis, University of Maryland College Park.
IceTAL, Rejkavik, Iceland. Donald Metzler, Eduard Hovy, and Chunliang Zhang.
Chris Callison-Burch, Trevor Cohn, and Mirella La- 2011. An empirical evaluation of data-driven para-
pata. 2008. Parametric: An automatic evaluation phrase generation techniques. In Proceedings of
metric for paraphrasing. In Proceedings of COL- ACL-HLT, Portland, USA.
ING, Manchester, UK. Franz Josef Och and Herman Ney. 2004. The align-
Chris Callison-Burch. 2007. Paraphrasing and Trans- ment template approach to statistical machine trans-
lation. Ph.D. thesis, University of Edinburgh. lation. Computational Linguistics, 30(4).
Chris Callison-Burch. 2008. Syntactic Constraints Bo Pang, Kevin Knight, and Daniel Marcu. 2003.
on Paraphrases Extracted from Parallel Corpora. In Syntax-based alignement of multiple translations:
Proceedings of EMNLP, Hawai, USA. Extracting paraphrases and generating new sen-
Marie Candito, Benot Crabbe, and Pascal Denis. tences. In Proceedings of NAACL-HLT, Edmonton,
2010. Statistical French dependency parsing: tree- Canada.
bank conversion and first results. In Proceedings of Matthew Snover, Nitin Madnani, Bonnie J. Dorr, and
LREC, Valletta, Malta. Richard Schwartz. 2010. TER-Plus: paraphrase,
semantic, and alignment enhancements to Transla-
David Chen and William Dolan. 2011. Collecting
tion Edit Rate. Machine Translation, 23(2-3).
highly parallel data for paraphrase evaluation. In
Jorg Tiedemann. 2007. Building a Multilingual Paral-
Proceedings of ACL, Portland, USA.
lel Subtitle Corpus. In Proceedings of the Confer-
Trevor Cohn, Chris Callison-Burch, and Mirella Lap- ence on Computational Linguistics in the Nether-
ata. 2008. Constructing corpora for the develop- lands, Leuven, Belgium.
ment and evaluation of paraphrase systems. Com-
Shiqi Zhao, Haifeng Wang, Ting Liu, and Sheng Li.
putational Linguistics, 34(4).
2008. Pivot Approach for Extracting Paraphrase
Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Patterns from Bilingual Corpora. In Proceedings
Unsupervised construction of large paraphrase cor- of ACL-HLT, Columbus, USA.
pora: Exploiting massively parallel news sources.
In Proceedings of COLING, Geneva, Switzerland.
Ulrich Germann. 2008. Yawat : Yet Another Word
Alignment Tool. In Proceedings of the ACL-HLT,
demo session, Columbus, USA.
Christian Jacquemin. 1999. Syntagmatic and paradig-
matic representations of term variation. In Proceed-
ings of ACL, College Park, USA.
Dan Klein and Christopher D. Manning. 2003. Accu-
rate unlexicalized parsing. In Proceedings of ACL,
Sapporo, Japan.
725
Determining the placement of German verbs in EnglishtoGerman
SMT
726
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 726735,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
before translation. Our approach is related to the dev set which is then translated and evaluated.
work of Collins et al. (2005). They reordered Only those rule sequences are extracted which
German sentences as a preprocessing step for maximize the translation performance of the
German-to-English SMT. Hand-crafted reorder- reordered dev set.
ing rules are applied on German parse trees in For the extraction of reordering rules, Gen-
order to move the German verbs into the posi- zel (2010) uses shallow constituent parse trees
tions corresponding to the positions of the English which are obtained from dependency parse trees.
verbs. Subsequently, the reordered German sen- The trees are annotated using both Penn Tree-
tences are translated into English leading to better bank POS tags and using Stanford dependency
translation performance when compared with the types. However, the constraints on possible re-
translation of the original German sentences. orderings are too restrictive in order to model all
We apply this method on the opposite trans- word movements required for English-to-German
lation direction, thus having English as a source translation. In particular, the reordering rules in-
language and German as a target language. How- volve only the permutation of direct child nodes
ever, we cannot simply invert the reordering rules and do not allow changing of child-parent rela-
which are applied on German as a source lan- tionships (deleting of a child or attaching a node
guage in order to reorder the English input. While to a new father node). In our implementation, a
the reordering of German implies movement of verb can be moved to any position in a parse tree
the German verbs into a single position, when re- (according to the reordering rules): the reordering
ordering English, we need to split the English ver- can be a simple permutation of child nodes, or at-
bal complexes and, where required, move their tachment of these nodes to a new father node (cf.
parts into different positions. Therefore, we need movement of bought and read in figure 11 ).
to identify exactly which parts of a verbal com- Thus, in contrast to Genzel (2010), our ap-
plex must be moved and their possible positions proach does not have any constraints with respect
in a German sentence. to the position of nodes marking a verb within the
Reordering rules can also be extracted automat- tree. Only the syntactic structure of the sentence
ically. For example, Niehues and Kolss (2009) restricts the distance of the linguistically moti-
automatically extracted discontiguous reordering vated verb movements.
rules (allowing gaps between POS tags which
can include an arbitrary number of words) from 3 Verb positions in English and German
a word-aligned parallel corpus with POS tagged 3.1 Syntax of German sentences
source side. Since many different rules can be ap-
plied on a given sentence, a number of reordered Since in this work, we concentrate on verbs, we
sentence alternatives are created which are en- use the notion verbal complex for a sequence con-
coded as a word lattice (Dyer et al., 2008). They sisting of verbs, verbal particles and negation.
dealt with the translation directions German-to- The verb positions in the German sentences de-
English and English-to-German, but translation pend on clause type and the tense as shown in ta-
improvement was obtained only for the German- ble 1. Verbs can be placed in 1st, 2nd or clause-
to-English direction. This may be due to miss- final position. Additionally, if a composed tense
ing information about clause boundaries since En- is given, the parts of a verbal complex can be
glish verbs often have to be moved to the clause interrupted by the middle field (MF) which con-
end. Our reordering has access to this kind of tains arbitrary sentence constituents, e.g., sub-
knowledge since we are working with a full syn- jects and objects (noun phrases), adjuncts (prepo-
tactic parser of English. sitional phrases), adverbs, etc. We assume that the
Genzel (2010) proposed a language- German sentences are SVO (analogously to En-
independent method for learning reordering glish); topicalization is beyond the scope of our
rules where the rules are extracted from parsed work.
source language sentences. For each node, all In this work, we consider two possible posi-
possible reorderings (permutations) of a limited tions of the negation in German: (1) directly in
number of the child nodes are considered. The 1
The verb movements shown in figure 1 will be explained
candidate reordering rules are applied on the in detail in section 4.
727
1st 2nd MF clause- position in declarative, or in the 1st position in in-
final terrogative clauses, in German, the entire verbal
subject finV any complex can additionally be placed at the clause
decl
subject finV any mainV end in subordinate or infinitival clauses (cf. row
finV subject any sub/inf in table 1).
int/perif
finV subject any mainV
Because of these differences, for nearly all
relCon subject any finV types of English clauses, reordering is needed in
sub/inf
relCon subject any VC
order to place the English verbs in the positions
Table 1: Position of the German subjects and verbs which correspond to the correct verb positions in
in declarative clauses (decl), interrogative clauses and German. Only English declarative clauses with
clauses with a peripheral clause (int/perif ), subordi- simple present and simple past tense have the
nate/infinitival (sub/inf ) clauses. mainV = main verb, same verb position as their German counterparts.
finV = finite verb, VC = verbal complex, any = arbi- We give statistics on clause types and their rele-
trary words, relCon = relative pronoun or conjunction.
vance for the verb reordering in section 5.1.
We consider extraponed consituents in perif, as well as
optional interrogatives in int to be in position 0.
4 Reordering of the English input
front of the main verb, and (2) directly after the The reordering is carried out on English parse
finite verb. The two negation positions are illus- trees. We first enrich the parse trees with clause
trated in the following examples: type labels, as described below. Then, for each
node marking a clause (S nodes), the correspond-
(1) Ich behaupte, dass ich es nicht gesagt habe.
ing sequence of reordering rules is carried out.
I claim that I it not say did.
The appropriate reordering is derived from the
(2) Ich denke nicht, dass er das gesagt hat. clause type label and the composition of the given
I think not that he that said has. verbal complex. The reordering rules are deter-
It should, however, be noted that in German, the ministic. Only one rule can be applied in a given
negative particle nicht can have several positions context and for each verb to be reordered, there is
in a sentence depending on the context (verb argu- a unique reordered position.
ments, emphasis). Thus, more analysis is ideally The reordering procedure is the same for the
needed (e.g., discourse, etc.). training and the testing data. It is carried out
on English parse trees resulting in modified parse
3.2 Comparison of verb positions trees which are read out in order to generate the
English and German verbal complexes differ both reordered English sentences. These are input for
in their construction and their position. The Ger- training a PSMT system or input to the decoder.
man verbal complex can be discontiguous, i.e., its The processing steps are shown in figure 1.
parts can be placed in different positions which For the development of the reordering rules, we
implies that a (large) number of other words can used a small sample of the training data. In par-
be placed between the verbs (situated in the MF). ticular, by observing the English parse trees ex-
In English, the verbal complex can only be inter- tracted randomly from the training data, we de-
rupted by adverbials and subjects (in interrogative veloped a set of rules which transform the origi-
clauses). Furthermore, in German, the finite verb nal trees in such a way that the English verbs are
can sometimes be the last element of the verbal moved to the positions which correspond to the
complex, while in English, the finite verb is al- placement of verbs in German.
ways the first verb in the verbal complex.
4.1 Labeling clauses with their type
In terms of positions, the verbs in English and
German can differ significantly. As previously As shown in section 3.1, the verb positions in Ger-
noted, the German verbal complex can be discon- man depend on the clause type. Since we use En-
tiguous, simultaneously occupying 1st/2nd and glish parse trees produced by the generative parser
clause-final position (cf. rows decl and int/perif in of Charniak and Johnson (2005) which do not
table 1), which is not the case in English. While in have any function labels, we implemented a sim-
English, the verbal complex is placed in the 2nd ple rule-based clause type labeling script which
728
SEXTR tence end. The starting node is the node which
ADVP , NP VP1 . marks the verbal phrase in which the verbs are
, . enclosed. When the next node marking a clause
RB PRP VBD NP
is identified, the search stops and returns the posi-
reordering
Yesterday I read NP SSUB tion in front of the identified clause marking node.
DT NN WHNP S
When, for example, searching for the clause
boundary of S-EXTR in figure 1, we search re-
a book which NP VP
cursively for the first clause marking node within
PRP VBD NP VP1 , which is S-SUB. The position in front of S-
SEXTR SUB is marked as clause-final position of S-EXTR.
I bought JJ NN
ADVP , VBD NP VP1 .
last week 4.3 Basic verb reordering rules
, .
RB read PRP NP
The reordering procedure takes into account the
Yesterday I NP SSUB following word categories: verbs, verb particles,
the infinitival particle to and the negative parti-
DT NN WHNP S
cle not, as well as its abbreviated form t. The
a book which NP VP reordering rules are based on POS labels in the
PRP NP VBD
parse tree.
read out and translate
The reordering procedure is a sequence of ap-
I JJ NN bought
plications of the reordering rules. For each el-
last week ement of an English verbal complex, its proper-
ties are derived (tense, main verb/auxiliary, finite-
Figure 1: Processing steps: Clause type labeling an- ness). The reordering is then carried out corre-
notates the given original tree with clause type labels sponding to the clause type and verbal properties
(in figure, S-EXTR and S-SUB). Subsequently, the re- of a verb to be processed.
ordering is performed (cf. movement of the verbs read
In the following, the reordering rules are pre-
and bought). The reordered sentence is finally read out
and given to the decoder. sented. Examples of reordered sentences are
given in table 2, and are discussed further here.
enriches every clause starting node with the corre- Main clause (S-MAIN)
sponding clause type label. The label depends on
(i) simple tense: no reordering required
the context (father, child nodes) of a given clause
(cf. appearsfinV in input 1);
node. If, for example, the first child node of a
given S node is WH* (wh-word) or IN (subordi- (ii) composed tense: the main verb is moved to
nating conjunction), then the clause type label is the clause end. If a negative particle exists, it
SUB (subordinate clause, cf. figure 1). is moved in front of the reordered main verb,
We defined five clause type labels which indi- while the optional verb particle is moved af-
cate main clauses (MAIN), main clauses with a ter the reordered main verb (cf. [has]finV
peripheral clause in the prefield (EXTR), subor- [been developing]mainV in input 2).
dinate (SUB), infinitival (XCOMP) and interroga-
tive clauses (INT). Main clause with peripheral clause (S-EXTR)
4.2 Clause boundary identification (i) simple tense: the finite verb is moved to-
The German verbs are often placed at the clause gether with an optional particle to the 1st po-
end (cf. rows decl, int/perif and sub/inf in ta- sition (i.e. in front of the subject);
ble 1), making it necessary to move their En- (ii) composed tense: the main verb, as well
glish counterparts into the corresponding posi- as optional negative and verb particles are
tions within an English tree. For this reason, we moved to the clause end. The finite verb is
identify the clause ends (the right boundaries). moved in the 1st position, i.e. in front of the
The search for the clause end is implemented as subject (cf. havef inV [gone up]mainV in in-
a breadth-first search for the next S node or sen- put 3).
729
Subordinate clause (S-SUB) 4.4.3 Flexible position of German verbs
We stated that the English verbs are never moved
(i) simple tense: the finite verb is moved to the
outside the subclause they were originally in. In
clause end (cf. boastsfinV in input 3);
German there are, however, some constructions
(ii) composed tense: the main verb, as well (infinitival and relative clauses), in which the
as optional negative and verb particles are main verb can be placed after a subsequent clause.
moved to the clause end, the finite verb is Consider two German translations of the English
placed after the reordered main verb (cf. sentence He has promised to come:
havefinV [been executed]mainV in input 5).
(3a) Er hat [zu kommen]S versprochen.
he has to come promised.
Infinitival clause (S-XCOMP)
The entire English verbal complex is moved from (3b) Er hat versprochen, [zu kommen]S .
the 2nd position to the clause-final position (cf. he has promised, to come.
[to discuss]VC in input 4). In (3a), the German main verb versprochen is
placed after the infinitival clause zu kommen (to
Interrogative clause (S-INT) come), while in (3b), the same verb is placed in
front of it. Both alternatives are grammatically
(i) simple tense: no reordering required;
correct.
(ii) composed tense: the main verb, as well
If a German verb should come after an em-
as optional negative and verb particles are
bedded clause as in example (3a) or precede it
moved to the clause end (cf. [did]finV
(cf. example (3b)), depends not only on syntac-
knowmainV in input 5).
tic but also on stylistic factors. Regarding the
4.4 Reordering rules for other phenomena verb reordering problem, we would therefore have
to examine the given sentence in order to derive
4.4.1 Multiple auxiliaries in English the correct (or more probable) new verb position
Some English tenses require a sequence of aux- which is beyond the scope of this work. There-
iliaries, not all of which have a German coun- fore, we allow only for reorderings which do not
terpart. In the reordering process, non-finite cross clause boundaries as shown in example (3b).
auxiliaries are considered to be a part of the
main verb complex and are moved together with 5 Experiments
the main verb (cf. movement of hasfinV [been
In order to evaluate the translation of the re-
developing]mainV in input 2).
ordered English sentences, we built two SMT sys-
4.4.2 Simple vs. composed tenses tems with Moses (Koehn et al., 2007). As train-
In English, there are some tenses composed of ing data, we used the Europarl corpus which con-
an auxiliary and a main verb which correspond sists of 1,204,062 English/German sentence pairs.
to a German tense composed of only one verb, The baseline system was trained on the original
e.g., am reading lese and does John read? English training data while the contrastive system
liest John? Splitting such English verbal com- was trained on the reordered English training data.
plexes and only moving the main verbs would In both systems, the same original German sen-
lead to constructions which do not exist in Ger- tences were used. We used WMT 2009 dev and
man. Therefore, in the reordering process, the test sets to tune and test the systems. The baseline
English verbal complex in present continuous, as system was tuned and tested on the original data
well as interrogative phrases composed of do and while for the contrastive system, we used the re-
a main verb, are not split. They are handled as ordered English side of the dev and test sets. The
one main verb complex and reordered as a sin- German 5-gram language model used in both sys-
gle unit using the rules for main verbs (e.g. [be- tems was trained on the WMT 2009 German lan-
cause I am reading a book]SUB because I a guage modeling data, a large German newspaper
corpus consisting of 10,193,376 sentences.
book am reading weil ich ein Buch lese.2
other tenses which could (or should) be treated in the same
2
We only consider present continuous and verbs in com- way (cf. has been developing on input 2, table 2). We do not
bination with do for this kind of reordering. There are also do this to keep the reordering rules simple and general.
730
Input 1 The programme appears to be successful for published data shows that MRSA is on the decline in the UK.
Reordered The programme appears successful to be for published data shows that MRSA on the decline in the UK is.
Input 2 The real estate market in Bulgaria has been developing at an unbelievable rate - all of Europe has its eyes
on this heretofore rarely heard-of Balkan nation.
Reordered The real estate market in Bulgaria has at an unbelievable rate been developing - all of Europe has its eyes
on this heretofore rarely heard-of Balkan nation.
Input 3 While Bulgaria boasts the European Unions lowest real estate prices, they have still gone up by 21 percent
in the past five years.
Reordered While Bulgaria the European Unions lowest real estate prices boasts, have they still by 21 percent in the
past five years gone up.
Input 4 Professionals and politicians from 192 countries are slated to discuss the Bali Roadmap that focuses on
efforts to cut greenhouse gas emissions after 2012, when the Kyoto Protocol expires.
Reordered Professionals and politicians from 192 countries are slated the Bali Roadmap to discuss that on efforts
focuses greenhouse gas emissions after 2012 to cut, when the Kyoto Protocol expires.
Input 5 Did you know that in that same country, since 1976, 34 mentally-retarded offenders have been executed?
Reordered Did you know that in that same country, since 1976, 34 mentally-retarded offenders been executed have?
731
a case. There is no translation of the English in- On the other hand, in the baseline SMT system,
finitival verbal complex to have. In the transla- the subject they is likely to be a part of a trans-
tion generated by the contrastive system, the ver- lation phrase with the correct German equivalent
bal complex does get translated (zu haben) and (they have said sie haben gesagt). They is then
is also placed correctly. We think this is because used as a disambiguating context which is missing
the reordering model is not able to identify the in the reordered sentence (but the order is wrong).
position for the verb which is licensed by the lan-
guage model, causing a hypothesis with no verb 6.2.2 Verb dependency
to be scored higher than the hypotheses with in- A similar problem occurs in a verbal complex:
correctly placed verbs. (5a) They have said it to me yesterday.
(5b) They have it to me yesterday said.
6 Error analysis In sentence (5a), the English consecutive verbs
have said are a sequence consisting of a finite
6.1 Erroneous reordering in our system
auxiliary have and the past participle said. They
In some cases, the reordering of the English parse should be translated into the corresponding Ger-
trees fails. Most erroneous reorderings are due to man verbal complex haben gesagt. But, if the
a number of different parsing and tagging errors. verbs are split, we will probably get translations
Coordinated verbs are also problematic due to which are completely independent. Even if the
their complexity. Their composition can vary, and German auxiliary is correctly inflected, it is hard
thus it would require a large number of different to predict how said is going to be translated. If
reordering rules to fully capture this. In our re- the distance between the auxiliary habe and the
ordering script, the movement of complex struc- hypothesized translation of said is large, the lan-
tures such as verbal phrases consisting of a se- guage model will not be able to help select the
quence of child nodes is not implemented (only correct translation. Here, the baseline SMT sys-
nodes with one child, namely the verb, verbal par- tem again has an advantage as the verbs are con-
ticle or negative particle are moved). secutive. It is likely they will be found in the train-
ing data and extracted with the correct German
6.2 Splitting of the English verbal complex phrase (but the German order is again incorrect).
Since in many cases, the German verbal complex
6.3 Collocations
is discontiguous, we need to split the English ver-
bal complex and move its parts into different posi- Collocations (verbobject pairs) are another case
tions. This ensures the correct placement of Ger- which can lead to a problem:
man verbs. However, this does not ensure that the (6a) I think that the discussion would take place
German verb forms are correct because of highly later this evening.
ambiguous English verbs. In some cases, we can (6b) I think that the discussion place later this
lose contextual information which would be use- evening take would.
ful for disambiguating ambiguous verbs and gen- The English collocation in (6a) consisting of the
erating the appropriate German verb forms. verb take and the object place corresponds to the
German verb stattfinden. Without this specific ob-
6.2.1 Subjectverb agreement ject, the verb take is likely to be translated liter-
Let us consider the English clause in (4a) and its ally. In the reordered sentence, the verbal com-
reordered version in (4b): plex take would is indeed separated from the ob-
(4a) ... because they have said it to me yesterday. ject place which would probably lead to the literal
(4b) ... because they it to me yesterday said have. translation of both parts of the mentioned collo-
In (4b), the English verbs said have are separated cation. So, as already described in the preceding
from the subject they. The English said have can paragraphs, an important source of contextual in-
be translated in several ways into German. With- formation is lost which could ensure the correct
out any information about the subject (the dis- translation of the given phrase.
tance between the verbs and the subject can be This problem is not specific to Englishto
very large), it is relatively likely that an erroneous German. For instance, the same problem occurs
German translation is generated. when translating German into English. If, for ex-
732
Input 1 An MRSA - an antibiotic resistant staphylococcus - infection was recently diagnosed in the trauma-
tology ward of Janos hospital.
Reordered An MRSA - an antibiotic resistant staphylococcus - infection was recently in the traumatology ward
input of Janos hospital diagnosed.
Baseline Ein MRSA - ein Antibiotikum resistenter Staphylococcus - war vor kurzem in der festgestellt
translation A MRSA - an antibiotic resistant Staphylococcus - was before recent in the diagnosed
traumatology Ward von Janos Krankenhaus.
traumatology ward of Janos hospital.
Reordered Ein MRSA - ein Antibiotikum resistenter Staphylococcus - Infektion wurde vor kurzem in den
translation A MRSA - an antibiotic resistant Staphylococcus - infection was before recent in the
traumatology Station der Janos Krankenhaus diagnostiziert.
traumatology ward of Janos hospital diagnosed.
Input 2 The ECB predicts that 2008 inflation will climb to 2.5 percent from the earlier 2.1, but will drop
back to 1.9 percent in 2009.
Reordered The ECB predicts that 2008 inflation to 2.5 percent from the earlier 2.1 will climb, but back to 1.9
input percent in 2009 will drop.
Baseline Die EZB sagt, dass 2008 die Inflationsrate wird auf 2,5 Prozent aus der fruheren 2,1, sondern
translation The ECB says, that 2008 the inflation rate will to 2.5 percent from the earlier 2.1, but
fallen zuruck auf 1,9 Prozent im Jahr 2009.
fall back to 1.9 percent in the year 2009.
Reordered Die EZB prophezeit, dass 2008 die Inflation zu 2,5 Prozent aus der fruheren 2,1 ansteigen
translation The ECB predicts, that 2008 the inflation rate to 2.5 percent from the earlier 2.1 climb
wird, aber auf 1,9 Prozent in 2009 sinken wird.
will, but to 1.9 percent in 2009 fall will.
Input 3 Labour Minister Monika Lamperth appears not to have a sensitive side.
R. input Labour Minister Monika Lamperth appears a sensitive side not to have .
Baseline Arbeitsminister Monika Lamperth scheint nicht eine sensible Seite.
translation Labour Minister Monika Lamperth appears not a sensitive side.
Reordered Arbeitsminister Monika Lamperth scheint eine sensible Seite nicht zu haben.
translation Labour Minister Monika Lamperth appears a sensitive side not to have.
Table 5: Example translations, the baseline has problems with verbal elements, reordered is correct
ample, the object Kauf (buying) of the colloca- pose a problem for translation (see sections 6.2
tion nehmen + in Kauf (accept) is separated from 6.3). Although the positions of the verbs in the
the verb nehmen (take), they are very likely to be translations are now correct, the distance between
translated literally (rather than as the idiom mean- subjects and verbs, or between verbs in a single
ing to accept), thus leading to an erroneous En- VP might lead to the generation of erroneously
glish translation. inflected verbs. The separate generation of Ger-
man verbal morphology is an interesting area of
6.4 Error statistics future work, see (de Gispert and Marino, 2008).
We also found 2 problematic collocations but note
We manually checked 100 randomly chosen En-
that this only gives a rough idea of the problem,
glish sentences to see how often the problems de-
further study is needed.
scribed in the previous sections occur. From a
total of 276 clauses, 29 were not reordered cor-
6.5 POS-based disambiguation of the
rectly. 20 errors were caused by incorrect parsing
English verbs
and/or POS tags, while the remaining 9 are mostly
due to different kinds of coordination. Table 6 With respect to the problems described in 6.2.1
shows correctly reordered clauses which might and 6.2.2, we carried out an experiment in which
733
total d 5 tokens the main verb of a verbal complex can occupy
subjectverb 40 19 different positions in a clause, we had to define
verb dependency 32 14 the English counterparts of the two components
collocations 8 2
of the German verbal complex. We defined non-
Table 6: total is the number of clauses found for the finite English verbal elements as a part of the main
respective phenomenon. d 5 tokens is the number of verb complex which are then moved together with
clauses where the distance between relevant tokens is the main verb. This rigid definition could be re-
at least 5, which is problematic. laxed by considering multiple different splittings
and movements of the English verbs.
Baseline + POS Reordered + POS
BLEU 13.11 13.68 Furthermore, the reordering rules are applied
on a clause not allowing for movements across the
Table 7: BLEU scores of the baseline and the con- clause boundaries. However, we also showed that
trastive SMT system using verbal POS tags in some cases, the main verbs may be moved after
the succeeding subclause. Stochastic rules could
we used POS tags in order to disambiguate the allow for both placements or carry out the more
English verbs. For example, the English verb said probable reordering given a specific context. We
corresponds to the German participle gesagt, as will address these issues in future work.
well as to the finite verb in simple past, e.g. sagte. Unfortunately, some important contextual in-
We attached the POS tags to the English verbs in formation is lost when splitting and moving En-
order to simulate a disambiguating suffix of a verb glish verbs. When English verbs are highly am-
(e.g. said said VBN, said VBD). The idea be- biguous, erroneous German verbs can be gener-
hind this was to extract the correct verbal trans- ated. The experiment described in section 6.5
lation phrases and score them with appropriate shows that more effort should be made in order to
translation probabilities (e.g. p(said VBN, gesagt) overcome this problem. The incorporation of sep-
> p(said VBN, sagte). arate morphological generation of inflected Ger-
We built and tested two PSMT systems using man verbs would improve translation.
the data enriched with verbal POS tags. The
first system is trained and tested on the original 8 Conclusion
English sentences, while the contrastive one was
trained and tested on the reordered English sen- We presented a method for reordering English as a
tences. Evaluation results are shown in table 7. preprocessing step for EnglishtoGerman SMT.
The baseline obtains a gain of 0.09 and the con- To our knowledge, this is one of the first papers
trastive system of 0.05 BLEU points over the cor- which reports on experiments regarding the re-
responding PSMT system without POS tags. Al- ordering problem for EnglishtoGerman SMT.
though there are verbs which are now generated We showed that the reordering rules specified in
correctly, the overall translation improvement lies this work lead to improved translation quality. We
under our expectation. We will directly model the observed that verbs are placed correctly more of-
inflection of German verbs in future work. ten than in the baseline, and that verbs which were
omitted in the baseline are now often generated.
7 Discussion and future work We carried out a thorough analysis of the rules
We implemented reordering rules for English ver- applied and discussed problems which are related
bal complexes because their placement differs to highly ambiguous English verbs. Finally we
significantly from German placement. The imple- presented ideas for future work.
mentation required dealing with three important
problems: (i) definition of the clause boundaries, Acknowledgments
(ii) identification of the new verb positions and
(iii) correct splitting of the verbal complexes. This work was funded by Deutsche Forschungs-
We showed some phenomena for which a gemeinschaft grant Models of Morphosyntax for
stochastic reordering would be more appropriate. Statistical Machine Translation.
For example, since in German, the auxiliary and
734
References
Eugene Charniak and Mark Johnson. 2005. Coarse-
to-fine n-best parsing and MaxEnt discriminative
reranking. In ACL.
Michael Collins, Philipp Koehn, and Ivona Kucerova.
2005. Clause restructuring for statistical machine
translation. In ACL.
Adria de Gispert and Jose B. Marino. 2008. On the
impact of morphology in English to Spanish statis-
tical MT. Speech Communication, 50(11-12).
Chris Dyer, Smaranda Muresan, and Philip Resnik.
2008. Generalizing word lattice translation. In
ACL-HLT.
Dmitriy Genzel. 2010. Automatically learning
source-side reordering rules for large scale machine
translation. In COLING.
Deepa Gupta, Mauro Cettolo, and Marcello Federico.
2007. POS-based reordering models for statistical
machine translation. In Proceedings of the Machine
Translation Summit (MT-Summit).
Nizar Habash. 2007. Syntactic preprocessing for sta-
tistical machine translation. In Proceedings of the
Machine Translation Summit (MT-Summit).
Jason Katz-Brown, Slav Petrov, Ryan McDon-
ald, Franz Och, David Talbot, Hiroshi Ichikawa,
Masakazu Seno, and Hideto Kazawa. 2011. Train-
ing a parser for machine translation reordering. In
EMNLP.
Philipp Koehn, Hieu Hoang, Alexandra Birch,
Chris Callison-Burch, Marcello Federico, Nicola
Bertoldi, Brooke Cowan, Wade Shen, Christine
Moran, Richard Zens, Chris Dyer, Ondrej Bojar,
Alexandra Constantin, and Evan Herbst. 2007.
Moses: Open source toolkit for statistical machine
translation. In ACL, Demonstration Program.
Philipp Koehn. 2004. Statistical significance tests for
machine translation evaluation. In EMNLP.
Jan Niehues and Muntsin Kolss. 2009. A POS-based
model for long-range reorderings in SMT. In EACL
Workshop on Statistical Machine Translation.
Kishore Papineni, Salim Roukos, Todd Ward, and
Wei-Jing Zhu. 2002. BLEU: a method for auto-
matic evaluation of machine translation. In ACL.
Peng Xu, Jaecho Kang, Michael Ringgaard, and Franz
Och. 2009. Using a dependency parser to improve
SMT for subject-object-verb languages. In NAACL.
735
Syntax-Based Word Ordering Incorporating a Large-Scale Language
Model
736
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 736746,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
the system. N -gram models have been used as a Steedman (2002), meaning that the CCG combina-
standard component in statistical machine trans- tory rules are encoded as rule instances, together
lation, but have not been applied to the syntac- with a number of additional rules which deal with
tic model of Z&C. Intuitively, an N -gram model punctuation and type-changing. Given a sentence,
can improve local fluency when added to a syntax its CCG derivation can be produced by first assign-
model. Our experiments show that a four-gram ing a lexical category to each word, and then re-
model trained using the English GigaWord cor- cursively applying CCG rules bottom-up.
pus gave improvements when added to the syntax-
based baseline system. 2.2 The decoding algorithm
The contributions of this paper are as follows. In the decoding algorithm, a hypothesis is an
First, we improve on the performance of the Z&C edge, which corresponds to a sub-tree in a CCG
system for the challenging task of the general derivation. Edges are built bottom-up, starting
word ordering problem. Second, we develop a from leaf edges, which are generated by assigning
novel method for incorporating a large-scale lan- all possible lexical categories to each input word.
guage model into a syntax-based generation sys- Each leaf edge corresponds to an input word with
tem. Finally, we analyse large-margin training in a particular lexical category. Two existing edges
the context of learning-guided best-first search, can be combined if there exists a CCG rule which
offering a novel solution to this computationally combines their category labels, and if they do not
hard problem. contain the same input word more times than its
total count in the input. The resulting edge is as-
2 The statistical model and decoding signed a category label according to the combi-
algorithm natory rule, and covers the concatenated surface
strings of the two sub-edges in their order or com-
We take Z&C as our baseline system. Given bination. New edges can also be generated by ap-
a multi-set of input words, the baseline system plying unary rules to a single existing edge. Start-
builds a CCG derivation by choosing and ordering ing from the leaf edges, the bottom-up process is
words from the input set. The scoring model is repeated until a goal edge is found, and its surface
trained using CCGBank (Hockenmaier and Steed- string is taken as the output.
man, 2007), and best-first decoding is applied. We This derivation-building process is reminiscent
apply the same decoding framework in this paper, of a bottom-up CCG parser in the edge combina-
but apply an improved training process, and incor- tion mechanism. However, it is fundamentally
porate an N -gram language model into the syntax different from a bottom-up parser. Since, for
model. In this section, we describe and discuss the generation problem, the order of two edges
the baseline statistical model and decoding frame- in their combination is flexible, the search prob-
work, motivating our extensions. lem is much harder than that of a parser. With
no input order specified, no efficient dynamic-
2.1 Combinatory Categorial Grammar
programming algorithm is available, and less con-
CCG , and parsing with CCG , has been described textual information is available for disambigua-
elsewhere (Clark and Curran, 2007; Hockenmaier tion due to the lack of an input string.
and Steedman, 2002); here we provide only a In order to combat the large search space, best-
short description. first search is applied, where candidate hypothe-
CCG (Steedman, 2000) is a lexicalized gram- ses are ordered by their scores, and kept in an
mar formalism, which associates each word in a agenda, and a limited number of accepted hy-
sentence with a lexical category. There is a small potheses are recorded in a chart. Here the chart
number of basic lexical categories, such as noun is essentially a set of beams, each of which con-
(N), noun phrase (NP), and prepositional phrase tains the highest scored edges covering a particu-
(PP). Complex lexical categories are formed re- lar number of words. Initially, all leaf edges are
cursively from basic categories and slashes, which generated and scored, before they are put onto the
indicate the directions of arguments. The CCG agenda. During each step in the decoding process,
grammar used by our system is read off the deriva- the top edge from the agenda is expanded. If it is
tions in CCGbank, following Hockenmaier and a goal edge, it is returned as the output, and the
737
Algorithm 1 The decoding algorithm. During decoding, feature vectors are computed
a I NITAGENDA( ) incrementally. When an edge is constructed, its
c I NIT C HART( ) score is computed from the scores of its sub-edges
while not T IME O UT( ) do and the incrementally added structure:
new []
e P OP B EST(a) f (e) = (e)
X
if G OALT EST(e) then = (es ) + (e)
return e es e
end if
X
= (es ) + (e)
for e U NARY(e, grammar) do es e
A PPEND(new, e) X
= f (es ) + (e)
end for
es e
for e c do
if C AN C OMBINE(e, e) then In the equation, es e represents a sub-edge of
e B INARY(e, e, grammar) e. Leaf edges do not have any sub-edges. Unary-
A PPEND(new, e ) branching edges have one sub-edge, and binary-
end if branching edges have two sub-edges. The fea-
if C AN C OMBINE(e, e) then ture vector (e) represents the incremental struc-
e B INARY(e, e, grammar) ture when e is constructed over its sub-edges.
A PPEND(new, e ) It is called the constituent-level feature vector
end if by Z&C. For leaf edges, (e) includes informa-
end for tion about the lexical category label; for unary-
for e new do branching edges, (e) includes information from
A DD(a, e ) the unary rule; for binary-branching edges, (e)
end for includes information from the binary rule, and ad-
A DD(c, e) ditionally the token, POS and lexical category bi-
end while grams and trigrams that result from the surface
string concatenation of its sub-edges. The score
f (e) is therefore the sum of f (es ) (for all es e)
decoding finishes. Otherwise it is extended with plus (e) . The feature templates we use are the
unary rules, and combined with existing edges in same as those in the baseline system.
the chart using binary rules to produce new edges. An important aspect of the scoring model is that
The resulting edges are scored and put onto the edges with different sizes are compared with each
agenda, while the original edge is put onto the other during decoding. Edges with different sizes
chart. The process repeats until a goal edge is can have different numbers of features, which can
found, or a timeout limit is reached. In the latter make the training of a discriminative model more
case, a default output is produced using existing difficult. For example, a leaf edge with one word
edges in the chart. can be compared with an edge over the entire in-
Pseudocode for the decoder is shown as Algo- put. One way of reducing the effect of the size dif-
rithm 1. Again it is reminiscent of a best-first ference is to include the size of the edge as part of
parser (Caraballo and Charniak, 1998) in the use feature definitions, which can improve the compa-
of an agenda and a chart, but is fundamentally dif- rability of edges of different sizes by reducing the
ferent due to the fact that there is no input order. number of features they have in common. Such
2.3 Statistical model and feature templates features are applied by Z&C, and we make use of
them here. Even with such features, the question
The baseline system uses a linear model to score of whether edges with different sizes are linearly
hypotheses. For an edge e, its score is defined as: separable is an empirical one.
f (e) = (e) , 3 Training
where (e) represents the feature vector of e and The efficiency of the decoding algorithm is de-
is the parameter vector of the model. pendent on the statistical model, since the best-
738
first search is guided to a solution by the model, Algorithm 2 The training algorithm.
and a good model will lead to a solution being a I NITAGENDA( )
found more quickly. In the ideal situation for the c I NIT C HART( )
best-first decoding algorithm, the model is perfect while not T IME O UT( ) do
and the score of any gold-standard edge is higher new []
than the score of any non-gold-standard edge. As e P OP B EST(a)
a result, the top edge on the agenda is always a if G OLD S TANDARD(e) and G OALT EST(e)
gold-standard edge, and therefore all edges on the then return e
chart are gold-standard before the gold-standard end if
goal edge is found. In this oracle procedure, the if not G OLD S TANDARD(e) then
minimum number of edges is expanded, and the e e
output is correct. The best-first decoder is perfect e+ M IN G OLD(a)
in not only accuracy, but also speed. In practice U PDATE PARAMETERS(e+ , e )
this ideal situation is rarely met, but it determines R E C OMPUTE S CORES(a, c)
the goal of the training algorithm: to produce the continue
perfect model and hence decoder. end if
If we take gold-standard edges as positive ex- for e U NARY(e, grammar) do
amples, and non-gold-standard edges as negative A PPEND(new, e)
examples, the goal of the training problem can be end for
viewed as finding a large separating margin be- for e c do
tween the scores of positive and negative exam- if C AN C OMBINE(e, e) then
ples. However, it is infeasible to generate the full e B INARY(e, e, grammar)
space of negative examples, which is factorial in A PPEND(new, e )
the size of input. Like Z&C, we apply online end if
learning, and generate negative examples based if C AN C OMBINE(e, e) then
on the decoding algorithm. e B INARY(e, e, grammar)
Our training algorithm is shown as Algo- A PPEND(new, e )
rithm 2. The algorithm is based on the decoder, end if
where an agenda is used as a priority queue of end for
edges to be expanded, and a set of accepted edges for e new do
is kept in a chart. Similar to the decoding algo- A DD(a, e )
rithm, the agenda is intialized using all possible end for
leaf edges. During each step, the top of the agenda A DD(c, e)
e is popped. If it is a gold-standard edge, it is ex- end while
panded in exactly the same way as the decoder,
with the newly generated edges being put onto
the agenda, and e being inserted into the chart. for further work possible alternative methods to
If e is not a gold-standard edge, we take it as a generate more negative examples during training.
negative example e , and take the lowest scored Another way of viewing the training process is
gold-standard edge on the agenda e+ as a positive that it pushes gold-standard edges towards the top
example, in order to make an udpate to the model of the agenda, and crucially pushes them above
parameter vector . Our parameter update algo- non-gold-standard edges. This is the view de-
rithm is different from the baseline perceptron al- scribed by Z&C. Given a positive example e+ and
gorithm, as will be discussed later. After updating a negative example e , they use the perceptron
the parameters, the scores of agenda edges above algorithm to penalize the score for (e ) and re-
and including e , together with all chart edges, ward the score of (e+ ), but do not update pa-
are updated, and e is discarded before the start rameters for the sub-edges of e+ and e . An argu-
of the next processing step. By not putting any ment for not penalizing the sub-edge scores for e
non-gold-standard edges onto the chart, the train- is that the sub-edges must be gold-standard edges
ing speed is much faster; on the other hand a wide (since the training process is constructed so that
range of negative examples is pruned. We leave only gold-standard edges are expanded). From
739
the perspective of correctness, it is unnecessary have been used as a standard component in statis-
to find a margin between the sub-edges of e+ and tical machine translation systems to control out-
those of e , since both are gold-standard edges. put fluency. For the syntax-based generation sys-
However, since the score of an edge not only tem, the incorporation of an N -gram language
represents its correctness, but also affects its pri- model can potentially improve the local fluency
ority on the agenda, promoting the sub-edge of of output sequences. In addition, the N -gram
e+ can lead to easier edges being constructed language model can be trained separately using
before harder ones (i.e. those that are less a large amount of data, while the syntax-based
likely to be correct), and therefore improve the model requires manual annotation for training.
output accuracy. This perspective has been ob- The standard method for the combination of
served by other works of learning-guided-search a syntax model and an N -gram model is linear
(Shen et al., 2007; Shen and Joshi, 2008; Gold- interpolation. We incorporate fourgram, trigram
berg and Elhadad, 2010). Intuitively, the score and bigram scores into our syntax model, so that
difference between easy gold-standard and harder the score of an edge e becomes:
gold-standard edges should not be as great as the
difference between gold-standard and non-gold- F (e) = f (e) + g(e)
standard edges. The perceptron update cannot = f (e) + gfour (e) + gtri (e) + gbi (e),
provide such control of separation, because the
amount of update is fixed to 1. where f is the syntax model score, and g is the
As described earlier, we treat parameter update N -gram model score. g consists of three com-
as finding a separation between correct and incor- ponents, gfour , gtri and gbi , representing the log-
rect edges, in which the global feature vectors , probabilities of fourgrams, trigrams and bigrams
rather than , are considered. Given a positive ex- from the language model, respectively. , and
ample e+ and a negative example e , we make a are the corresponding weights.
minimum update so that the score of e+ is higher During decoding, F (e) is computed incremen-
than that of e with some margin: tally. Again, denoting the sub-edges of e as es ,
740
negative examples; the training algorithm finds a CCGBank Sentences Tokens
value of that best suits the precomputed , training 39,604 929,552
and values, together with the N -gram language development 1,913 45,422
model. We call this method g-precomputed in- GigaWord v4 Sentences Tokens
terpolation. Yet another method is to initialize , AFP 30,363,052 684,910,697
, and as all zeroes, and run the training al- XIN 15,982,098 340,666,976
gorithm taking into account the N -gram language
model. We call this method g-free interpolation. Table 1: Number of sentences and tokens by language
The incorporation of an N -gram language model source.
model into the syntax-based generation system is
weakly analogous to N -gram model insertion for standard sequence, assuming that for some prob-
syntax-based statistical machine translation sys- lems the ambiguities can be reduced (e.g. when
tems, both of which apply a score from the N - the input is already partly correctly ordered).
gram model component in a derivation-building Z&C use different probability cutoff levels (the
process. As discussed earlier, polynomial-time parameter in the supertagger) to control the
decoding is typically feasible for syntax-based pruning. Here we focus mainly on the dictionary
machine translation systems without an N -gram method, which leaves lexical category disam-
language model, due to constraints from the biguation entirely to the generation system. For
grammar. In these cases, incorporation of N - comparison, we also perform experiments with
gram language models can significantly increase lexical category pruning. We chose = 0.0001,
the complexity of a dynamic-programming de- which leaves 5.4 leaf edges per word on average.
coder (Bar-Hillel et al., 1961). Efficient search We used the SRILM Toolkit (Stolcke, 2002)
has been achieved using chart pruning (Chiang, to build a true-case 4-gram language model es-
2007) and iterative numerical approaches to con- timated over the CCGBank training and develop-
strained optimization (Rush and Collins, 2011). ment data and a large additional collection of flu-
In contrast, the incorporation of an N -gram lan- ent sentences in the Agence France-Presse (AFP)
guage model into our decoder is more straightfor- and Xinhua News Agency (XIN) subsets of the
ward, and does not add to its asymptotic complex- English GigaWord Fourth Edition (Parker et al.,
ity, due to the heuristic nature of the decoder. 2009), a total of over 1 billion tokens. The Gi-
gaWord data was first pre-processed to replicate
5 Experiments
the CCGBank tokenization. The total number
We use sections 221 of CCGBank to train our of sentences and tokens in each LM component
syntax model, section 00 for development and is shown in Table 1. The language model vo-
section 23 for the final test. Derivations from cabulary consists of the 46,574 words that oc-
CCGBank are transformed into inputs by turn- cur in the concatenation of the CCGBank train-
ing their surface strings into multi-sets of words. ing, development, and test sets. The LM proba-
Following Z&C, we treat base noun phrases (i.e. bilities are estimated using modified Kneser-Ney
NP s that do not recursively contain other NPs) as smoothing (Kneser and Ney, 1995) with interpo-
atomic units for the input. Output sequences are lation of lower n-gram orders.
compared with the original sentences to evaluate
their quality. We follow previous work and use 5.1 Development experiments
the BLEU metric (Papineni et al., 2002) to com- A set of development test results without lexical
pare outputs with references. category pruning (i.e. using the full dictionary) is
Z&C use two methods to construct leaf edges. shown in Table 2. We train the baseline system
The first is to assign lexical categories according and our systems under various settings for 10 iter-
to a dictionary. There are 26.8 lexical categories ations, and measure the output BLEU scores after
for each word on average using this method, cor- each iteration. The timeout value for each sen-
responding to 26.8 leaf edges. The other method tence is set to 5 seconds. The highest score (max
is to use a pre-processing step a CCG supertag- BLEU) and averaged score (avg. BLEU) of each
ger (Clark and Curran, 2007) to prune can- system over the 10 training iterations are shown
didate lexical categories according to the gold- in the table.
741
Method max BLEU avg. BLEU
baseline 38.47 37.36
margin 41.20 39.70
margin +LM (g-precomputed) 41.50 40.84
margin +LM ( = 0, = 0, = 0) 40.83
margin +LM ( = 0.08, = 0.016, = 0.004) 38.99
margin +LM ( = 0.4, = 0.08, = 0.02) 36.17
margin +LM ( = 0.8, = 0.16, = 0.04) 34.74
BLEU
terpolation we manually chose = 0.8, = 0.16 41
742
baseline margin margin +LM
as a nonexecutive director Pierre Vinken 61 years old , the board will join as a as a nonexecutive director Pierre Vinken
, 61 years old , will join the board . 29 nonexecutive director Nov. 29 , Pierre , 61 years old , will join the board Nov.
Nov. Vinken . 29 .
Lorillard nor smokers were aware of the of any research who studied Neither the Neither Lorillard nor any research on the
Kent cigarettes of any research on the workers were aware of smokers on the workers who studied the Kent cigarettes
workers who studied the researchers Kent cigarettes nor the researchers were aware of smokers of the researchers
.
you But 35 years ago have to recognize recognize But you took place that these But you have to recognize that these
that these events took place . events have to 35 years ago . events took place 35 years ago .
investors to pour cash into money funds Despite investors , yields continue to Despite investors , recent declines in
continue in Despite yields recent declines pour into money funds recent declines in yields continue to pour cash into money
cash . funds .
yielding The top money funds are cur- The top money funds currently are yield- The top money funds are yielding well
rently well over 9 % . ing well over 9 % . over 9 % currently .
where A buffet breakfast , held in the mu- everyday visitors are banned to where A buffet breakfast , everyday visitors are
seum was food and drinks to . everyday A buffet breakfast was held , food and banned to where food and drinks was
visitors banned drinks in the museum . held in the museum .
A Commonwealth Edison spokesman tracking A Commonwealth Edison an administrative nightmare whose ad-
said an administrative nightmare would spokesman said that the two million cus- dresses would be tracking down A Com-
be tracking down the past 3 12 years that tomers whose addresses have changed monwealth Edison spokesman said that
the two million customers have . whose down during the past 3 12 years would the two million customers have changed
changed be an administrative nightmare . during the past 3 12 years .
The $ 2.5 billion Byron 1 plant , Ill. , was The $ 2.5 billion Byron 1 plant was near The $ 2.5 billion Byron 1 plant near
completed . near Rockford in 1985 completed in Rockford , Ill. , 1985 . Rockford , Ill. , was completed in 1985 .
will ( During its centennial year , The as The Wall Street Journal ( During its During its centennial year events will re-
Wall Street Journal report events of the centennial year , milestones stand of port , The Wall Street Journal that stand
past century that stand as milestones of American business history that will re- as milestones of American business his-
American business history . ) port events of the past century . ) tory ( of the past century ) .
Table 3: Some chosen examples with significant improvements (supertagger parameter = 0.0001).
method, the examples are chosen from the devel- syntactically grammatical, but are semantically
opment output with lexical category pruning, af- anomalous. For example, person names are often
ter the optimal number of training iterations, with confused with company names, verbs often take
the timeout set to 5s. We also tried manually se- unrelated subjects and objects. The problem is
lecting examples without lexical category prun- much more severe for long sentences, which have
ing, but the improvements were not as obvious, more ambiguities. For specific tasks, extra infor-
partly because the overall fluency was lower for mation (such as the source text for machine trans-
all the three systems. lation) can be available to reduce ambiguities.
Table 4 shows a set of examples chosen ran-
domly from the development test outputs of our 6 Final results
system with the N -gram model. The optimal
number of training iterations is used, and a time- The final results of our system without lexical cat-
out of 1 minute is used in addition to the 5s time- egory pruning are shown in Table 5. Row W09
out for comparison. With more time to decode CLE and W09 AB show the results of the
each input, the system gave a BLEU score of maximum spanning tree and assignment-based al-
44.61, higher than 41.50 with the 5s timout. gorithms of Wan et al. (2009); rows margin
While some of the outputs we examined are and margin +LM show the results of our large-
reasonably fluent, most are to some extent frag- margin training system and our system with the
mentary.2 In general, the system outputs are N -gram model. All these results are directly com-
still far below human fluency. Some samples are parable since we do not use any lexical category
2
pruning for this set of results. For each of our
Part of the reason for some fragmentary outputs is the
default output mechanism: partial derivations from the chart
systems, we fix the number of training iterations
are greedily put together when timeout occurs before a goal according to development test scores. Consis-
hypothesis is found. tent with the development experiments, our sys-
743
timeout = 5s timeout = 1m
drooled the cars and drivers , like Fortune 500 executives . over After schoolboys drooled over the cars and drivers , the race
the race like Fortune 500 executives .
One big reason : thin margins . One big reason : thin margins .
You or accountants look around ... and at an eye blinks . pro- blinks nobody You or accountants look around ... and at an eye
fessional ballplayers . professional ballplayers
most disturbing And of it , are educators , not students , for the And blamed for the wrongdoing , educators , not students who
wrongdoing is who . are disturbing , much of it is most .
defeat coaching aids the purpose of which is , He and other gauge coaching aids learning progress can and other critics say
critics say can to . standardized tests learning progress the purpose of which is to defeat , standardized tests .
The federal government of government debt because Congress The federal government suspended sales of government debt
has lifted the ceiling on U.S. savings bonds suspended sales because Congress has nt lifted the ceiling on U.S. savings
bonds .
Table 4: Some examples chosen at random from development test outputs without lexical category pruning.
System BLEU 2011). Unlike our system, and Wan et al. (2009),
W09 CLE 26.8 input dependencies provide additional informa-
W09 AB 33.7 tion to these systems. Although the search space
Z&C11 40.1 can be constrained by the assumption of projec-
margin 42.5 tivity, permutation of modifiers of the same head
margin +LM 43.8 word makes exact inference for tree lineariza-
tion intractable. The above systems typically ap-
Table 5: Test results without lexical category pruning. ply approximate inference, such as beam-search.
While syntax-based features are commonly used
System BLEU by these systems for linearization, Filippova and
Z&C11 43.2 Strube (2009) apply a trigram model to control
local fluency within constituents. A dependency-
margin 44.7
based N-gram model has also been shown effec-
margin +LM 46.1
tive for the linearization task (Guo et al., 2011).
The best-first inference and timeout mechanism
Table 6: Test results with lexical category pruning (su-
of our system is similar to that of White (2004), a
pertagger parameter = 0.0001).
surface realizer from logical forms using CCG.
744
References Short Papers, pages 225228, Boulder, Colorado,
June. Association for Computational Linguistics.
Yehoshua Bar-Hillel, M. Perles, and E. Shamir. 1961.
Yoav Goldberg and Michael Elhadad. 2010. An effi-
On formal properties of simple phrase structure
cient algorithm for easy-first non-directional depen-
grammars. Zeitschrift fur Phonetik, Sprachwis-
dency parsing. In Human Language Technologies:
senschaft und Kommunikationsforschung, 14:143
The 2010 Annual Conference of the North American
172. Reprinted in Y. Bar-Hillel. (1964). Language
Chapter of the Association for Computational Lin-
and Information: Selected Essays on their Theory
guistics, pages 742750, Los Angeles, California,
and Application, Addison-Wesley 1964, 116150.
June. Association for Computational Linguistics.
Regina Barzilay and Kathleen McKeown. 2005. Sen-
Yuqing Guo, Deirdre Hogan, and Josef van Genabith.
tence fusion for multidocument news summariza-
2011. Dcu at generation challenges 2011 surface
tion. Computational Linguistics, 31(3):297328. realisation track. In Proceedings of the Generation
Graeme Blackwood, Adria de Gispert, and William Challenges Session at the 13th European Workshop
Byrne. 2010. Fluency constraints for minimum on Natural Language Generation, pages 227229,
Bayes-risk decoding of statistical machine trans- Nancy, France, September. Association for Compu-
lation lattices. In Proceedings of the 23rd Inter- tational Linguistics.
national Conference on Computational Linguistics Julia Hockenmaier and Mark Steedman. 2002. Gen-
(Coling 2010), pages 7179, Beijing, China, Au- erative models for statistical parsing with Combi-
gust. Coling 2010 Organizing Committee. natory Categorial Grammar. In Proceedings of the
Bernd Bohnet, Leo Wanner, Simon Mill, and Alicia 40th Meeting of the ACL, pages 335342, Philadel-
Burga. 2010. Broad coverage multilingual deep phia, PA.
sentence generation with a stochastic multi-level re- Julia Hockenmaier and Mark Steedman. 2007. CCG-
alizer. In Proceedings of the 23rd International bank: A corpus of CCG derivations and dependency
Conference on Computational Linguistics (Coling structures extracted from the Penn Treebank. Com-
2010), pages 98106, Beijing, China, August. Col- putational Linguistics, 33(3):355396.
ing 2010 Organizing Committee.
R. Kneser and H. Ney. 1995. Improved backing-off
Peter F. Brown, Stephen Della Pietra, Vincent J. Della for m-gram language modeling. In International
Pietra, and Robert L. Mercer. 1993. The mathe- Conference on Acoustics, Speech, and Signal Pro-
matics of statistical machine translation: Parameter cessing, 1995. ICASSP-95, volume 1, pages 181
estimation. Computational Linguistics, 19(2):263 184.
311. Philip Koehn, Franz Och, and Daniel Marcu. 2003.
Sharon A. Caraballo and Eugene Charniak. 1998. Statistical phrase-based translation. In Proceedings
New figures of merit for best-first probabilistic chart of NAACL/HLT, Edmonton, Canada, May.
parsing. Comput. Linguist., 24:275298, June. Philipp Koehn, Hieu Hoang, Alexandra Birch,
David Chiang. 2007. Hierarchical Phrase- Chris Callison-Burch, Marcello Federico, Nicola
based Translation. Computational Linguistics, Bertoldi, Brooke Cowan, Wade Shen, Christine
33(2):201228. Moran, Richard Zens, Chris Dyer, Ondrej Bojar,
Stephen Clark and James R. Curran. 2007. Wide- Alexandra Constantin, and Evan Herbst. 2007.
coverage efficient statistical parsing with CCG Moses: Open source toolkit for statistical ma-
and log-linear models. Computational Linguistics, chine translation. In Proceedings of the 45th An-
33(4):493552. nual Meeting of the Association for Computational
Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Linguistics Companion Volume Proceedings of the
Shalev-Shwartz, and Yoram Singer. 2006. Online Demo and Poster Sessions, pages 177180, Prague,
passive-aggressive algorithms. Journal of Machine Czech Republic, June. Association for Computa-
Learning Research, 7:551585. tional Linguistics.
Katja Filippova and Michael Strube. 2007. Gener- Kishore Papineni, Salim Roukos, Todd Ward, and
ating constituent order in german clauses. In Pro- Wei-Jing Zhu. 2002. Bleu: a method for auto-
ceedings of the 45th Annual Meeting of the Asso- matic evaluation of machine translation. In Pro-
ciation of Computational Linguistics, pages 320 ceedings of 40th Annual Meeting of the Associa-
327, Prague, Czech Republic, June. Association for tion for Computational Linguistics, pages 311318,
Computational Linguistics. Philadelphia, Pennsylvania, USA, July. Association
Katja Filippova and Michael Strube. 2009. Tree lin- for Computational Linguistics.
earization in english: Improving language model Robert Parker, David Graff, Junbo Kong, Ke Chen, and
based approaches. In Proceedings of Human Lan- Kazuaki Maeda. 2009. English Gigaword Fourth
guage Technologies: The 2009 Annual Conference Edition, Linguistic Data Consortium.
of the North American Chapter of the Association Alexander M. Rush and Michael Collins. 2011. Exact
for Computational Linguistics, Companion Volume: decoding of syntactic translation models through la-
745
grangian relaxation. In Proceedings of the 49th An-
nual Meeting of the Association for Computational
Linguistics: Human Language Technologies, pages
7282, Portland, Oregon, USA, June. Association
for Computational Linguistics.
Libin Shen and Aravind Joshi. 2008. LTAG depen-
dency parsing with bidirectional incremental con-
struction. In Proceedings of the 2008 Conference
on Empirical Methods in Natural Language Pro-
cessing, pages 495504, Honolulu, Hawaii, Octo-
ber. Association for Computational Linguistics.
Libin Shen, Giorgio Satta, and Aravind Joshi. 2007.
Guided learning for bidirectional sequence classi-
fication. In Proceedings of ACL, pages 760767,
Prague, Czech Republic, June.
Mark Steedman. 2000. The Syntactic Process. The
MIT Press, Cambridge, Mass.
Andreas Stolcke. 2002. SRILM - an extensible lan-
guage modeling toolkit. In Proceedings of the In-
ternational Conference on Spoken Language Pro-
cessing, pages 901904.
Stephen Wan, Mark Dras, Robert Dale, and Cecile
Paris. 2009. Improving grammaticality in statisti-
cal sentence generation: Introducing a dependency
spanning tree algorithm with an argument satisfac-
tion model. In Proceedings of the 12th Conference
of the European Chapter of the ACL (EACL 2009),
pages 852860, Athens, Greece, March. Associa-
tion for Computational Linguistics.
Michael White. 2004. Reining in CCG chart realiza-
tion. In Proc. INLG-04, pages 182191.
Dekai Wu. 1997. Stochastic inversion transduction
grammars and bilingual parsing of parallel corpora.
Computational Linguistics, 23(3).
Yue Zhang and Stephen Clark. 2011. Syntax-
based grammaticality improvement using CCG and
guided search. In Proceedings of the 2011 Confer-
ence on Empirical Methods in Natural Language
Processing, pages 11471157, Edinburgh, Scot-
land, UK., July. Association for Computational Lin-
guistics.
746
Midge: Generating Image Descriptions From Computer Vision
Detections
Margaret Mitchell Jesse Dodge Amit Goyal Kota Yamaguchi Karl Stratosk
Xufeng Han Alyssa Mensch Alex Berg Tamara Berg Hal Daume III
U. of Aberdeen and Oregon Health and Science University, m.mitchell@abdn.ac.uk
Stony Brook University, {aberg,tlberg,xufhan,kyamagu}@cs.stonybrook.edu
U. of Maryland, {hal,amit}@umiacs.umd.edu
k
Columbia University, stratos@cs.columbia.edu
U. of Washington, dodgejesse@gmail.com, MIT, acmensch@mit.edu
Abstract
This paper introduces a novel generation
system that composes humanlike descrip-
tions of images from computer vision de-
tections. By leveraging syntactically in-
formed word co-occurrence statistics, the
generator filters and constrains the noisy
detections output from a vision system to The bus by the road with a clear blue sky
generate syntactic trees that detail what Figure 1: Example image with generated description.
the computer vision system sees. Results
show that the generation system outper- formation from a language model, or to be short
forms state-of-the-art systems, automati- and simple, but as true to the image as possible.
cally generating some of the most natural Rather than using a fixed template capable of
image descriptions to date.
generating one kind of utterance, our approach
therefore lies in generating syntactic trees. We
1 Introduction use a tree-generating process (Section 4.3) simi-
lar to a Tree Substitution Grammar, but preserv-
It is becoming a real possibility for intelligent sys-
ing some of the idiosyncrasies of the Penn Tree-
tems to talk about the visual world. New ways of
bank syntax (Marcus et al., 1995) on which most
mapping computer vision to generated language
statistical parsers are developed. This allows us
have emerged in the past few years, with a fo-
to automatically parse and train on an unlimited
cus on pairing detections in an image to words
amount of text, creating data-driven models that
(Farhadi et al., 2010; Li et al., 2011; Kulkarni et
flesh out descriptions around detected objects in a
al., 2011; Yang et al., 2011). The goal in connect-
principled way, based on what is both likely and
ing vision to language has varied: systems have
syntactically well-formed.
started producing language that is descriptive and
poetic (Li et al., 2011), summaries that add con- An example generated description is given in
tent where the computer vision system does not Figure 1, and example vision output/natural lan-
(Yang et al., 2011), and captions copied directly guage generation (NLG) input is given in Fig-
from other images that are globally (Farhadi et al., ure 2. The system (Midge) generates descrip-
2010) and locally similar (Ordonez et al., 2011). tions in present-tense, declarative phrases, as a
nave viewer without prior knowledge of the pho-
A commonality between all of these ap-
tographs content.1
proaches is that they aim to produce natural-
sounding descriptions from computer vision de- Midge is built using the following approach:
tections. This commonality is our starting point: An image processed by computer vision algo-
We aim to design a system capable of producing rithms can be characterized as a triple <Ai , Bi ,
natural-sounding descriptions from computer vi- Ci >, where:
sion detections that are flexible enough to become 1
Midge is available to try online at:
more descriptive and poetic, or include likely in- http://recognition.cs.stonybrook.edu:8080/mitchema/midge/.
747
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 747756,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
stuff: sky .999 a description to select for the kinds of structures it
id: 1
tends to appear in (syntactic constraints) and the
atts: clear:0.432, blue:0.945
grey:0.853, white:0.501 ... other words it tends to occur with (semantic con-
b. box: (1,1 440,141) straints). This is a data-driven way to generate
stuff: road .908 likely adjectives, prepositions, determiners, etc.,
id: 2
taking the intersection of what the vision system
atts: wooden:0.722 clear:0.020 ...
b. box: (1,236 188,94) predicts and how the object noun tends to be de-
object: bus .307 scribed.
id: 3
atts: black:0.872, red:0.244 ... 2 Background
b. box: (38,38 366,293) Our approach to describing images starts with
preps: id 1, id 2: by id 1, id 3: by id 2, id 3: below
a system from Kulkarni et al. (2011) that com-
Figure 2: Example computer vision output and natu-
ral language generation input. Values correspond to
poses novel captions for images in the PASCAL
scores from the vision detections. sentence data set,2 introduced in Rashtchian et
al. (2010). This provides multiple object detec-
tions based on Felzenszwalbs mixtures of multi-
Ai is the set of object/stuff detections with
scale deformable parts models (Felzenszwalb et
bounding boxes and associated attribute
al., 2008), and stuff detections (roughly, mass
detections within those bounding boxes.
nouns, things like sky and grass) based on linear
Bi is the set of action or pose detections as-
SVMs for low level region features.
sociated to each ai Ai .
Appearance characteristics are predicted using
Ci is the set of spatial relationships that hold
trained detectors for colors, shapes, textures, and
between the bounding boxes of each pair
materials, an idea originally introduced in Farhadi
ai , aj Ai .
et al. (2009). Local texture, Histograms of Ori-
Similarly, a description of an image can be char- ented Gradients (HOG) (Dalal and Triggs, 2005),
acterized as a triple <Ad , Bd , Cd > where: edge, and color descriptors inside the bounding
box of a recognized object are binned into his-
Ad is the set of nouns in the description with tograms for a vision system to learn to recognize
associated modifiers. when an object is rectangular, wooden, metal,
Bd is the set of verbs associated to each ad etc. Finally, simple preposition functions are used
Ad . to compute the spatial relations between objects
Cd is the set of prepositions that hold be- based on their bounding boxes.
tween each pair of ad , ae Ad . The original Kulkarni et al. (2011) system gen-
erates descriptions with a template, filling in slots
With this representation, mapping <Ai , Bi , Ci >
by combining computer vision outputs with text
to <Ad , Bd , Cd > is trivial. The problem then
based statistics in a conditional random field to
becomes: (1) How to filter out detections that
predict the most likely image labeling. Template-
are wrong; (2) how to order the objects so that
based generation is also used in the recent Yang et
they are mentioned in a natural way; (3) how to
al. (2011) system, which fills in likely verbs and
connect these ordered objects within a syntacti-
prepositions by dependency parsing the human-
cally/semantically well-formed tree; and (4) how
written UIUC Pascal-VOC dataset (Farhadi et al.,
to add further descriptive information from lan-
2010) and selecting the dependent/head relation
guage modeling alone, if required.
with the highest log likelihood ratio.
Our solution lies in using Ai and Ad as descrip-
Template-based generation is useful for auto-
tion anchors. In computer vision, object detec-
matically generating consistent sentences, how-
tions form the basis of action/pose, attribute, and
ever, if the goal is to vary or add to the text pro-
spatial relationship detections; therefore, in our
duced, it may be suboptimal (cf. Reiter and Dale
approach to language generation, nouns for the
(1997)). Work that does not use template-based
object detections are used as the basis for the de-
generation includes Yao et al. (2010), who gener-
scription. Likelihood estimates of syntactic struc-
ate syntactic trees, similar to the approach in this
ture and word co-occurrence are conditioned on
2
object nouns, and this enables each noun head in http://vision.cs.uiuc.edu/pascal-sentences/
748
black, blue, brown, colorful, golden, gray,
green, orange, pink, red, silver, white, yel-
low, bare, clear, cute, dirty, feathered, flying,
furry, pine, plastic, rectangular, rusty, shiny,
spotted, striped, wooden
Table 1: Modifiers used to extract training corpus.
Kulkarni et al.: This is a pic- Kulkarni et al.: This is
ture of three persons, one bot- a picture of two potted- for naturally varied but well-formed text, generat-
tle and one diningtable. The plants, one dog and one ing syntactic trees rather than filling in a template.
first rusty person is beside the person. The black dog is In addition to these tasks, Midge automatically
second person. The rusty bot- by the black person, and
tle is near the first rusty per- near the second feathered
decides what the subject and objects of the de-
son, and within the colorful pottedplant. scription will be, leverages the collected word co-
diningtable. The second per- occurrence statistics to filter possible incorrect de-
son is by the third rusty per- tections, and offers the flexibility to be as de-
son. The colorful diningtable
scriptive or as terse as possible, specified by the
is near the first rusty person,
and near the second person, user at run-time. The end result is a fully au-
and near the third rusty person. tomatic vision-to-language system that is begin-
Yang et al.: Three people Yang et al.: The person is ning to generate syntactically and semantically
are showing the bottle on the sitting in the chair in the well-formed descriptions with naturalistic varia-
street room
tion. Example descriptions are given in Figures 4
Midge: people with a bottle at Midge: a person in black and 5, and descriptions from other recent systems
the table with a black dog by potted
plants
are given in Figure 3.
Figure 3: Descriptions generated by Midge, Kulkarni
The results are promising, but it is important to
et al. (2011) and Yang et al. (2011) on the same images. note that Midge is a first-pass system through the
Midge uses the Kulkarni et al. (2011) front-end, and so steps necessary to connect vision to language at
outputs are directly comparable. a deep syntactic/semantic level. As such, it uses
basic solutions at each stage of the process, which
paper. However, their system is not automatic, re- may be improved: Midge serves as an illustration
quiring extensive hand-coded semantic and syn- of the types of issues that should be handled to
tactic details. Another approach is provided in automatically generate syntactic trees from vision
Li et al. (2011), who use image detections to se- detections, and offers some possible solutions. It
lect and combine web-scale n-grams (Brants and is evaluated against the Kulkarni et al. system, the
Franz, 2006). This automatically generates de- Yang et al. system, and human-written descrip-
scriptions that are either poetic or strange (e.g., tions on the same set of images in Section 5, and
tree snowing black train). is found to significantly outperform the automatic
A different line of work transfers captions of systems.
similar images directly to a query image. Farhadi
et al. (2010) use <object,action,scene> triples 3 Learning from Descriptive Text
predicted from the visual characteristics of the To train our system on how people describe im-
image to find potential captions. Ordonez et al. ages, we use 700,000 (Flickr, 2011) images with
(2011) use global image matching with local re- associated descriptions from the dataset in Or-
ordering from a much larger set of captioned pho- donez et al. (2011). This is separate from our
tographs. These transfer-based approaches result evaluation image set, consisting of 840 PASCAL
in natural captions (they are written by humans) images. The Flickr data is messier than datasets
that may not actually be true of the image. created specifically for vision training, but pro-
This work learns and builds from these ap- vides the largest corpus of natural descriptions of
proaches. Following Kulkarni et al. and Li et al., images to date.
the system uses large-scale text corpora to esti- We normalize the text by removing emoticons
mate likely words around object detections. Fol- and mark-up language, and parse each caption
lowing Yang et al., the system can hallucinate using the Berkeley parser (Petrov, 2010). Once
likely words using word co-occurrence statistics parsed, we can extract syntactic information for
alone. And following Yao et al., the system aims individual (word, tag) pairs.
749
a cow with sheep with a gray sky people with boats a brown cow people at
green grass by the road a wooden table
Figure 4: Example generated outputs.
Awkward Prepositions Incorrect Detections
a person boats under a black bicycle at the sky a yellow bus cows by black sheep
on the dog the sky a green potted plant with people by the road
Figure 5: Example generated outputs: Not quite right
750
NP Unordered Ordered
NP PP bottle, table, person person, bottle, table
road, sky, cow cow, road, sky
NP PP IN NP
Figure 8: Example nominal orderings.
DT NN IN NP at DT NN
pipeline. The hand-built component contains plu-
- people with DT NN the table
ral forms of singular nouns, the list of possible
a bottle spatial relations shown in Table 3, and a map-
Figure 6: Tree generated from tree growth process. ping between attribute values and modifier sur-
face forms (e.g., a green detection for person is to
Midge was developed using detections run on be realized as the postnominal modifier in green).
Flickr images, incorporating action/pose detec-
tions for verbs as well as object detections for 4.2 Content Determination
nouns. In testing, we generate descriptions for 4.2.1 Step 1: Group the Nouns
the PASCAL images, which have been used in
An initial set of object detections must first be
earlier work on the vision-to-language connection
split into clusters that give rise to different sen-
(Kulkarni et al., 2011; Yang et al., 2011), and al-
tences. If more than 3 objects are detected in the
lows us to compare systems directly. Action and
image, the system begins splitting these into dif-
pose detection for this data set still does not work
ferent noun groups. In future work, we aim to
well, and so the system does not receive these de-
compare principled approaches to this task, e.g.,
tections from the vision front-end. However, the
using mutual information to cluster similar nouns
system can still generate verbs when action and
together. The current system randomizes which
pose detectors have been run, and this framework
nouns appear in the same group.
allows the system to hallucinate likely verbal
constructions between objects if specified at run- 4.2.2 Step 2: Order the Nouns
time. A similar approach was taken in Yang et al. Each group of nouns are then ordered to deter-
(2011). Some examples are given in Figure 7. mine when they are mentioned in a sentence. Be-
We follow a three-tiered generation process cause the system generates declarative sentences,
(Reiter and Dale, 2000), utilizing content determi- this automatically determines the subject and ob-
nation to first cluster and order the object nouns, jects. This is a novel contribution for a general
create their local subtrees, and filter incorrect de- problem in NLG, and initial evaluation (Section
tections; microplanning to construct full syntactic 5) suggests it works reasonably well.
trees around the noun clusters, and surface real- To build the nominal ordering model, we use
ization to order selected modifiers, realize them as WordNet to associate all head nouns in the Flickr
postnominal or prenominal, and select final out- data to all of their hypernyms. A description is
puts. The system follows an overgenerate-and- represented as an ordered set [a1 ...an ] where each
select approach (Langkilde and Knight, 1998), ap is a noun with position p in the set of head
which allows different final trees to be selected nouns in the sentence. For the position pi of each
with different settings. hypernym ha in each sentence with n head nouns,
we estimate p(pi |n, ha ).
4.1 Knowledge Base
During generation, the system greedily maxi-
Midge uses a knowledge base that stores models mizes p(pi |n, ha ) until all nouns have been or-
for different tasks during generation. These mod- dered. Example orderings are shown in Figure 8.
els are primarily data-driven, but we also include This model automatically places animate objects
a hand-built component to handle a small set of near the beginning of a sentence, which follows
rules. The data-driven component provides the psycholinguistic work in object naming (Branigan
syntactically informed word co-occurrence statis- et al., 2007).
tics learned from the Flickr data, a model for or-
dering the selected nouns in a sentence, and a 4.2.3 Step 3: Filter Incorrect Attributes
model to change computer vision attributes to at- For the system to be able to extend coverage as
tribute:value pairs. Below, we discuss the three new computer vision attribute detections become
main data-driven models within the generation available, we develop a method to automatically
751
A person sitting on a sofa Cows grazing Airplanes flying A person walking a dog
Figure 7: Hallucinating: Creating likely actions. Straightforward to do, but can often be wrong.
COLOR purple blue green red white ... member of the group.
MATERIAL plastic wooden silver ...
SURFACE furry fluffy hard soft ... 4.2.5 Step 5: Gather Local Subtrees Around
QUALITY shiny rust dirty broken ... Object Nouns
Table 2: Example attribute classes and values. 1 2
NP
group adjectives into broader attribute classes,3 DT{0,1} JJ* NN S
and the generation system uses these classes when
n NP{NN n} VP{VBZ}
deciding how to describe objects. To group adjec-
3 4
tives, we use a bootstrapping technique (Kozareva NP NP
et al., 2008) that learns which adjectives tend to
NP{NN n} VP{VB(G|N)} NP{NN n} PP{IN}
co-occur, and groups these together to form an at-
5 6
tribute class. Co-occurrence is computed using PP VP
cosine (distributional) similarity between adjec-
tives, considering adjacent nouns as context (i.e., IN NP{NN n} VB(G|N|Z) PP{IN}
JJ NN constructions). Contexts (nouns) for adjec- 7
VP
tives are weighted using Pointwise Mutual Infor-
mation and only the top 1000 nouns are selected VB(G|N|Z) NP{NN n}
for every adjective. Some of the learned attribute Figure 9: Initial subtree frames for generation, present-
classes are given in Table 2. tense declarative phrases. marks a substitution site,
In the Flickr corpus, we find that each attribute * marks 0 sister nodes of this type permitted, {0,1}
(COLOR, SIZE, etc.), rarely has more than a single marks that this node can be included of excluded.
Input: set of ordered nouns, Output: trees preserving
value in the final description, with the most com- nominal ordering.
mon (COLOR) co-occurring less than 2% of the
time. Midge enforces this idea to select the most Possible actions/poses and spatial relationships
likely word v for each attribute from the detec- between objects nouns, represented by verbs and
tions. In a noun phrase headed by an object noun, prepositions, are selected using the subtree frames
NP{NN noun}, the prenominal adjective (JJ v) for listed in Figure 9. Each head noun selects for its
each attribute is selected using maximum likeli- likely local subtrees, some of which are not fully
hood. formed until the Microplanning stage. As an ex-
ample of how this process works, see Figure 10,
4.2.4 Step 4: Group Plurals which illustrates the combination of Trees 4 and
How to generate natural-sounding spatial rela- 5. For simplicity, we do not include the selection
tions and modifiers for a set of objects, as opposed of further subtrees. The subject noun duck se-
to a single object, is still an open problem (Fu- lects for prepositional phrases headed by different
nakoshi et al., 2004; Gatt, 2006). In this work, we prepositions, and the object noun grass selects
use a simple method to group all same-type ob- for prepositions that head the prepositional phrase
jects together, associate them to the plural form in which it is embedded. Full PP subtrees are cre-
listed in the KB, discard the modifiers, and re- ated during Microplanning by taking the intersec-
turn spatial relations based on the first recognized tion of both.
3
What in computer vision are called attributes are called
The leftmost noun in the sequence is given a
values in NLG. A value like red belongs to a COLOR at- rightward directionality constraint, placing it as
tribute, and we use this distinction in the system. the subject of the sentence, and so it will only se-
752
a over b a above b b below a b beneath a a by b b by a a on b b under a
b underneath a a upon b a over b
a by b a against b b against a b around a a around b a at b b at a a beside b
b beside a a by b b by a a near b b near a b with a a with b
a in b a in b b outside a a within b a by b b by a
Table 3: Possible prepositions from bounding boxes.
Tree 1: Tree 7:
Collect all NP (DT det) (JJ adj)* (NN noun) Collect VP subtrees headed by (VBX verb) with
and NP (JJ adj)* (NN noun) subtrees, where: embedded NP objects, where:
p((JJ adj)|(NN noun)) > for each adj p(VP{VBX verb}|NP{NN noun}=OBJ) >
p((DT det)|JJ, (NN noun)) > , and the proba- 4.3 Microplanning
bility of a determiner for the head noun is higher
than the probability of no determiner. 4.3.1 Step 6: Create Full Trees
Any number of adjectives (including none) may In Microplanning, full trees are created by tak-
be generated, and we include the presence or ab- ing the intersection of the subtrees created in Con-
sence of an adjective when calculating which de- tent Determination. Because the nouns are or-
terminer to include. dered, it is straightforward to combine the sub-
The reasoning behind the generation of these trees surrounding a noun in position 1 with sub-
subtrees is to automatically learn whether to treat trees surrounding a noun in position 2. Two
753
NP words. We find that the second method produces
VP NP CC NP descriptions that seem more natural and varied
VP* than the n-gram ranking method for our develop-
and
ment set, and so use the longest string method in
Figure 11: Auxiliary trees for generation.
evaluation.
further trees are necessary to allow the subtrees
4.4.2 Step 8: Prenominal Modifier Ordering
gathered to combine within the Penn Treebank
syntax. These are given in Figure 11. If two To order sets of selected adjectives, we use the
nouns in a proposed sentence cannot be combined top-scoring prenominal modifier ordering model
with prepositions or verbs, we backoff to combine discussed in Mitchell et al. (2011). This is an n-
them using (CC and). gram model constructed over noun phrases that
Stepping through this process, all nouns will were extracted from an automatically parsed ver-
have a set of subtrees selected by Tree 1. Prepo- sion of the New York Times portion of the Giga-
sitional relationships between nouns are created word corpus (Graff and Cieri, 2003). With this
by substituting Tree 1 subtrees into the NP nodes in place, blue clear sky becomes clear blue sky,
of Trees 4 and 5, as shown in Figure 10. Verbal wooden brown table becomes brown wooden ta-
relationships between nouns are created by substi- ble, etc.
tuting Tree 1 subtrees into Trees 2, 3, and 7. Verb
5 Evaluation
with preposition relationships are created between
nouns by substituting the VBX node in Tree 6 Each set of sentences is generated with (likeli-
with the corresponding node in Trees 2 and 3 to hood cutoff) set to .01 and (observation count
grow the tree to the right, and the PP node in Tree cutoff) set to 3. We compare the system against
6 with the corresponding node in Tree 5 to grow human-written descriptions and two state-of-the-
the tree to the left. Generation of a full tree stops art vision-to-language systems, the Kulkarni et al.
when all nouns in a group are dominated by the (2011) and Yang et al. (2011) systems.
same node, either an S or NP. Human judgments were collected using Ama-
zons Mechanical Turk (Amazon, 2011). We
4.4 Surface Realization follow recommended practices for evaluating an
In the surface realization stage, the system se- NLG system (Reiter and Belz, 2009) and for run-
lects a single tree from the generated set of pos- ning a study on Mechanical Turk (Callison-Burch
sible trees and removes mark-up to produce a fi- and Dredze, 2010), using a balanced design with
nal string. This is also the stage where punctua- each subject rating 3 descriptions from each sys-
tion may be added. Different strings may be gen- tem. Subjects rated their level of agreement on
erated depending on different specifications from a 5-point Likert scale including a neutral mid-
the user, as discussed at the beginning of Section dle position, and since quality ratings are ordinal
4 and shown in the online demo. To evaluate the (points are not necessarily equidistant), we evalu-
system against other systems, we specify that the ate responses using a non-parametric test. Partici-
system should (1) not hallucinate likely verbs; and pants that took less than 3 minutes to answer all 60
(2) return the longest string possible. questions and did not include a humanlike rating
for at least 1 of the 3 human-written descriptions
4.4.1 Step 7: Get Final Tree, Clear Mark-Up were removed and replaced. It is important to note
We explored two methods for selecting a final that this evaluation compares full generation sys-
string. In one method, a trigram language model tems; many factors are at play in each system that
built using the Europarl (Koehn, 2005) data with may also influence participants perception, e.g.,
start/end symbols returns the highest-scoring de- sentence length (Napoles et al., 2011) and punc-
scription (normalizing for length). In the second tuation decisions.
method, we limit the generation system to select The systems are evaluated on a set of 840
the most likely closed-class words (determiners, images evaluated in the original Kulkarni et al.
prepositions) while building the subtrees, over- (2011) system. Participants were asked to judge
generating all possible adjective combinations. the statements given in Figure 12, from Strongly
The final string is then the one with the most Disagree to Strongly Agree.
754
Grammaticality Main Aspects Correctness Order Humanlikeness
Human 4 (3.77, 1.19) 4 (4.09, 0.97) 4 (3.81, 1.11) 4 (3.88, 1.05) 4 (3.88, 0.96)
Midge 3 (2.95, 1.42) 3 (2.86, 1.35) 3 (2.95, 1.34) 3 (2.92, 1.25) 3 (3.16, 1.17)
Kulkarni et al. 2011 3 (2.83, 1.37) 3 (2.84, 1.33) 3 (2.76, 1.34) 3 (2.78, 1.23) 3 (3.13, 1.23)
Yang et al. 2011 3 (2.95, 1.49) 2 (2.31, 1.30) 2 (2.46, 1.36) 2 (2.53, 1.26) 3 (2.97, 1.23)
Table 4: Median scores for systems, mean and standard deviation in parentheses. Distance between points on the
rating scale cannot be assumed to be equidistant, and so we analyze results using a non-parametric test.
G RAMMATICALITY:
This description is grammatically correct. side. On the computer vision side, incorrect ob-
M AIN A SPECTS : jects are often detected and salient objects are of-
This description describes the main aspects of this ten missed. Midge does not yet screen out un-
image. likely objects or add likely objects, and so pro-
C ORRECTNESS : vides no filter for this. On the language side, like-
This description does not include extraneous or in-
lihood is estimated directly, and the system pri-
correct information.
O RDER : marily uses simple maximum likelihood estima-
The objects described are mentioned in a reasonable tions to combine subtrees. The descriptive cor-
order. pus that informs the system is not parsed with
H UMANLIKENESS : a domain-adapted parser; with this in place, the
It sounds like a person wrote this description. syntactic constructions that Midge learns will bet-
Figure 12: Mechanical Turk prompts. ter reflect the constructions that people use.
In future work, we hope to address these issues
We report the scores for the systems in Table
as well as advance the syntactic derivation pro-
4. Results are analyzed using the non-parametric
cess, providing an adjunction operation (for ex-
Wilcoxon Signed-Rank test, which uses median
ample, to add likely adjectives or adverbs based
values to compare the different systems. Midge
on language alone). We would also like to incor-
outperforms all recent automatic approaches on
porate meta-data even when no vision detection
C ORRECTNESS and O RDER, and Yang et al. ad-
fires for an image, the system may be able to gen-
ditionally on H UMANLIKENESS and M AIN A S -
erate descriptions of the time and place where an
PECTS . Differences between Midge and Kulkarni
image was taken based on the image file alone.
et al. are significant at p < .01; Midge and Yang et
al. at p < .001. For all metrics, human-written de- 7 Conclusion
scriptions still outperform automatic approaches We have introduced a generation system that uses
(p < .001). a new approach to generating language, tying a
These findings are striking, particularly be- syntactic model to computer vision detections.
cause Midge uses the same input as the Kulka- Midge generates a well-formed description of an
rni et al. system. Using syntactically informed image by filtering attribute detections that are un-
word co-occurrence statistics from a large corpus likely and placing objects into an ordered syntac-
of descriptive text improves over state-of-the-art, tic structure. Humans judge Midges output to be
allowing syntactic trees to be generated that cap- the most natural descriptions of images generated
ture the variation of natural language. thus far. The methods described here are promis-
6 Discussion ing for generating natural language descriptions
of the visual world, and we hope to expand and
Midge automatically generates language that is as refine the system to capture further linguistic phe-
good as or better than template-based systems, nomena.
tying vision to language at a syntactic/semantic
level to produce natural language descriptions. 8 Acknowledgements
Results are promising, but, there is more work to Thanks to the Johns Hopkins CLSP summer
be done: Evaluators can still tell a difference be- workshop 2011 for making this system possible,
tween human-written descriptions and automati- and to reviewers for helpful comments. This
cally generated descriptions. work is supported in part by Michael Collins and
Improvements to the generated language are by NSF Faculty Early Career Development (CA-
possible at both the vision side and the language REER) Award #1054133.
755
References Siming Li, Girish Kulkarni, Tamara L. Berg, Alexan-
der C. Berg, and Yejin Choi. 2011. Composing
Amazon. 2011. Amazon mechanical turk: Artificial simple image descriptions using web-scale n-grams.
artificial intelligence. Proceedings of CoNLL 2011.
Holly P. Branigan, Martin J. Pickering, and Mikihiro Mitchell Marcus, Ann Bies, Constance Cooper, Mark
Tanaka. 2007. Contributions of animacy to gram- Ferguson, and Alyson Littman. 1995. Treebank II
matical function assignment and word order during bracketing guide.
production. Lingua, 118(2):172189. George A. Miller. 1995. WordNet: A lexical
Thorsten Brants and Alex Franz. 2006. Web 1T 5- database for english. Communications of the ACM,
gram version 1. 38(11):3941.
Chris Callison-Burch and Mark Dredze. 2010. Creat- Margaret Mitchell, Aaron Dunlop, and Brian Roark.
ing speech and language data with Amazons Me- 2011. Semi-supervised modeling for prenomi-
chanical Turk. NAACL 2010 Workshop on Creat- nal modifier ordering. Proceedings of the 49th
ing Speech and Language Data with Amazons Me- ACL:HLT.
chanical Turk. Courtney Napoles, Benjamin Van Durme, and Chris
Navneet Dalal and Bill Triggs. 2005. Histograms of Callison-Burch. 2011. Evaluating sentence com-
oriented gradients for human detections. Proceed- pression: Pitfalls and suggested remedies. ACL-
ings of CVPR 2005. HLT Workshop on Monolingual Text-To-Text Gen-
Ali Farhadi, Ian Endres, Derek Hoiem, and David eration.
Forsyth. 2009. Describing objects by their at- Vicente Ordonez, Girish Kulkarni, and Tamara L Berg.
tributes. Proceedings of CVPR 2009. 2011. Im2text: Describing images using 1 million
Ali Farhadi, Mohsen Hejrati, Mohammad Amin captioned photographs. Proceedings of NIPS 2011.
Sadeghi, Peter Young, Cyrus Rashtchian, Julia Slav Petrov. 2010. Berkeley parser. GNU General
Hockenmaier, and David Forsyth. 2010. Every pic- Public License v.2.
ture tells a story: generating sentences for images. Cyrus Rashtchian, Peter Young, Micah Hodosh, and
Proceedings of ECCV 2010. Julia Hockenmaier. 2010. Collecting image anno-
Pedro Felzenszwalb, David McAllester, and Deva Ra- tations using amazons mechanical turk. Proceed-
maman. 2008. A discriminatively trained, mul- ings of the NAACL HLT 2010 Workshop on Creat-
tiscale, deformable part model. Proceedings of ing Speech and Language Data with Amazons Me-
CVPR 2008. chanical Turk.
Flickr. 2011. http://www.flickr.com. Accessed Ehud Reiter and Anja Belz. 2009. An investiga-
1.Sep.11. tion into the validity of some metrics for automat-
ically evaluating natural language generation sys-
Kotaro Funakoshi, Satoru Watanabe, Naoko
tems. Computational Linguistics, 35(4):529558.
Kuriyama, and Takenobu Tokunaga. 2004.
Ehud Reiter and Robert Dale. 1997. Building ap-
Generating referring expressions using perceptual
plied natural language generation systems. Journal
groups. Proceedings of the 3rd INLG.
of Natural Language Engineering, pages 5787.
Albert Gatt. 2006. Generating collective spatial refer-
Ehud Reiter and Robert Dale. 2000. Building Natural
ences. Proceedings of the 28th CogSci.
Language Generation Systems. Cambridge Univer-
David Graff and Christopher Cieri. 2003. English Gi- sity Press.
gaword. Linguistic Data Consortium, Philadelphia, Yezhou Yang, Ching Lik Teo, Hal Daume III, and
PA. LDC Catalog No. LDC2003T05. Yiannis Aloimonos. 2011. Corpus-guided sen-
Philipp Koehn. 2005. Europarl: A parallel cor- tence generation of natural images. Proceedings of
pus for statistical machine translation. MT Summit. EMNLP 2011.
http://www.statmt.org/europarl/. Benjamin Z. Yao, Xiong Yang, Liang Lin, Mun Wai
Zornitsa Kozareva, Ellen Riloff, and Eduard Hovy. Lee, and Song-Chun Zhu. 2010. I2T: Image pars-
2008. Semantic class learning from the web with ing to text description. Proceedings of IEEE 2010,
hyponym pattern linkage graphs. Proceedings of 98(8):14851508.
ACL-08: HLT.
Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Sim-
ing Li, Yejin Choi, Alexander C. Berg, and Tamara
Berg. 2011. Baby talk: Understanding and gener-
ating image descriptions. Proceedings of the 24th
CVPR.
Irene Langkilde and Kevin Knight. 1998. Gener-
ation that exploits corpus-based statistical knowl-
edge. Proceedings of the 36th ACL.
756
Generation of landmark-based navigation instructions
from open-source data
757
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 757766,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
on the number of driving errors and on user sat- is a crucial goal of improved navigation systems,
isfaction, and outperforms it significantly on the as driver inattention of various kinds is a lead-
time female users spend looking away from the ing cause of traffic accidents (25% of all police-
road. To our knowledge, this is the first time that reported car crashes in the US in 2000, according
the generation of landmarks has been shown to to Stutts et al. (2001)). Another road-based study
significantly improve the instructions of a wide- conducted by May and Ross (2006) yielded simi-
coverage navigation system. lar results.
Plan of the paper. We start by reviewing ear- One recurring finding in studies on landmarks
lier literature on landmarks, route instructions, in navigation is that some user groups are able
and the use of NLG for route instructions in Sec- to benefit more from their inclusion than oth-
tion 2. We then present the way in which we ers. This is particularly the case for female users.
extract information on potential landmarks from While men tend to outperform women in wayfind-
OpenStreetMap in Section 3. Section 4 shows ing tasks, completing them faster and with fewer
how we generate route instructions, and Section 5 navigation errors (c.f. Allen (2000)), women are
presents the evaluation. Section 6 concludes. likely to show improved wayfinding performance
when landmark information is given (e.g. Saucier
2 Related Work et al. (2002)).
What makes an object in the environment a good Despite all of this evidence from human-human
landmark has been the topic of research in vari- studies, there has been remarkably little research
ous disciplines, including cognitive science, com- on implemented navigation systems that use land-
puter science, and urban planning. Lynch (1960) marks. Commercial systems make virtually no
defines landmarks as physical entities that serve use of landmark information when giving direc-
as external points of reference that stand out from tions, relying on metric representations instead
their surroundings. Kaplan (1976) specified a (e.g. Turn right in one hundred meters). In aca-
landmark as a known place for which the in- demic research, there have only been a handful of
dividual has a well-formed representation. Al- relevant systems. A notable example is the DEEP
though there are different definitions of land- MAP system, which was created in the SmartKom
marks, a common theme is that objects are con- project as a mobile tourist information system for
sidered landmarks if they have some kind of cog- the city of Heidelberg (Malaka and Zipf, 2000;
nitive salience (both in terms of visual distinctive- Malaka et al., 2004). DEEP MAP uses landmarks
ness and frequeny of interaction). as waypoints for the planning of touristic routes
The usefulness of landmarks in route instruc- for car drivers and pedestrians, while also making
tions has been shown in a number of different use of landmark information in the generation of
human-human studies. Experimental results from route directions. Raubal and Winter (2002) com-
Lovelace et al. (1999) show that people not only bine data from digital city maps, facade images,
use landmarks intuitively when giving directions, cultural heritage information, and other sources
but they also perceive instructions that are given to to compute landmark descriptions that could be
them to be of higher quality when those instruc- used in a pedestrian navigation system for the city
tions contain landmark information. Similar find- of Vienna.
ings have also been reported by Michon and Denis The key to the richness of these systems is a
(2001) and Tom and Denis (2003). set of extensive, manually curated geographic and
Regarding car navigation systems specifically, landmark databases. However, creation and main-
Burnett (2000) reports on a road-based user study tenance of such databases is expensive, which
which compared a landmark-based navigation makes it impractical to use these systems outside
system to a conventional car navigation system. of the limited environments for which they were
Here the provision of landmark information in created. There have been a number of suggestions
route directions led to a decrease of navigational for automatically acquiring landmark data from
errors. Furthermore, glances at the navigation existing electronic databases, for instance cadas-
display were shorter and fewer, which indicates tral data (Elias, 2003) and airborne laser scans
less driver distraction in this particular experi- (Brenner and Elias, 2003). But the raw data for
mental condition. Minimizing driver distraction these approaches is still hard to obtain; informa-
758
tion about landmarks is mostly limited to geomet-
ric data and does not specify the semantic type
of a landmark (such as church); and updating
the landmark database frequently when the real
world changes (e.g., a shop closes down) remains
an open issue.
The closest system in the literature to the re-
search we present here is the CORAL system
(Dale et al., 2003). CORAL generates a text of
driving instructions with landmarks out of the out- Figure 1: A graphical representation of some nodes
put of a commercial web-based route planner. Un- and ways in OpenStreetMap.
like CORAL, our system relies purely on open-
Landmark Type
source map data. Also, our system generates driv-
Street Furniture stop sign
ing instructions in real time (as opposed to a sin- traffic lights
gle discourse before the user starts driving) and pedestrian crossing
reacts in real time to driving errors. Finally, we Visual Landmarks church
evaluate our system thoroughly for driving errors, certain video stores
user satisfaction, and driver distraction on an ac- certain supermarkets
tual driving task, and find a significant improve- gas station
ment over the baseline. pubs and bars
759
cable not just for one particular city, but for any
place for which OpenStreetMap data is available.
We end up with two classes of landmark types:
street furniture and visual landmarks. Street fur-
niture is a generic term for objects that are in-
stalled on streets. In this subset, we include stop
signs, traffic lights, and pedestrian crossings. Our
assumption is that these objects inherently pos-
sess a high salience, since they already require
particular attention from the driver. Visual land-
marks encompass roadside buildings that are not
directly connected to the road infrastructure, but
draw the drivers attention due to visual salience.
Churches are an obvious member of this group; in Figure 3: Schematic representation of an episode
addition, we include gas stations, pubs, and bars, (dashed red line), with sample trigger positions of pre-
as well as certain supermarket and video store view, turn instruction, and confirmation messages.
chains (selected for wide distribution over differ-
ent cities and recognizable, colorful signs).
Given a certain location at which the Virtual
Co-Pilot is to be used, we automatically extract most interesting. Our system avoids the genera-
suitable landmarks along with their types and lo- tion of metric distance indicators, as in turn left
cations from OpenStreetMap. We also gather in 100 meters. Instead, it tries to find landmarks
the road network information that is required that describe the position of the decision point:
for route planning, and collect informations on Prepare to turn left after the church. When no
streets, such as their names, from the tags. We landmark is available, the system tries to use street
then transform this information into a directed intersections as secondary landmarks, as in Turn
street graph. The nodes of this graph are the right at the next/second/third intersection. Metric
OpenStreetMap nodes that are part of streets; two distances are only used when both of these strate-
adjacent nodes are connected by a single directed gies fail.
edge for segments of one-way streets and a di-
In-car NLG takes place in a heavily real-time
rected edge in each direction for ordinary street
setting, in which an utterance becomes uninter-
segments. Each edge is weighted with the Eu-
pretable or even misleading if it is given too late.
clidean distance between the two nodes.
This problem is exacerbated for NLG of speech
4 Generation of route directions because simply speaking the utterance takes time
as well. One consequence that our system ad-
We will now describe how the Virtual Co-Pilot dresses is the problem of planning preview mes-
generates route directions from OpenStreetMap sages in such a way that they can be spoken be-
data. The system generates three types of mes- fore the decision point without overlapping each
sages (see Fig. 3). First, at every decision point, other. We handle this problem in the sentence
i.e. at the intersection where a driving maneu- planner, which may aggregate utterances to fit
ver such as turning left or right is required, the into the available time. A second problem is that
user is told to turn immediately in the given di- the users reactions to the generated utterances are
rection (now turn right). Second, if the driver unpredictable; if the driver takes a wrong turn, the
has followed an instruction correctly, we gener- system must generate updated instructions in real
ate a confirmation message after the driver has time.
made the turn, letting them know they are still
on the right track. Finally, we generate preview Below, we describe the individual components
messages on the street leading up to the decision of the system. We mostly follow a standard NLG
point. These preview messages describe the loca- pipeline (Reiter and Dale, 2000), with a focus on
tion of the next driving maneuver. the sentence planner and an extension to interac-
Of the three types, preview messages are the tive real-time NLG.
760
Segment123 when the road makes a sharp turn where a minor
From: Node1
street forks off. To handle this case, we introduce
To: Node2
On: Main Street decision points at nodes with multiple adjacent
segments if the angle between the incoming and
Segment124 outgoing segment of the street exceeds a certain
From: Node2
threshold. Conversely, our heuristic will some-
To: Node3
On: Main Street
times end an episode where no driving maneuver
is necessary, e.g. when an ongoing street changes
Segment125 its name. This is unproblematic in practice; the
From: Node3 system will simply generate an instruction to keep
To: Node4
driving straight ahead. Fig. 3 shows a graphical
On: Park Street
representation of an episode, with the street seg-
Segment126 ments belonging to it drawn as red dashed lines.
From: Node4
To: Node5 4.2 Aggregation
On: Park Street
Because we generate spoken instructions that are
Figure 4: A simple example of a route plan consisting given to the user while they are driving, the timing
of four street segments. of the instructions becomes a crucial issue, espe-
cially because a driver moves faster than the user
of a pedestrian navigation system. It is undesir-
4.1 Content determination and text planning
able for a second instruction to interrupt an ear-
The first step in our system is to obtain a plan for lier one. On the other hand, the second instruc-
reaching the destination. To this end, we com- tion cannot be delayed because this might make
pute a shortest path on the directed street graph the user miss a turn or interpret the instruction in-
described in Section 3. The result is an ordered correctly.
list of street segments that need to be traversed in We must therefore control at which points in-
the given order to successfully reach the destina- structions are given and make sure that they do
tion; see Fig. 4 for an example. not overlap. We do this by always presenting pre-
To be suitable as the input for an NLG system, view messages at trigger positions at certain fixed
this flat list of OpenStreetMap nodes needs to be distances from the decision point. The sentence
subdivided into smaller message chunks. In turn- planner calculates where these trigger positions
by-turn navigation, the general delimiter between are located for each episode. In this way, we cre-
such chunks are the driving maneuvers that the ate time frames during which there is enough time
driver must execute at each decision point. We for instructions to be presented.
call each span between two decision points an However, some episodes are too short to ac-
episode. Episodes are not explicitly represented commodate the three trigger positions for the con-
in the original route plan: although every segment firmation message and the two preview messages.
has a street name associated with it, the name of In such episodes, we aggregate different mes-
a street sometimes changes as we go along, and sages. We remove the trigger positions for the two
because chains of segments are used to model preview messages from the episode, and instead
curved streets in OpenStreetMap, even segments add the first preview message to the turn instruc-
that are joined at an angle may be parts of the tion message of the previous episode. This allows
same street. Thus, in Fig. 4 it is not apparent our system to generate instructions like Now turn
which segment traversals require any navigational right, and then turn left after the church.
maneuvers.
We identify episode boundaries with the fol- 4.3 Generation of landmark descriptions
lowing heuristic. We first assume that episode The Virtual Co-Pilot computes referring expres-
boundaries occur when the street name changes sions to decision points by selecting appropriate
from one segment to the next. However, stay- landmarks. To this end, it first looks up landmark
ing on the road may involve a driving maneu- candidates within a given range of the decision
ver (and therefore a decision point) as well, e.g. point from the database created in Section 3. This
761
yields an initial list of landmark candidates. Preview message p1 :
Trigger position: Node3 50m
Some of these landmark candidates may be un- Turn direction: right
suitable for the given situation because of lack of Landmark: church
uniqueness. If there are several visual landmarks Preposition: after
of the same type along the course of an episode,
all of these landmark candidates are removed. For Preview message p2 = p1 , except:
Trigger position: Node3 100m
episodes which contain multiple street furniture
landmarks of the same type, the first three in each Turn instruction t1 :
episode are retained; a referring expression for the Trigger position: Node3
decision point might then be at the second traf- Turn direction: right
fic light. If the decision point is no more than Confirmation message c1 :
three intersections away, we also add a landmark Trigger position: Node3 + 50m
description of the form at the third intersection.
Furthermore, a landmark must be visible from the Figure 5: Semantic representations of the different
last segment of the current episode; we only retain types of instructions in one episode.
a candidate if it is either adjacent to a segment of
the current episode or if it is close to the end point Turn direction preposition landmark).
of the very last segment of the episode. Among
the landmarks that are left over, the system prefers 4.4 Interactive generation
visual landmarks over street furniture, and street As a final point, the NLG process of a car naviga-
furniture over intersections. If no landmark candi- tion system takes place in an interactive setting:
dates are left over, the system falls back to metric as the system generates and utters instructions, the
distances. user may either follow them correctly, or they may
Second, the Virtual Co-Pilot determines the miss a turn or turn incorrectly because they mis-
spatial relationship between the landmark and the understood the instruction or were forced to disre-
decision point so that an appropriate preposition gard it by the traffic situation. The system must be
can be used in the referring expression. If the de- able to detect such problems, recover from them,
cision point occurs before the landmark along the and generate new instructions in real time.
course of the episode, we use the preposition in Our system receives a continuous stream of in-
front of, otherwise, we use after. Intersections formation about the position and direction of the
are always used with at and metric distances user. It performs execution monitoring to check
with in. whether the user is still following the intended
Finally, the system decides how to refer to the route. If a trigger position is reached, we present
landmark objects themselves. Although it has ac- the instruction that we have generated for this po-
cess to the names of all objects from the Open- sition. If the user has left the route, the system
StreetMap data, the user may not know these reacts by planning a new route starting from the
names. We therefore refer to churches, gas sta- users current position and generating a new set of
tions, and any street furniture simply as the instructions. We check whether the user is follow-
church, the gas station, etc. For supermar- ing the intended route in the following way. The
kets and bars, we assume that these buildings are system keeps track of the current episode of the
more saliently referred to by their names, which route plan, and monitors the distance of the car
are used in everyday language, and therefore use to the final node of the episode. While the user
the names to refer to them. is following the route correctly, the distance be-
The result of the sentence planning stage is tween the car and the final node should decrease
a list of semantic representations, specifying the or at least stay the same between two measure-
individual instructions that are to be uttered in ments. To accommodate for occasional deviations
each episode; an example is shown in Fig. 5. from the middle of the road, we allow five subse-
For each type of instruction, we then use a sen- quent measurements to increase the distance; the
tence template to generate linguistic surface forms sixth increase of the distance triggers a recompu-
by inserting the information contained in those tation of the route plan and a freshly generated
plans into the slots provided by the templates (e.g. instruction. On the other hand, when the distance
762
of the car to the final node falls below a certain
threshold, we assume that the end of the episode
has been reached, and activate the next episode.
By monitoring whether the user is now approach-
ing the final node of this new episode, we can in
particular detect wrong turns at intersections.
Because each instruction carries the risk that it
may not be followed correctly, there is a question
as to whether it is worth planning out all remain-
ing instructions for the complete route plan. After
all, if the user does not follow the first instruc-
tion, the computation of all remaining instructions
was a waste of time. We decided to compute all
future instructions anyway because the aggrega-
Figure 6: Experiment setup. A) Main screen B) Navi-
tion procedure described above requires them. In gation screen C) steering wheel D) eye tracker
practice, the NLG process is so efficient that all
instructions can be done in real time, but this de-
cision would have to be revisited for a slower sys- a separate 7 monitor (B). The driving simula-
tem. tor was controlled by means of a steering wheel
(C), along with a pair of brake and acceleration
5 Evaluation pedals. We recorded user eye movements using
We will now report on an experiment in which we a Tobii IS-Z1 table-mounted eye tracker (D). The
evaluated the performance of the Virtual Co-Pilot. generated instructions were converted to speech
using MARY, an open-source text-to-speech sys-
5.1 Experimental Method tem (Schroder and Trouvain, 2003), and played
5.1.1 Subjects back on loudspeakers.
The task of the user was to drive the car in
In total, 12 participants were recruited through
the virtual environment towards a given destina-
printed ads and mailing lists. All of them were
tion; spoken instructions were presented to them
university students aged between 21 and 27 years.
as they were driving, in real time. Using the
Our experiment was balanced for gender, hence
steering wheel and the pedals, users had full con-
we recruited 6 male and 6 female participants. All
trol over steering angles, acceleration and brak-
participants were compensated for their effort.
ing. The driving speed was limited to 30 km/h, but
5.1.2 Design there were no restrictions otherwise. The driving
The driving simulator used in the experiment simulator sent the NLG system a message with the
replicates a real-world city center using a 3D current position of the car (as GPS coordinates)
model that contains buildings and streets as they once per second.
can be perceived in reality. The street layout 3D Each user was asked to drive three short routes
model used by the driving simulator is based on in the driving simulator. Each route took about
OpenStreetMap data, and buildings were added to four minutes to complete, and the travelled dis-
the virtual environment based on cadastral data. tance was about 1 km. The number of episodes
To increase the perceived realism of the model, per route ranged from three to five. Landmark
some buildings were manually enhanced with candidates were sufficiently dense that the Virtual
photographic images of their real-world counter- Co-Pilot used landmarks to refer to all decision
parts (see Fig. 7). points and never had to fall back to the metric dis-
Figure 6 shows the set-up of the evaluation ex- tance strategy.
periment. The virtual driving simulator environ- There were three experimental conditions,
ment (main picture in Fig. 7) was presented to the which differed with respect to the spoken route
participants on a 20 computer screen (A). In ad- instructions and the use of the navigation screen.
dition, graphical navigation instructions (shown In the baseline condition, designed to replicate the
in the lower right of Fig. 7) were displayed on behavior of an off-the-shelf commercial car nav-
763
All Users Males Females
B VCP B VCP B VCP
Total Fixation Duration (seconds) 4.9 3.5 2.7 4.1 7.0 2.9*
Total Fixation Count (N) 21.8 15.4 13.5 16.5 30.0 14.3*
The system provided the right amount 3.9 2.9 4.2* 3.3 3.5 2.5
of information at any time
I was insecure at times about still be- 2.3 3.2 1.9* 2.8 2.6 3.5
ing on the right track.
It was important to have a visual rep- 4.3 4.0 4.2 4.2 4.3 3.7
resentation of route directions
I could trust the navigation system 3.6 3.7 4.1 3.7 3.0 3.7
Figure 8: Mean values for gaze behavior and subjective evaluation, separated by user group and condition (B =
baseline, VCP = our system). Significant differences are indicated by *; better values are printed in boldface.
5.2 Results
Figure 7: Screenshot of a scene in the driving simula- There were no significant differences between the
tor. Lower right corner: matching screenshot of navi- Virtual Co-Pilot and the baseline system on task
gation display.
completion time, rate of driving errors, or any of
the questions of the DALI questionnaire. Driv-
ing errors in particular were very rare: there were
igation system, participants were provided with only four driving errors in total, two of which
spoken metric distance-to-turn navigation instruc- were due to problems with left/right coordination.
tions. The navigation screen showed arrows de- We then analyzed the gaze data collected by the
picting the direction of the next turn, along with table-mounted eye tracker, which we set up such
the distance to the decision point (cf. Fig. 7). The that it recognized glances at the navigation screen.
second condition replaced the spoken route in- In particular, we looked at the total fixation dura-
structions by those generated by the Virtual Co- tion (TFD), i.e. the total amount of time that a user
Pilot. In a third condition, the output of the nav- spent looking at the navigation screen during a
igation screen was further changed to display an given trial run. We also looked at the total fixation
icon for the next landmark along with the arrow count (TFC), i.e. the total number of times that a
and distance indicator. The three routes were pre- user looked at the navigation screen in each run.
sented to the users in different orders, and com- Mean values for both metrics are given in Fig. 8,
bined with the conditions in a Latin Squares de- averaged over all subjects and only male and fe-
sign. In this paper, we focus on the first and sec- male subjects, respectively; the VCP column is
ond condition, in order to contrast the two styles for the Virtual Co-Pilot, whereas B stands for
of spoken instruction. the baseline. We found that male users tended
Participants were asked to answer two ques- to look more at the navigation screen in the VCP
tionnaires after each trial run. The first was the condition than in B, although the difference is not
DALI questionnaire (Pauzie, 2008), which asks statistically significant. However, female users
subjects to report how they perceived different looked at the navigation screen significantly fewer
764
times (t(5) = 3.2, p < 0.05, t-test for dependent other subjective questions. This may partly be due
samples) and for significantly shorter amounts of to the fact that the subjects were familiar with ex-
time (t(5) = 3.2, p < 0.05) in the VCP condition isting commercial car navigation systems and not
than in B. used to landmark-based instructions. On the other
On the subjective questionnaire, most questions hand, this finding is also consistent with results
yielded no significant differences (and are not re- of other evaluations of NLG systems, in which
ported here). However, we found that female an improvement in the objective task usefulness
users tended to rate the Virtual Co-Pilot more pos- of the system does not necessarily correlate with
itively than the baseline on questions concerning improved scores from subjective questionnaires
trust in the system and the need for the navigation (Gatt et al., 2009).
screen (but not significantly). Male users found
that the baseline significantly outperformed the 6 Conclusion
Virtual Co-Pilot on presenting instructions at the
In this paper, we have described a system for gen-
right time (t(5) = 2.7, p < 0.05) and on giving
erating real-time car navigation instructions with
them a sense of security in still being on the right
landmarks. Our system is distinguished from ear-
track (t(5) = 2.7, p < 0.05).
lier work in its reliance on open-source map data
5.3 Discussion from OpenStreetMap, from which we extract both
the street graph and the potential landmarks. This
The most striking result of the evaluation is that demonstrates that open resources are now infor-
there was a significant reduction of looks to the mative enough for use in wide-coverage naviga-
navigation display, even if only for one group tion NLG systems. The system then chooses ap-
of users. Female users looked at the navigation propriate landmarks at decision points, and con-
screen less and more rarely with the Virtual Co- tinuously monitors the drivers behavior to pro-
Pilot compared to the baseline system. In a real vide modified instructions in real time when driv-
car navigation system, this translates into a driver ing errors occur.
who spends less time looking away from the road,
We evaluated our system using a driving simu-
i.e. a reduction in driver distraction and an in-
lator with respect to driving errors, user satisfac-
crease in traffic safety. This suggests that female
tion, and driver distraction. To our knowledge,
users learned to trust the landmark-based instruc-
we have shown for the first time that a landmark-
tions, an interpretation that is further supported
based car navigation system outperforms a base-
by the trends we found in the subjective question-
line significantly; namely, in the amount of time
naire.
female users spend looking away from the road.
We did not find these differences in the male
In many ways, the Virtual Co-Pilot is a very
user group. Part of the reason may be the known
simple system, which we see primarily as a start-
gender differences in landmark use we mentioned
ing point for future research. The evaluation
in Section 2. But interestingly, the two signifi-
confirmed the importance of interactive real-time
cantly worse ratings by male users concerned the
NLG for navigation, and we therefore see this as
correct timing of instructions and the feedback for
a key direction of future work. On the other hand,
driving errors, i.e. issues regarding the systems
it would be desirable to generate more complex
real-time capabilities. Although our system does
referring expressions (the tall church). This
not yet perform ideally on these measures, this
would require more informative map data, as well
confirms our initial hypothesis that the NLG sys-
as a formal model of visual salience (Kelleher and
tem must track the users behavior and schedule
van Genabith, 2004; Raubal and Winter, 2002).
its utterances appropriately. This means that ear-
lier systems such as CORAL, which only com- Acknowledgments. We would like to thank the
pute a one-shot discourse of route instructions DFKI CARMINA group for providing the driv-
without regard to the timing of the presentation, ing simulator, as well as their support. We would
miss a crucial part of the problem. furthermore like to thank the DFKI Agents and
Apart from the exceptions we just discussed, Simulated Reality group for providing the 3D city
the landmark-based system tended to score com- model.
parably or a bit worse than the baseline on the
765
References A. J. May and T. Ross. 2006. Presence and quality
of navigational landmarks: effect on driver perfor-
G. L. Allen. 2000. Principles and practices for com- mance and implications for design. Human Fac-
municating route knowledge. Applied Cognitive tors: The Journal of the Human Factors and Er-
Psychology, 14(4):333359. gonomics Society, 48(2):346.
C. Brenner and B. Elias. 2003. Extracting land- P. E. Michon and M. Denis. 2001. When and why are
marks for car navigation systems using existing visual landmarks used in giving directions? Spatial
gis databases and laser scanning. International information theory, pages 292305.
archives of photogrammetry remote sensing and
A. Pauzie. 2008. Evaluating driver mental workload
spatial information sciences, 34(3/W8):131138.
using the driving activity load index (DALI). In
G. Burnett. 2000. Turn right at the Traffic Lights: Proc. of European Conference on Human Interface
The Requirement for Landmarks in Vehicle Nav- Design for Intelligent Transport Systems, pages 67
igation Systems. The Journal of Navigation, 77.
53(03):499510. M. Raubal and S. Winter. 2002. Enriching wayfind-
R. Dale, S. Geldof, and J. P. Prost. 2003. Using natural ing instructions with local landmarks. Geographic
language generation for navigational assistance. In information science, pages 243259.
ACSC, pages 3544. E. Reiter and R. Dale. 2000. Building natural lan-
B. Elias. 2003. Extracting landmarks with data min- guage generation systems. Studies in natural lan-
ing methods. Spatial information theory, pages guage processing. Cambridge University Press.
375389. D. M. Saucier, S. M. Green, J. Leason, A. MacFadden,
A. Gatt, F. Portet, E. Reiter, J. Hunter, S. Mahamood, S. Bell, and L. J. Elias. 2002. Are sex differences in
W. Moncur, and S. Sripada. 2009. From data to text navigation caused by sexually dimorphic strategies
in the neonatal intensive care unit: Using NLG tech- or by differences in the ability to use the strategies?.
nology for decision support and information man- Behavioral Neuroscience, 116(3):403.
agement. AI Communications, 22:153186. M. Schroder and J. Trouvain. 2003. The German
S. Kaplan. 1976. Adaption, structure and knowledge. text-to-speech synthesis system MARY: A tool for
In G. Moore and R. Golledge, editors, Environmen- research, development and teaching. International
tal knowing: Theories, research and methods, pages Journal of Speech Technology, 6(4):365377.
3245. Dowden, Hutchinson and Ross. K. Striegnitz and F. Majda. 2009. Landmarks in
J. D. Kelleher and J. van Genabith. 2004. Visual navigation instructions for a virtual environment.
salience and reference resolution in simulated 3-D Online Proceedings of the First NLG Challenge
environments. Artificial Intelligence Review, 21(3). on Generating Instructions in Virtual Environments
A. Koller, K. Striegnitz, D. Byron, J. Cassell, R. Dale, (GIVE-1).
J. Moore, and J. Oberlander. 2010. The First Chal- J. C. Stutts, D. W. Reinfurt, L. Staplin, and E. A. Rodg-
lenge on Generating Instructions in Virtual Environ- man. 2001. The role of driver distraction in traf-
ments. In E. Krahmer and M. Theune, editors, Em- fic crashes. Washington, DC: AAA Foundation for
pirical Methods in Natural Language Generation. Traffic Safety.
Springer. A. Tom and M. Denis. 2003. Referring to landmark
N. Lessmann, S. Kopp, and I. Wachsmuth. 2006. Sit- or street information in route directions: What dif-
uated interaction with a virtual human percep- ference does it make? Spatial information theory,
tion, action, and cognition. In G. Rickheit and pages 362374.
I. Wachsmuth, editors, Situated Communication,
pages 287323. Mouton de Gruyter.
K. Lovelace, M. Hegarty, and D. Montello. 1999. El-
ements of good route directions in familiar and un-
familiar environments. Spatial information theory.
Cognitive and computational foundations of geo-
graphic information science, pages 751751.
K. Lynch. 1960. The image of the city. MIT Press.
R. Malaka and A. Zipf. 2000. DEEP MAP Chal-
lenging IT research in the framework of a tourist in-
formation system. Information and communication
technologies in tourism, 7:1527.
R. Malaka, J. Haeussler, and H. Aras. 2004.
SmartKom mobile: intelligent ubiquitous user in-
teraction. In Proceedings of the 9th International
Conference on Intelligent User Interfaces.
766
To what extent does sentence-internal realisation reflect discourse
context? A study on word order
767
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 767776,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
these reflexes. This explains in part the fairly high pata (2010) have improved a sentence compres-
baseline performance of n-gram language mod- sion system by capturing prominence of phrases
els in the surface realization task. And the effect or referents in terms of lexical chain information
can indeed be taken much further: the discrimi- inspired by Morris and Hirst (1991) and Center-
native training experiments of Cahill and Riester ing (Grosz et al., 1995). In their system, discourse
(2009) show how effective it is to systematically context is represented in terms of hard constraints
take advantage of asymmetry patterns in the mor- modelling whether a certain constituent can be
phosyntactic reflexes of the discourse notion of deleted or not.
information status (i.e., using a feature set with In the linearisation or surface realisation do-
well-chosen purely sentence-bound features). main, there is a considerable body of work ap-
These observations give rise to the question: in proximating information structure in terms of
the light of the difficulty in obtaining reliable dis- sentence-internal realisation (Ringger et al., 2004;
course information on the one hand and the effec- Filippova and Strube, 2009; Velldal and Oepen,
tiveness of exploiting the reflexes of discourse in 2005; Cahill et al., 2007). Cahill and Riester
the sentence-internal material on the other can (2009) improve realisation ranking for German
we nevertheless expect to gain something from which mainly deals with word order variation by
adding sentence-external feature information? representing precedence patterns of constituents
We propose two scenarios for adressing this in terms of asymmetries in their morphosyntac-
question: first, we choose an approximative ac- tic properties. As a simple example, a pattern ex-
cess to context information and relations between ploited by Cahill and Riester (2009) is the ten-
discourse referents lexical reiteration of head dency of definite elements tend to precede indef-
words, combined with information about their inites, which, on a discourse level, reflects that
grammatical relation and topological positioning given entities in a sentence tend to precede new
in prior sentences. We apply these features in a entities.
rich sentence-internal surface realisation ranking Other work on German surface realisation has
model for German. Secondly, we choose a more highlighted the role of the initial position in the
controlled scenario: we train a constituent order- German sentence, the so-called Vorfeld (or pre-
ing classifier based on a feature model that cap- field). Filippova and Strube (2007) show that
tures properties of discourse referents in terms of once the Vorfeld (i.e. the constituent that precedes
manually annotated coreference relations. As we the finite verb) is correctly determined, the pre-
get the same effect in both setups the sentence- diction of the order in the Mittelfeld (i.e. the con-
external features do not improve over a baseline stituents that follow the finite verb) is very easy.
that captures basic morphosyntactic properties of Cheung and Penn (2010) extend the approach
the constituents we conclude that sentence- of Filippova and Strube (2007) and augment a
internal realisation is actually a relatively accurate sentence-internal constituent ordering model with
predictor of discourse context, even more accurate sentence-external features inspired from the en-
than information that can be obtained from coref- tity grid model proposed by Barzilay and Lapata
erence and lexical chain relations. (2008).
768
(1) a. Kurze Zeit spater erklarte ein Anrufer bei Nachrichtenagenturen in Pakistan , die Gruppe Gamaa bekenne sich.
Shortly after, a caller declared at the news agencies in Pakistan, that the group Gamaa avowes itself.
b. Diese Gruppe wird fur einen Groteil der Gewalttaten verantwortlich gemacht , die seit dreieinhalb Jahren in
Agypten verubt worden sind .
This group is made responsible for most of the violent acts that have been committed in Egypt in the last three and
a half years.
(2) a. Belgien wunscht, dass sich WEU und NATO daruber einigen.
Belgium wants that WEU and NATO agree on that.
b. Belgien sieht in der NATO die beste militarische Struktur in Europa .
Belgium sees the best military structure of Europe in the NATO.
(3) a. Frauen vom Land kampften aktiv darum , ein Staudammprojekt zu verhindern.
Women from the countryside fighted actively to block the dam project.
b. Auch in den Stadten fanden sich immer mehr Frauen in Selbsthilfeorganisationen zusammen.
Also in the cities, more and more women team up in self-help organisations.
mation about the prior mentioning of a referent of the noun group is modified by a demonstra-
would be helpful for predicting the position of this tive pronoun such that its known and prominent
referent in a sentence. discourse status is overt in the morpho-syntactic
The idea that the occurence of discourse refer- realisation. In Example (2), both instances of
ents in a text is a central aspect of discourse struc- Belgium are realised as bare proper nouns with-
ture has been systematically pursued by Centering out an overt morphosyntactic clue indicating their
Theory (Grosz et al., 1995). Its most important discourse status.
notions are related to the realisation of discourse Beyond the simple presence of reitered items in
referents (i.e. described as centers) and the way sequences of sentences, we expected that it would
the centers are arranged in a sequence of utter- be useful to look at the position and syntactic
ances to make this sequence a coherent discourse. function of the previous mentions of a discourse
Another important concept is the ranking of dis- referent. In Example (1), the reiterated item is first
course referents which basically determines the introduced in an embedded sentence and realised
prominence of a referent in a certain sentence and in the Vorfeld in the second utterance. In terms
is driven by several factors (e.g. their grammati- of centering, this transition would correspond to
cal function). For free word order languages like a topic shift. In Example (2), both instances are
German, word order has been proposed as one of realised in the Vorfeld, such that the topic of the
the factors that account for the ranking (Poesio et first sentence is carried over to the next.
al., 2004). In a similar spirit, Morris and Hirst In Example (3), we illustrate a further type of
(1991) have proposed that chains of (related) lex- lexical reiteration. In this case, two identical head
ical items in a text are an important indicator of nouns are realised in subsequent sentences, even
text structure. though they refer to two different discourse refer-
Our main hypothesis was that it is possible to ents. While this type of lexical chain is described
exploit these intuitions from Centering Theory as reiteration without identity of referents by
and the idea of lexical chains for word order pre- Morris and Hirst (1991), it would not be captured
diction. Thus, we expected that it would be easier in Centering since this is not a case of strict coref-
to predict the position of a referent in a sentence erence. On the other hand, lexical chains do not
if we have not only given its realisation in the cur- capture types of reiterated discourse referents that
rent utterance but also its prominence in the previ- have distinct morpho-syntactic realisations, e.g.
ous discourse. Especially, we expected this intu- nouns and pronouns.
ition to hold for cases where the morpho-syntactic Originally, we had the hypothesis that strict
realisation of a constituent does not provide many corefence information is more useful and accurate
clues. This is illustrated in Examples (1) and (2) for word order prediction than rather loose lexi-
which both exemplify the reiteration of a lexical cal chains which conflate several types of referen-
item in two subsequent sentences, (reiteration is tial and lexical relations. However, the advantage
one type of lexical chain discussed in Morris and of chains, especially chains of reiteration, is that
Hirst (1991)). In Example (1), the second instance they can be easily detected in any corpus text and
769
that they might capture topics of sentences be- The realisation ranking component is an SVM
yond the identity of referents. Thus, we started ranking model implemented with SVMrank,
out from the idea of lexical chains and added cor- a Support Vector Machine-based learning tool
responding features in a statistical ranking model (Joachims, 2006). During training, each sentence
for surface realisation of German (Section 4). As is annotated with a rank and a set of features ex-
this strategy did not work out, we wanted to assess tracted from the F-structure, its surface string and
whether an ideal coreference annotation would be external resources (e.g. a language model). If
helpful at all for predicting word order. In a sec- the sentence matches the original corpus string,
ond experiment, we use a corpus which is manu- its rank will be highest, the assumption being that
ally annotated for coreference (Section 5). the original sentence corresponds to the optimal
realisation in context. The output of generation,
4 Experiment 1: Realisation Ranking the top-ranked sentence, is evaluated against the
with Lexical Chains original corpus sentence.
In this Section, we present an experiment that in- 4.2 The Feature Models
vestigates sentence-external context in a surface As the aim of this experiment is to better un-
realisation task. The sentence-external context is derstand the nature of sentence-internal features
represented in terms of lexical chain features and reflecting discourse context and compare them
compared to sentence-internal models which are to sentence-external ones, we build several fea-
based on morphosyntactic features. The experi- ture models which capture different aspects of the
ment thus targets a generation scenario where no constituents in a given sentence. The sentence-
coreference information is available and aims at internal features describe the morphosyntacic re-
assessing whether relatively naive context infor- alisation of constituents, for instance their func-
mation is also useful. tion (subject, object), and can be straightfor-
wardly extracted from the f-structure. These fea-
4.1 System Description tures are then combined into discriminative prece-
We carry out our first experiment in a regener- dence features, for instance subject-precedes-
ation set-up with two components: a) a large- object. We implement the following types of
scale hand-crafted Lexical Functional Grammar morphosyntactic features:
(LFG) for German (Rohrer and Forst, 2006), used
syntactic function (arguments and adjuncts)
to parse and regenerate a corpus sentence, b)
a stochastic ranker that selects the most appro- modification (e.g. nouns modified by relative
priate regenerated sentence in context according clauses, genitive etc.)
to an underlying, linguistically motivated feature syntactic category (e.g. adverbs, proper
model. In contrast to fully statistical linearisation nouns, phrasal arguments)
methods, our system first generates the full set definiteness for nouns
of sentences that correspond to the grammatically number and person for nominal elements
well-formed realisations of the intermediate syn- types of pronouns (e.g. demonstrative, re-
tactic representation.1 This representation is an flexive)
f-structure, which underspecifies the order of con- constituent span and number of embedded
stituents and, to some extent, their morphological nodes in the tree
realisation, such that the output sentences contain In addition, we also include language model
all possible combinations of word order permu- scores in our ranking model. In Section 4.4,
tations and morphological variants. Depending we report on results for several subsets of these
on the length and structure of the original corpus features where BaseSyn refers to a model that
sentence, the set of regenerated sentences can be only includes the syntactic function features and
huge (see Cahill et al. (2007) for details on regen- FullMorphSyn includes all features mentioned
erating the German treebank TIGER). above.
1
There are occasional mistakes in the grammar which
For extracting the lexical chains, we check for
sometimes lead to ungrammatical strings being generated, any overlapping nouns in the n sentences previ-
but this is rare. ous to the current one being generated. We check
770
Rank Sentence and Features
% Diese Gruppe wird fur einen Groteil der Gewalttaten verantwortlich gemacht.
% This group is for a major part of the violent acts responsible made.
1 subject-<-pp-object, demonstrative-<-indefinite, overlap-<-no-overlap, overlap-in-vorfeld, lm:-7.89
% Fur einen Groteil der Gewalttaten wird diese Gruppe verantwortlich gemacht.
% For a major part of the violent acts is this group responsible made.
3 pp-object-<-subject, indefinite-<-demonstrative, no-overlap-<-overlap, no-overlap-in-vorfeld, lm:-10.33
% Verantwortlich gemacht wird diese Gruppe fur einen Groteil der Gewalttaten.
% Responsible made is this group for a major part of the violent acts.
3 subject-<-pp-object, demonstrative-<-indefinite, overlap-<-no-overlap, lm:-9.41
Figure 1: Made-up training example for realisation ranking with precedence features
proper and common nouns, considering full and # Sentences % Sentences with overlap
in context Training Dev Test
partial overlaps as shown in Examples (1) and
1 20.96 23.64 20.42
(2), where the (a) example is the previous sen- 2 35.42 40.74 35.00
tence in the corpus. For each overlap, we record 3 45.58 50.00 53.33
the following properties: (i) function in the previ- 4 52.66 53.70 58.75
5 57.45 58.18 64.58
ous sentence, (ii) position in the previous sentence
6 61.42 57.41 68.75
(e.g. Vorfeld), (iii) distance between sentences, 7 64.58 61.11 70.83
(iv) total number of overlaps. 8 67.05 62.96 72.08
These overlap features are then also 9 69.20 64.81 74.17
combined in terms of precedence, e.g. 10 71.16 70.37 75.83
has subject overlap:3-precedes-no overlap, Table 1: The percentage of sentences that have at least
meaning that in the current sentence a noun one overlapping entity in the previous n sentences
that was previously mentioned in a subject 3
sentences ago precedes a noun that was not
mentioned before.
In Figure 1, we give an example of a set of gen- coreference annotation, since we already have a
eration alternatives and their (partial) feature rep- number of resources available to match the syn-
resentation for the sentence (1-b). Precedence is tactic analyses produced by our grammar against
indicated by <. the analyses in the treebank. Thus, in our regen-
Basically, our sentence-external feature model eration system, we parse the sentences with the
is built on the intuition that lexical chains or over- grammar, and choose the parsed f-structures that
laps approximate discourse status in a way which are compatible with the manual annotation in the
is similar to sentence-internal morphosyntactic TIGER treebank as is done in Cahill et al. (2007).
properties. Thus, we would expect that overlaps This compatibility check eliminates noise which
indicate givenness, salience or prominence and would be introduced by generating from incorrect
that asymmetries between overlapping and non- parses (e.g. incorrect PP-attachments typically re-
overlapping entities are helpful in the ranking. sult in unnatural and non-equivalent surface reali-
sations).
4.3 Data
All our models are trained on 7,039 sentences For comparing the string chosen by the mod-
(subdivided into 1259 texts) from the TIGER els against the original corpus sentence, we use
Treebank of German newspaper text (Brants et al., BLEU, NIST and exact match. Exact match is
2002). We tune the parameters of our SVM model a strict measure that only credits the system if it
on a development set of 55 sentences and report chooses the exact same string as the original cor-
the final results for our unseen test set of 240 sen- pus string. BLEU and NIST are more relaxed
tences. Table 1 shows how many sentences in our measures that compare the strings on the n-gram
training, development and test sets have at least level. Finally, we report accuracy scores for the
one textually overlapping phrase in the previous Vorfeld position (VF) corresponding to the per-
110 sentences. centage of sentences generated with a correct Vor-
We choose the TIGER treebank, which has no feld.
771
Sc BLEU NIST Exact VF by morphosyntactic features. However, we cannot
0 0.766 11.885 50.19 64.0
exclude the possibility that the chain features are
1 0.765 11.756 49.78 64.0
2 0.765 11.886 50.01 64.1 too noisy as they conflate several types of lexical
3 0.765 11.885 50.08 63.8 and coreferential relations. This will be adressed
4 0.761 11.723 49.43 63.2 in the following experiment.
5 0.765 11.884 49.71 64.2
6 0.768 11.892 50.42 64.6
5 Experiment 2: Constituent Ordering
7 0.765 11.885 50.01 64.5
8 0.764 11.884 49.78 64.3 with Centering-inspired Features
9 0.765 11.888 49.82 63.6
10 0.764 11.889 49.7 63.5 We now look at a simpler generation setup where
we concentrate on the ordering of constituents in
Table 2: Tenfold-crossvalidation for feature model the German Vorfeld and Mittelfeld. This strat-
FullMorphSyn and different context windows (Sc ) egy has also been adopted in previous investiga-
Model BLEU VF tions of German word order: Filippova and Strube
Language Model 0.702 51.2 (2007) show that once the German Vorfeld is cor-
Language Model + Context Sc = 5 0.715 54.3 rectly chosen, the prediction accuracy for the Mit-
BaseSyn 0.757 62.0
telfeld (the constituents following the finite verb)
BaseSyn + Context Sc = 5 0.760 63.0
FullMorphSyn 0.766 64.0 is in the 90s.
FullMorphSyn + Context Sc = 5 0.763 64.2 In order to eliminate noise introduced from po-
tentially heterogeneous chain features, we look at
Table 3: Evaluation for different feature models; Lan-
coreference features and, again, compare them to
guage Model: ranking based on language model
scores, BaseSyn: precedence between constituent
sentence-internal morphosyntactic features. We
functions, FullMorphSyn: entire set of sentence- target a generation scenario where coreference in-
internal features. formation is available. The aim is to establish an
upper bound concerning the quality improvement
4.4 Results for word order prediction by recurring to manual
In Table 2, we report the performance of the full corefence annotation.
sentence-internal feature model combined with
5.1 Data and Setup
context windows from zero to ten. The scores
have been obtained from tenfold-crossvalidation. We carry out the constituent ordering experiment
For none of the context windows, the model out- on the Tuba-D/Z treebank (v5) of German news-
performs the baseline with a zero context which paper articles (Telljohann et al., 2006). It com-
has no sentence-external features. In Table 3, prises about 800k tokens in 45k sentences. We
we compare the performance of several feature choose this corpus because it is not only annotated
models corresponding to subsets of the features with syntactic analyses but also with coreference
used so far which are combined with sentence- relations (Naumann, 2006). The syntactic annota-
external features respectively. We note that the tion format differs from the TIGER treebank used
function precedence features (i.e. the BaseSyn in the previous experiment, for instance, it ex-
model) are very powerful, leading to a major im- plicitely represents the Vorfeld and Mittelfeld as
provement compared to a language model. The phrasal nodes in the tree. This format is very con-
sentence-external features lead to an improvement venient for the extraction of constituents in the re-
when combined with the language-model based spective positions.
ranking. However, this improvement is leveled The Tuba-D/Z coreference annotation distin-
out in the BaseSyn model. guishes several relations between discourse ref-
On the one hand, the fact that the lexical chain erents, most importantly coreferential relation
features improve a language-model based ranking and anaphoric relation where the first denotes
suggests these features are, to some extent, pre- a relation between noun phrases that refer to the
dictive for certain patterns of German word order. same entity, and the latter refers to a link between
On the other hand, the fact that they dont improve a pronoun and a contextual antecedent, see Nau-
over an informed sentence-internal baseline sug- mann (2006) for further detail. We expected the
gests that these patterns are equally well captured coreferential relation to be particularly useful, as
772
it cannot always be read off the morphosyntac- # VF # MF
Backward Center 3.5% 5.1%
tic realisation of a noun phrase, whereas pronouns
Forward Center 6.8% 6.8%
are almost always used in an anaphoric relation. Coref Link 30.5% 23.4%
The constituent ordering model is implemented
as a classifier that is given a set of constituents Table 4: Backward and forward centers and their posi-
and predicts the constituent that is most likely to tions
be realised in the Vorfeld.
The set of candidate constituents is determined chain model since there is no lexical overlap be-
from the tree of the original corpus sentence. We tween the realisations of the discourse referents.
will assume that all constituents under a Vorfeld These types of coreference features implicitly
and Mittelfeld node can be freely reordered. Thus, carry the information that would also be consid-
we do not check whether the word order variants ered in a Centering formalisation of discourse
we look at are actually grammatical assuming that context. In addition to these, we designed features
most of them are. In this sense, this experiment that explicitly describe centers as these might
is close to fully statistical generation approaches. have a higher weight. In line with Clarke and
As a further simplification, we do not look at mor- Lapata (2010), we compute backward (CB) and
phological generation variants of the constituents forward centers (CF ) in the following way:
or their head verb.
The classifier is implemented with SVMrank 1. Extract all entities from the current sentence
again. In contrast to the previous experiment and the previous sentence.
where we learned to rank sentences, the classi- 2. Rank the entities of the previous sentence ac-
fier now learns to rank constituents. The con- cording to their function (subject < direct
stituents have been extracted using the tool de- object < indirect object ...).
scribed in Bouma (2010). The final data set com- 3. Find the highest ranked entity in the previous
prises 48.513 candidate sets of freely orderable sentence that has a link to an entity in the
constituents. current sentence, this entity is the CB of the
sentence.
5.2 Centering-inspired Feature Model
To compare the discourse context model against a In the same way, we mark entities as forward
sentence-based model, we implemented a number centers that are ranked highest in the current sen-
of sentence-internal features that are very similar tence and have a link to an entity in the following
to the features used in the previous experiment. sentence.2 In Table 4, we report the percentage of
Since we extract them from the syntactic annota- sentences that have backward and forward centers
tion instead of f-structures, some labels and fea- in the Vorfeld or Mittelfeld. While the percentage
ture names will be different, however, the design of sentences that realise a backward center is quite
of the sentence-internal model is identical to the low, the overall proportion of sentences contain-
previous one in Section 4. ing some type of coreference link is in a dimen-
The sentence-external features differ in some sion such that the learner could definitely pick up
aspects from Section 4, since we extract coref- some predictive patterns. Going by the relative
erence relations of several types (see (Naumann, frequencies, coreferential constituents have a bias
2006) for the anaphoric relations annotated in the towards appearing in the Vorfeld rather than in the
Tueba-D/Z). For each type of coreference link, Mittelfeld.
we extract the following properties: (i) function
5.3 Results
of the antecedent, (ii) position of the antecedent,
(iii) distance between sentences, (iv) type of rela- First, we build three coreference-based con-
tion. We also distinguish coreference links anno- stituent classifiers on their entire training set and
tated for the whole phrase (head link) and links compare them to their sentence-internal baseline.
that are annotated for an element embedded by the The most simple baseline records the category of
constituent (contained link). The two types are 2
In Centering, all entities in a given utterance can be seen
illustrated in Examples (4) and (5). Note that both as forward centers, however we thought that this implemen-
cases would not have been captured in the lexical tation would be more useful.
773
(4) a. Die Rechnung geht an die AWO.
The bill goes to the AWO.
b. [Hintergrund der gegenseitigen Vorwurfe in der Arbeiterwohlfahrt] sind offenbar scharfe Konkurrenzen zwischen
Bremern und Bremerhavenern.
Apparently, [the background of the mutual accusations at the labour welfare] are rivalries between people from
Bremen and Bremerhaven.
(5) a. Dies ist die Behauptung, mit der Bremens Hafensenator die Skeptiker davon uberzeugt hat, [...].
This is the claim, which Bremens harbour senator used to convince doubters, [...].
b. Fur diese Behauptung hat Beckmeyer bisher keinen Nachweis geliefert. So far, Beckmeyer has not given a prove of
this claim.
Model VF Model VF
ConstituentLength + HeadPos 47.48% ConstituentLength + HeadPos 46.61%
ConstituentLength + HeadPos + Coref 51.30% ConstituentLength + HeadPos + Coref 52.23%
BaseSyn 54.82% BaseSyn 54.63%
BaseSyn + Coref 56.21% BaseSyn + Coref 56.67%
FullMorphSyn 57.24% FullMorphSyn 55.36%
FullMorphSyn + Coref 57.40% FullMorphSyn + Coref 57.93%
Table 5: Results from Vorfeld classification, training Table 6: Results from Vorfeld classification, training
and evaluation on entire treebank and evaluation on sentences that contain a coreference
link
774
positive impact of coreference features can be to generation and summarization. In Proceedings of
strengthened if the coreference annotation scheme HLT-NAACL 2004, Boston,MA.
is more exhaustive, including, e.g., bridging and Anja Belz and Ehud Reiter. 2006. Comparing auto-
event anaphora. matic and human evaluation of NLG systems. In
Proceedings of EACL 2006, pages 313320, Trento,
6 Conclusion Italy.
Gerlof Bouma. 2010. Syntactic tree queries in prolog.
We have carried out a number of experiments that In Proceedings of the Fourth Linguistic Annotation
show that sentence-internal models for word order Workshop, ACL 2010.
Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolf-
are hardly improved by features which explicitely
gang Lezius, and George Smith. 2002. The TIGER
represent the preceding context of a sentence in Treebank. In Proceedings of the Workshop on Tree-
terms of lexical and referential relations between banks and Linguistic Theories.
discourse entities. This suggests that sentence- Aoife Cahill and Arndt Riester. 2009. Incorporat-
internal realisation implicitly carries a lot of im- ing information status into generation ranking. In
formation about discourse context. On average, Proceedings of the Joint Conference of the 47th An-
the morphosyntactic properties of constituents in nual Meeting of the ACL and the 4th International
a text are better approximates of their discourse Joint Conference on Natural Language Processing
of the AFNLP, pages 817825, Suntec, Singapore,
status than actual coreference relations.
August. Association for Computational Linguistics.
This result feeds into a number of research Aoife Cahill, Martin Forst, and Christian Rohrer.
questions concerning the representation of dis- 2007. Stochastic Realisation Ranking for a Free
course and its application in generation systems. Word Order Language. In Proceedings of the
Although we should certainly not expect a com- Eleventh European Workshop on Natural Language
putational model to achieve a perfect accuracy in Generation, pages 1724, Saarbrucken, Germany.
the constituent ordering task even humans only DFKI GmbH.
Aoife Cahill. 2009. Correlating human and automatic
agree to a certain extent in rating word order vari-
evaluation of a german surface realiser. In Proceed-
ants (Belz and Reiter, 2006; Cahill, 2009) the ings of the ACL-IJCNLP 2009 Conference Short Pa-
average accuracy in the 60s for prediction of Vor- pers, pages 97100, Suntec, Singapore, August. As-
feld occupance is still moderate. An obvious di- sociation for Computational Linguistics.
rection would be to further investigate more com- Jackie C.K. Cheung and Gerald Penn. 2010. Entity-
plex representations of discourse that take into ac- based local coherence modelling using topological
count the relations between utterances, such as fields. In Proceedings of the 48th Annual Meeting
of the Association for Computational Linguistics
topic shifts. Moreover, it is not clear whether the
(ACL 2010). Association for Computational Lin-
effects we find for linearisation in this paper carry guistics.
over to other levels of generation such as tacti- James Clarke and Mirella Lapata. 2010. Discourse
cal generation where syntactic functions are not constraints for document compression. Computa-
fully specified. In a broader perspective, our re- tional Linguistics, 36(3):411441.
sults underline the need for better formalisations Stefanie Dipper and Heike Zinsmeister. 2009. The
of discourse that can be translated into features for role of the German Vorfeld for local coherence. In
large-scale applications such as generation. Christian Chiarcos, Richard Eckart de Castilho, and
Manfred Stede, editors, Von der Form zur Bedeu-
Acknowledgments tung: Texte automatisch verarbeiten/From Form to
Meaning: Processing Texts Automatically, pages
This work was funded by the Collaborative Re- 6979. Narr, Tubingen.
search Centre (SFB 732) at the University of Katja Filippova and Michael Strube. 2007. The ger-
man vorfeld and local coherence. Journal of Logic,
Stuttgart.
Language and Information, 16:465485.
Katja Filippova and Michael Strube. 2009. Tree Lin-
References earization in English: Improving Language Model
Based Approaches. In Proceedings of Human Lan-
Regina Barzilay and Mirella Lapata. 2008. Modeling guage Technologies: The 2009 Annual Conference
local coherence: An entity-based approach. Com- of the North American Chapter of the Association
putational Linguistics, 34:134. for Computational Linguistics, Companion Volume:
Regina Barzilay and Lillian Lee. 2004. Catching the Short Papers, pages 225228, Boulder, Colorado,
drift: Probabilistic content models with applications June. Association for Computational Linguistics.
775
Barbara J. Grosz, Aravind Joshi, and Scott Weinstein.
1995. Centering: A framework for modeling the
local coherence of discourse. Computational Lin-
guistics, 21(2):203225.
Thorsten Joachims. 2006. Training linear SVMs in
linear time. In Proceedings of the ACM Conference
on Knowledge Discovery and Data Mining (KDD),
pages 217226.
Nikiforos Karamanis, Massimo Poesioand Chris Mel-
lish, and Jon Oberlander. 2009. Evaluating center-
ing for information ordering using corpora. Com-
putational Linguistics, 35(1).
Jane Morris and Graeme Hirst. 1991. Lexical cohe-
sion, the thesaurus, and the structure of text. Com-
putational Linguistics, 17(1):21225.
Karin Naumann. 2006. Manual for the annotation of
in-document referential relations. Technical report,
Seminar fur Sprachwissenschaft, Abt. Computerlin-
guistik, Universitat Tubingen.
Massimo Poesio and Ron Artstein. 2005. The relia-
bility of anaphoric annotation, reconsidered: Taking
ambiguity into account. In Proc. of ACL Workshop
on Frontiers in Corpus Annotation.
Massimo Poesio, Rosemary Stevenson, Barbara di Eu-
genio, and Janet Hitzeman. 2004. Centering: A
parametric theory and its instantiations. Computa-
tional Linguistics, 30(3):309363.
Eric K. Ringger, Michael Gamon, Robert C. Moore,
David Rojas, Martine Smets, and Simon Corston-
Oliver. 2004. Linguistically Informed Statisti-
cal Models of Constituent Structure for Ordering
in Sentence Realization. In Proceedings of the
2004 International Conference on Computational
Linguistics, Geneva, Switzerland.
Julia Ritz, Stefanie Dipper, and Michael Gotze. 2008.
Annotation of information structure: An evaluation
across different types of texts. In Proceedings of the
the 6th LREC conference.
Christian Rohrer and Martin Forst. 2006. Improv-
ing Coverage and Parsing Quality of a Large-Scale
LFG for German. In Proceedings of the Fifth In-
ternational Conference on Language Resources and
Evaluation (LREC), Genoa, Italy.
Augustin Speyer. 2005. Competing constraints on
vorfeldbesetzung in german. In Proceedings of the
Constraints in Discourse Workshop, pages 7987.
Heike Telljohann, Erhard Hinrichs, Sandra Kubler,
and Heike Zinsmeister. 2006. Stylebook for the
tubingen treebank of written german (tuba-d/z).
revised version. Technical report, Seminar fur
Sprachwissenschaft, Universitat Tubingen.
Erik Velldal and Stephan Oepen. 2005. Maximum
entropy models for realization ranking. In Proceed-
ings of the 10th Machine Translation Summit, pages
109116, Thailand.
776
Behind the Article: Recognizing Dialog Acts in Wikipedia Talk Pages
777
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 777786,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
by covering almost 15% of all relevant Talk pages, peculiarities of the Switchboard corpus. The re-
as opposed to the much smaller fraction we could sulting SWDB-DAMSL schema contained more
achieve for the English Wikipedia. The long-term than 220 distinct labels which have been clustered
goal of this work is to identify relations between to 42 coarse grained labels. Both schemata have
contributions on the Talk pages and particular arti- often been adapted for special purpose annotation
cle edits. We plan to analyze the relation between tasks.
article discussions and article content and identify With the rise of the social web, the amount of
the edits in the article revision history that react to research analyzing user generated discourse sub-
the problems discussed on the Talk page. In com- stantially increased. In addition to analyzing web
bination with article quality assessment (Yaari et forums (Kim et al., 2010a), chats (Carpenter and
al., 2011), this opens up the possibility to iden- Fujioka, 2011) and emails (Cohen et al., 2004),
tify successful patterns of collaboration which in- Wikipedia Talk pages have recently moved into
crease the article quality. Furthermore, our work the center of attention of the research community.
will enable practical applications. By augment- Viegas et al. (2007) manually annotate 25
ing Wikipedia articles with the information de- Wikipedia article discussion pages with a set of
rived from automatically labeled discussions, arti- 11 labels in order to analyze how Talk pages are
cle readers can be made aware of particular prob- used for planning the work on articles and resolv-
lems that are being discussed on the Talk page ing disputes among the editors. Schneider et al.
behind the article. (2011) extend this schema and manually annotate
Our primary contributions in this paper are: (1) 100 Talk pages with 15 labels. They confirm the
an annotation schema for dialog acts reflecting findings of Viegas et al. that coordination requests
the efforts for coordinating the article improve- occur most frequently in the discussions.
ment; (2) the Simple English Wikipedia Dis- Bender et al. (2011) describe a corpus of 47
cussion (SEWD) corpus, consisting of 100 seg- Talk pages which have been annotated for author-
mented and annotated Talk pages which we make ity claims and alignment moves. With this cor-
freely available for download; and (3) a dialog pus, the authors analyze how the participants in
act classification pipeline that incorporates sev- Wikipedia discussions establish their credibility
eral state of the art machine learning algorithms and how they express agreement and disagree-
and feature selection techniques and achieves an ment towards other participants or topics.
average F1 -score of .82 on our corpus. From a different perspective, Stvilia et al.
(2008) analyze 60 discussion pages in regard to
2 Related Work how information quality (IQ) in Wikipedia arti-
The analysis of speech and dialog acts has its cles is assessed on the Talk pages and which types
roots in the linguistic field of pragmatics. In of IQ problems are identified by the community.
1962, John Austin shifted the focus from the mere They describe a Wikipedia IQ assessment model
declarative use of language as a means for making and map it to established frameworks. Further-
factual statements towards its non-declarative use more, they provide a list of IQ problems along
as a tool for performing actions. The speech act with related causal factors and necessary actions
theory was further systematized by Searle (1969), which has also inspired the design of our annota-
whose classification of illocutionary acts (Searle, tion schema.
1976) is still used as a starting point for creating Finally, Laniado et al. (2011) examine
dialog act classification schemata for natural lan- Wikipedia discussion networks in order to
guage processing. capture structural patterns of interaction. They
A well known, domain- and task-independent extract the thread structure from all Talk pages in
annotation schema is DAMSL (Core and Allen, the English Wikipedia and create tree structures
1997). It was created as the standard annotation of the discussion. The analysis of the graphs
schema for dialog tagging on the utterance level reveals patterns that are unique to Wikipedia
by the Discourse Resource Initiative. It uses a discussions and might be used as a means to
four-dimensional tagset that allows arbitrary label characterize different types of Talk pages.
combinations for each utterance. Jurafsky et al. To the best of our knowledge, there is no
(1997) augmented the DAMSL schema to fit the work yet that uses machine learning to automati-
778
are usually headed by a topic title. Finally, the
thread structure designates the sequence of turns
and their indentation levels on the Talk page. A
structural overview of a Talk page and its con-
stituents can be seen in Figure 1.
We composed an annotation schema that re-
flects the coordination efforts for article improve-
ment. Therefore, we manually analyzed a set
of thirty Talk pages from the Simple English
Wikipedia to identify the types of article defi-
ciencies that are discussed and the way article
Figure 1: Structure of a Talk page: a) Talk page title, improvement is coordinated. We furthermore
b) untitled discussion topic, c) titled discussion topic, incorporated the findings from an information-
d) unsigned turns, e) signed turns, f) topic title scientific analysis of information quality in
Wikipedia (Stvilia et al., 2008), which identifies
cally classify user contributions in Wikipedia Talk twelve types of quality problems, like e.g. Accu-
pages. Furthermore, there is no corpus available racy, Completeness or Relevance. Our resulting
that reflects the efforts of article improvement in tagset consists of 17 labels (cf. Table 1) which can
Wikipedia discussions. This is the subject of our be subdivided into four higher level categories:
work.
Article Criticism Denote comments that iden-
tify deficiencies in the article. The criticism
3 Annotation Schema
can refer to the article as a whole or to indi-
The main purpose of Wikipedia Talk pages is the vidual parts of the article.
coordination of the editing process with the goal Explicit Performative Announce, report or sug-
of improving and sustaining the quality of the re- gest editing activities.
spective article. The criteria for article quality in
Wikipedia are loosely defined in the guidelines for Information Content Describe the direction of
good articles2 and very good articles3 . Ac- the communication. A contribution can be
cording to these guidelines, distinguished articles used to communicate new information to
must be well-written in simple English, compre- others (IP), to request information (IS), or
hensive, neutral, stable, accurate, verifiable and to suggest changes to established facts (IC).
follow the Wikipedia style guidelines4 . These cri- The IP label applies to most of the contri-
teria are the main points of reference in the dis- butions as most comments provide a certain
cussions on the Talk pages. amount of new information.
Discourse analysis, as it is performed in this pa- Interpersonal Describe the attitude that is ex-
per, can be carried out on various levels, depend- pressed towards other participants in the dis-
ing on what is regarded as the smallest unit of the cussion and/or their comments.
discourse. In this work, we focus on turns, not
on individual utterances, as we are interested in a Since a single turn may consist of several utter-
coarse-grained analysis of the discourse-structure ances, it can consequently comprise multiple di-
as a first step towards a finer-grained discourse alog acts. Therefore, we designed the annotation
analysis. We define a turn (or contribution) as the study as a multi-label classification task, i.e. the
body of text that is added by an individual contrib- annotators can assign one or more labels to each
utor in one or more revisions to a single discus- annotation unit. Each label is chosen indepen-
sion topic until another contributor edits the page. dently. Table 1 shows the labels, their respective
Furthermore, a topic (or discussion) is the body definitions and an example from our corpus.
of turns that revolve around a single matter. They
4 Corpus Creation and Analysis
2
http://simple.wikipedia.org/wiki/WP:RGA
3
http://simple.wikipedia.org/wiki/WP:RVGA The SEWD corpus consists of 100 annotated Talk
4
http://simple.wikipedia.org/wiki/WP:STYLE pages extracted from a snapshot of the Simple En-
779
Label Description Example
Article Criticism
It should be added (1) that voters may skip prefer-
CM Content incomplete or lacking detail ences, but (2) that skipping preferences has no impact
on the result of the elections.
Kris Kringle is NOT a Germanic god, but an English
CW Lack of accuracy or correctness mispronunciation of Christkind, a German word that
means the baby Jesus.
The references should be removed. The reason: The
CU Unsuitable or unnecessary content references are too complicated for the typical reader
of simple Wikipedia.
CS Structural problems Also use sectioning, and interlinking
This section needs to be simplified further; there are a
CL Deficiencies in language or style
lot of words that are too complex for this wiki.
This article seems to take a clear pro-Christian, anti-
COBJ Objectivity issues
commercial view.
I have started an article on Google. It needs improve-
CO Other kind of criticism
ment though.
Explicit Performative
PSR Explicit suggestion, recommendation or request This section needs to be simplified further
Got it. The URL is http://www.dmbeatles.com/
PREF Explicit reference or pointer
history.php?year=1968
PFC Commitment to an action in the future Okay, I forgot to add that, Ill do so later tonight.
I took and hopefully simplified the [[en:Prehistoric
PPC Report of a performed action
musicPrehistoric music]] article from EnWP
Information Content
IP Information providing Depression is the most basic term there is.
So what kind of theory would you use for your music
IS Information seeking
composing?
In linguistics and generally speaking, when Talking
about the lexicon in a language, words are usually cat-
IC Information correcting
egorized as nouns, verbs, adjectives and so on.
The term doing word does not exist.
Interpersonal
Positive attitude towards other contributor or
ATT+ Thank you.
acceptance
Okay, I can understand that, but some citations are
ATTP Partial acceptance or partial rejection
going to have to be included for [[WP:V]].
Negative attitude towards other contributor or Now what? You think you know so much about every-
ATT-
rejection thing, and you are not even helping?!
Table 1: Annotation schema for the dialog act classification in Wikipedia discussion pages with examples from
the SEWD Corpus. Some examples have been shortened to fit the table.
glish Wikipedia from Apr 4th 2011.5 Technically pages with 11-20 turns, and (iii) pages with more
speaking, a Talk page is a normal Wiki page lo- than 20 turns. We then randomly extracted 50 dis-
cated in one of the Talk namespaces. In this work, cussion pages from class (i), 40 pages from class
we focus on article Talk pages and do not re- (ii) and 10 pages from class (iii). This decision is
gard User Talk pages. We selected the discussion grounded in the restricted resources for the human
pages according to the number of turns they con- annotation task.
tain. First, we discarded all discussion pages with
less than four contributions. We then analyzed Data Preprocessing Due to a lack of discussion
the distribution of turn counts per discussion page structure, extracting the discussion threads from
in the remaining set of pages and defined three the Talk pages requires a substantial amount of
classes: (i) discussion pages with 4-10 turns, (ii) preprocessing. Laniado et al. (2011) tackle the
5
The snapshot contains 69900 articles and 5783 Talk thread extraction by using text indentation and in-
pages of which 683 contained more than 3 contributions. serted user signatures as clues. We found these
780
attributes to be insufficient for a reliable recon- Annotation Process For our annotation study,
struction of the thread structure.6 we used the freely available MMAX2 annotation
Our preprocessing approach consists of three tool8 . Two annotators were introduced to the an-
steps: data retrieval, topic segmentation and turn notation schema by an instructor and trained on
segmentation. For retrieving the discussion pages, an extra set of ten discussion pages. During the
we use the Java Wikipedia Library (JWPL) (Zesch annotation of the corpus, the annotators were al-
et al., 2008), which offers efficient, database- lowed to discuss difficult cases and could consult
driven access to the contents of Wikipedia. We the instructor if in doubt. They had access to the
segment the individual Talk pages into discus- segmented discussion pages within the MMAX2
sions topics using the MediaWiki parser that tool as well as to the original Wikipedia articles
comes with JWPL. In our corpus, the parser man- and discussion pages on the web.
aged to identify all topic boundaries without any The reconciliation of the annotations was car-
errors. The most complex preprocessing step is ried out by an expert annotator. In order to obtain
the turn segmentation. a consolidated gold standard, the expert decided
First, we use the revision history of the Talk all cases in which the annotations of the two an-
page to identify the author and the creation time notators did not match. Descriptive statistics for
of each paragraph. We use the Wikipedia Revi- the label assignments of each annotator and for
sion Toolkit (Ferschke et al., 2011) to examine the the gold standard can be seen in Table 2 and will
changes between adjacent revisions of the Talk be further discussed in Section 4.2.
page in order to identify the exact time a piece of
text was added as well as the author of the con- Corpus Format We publish our SEWD cor-
tribution. We have to filter out malicious edits pus in two formats9 , the original MMAX format,
from the history, as they would negatively affect and as XMI files for further processing with the
the segmentation process. We therefore disregard Apache Unstructured Information Management
all edits that are reverted in later later revisions. Architecture10 . For the latter format, we also pro-
In contrast to vandalism on article pages, this ap- vide the type system which defines all necessary
proach has proven to be sufficient to detect van- corpus specific types needed for using the data in
dalism in the Talk page history. an NLP pipeline.
Within each discussion topic, we aggregate all 4.1 Inter-Annotator Agreement
adjacent paragraphs with the same author and the
same time stamp to one turn. In order to account To evaluate the reliability of our dataset, we per-
for turns that were written in multiple revisions, form a detailed inter-rater agreement study. For
we regard all time stamps within a window of 10 measuring the agreement of the individual labels,
minutes7 as belonging to the same turn, unless the we report the observed agreement, Kappa statis-
page was edited by another user in the meantime. tics (Carletta, 1996), and F1 -scores. The latter are
Finally, the turn is marked with the indentation computed by treating one annotator as the gold
level of its least indented paragraph. This infor- standard and the other one as predictions (Hripc-
mation is used to identify the relationship between sak and Rothschild, 2005). The scores can be seen
the turns, since indentation is used to indicate a in Table 2.
reply to an existing comment in the discussion. The average observed agreement across all la-
A co-author of this paper evaluated the ac- bels is PO = .94. The individual Kappa scores
ceptability of the boundaries of each turn in the largely fall into the range that Landis and Koch
SEWD corpus and found that 94% of the 1450 (1977) regard as substantial agreement, while
turns were correctly segmented. Turns with seg- three labels are above the more strict .8 thresh-
mentation errors were not included in the gold old for reliable annotations (Artstein and Poesio,
standard. 2008). Furthermore, we obtain an overall pooled
Kappa (De Vries et al., 2008) of pool = .67,
6
Viegas et al. (2007) reported that only 67% of the con-
8
tributions on Wikipedia Talk pages are signed, which makes http://www.mmax2.net
9
signatures an unreliable predictor for turn boundaries. http://www.ukp.tu-darmstadt.de/data/
7
We experimentally tested values between 1 and 60 min- wikidiscourse
10
utes. http://uima.apache.org
781
Annotator 1 Annotator 2 Inter-Annotator Agreement Gold Standard
Label N Percent N Percent NA1 A2 PO F1 N Percent
Article Criticism
CM 183 13.4% 105 7.7% 193 .93 .63 .66 116 8.5%
CW 106 7.8% 57 4.2% 120 .95 .52 .55 70 5.1%
CU 69 5.0% 35 2.6% 83 .95 .38 .40 42 3.1%
CS 164 12.0% 101 7.4% 174 .94 .66 .69 136 9.9%
CL 195 14.3% 199 14.6% 244 .93 .73 .77 219 16.0%
COBJ 27 2.0% 23 1.7% 29 .99 .84 .84 27 2.0%
CO 20 1.5% 59 4.3% 71 .95 .18 .20 48 3.5%
Explicit Performative
PSR 458 33.5% 351 25.7% 503 .86 .66 .76 406 29.7%
PREF 43 3.1% 31 2.3% 51 .98 .61 .62 45 3.3%
PFC 73 5.3% 65 4.8% 86 .98 .76 .77 77 5.6%
PPC 357 26.1% 340 24.9% 371 .97 .92 .94 358 26.2%
Information Content
IP 1084 79.3% 1027 75.1% 1135 .89 .69 .93 1070 78.3%
IS 228 16.7% 208 15.2% 256 .95 .80 .83 220 16.1%
IC 187 13.7% 109 8.0% 221 .89 .46 .51 130 9.5%
Interpersonal
ATT+ 71 5.2% 140 10.2% 151 .94 .55 .58 144 10.5%
ATTP 71 5.2% 30 2.2% 79 .96 .42 .44 33 2.4%
ATT- 67 4.9% 74 5.4% 100 .96 .56 .58 87 6.4%
Table 2: Label frequencies and inter-annotator agreement. NA1 A2 denotes the number of turns that have been
labeled with the given label by at least one annotator. PO denotes the observed agreement.
which is defined as chose this label when they were unsure whether a
particular criticism label would fit a certain turn
PO PE
pool = (1) or not.
1 PE
Labels in the interpersonal category all show
with agreement scores below 0.6. It turned out that the
L L annotators had a different understanding of these
1X 1X
PO = POl , PE = PEl (2) labels. While one annotator assigned the labels
L L for any kind of positive or negative sentiment, the
l=1 l=1
other used the labels to express agreement and
where L denotes the number of labels, PEl the
disagreement between the participants of a dis-
expected agreement and POl the observed agree-
cussion.
ment of the lth label. pool is regarded to be more
A common problem for all labels were contri-
accurate than an averaged Kappa.
butions with a high degree of indirectness and im-
For assessing the overall inter-rater reliabil-
plicitness. Indirect contributions have to be in-
ity of the label set assignments per turn, we
terpreted in the light of conversational implica-
chose Krippendorffs Alpha (Krippendorff, 1980)
ture theory (Grice, 1975), which requires contex-
using MASI, a measure of agreement on set-
tual knowledge for decoding the intentions of a
valued items, as the distance function (Passon-
speaker. For example, the message
neau, 2006). MASI accounts for partial agree-
ment if the label sets of both annotators overlap Is population density allowed to be n/a?
in at least one label. We achieved an Alpha score
of = .75. According to Krippendorff, datasets has the surface form of a question. However, the
with this score are considered reliable and allow context of the discussion revealed that the author
tentative conclusions to be drawn. tried to draw attention to the missing figure in the
The CO label showed the lowest agreement of article and requested it to be filled or removed.
only = .18. The label was supposed to cover The annotators rarely made use of the context,
any criticism that is not covered by a dedicated which was a major source for disagreement in the
label. However, the annotators reported that they study.
782
Another difficulty for the annotators were long that edit requests and reports of performed edits
discussion turns. While the average turn consists are the main subject of discussion. Generally, it is
of 42 tokens, the largest contribution in the cor- more common that edits are reported after they
pus is 658 tokens long. Turns of this size can have been made than to announce them before
cover multiple aspects and potentially comprise they are carried out, as can be seen in the ratio
many different dialog acts, which increases the of PPC to PFC labels. The number of turns la-
probability of disagreement. This issue can be ad- beled with PSR is almost the same as the number
dressed by going from the turn level to the utter- of contributions labeled with either PPC or PFC.
ance level in future work. This allows the tentative conclusion that nearly all
A comparison of our results with the agreement requests potentially lead to an edit action. As a
reported for other datasets shows that the reliabil- matter of fact, the most common label adjacency
ity of our annotations lies well within the field of pair11 in the corpus is PSRPPC, which substan-
the related work. Bender et al. (2011) carried out tiates this assumption.
an annotation study of social acts in 365 discus- Article criticism labels have been assigned to
sions from 47 Wikipedia Talk pages. They report 39.4% of all turns. Almost half (241) of the labels
Kappa scores for thirteen labels in two categories from this class are assigned to the first turn of a
ranging from .13 to .66 per label. The overall discussion. This shows that it is common to open
agreement for each category was .50 and .59, re- a discussion in reference to a particular deficiency
spectively, which is considerably lower than our of the article. The large number of CL labels com-
pool = .67. Kim et al. (2010b) annotate pairs of pared to other labels from the same category is
posts taken from an online forum. They use a di- due to the fact that the Simple English Wikipedia
alog act tagset with twelve labels customized for requires authors to write articles in a way that they
modeling troubleshooting-oriented forum discus- are understandable for non-native speakers of En-
sions. For their corpus of 1334 posts, they report glish. Therefore, the use of adequate language is
an overall Kappa of .59. Kim et al. (2010a) iden- one of the major concerns of the Simple English
tify unresolved discussions in student online fo- Wikipedia community.
rums by annotating 1135 posts with five different
speech acts. They report Kappa scores per speech 5 Automatic Dialog Act Classification
act between .72 and .94. Their better results might For the automatic classification of dialog acts in
be due to a more coarse grained label set. Wikipedia Talk pages, we transform the multi-
label classification problem into a binary classi-
4.2 Corpus Analysis
fication task (Tsoumakas et al., 2010). We train a
The SEWD corpus contains 313 discussions con- binary classifier for each label using the WEKA
sisting of 1367 turns by 337 users. The average data-mining software (Hall et al., 2009). We use
length of a turn is 42 words. 208 of the 337 three learners for the classification task, a Naive
contributors are registered Wikipedia users, 129 Bayes classifier, J48, an implementation of the
wrote anonymously. On average, each contributor C4.5 decision tree algorithm (Quinlan, 1992) and
wrote 168 words in 4 turns. However, there was a SMO, an optimization algorithm for training sup-
cluster of 16 people with 20 contributions. port vector machines (Platt, 1998). Finally, we
Table 2 shows the frequencies of all labels in combine the best performing learners for each la-
the SEWD corpus. The most frequent labels are bel in a UIMA-based classification pipeline (Fer-
information providing (IP), requests (PSR) and rucci and Lally, 2004).
reports of performed edits (PPC). The IP-label
was assigned to more than 78% of all 1367 turns, Features for Dialog Act Classification As fea-
because almost every contribution provides a cer- tures, we use all uni-, bi- and trigrams that oc-
tain amount of information. The label was only curred in at least three different turns. Further-
omitted if a turn merely consisted of a discussion more, we include the time distance to the previ-
template but did not contain any text or if it exclu- ous and the next turn (in seconds), the length of
sively contained questions. the current, previous and next turn (in tokens), the
More than a quarter of the turns are labeled 11
A label transition A B is recorded if two adjacent
with PSR and PPC, respectively. This indicates turns are labeled with A and B, respectively.
783
position of the turn within the discussion, the in- Naive
Label Human Base J48 SMO Best
Bayes
dentation level of the turn and two binary features
CM .66 .07 .68 .48 .66 .68
indicating whether a turn references or is refer-
CW .55 .01 .70 .20 .56 .70
enced by another turn.12 In order to capture the CU .40 .07 .66 .35 .59 .66
sequential nature of the discussions, we use the CS .69 .09 .67 .67 .75 .75
n-grams of the previous and the next turn as addi- CL .77 .11 .70 .66 .73 .73
tional features. COBJ .84 .04 .78 .51 .63 .78
CO .20 .02 .61 .06 .39 .61
Balancing Positive and Negative Instances PSR .76 .30 .72 .70 .76 .76
Since the number of positive instances for each PREF .62 .00 .76 .41 .64 .76
PFC .77 .04 .70 .62 .73 .73
label is small compared to the number of nega- PPC .94 .25 .74 .82 .85 .85
tive instances, we create a balanced dataset which IP .93 .74 .83 .93 .93 .93
contains an equal amount of positive and nega- IS .83 .16 .79 .86 .85 .86
tive instances. Therefore, we randomly select the IC .51 .06 .67 .32 .59 .67
appropriate number of negative instances and dis- ATT+ .58 .10 .61 .65 .72 .72
card the rest. This improves the classification per- ATTP .44 .03 .72 .25 .62 .72
formance on every label for all three learners. ATT- .58 .07 .52 .30 .52 .52
Macro .65 .13 .70 .52 .68 .73
Feature Selection Using the full set of features, Micro .79 .35 .74 .75 .80 .82
we achieve the following macro/micro averaged Table 3: F1 -Scores for the balanced set with feature
F1 -scores: 0.29 / 0.57 for Naive Bayes, 0.42 / selection on 10-fold cross-validation. Base refers to
0.66 for J48 and 0.43 / 0.72 for SMO. To fur- the baseline performance, Best to our classification
ther improve the classification performance, we pipeline.
reduce the feature space using two feature selec-
tion techniques, the 2 metric (Yang and Ped-
Classification Results Table 3 shows the per-
ersen, 1997) and the Information Gain approach
formance of all classifiers and our final classi-
(Mitchell, 1997). For each label, we train separate
fication pipeline evaluated on 10-fold cross val-
classifiers using the top 100, 200 and 300 features
idation. Naive Bayes performed surprisingly
obtained by each feature selection technique and
well and showed the best macro averaged scores
choose the best performing set for our final clas-
among the three learners while SMO showed the
sification pipeline.
best micro averaged performance. We compare
Indentation and temporal distance to the pre- our results to a random baseline and to the per-
ceding turn proved to be the best ranked non- formance of the human annotators (cf. Table 3
lexical features overall. Additionally, the turn po- and Figure 2). The baseline assigns the dialog act
sition within the topic was a crucial feature for labels at random according to their frequency dis-
most labels in the criticism class and for PSR and tribution in the gold standard. Our classifier out-
IS labels. This is not surprising, because article performed the baseline significantly on all labels.
criticism, suggestions and questions tend to oc-
The comparison with the human performance
cur in the beginning of a discussion. The two
shows that our system is able to reach the human
reference features have not proven to be useful.
performance. In most cases, the annotation agree-
The relational information was better covered by
ment is reliable, and so are the results of the auto-
the indentation feature. The subjective quality of
matic classification. For the labels CU and CO,
the lexical features seems to be correlated with
the inter-annotator agreement is not high. The
the inter-annotator agreement of the respective la-
comparably good performance of the classifiers
bels. Features for labels with low agreement con-
on these labels shows that the instances do have
tain many n-grams without any recognizable se-
shared characteristics. Human raters, however,
mantic connection to the label. For labels with
have difficulties recognizing these labels consis-
good agreement, the feature lists almost exclu-
tently. Thus, their definitions need to be refined in
sively contain meaningful lexical cues.
future work.
12
A turn Y references a preceding turn X if the indenta- To our knowledge, none of the related work on
tion level of Y is one level deeper than of X. discourse analysis of Wikipedia Talk pages per-
784
1
Best Human Baseline
0.8
F1 -score
0.6
0.4
0.2
0
CM
CW
CU
CS
CL
BJ
CO
R
EF
IP
IS
IC
T+
TP
T-
PS
PF
PP
AT
CO
PR
AT
AT
Figure 2: F1 -Scores for our classification pipeline (Best), the human performance and baseline performance.
formed automatic dialog act classification. How- more, it will be the basis for practical applications
ever, there has been previous work on classify- that bring the hidden content of Talk pages to the
ing speech acts in other discourse types. Kim et attention of article readers.
al. (2010a) use Support Vector Machines (SVM)
and Transformation Based Learning (TBL) for Acknowledgments
the automatic assignment of five speech acts to This work has been supported by the Volkswagen
posts taken from student online forums. They re- Foundation as part of the Lichtenberg-
port individual F1 -scores per label which result Professorship Program under grant No. I/82806,
in a macro average of 0.59 for SVM and 0.66 and by the Hessian research excellence program
for TBL. Cohen et al. (2004) classify speech acts Landes-Offensive zur Entwicklung Wissen-
in emails. They train five binary classifiers us- schaftlich-okonomischer Exzellenz (LOEWE)
ing several learners on 1375 emails and report F1 as part of the research center Digital Humani-
scores per speech act between .44 and .85. De- ties.
spite the larger tagset, our classification approach
achieves an average F1 -score of .82 and therefore
lies in the top ranks of the related work. References
Ron Artstein and Massimo Poesio. 2008. Inter-Coder
6 Conclusions Agreement for Computational Linguistics. Compu-
tational Linguistics, 34(4):555596, December.
In this paper, we proposed an annotation schema
John L. Austin. 1962. How to Do Things with Words.
for the discourse analysis of Wikipedia discus- Clarendon Press, Cambridge, UK.
sions aimed at the coordination efforts for article Emily M. Bender, Jonathan T. Morgan, Meghan Ox-
improvement. We applied the annotation schema ley, Mark Zachry, Brian Hutchinson, Alex Marin,
to a corpus of 100 Wikipedia Talk pages, which Bin Zhang, and Mari Ostendorf. 2011. Annotat-
we make freely available for download. A thor- ing Social Acts: Authority Claims and Alignment
ough analysis of the inter-annotator agreement Moves in Wikipedia Talk Pages. In Proceedings of
the Workshop on Language in Social Media, pages
showed that the dataset is reliable. Finally, we
4857, Portland, Oregon, USA.
performed automatic dialog act classification on Jean Carletta. 1996. Assessing Agreement on Classi-
Wikipedia Talk pages. Therefore, we combined fication Tasks: The Kappa Statistic. Computational
three machine learning algorithms and two feature Linguistics, 22(2):249254.
selection techniques to a classification pipeline, Tamitha Carpenter and Emi Fujioka. 2011. The Role
which we trained on our SEWD corpus. We and Identification of Dialog Acts in Online Chat. In
achieve an average F1 -score of .82, which is com- Proceesings of the Workshop on Analyzing Micro-
parable to the human performance of .79. The text at the 25th AAAI Conference on Artificial Intel-
ligence, San Francisco, CA, USA.
ability to automatically classify discussion pages
William W. Cohen, Vitor R. Carvalho, and Tom M.
will help to investigate the relations between arti- Mitchell. 2004. Learning to Classify Email into
cle discussions and article edits, which is an im- Speech Acts. In Proceedings of the 2004 Con-
portant step towards understanding the processes ference on Empirical Methods in Natural Language
of collaboration in large-scale Wikis. Further- Processing, pages 309316, Barcelona, ES.
785
Mark G. Core and James F. Allen. 1997. Cod- Discussion Pages. In Proceedings of the 5th Inter-
ing dialogs with the DAMSL annotation scheme. national AAAI Conference on Weblogs and Social
In Proceedings of the Working Notes of the AAAI Media, Dublin, IE.
Fall Symposium on Communicative Action in Hu- Tom Mitchell. 1997. Machine Learning. McGraw-
mans and Machines, pages 2835, Cambridge, MA, Hill Education (ISE Editions), 1st edition.
USA. Rebecca Passonneau. 2006. Measuring Agreement on
Han De Vries, Marc N. Elliott, David E. Kanouse, and Set-valued Items (MASI) for Semantic and Prag-
Stephanie S. Teleki. 2008. Using Pooled Kappa matic Annotation. In Proceedings of the Fifth In-
to Summarize Interrater Agreement across Many ternational Conference on Language Resources and
Items. Field Methods, 20(3):272282. Evaluation, Genoa, IT.
David Ferrucci and Adam Lally. 2004. UIMA: An Ar- John C. Platt. 1998. Fast training of support vector
chitectural Approach to Unstructured Information machines using sequential minimal optimization.
Processing in the Corporate Research Environment. In Advances in Kernel Methods: Support Vector
Natural Language Engineering, 10:327348. Learning, pages 185208, Cambridge, MA, USA.
Oliver Ferschke, Torsten Zesch, and Iryna Gurevych. Ilona R. Posner and Ronald M. Baecker. 1992. How
2011. Wikipedia Revision Toolkit: Efficiently People Write Together. In Proceedings of the 25th
Accessing Wikipedias Edit History. In Proceed- Hawaii International Conference on System Sci-
ings of the 49th Annual Meeting of the Associa- ences, pages 127138, Wailea, Maui, HI, USA.
tion for Computational Linguistics: Human Lan- Ross Quinlan. 1992. C4.5: Programs for Machine
guage Technologies. System Demonstrations, pages Learning. Morgan Kaufmann, 1st edition.
97102, Portland, OR, USA. Jodi Schneider, Alexandre Passant, and John G. Bres-
Paul Grice. 1975. Logic and Conversation. In Pe- lin. 2011. Understanding and Improving Wikipedia
ter Cole and Jerry L. Morgan, editors, Syntax and Article Discussion Spaces. In Proceedings of the
Semantics, volume 3. New York: Academic Press. 26th Symposium on Applied Computing, Taichung,
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard
TW.
Pfahringer, Peter Reutemann, and Ian H. Witten.
John R. Searle. 1969. Speech Acts. Cambridge Uni-
2009. The WEKA Data Mining Software: An Up-
versity Press, Cambridge, UK.
date. SIGKDD Explorations, 11:1018.
John R. Searle. 1976. A classification of illocutionary
George Hripcsak and Adam S. Rothschild. 2005.
acts. Language in Society, 5:123.
Agreement, the f-measure, and reliability in infor-
mation retrieval. Journal of the American Medical Besiki Stvilia, Michael B. Twidale, Linda C. Smith,
Informatics Association, 12(3):296298. and Les Gasser. 2008. Information Quality Work
Dan Jurafsky, Liz Shriberg, and Debbra Biasca. 1997. Organization in Wikipedia. Journal of the Ameri-
Switchboard SWBD-DAMSL Shallow-Discourse- can Society for Information Science, 59:9831001.
Function Annotation Coders Manual. Technical Grigorios Tsoumakas, Ioannis Katakis, and Ioannis P.
Report Draft 13, University of Colorado, Institute Vlahavas. 2010. Mining multi-label data. In Data
of Cognitive Science. Mining and Knowledge Discovery Handbook, pages
Jihie Kim, Jia Li, and Taehwan Kim. 2010a. To- 667685. Springer.
wards Identifying Unresolved Discussions in Stu- Fernanda Viegas, Martin Wattenberg, Jesse Kriss, and
dent Online Forums. In Proceedings of the NAACL Frank Ham. 2007. Talk Before You Type: Coor-
HLT 2010 Fifth Workshop on Innovative Use of NLP dination in Wikipedia. In Proceedings of the 40th
for Building Educational Applications, pages 84 Annual Hawaii International Conference on System
91, Los Angeles, CA, USA. Sciences, Waikoloa, Big Island, HI, USA.
Su Nam Kim, Li Wang, and Timothy Baldwin. 2010b. Eti Yaari, Shifra Baruchson-Arbib, and Judit Bar-Ilan.
Tagging and linking web forum posts. In Pro- 2011. Information quality assessment of commu-
ceedings of the Fourteenth Conference on Compu- nity generated content: A user study of Wikipedia.
tational Natural Language Learning, CoNLL 10, Journal of Information Science, 37:487498.
pages 192202, Stroudsburg, PA, USA. Yiming Yang and Jan O. Pedersen. 1997. A Compara-
Klaus Krippendorff. 1980. Content Analysis: An tive Study on Feature Selection in Text Categoriza-
Introduction to Its Methodology. Thousand Oaks, tion. In Proceedings of the Fourteenth International
CA: Sage Publications. Conference on Machine Learning, pages 412420,
J. Richard Landis and Gary G. Koch. 1977. An Appli- San Francisco, CA, USA.
cation of Hierarchical Kappa-type Statistics in the Torsten Zesch, Christof Muller, and Iryna Gurevych.
Assessment of Majority Agreement among Multi- 2008. Extracting Lexical Semantic Knowledge
ple Observers. Biometrics, 33(2):363374, June. from Wikipedia and Wiktionary. In Proceedings of
David Laniado, Riccardo Tasso, Yana Volkovich, and the 6th International Conference on Language Re-
Andreas Kaltenbrunner. 2011. When the Wikipedi- sources and Evaluation, Marrakech, MA.
ans Talk: Network and Tree Structure of Wikipedia
786
An Unsupervised Dynamic Bayesian Network Approach to Measuring
Speech Style Accommodation
Mahaveer Jain1 , John McDonough1 , Gahgene Gweon2 , Bhiksha Raj1 , Carolyn Penstein Rose1,2
1. Language Technologies Institute; 2. Human Computer Interaction Institute
Carnegie Mellon University
Pittsburgh, PA 15213
{mmahavee,johnmcd,ggweon,bhiksha,cprose}@cs.cmu.edu
787
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 787797,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
particular stylistic elements we are focusing on. prior work on modeling emotional speech has
Our evaluation provides support for this hypothe- sought to identify features that themselves have
sis. a social interpretation, such as features that pre-
When stylistic shifts are focused on specific dict emotional states like uncertainty (Liscombe
linguistic features, then measuring the extent of et al., 2005), or surprise (Ang et al., 2002), or
the stylistic accommodation is simple since a social strategies like flirting (Ranganath et al.,
speakers style may be represented on a one or two 2009). However, our goal is to monitor social pro-
dimensional space, and movement can then be cesses that evolve over time and are reflected in
measured precisely within this space using sim- the change in speech dynamics. Examples include
ple linear functions. However, the rich sociolin- fostering trust, forming attachments, or building
guistic literature on speech style accommodation solidarity.
highlights a much greater variety of speech style
characteristics that may be associated with social 2.1 Defining Speech Style Accommmodation
status within an interaction and may thus be bene- The concept of what we refer to as Speech
ficial to monitor for stylistic shifts. Unfortunately, Style Accommodation has its roots in the field
within any given context, the linguistic features of the Social Psychology of Language, where
that have these status associations, which we re- the many ways in which social processes are re-
fer to as indexicality, are only a small subset of flected through language, and conversely, how
the linguistic features that are being used in some language influences social processes, are the ob-
way. Furthermore, which features carry this in- jects of investigation (Giles & Coupland, 1991).
dexicality are specific to a context. Thus, separat- As a first step towards leveraging this broad range
ing the socially meaningful variation from varia- of language processes, we refer to one very spe-
tion in linguistic features occurring for other rea- cific topic, which has been referred to as entrain-
sons is akin to searching for the proverbial needle ment, priming, accommodation, or adaptation in
in a haystack. It is this technical challenge that we other computational work (Levitan & Hirschberg,
address in this paper. 2011). Specifically we refer to the finding that
In the remainder of the paper we review the lit- conversational partners may shift their speaking
erature on speech style accommodation both from style within the interaction, either becoming more
a sociolinguistic perspective and from a techno- similar or less similar to one another.
logical perspective in order to motivate our hy- Our usage of the term accommodation specifi-
pothesis and proposed model. We then describe cally refers to the process of speech style conver-
the technical details of our model. Next, we gence within an interaction. Stylistic shifts may
present an experiment in which we test our hy- occur at a variety of levels of speech or language
pothesis about the nature of speech style accom- representation. For example, much of the early
modation and find statistically significant con- work on speech style accommodation focused on
firming evidence. We conclude with a discussion regional dialect variation, and specifically on as-
of the limitations of our model and directions for pects of pronunciation, such as the occurrence of
ongoing research. post-vocalic r in New York City, that reflected
differences in age, regional identification, and so-
2 Theoretical Framework
cioeconomic status (Labov, 2010a,b). Distribu-
Our research goal is to model the structure of tion of backchannels and pauses have also been
speech in a way that allows us to monitor so- the target of prior work on accommodation (Lev-
cial processes through speech. One common goal itan & Hirschberg, 2011). These effects may be
of prior work on modeling speech dynamics has moderated by other social factors. For example,
been for the purpose of informing the design of Bilous & Krauss (1988) found that females ac-
more natural spoken dialogue systems (Levitan et commodated to their male partners in conversa-
al., 2011). The practical goal of our work is to tion in terms of average number of words uttered
measure the social processes themselves, for ex- per turn. For example, Hecht et al. (1989) re-
ample in order to estimate the extent to which ported that extroverts are more listener adaptive
group discussions show signs of productive con- than introverts and hence extroverts converged
sensus building processes (Gweon, 2011). Much more in their data.
788
Accommodation could be measured either ity of speech and lexical features either over full
from textual or speech content of a conversation. conversations or by comparing the similarity in
The former relates to what people say whereas the first half and the second half of the conver-
the latter to how they say it. We are only inter- sation. For example, Edlund et al. (2009) mea-
ested in measuring accommodation from speech sure accommodation in pause and gap length us-
in this work. There has been work on convergence ing measures such as synchrony and convergence.
in text such as syntactic adaptation (Reitter et al., Levitan & Hirschberg (2011) found that accom-
2006) and language similarity in online commu- modation is also found in special social behaviors
nities (Huffaker et al., 2006). within conversation such as backchannels. They
show that speakers in conversation tend to use
2.2 Social Interpretation of Speech Style similar kinds of speech cues such as high pitch at
Accommodation the end of utterance to invite a backchannel from
It has long been established that while some their partner. In order to measure accommodation
speech style shifts are subconscious, speakers on these cues, they compute the correlation be-
may also choose to adapt their way of speaking tween the numerical values of these cues used by
in order to achieve social effects within an in- partners.
teraction (Sanders, 1987). One of the main mo- In our work we measure accommodation using
tives for accommodation is to decrease social dis- Dynamic Bayesian Networks (DBNs). Our mod-
tance. On a variety of levels, speech style accom- els are learnt in an unsupervised fashion. What
modation has been found to affect the impression we are specifically interested in is the manner in
that speakers give within an interaction. For ex- which the influence of one partner on the other is
ample, Welkowitz & Feldstein (1970) found that modeled. What is novel in our approach is the
when speakers become more similar to their part- introduction of the concept of an accommodation
ners, they are liked more by partners. Another state, or relational gestalt variable, which essen-
study by Putman & Street Jr (1984) demonstrated tially models the momentum of the influence that
that interviewees who converge to the speaking one partner is having on the other partners speak-
rate and response latency of their interviewers are ing style. It allows us to represent structurally the
rated more favorably by the interviewers. Giles et insight that accommodation occurs over time as a
al. (1987) found that more accommodating speak- reflection of a social process, and thus has some
ers were rated as more intelligent and supportive consistency in the nature of the accommodation
by their partners. Conversely, social factors in within some span of time. The prior work de-
an interaction affect the extent to which speak- scribed in this section can be thought of as tak-
ers engage in, and some times chose not to en- ing the influence of the partners style directly on
gage in, accommodation. For example, Purcell the speakers style within an instant as the floor
(1984) found that Hawaiian children exhibit more shifts from one speaker to the next. Thus, no con-
convergence in interactions with peer groups that sistency in the manner in which the accommoda-
they like more. Bourhis & Giles (1977) found that tion is occurring is explicitly encouraged by the
Welsh speakers while answering to an English model. The major advantage of consistency of
surveyor broadened their Welsh accent when their motion within the style shift over time is that it
ethnic identity was challenged. Scotton (1985) provides a sign post for identifying which style
found that few people hesitated to repeat lexi- variation within the speech is salient with respect
cal patterns of their partners to maintain integrity. to social interpretation within a specific interac-
Nenkova et al. (2008) found that accommodation tion so that the model may remain agnostic and
on high frequency words correlates with natural- may thus be applied to a variety of interactions
ness, task success, and coordinated turn-taking that differ with respect to which stylistic features
behavior. are salient in this respect.
789
ing rate etc. In this work, we leverage on sev-
eral of these speech features to quantify accom-
modation. We propose a series of models that
can be trained unsupervised from speech features
and can be used for predicting accommodation.
The models attempt to capture the dependence of
speech features on speaking style, as well as the
Figure 1: An example Dynamic Bayesian Network
effect of persistence and accommodation on style.
(DBN) showing the temporal relationship between
We use a dynamic Bayesian network (DBN) for- three random variables (A,B and C). A is observered
malism to capture these relationships. Below we and dependent on two hidden variables B and C. Di-
briefly review DBNs, and subsequently describe rected edges across time (t 1 t) indicate temporal
the speech features used, and the proposed mod- relationships between variables. In this example, the
els. variables At and Bt are both dependent on Bt1 with
the relationship defined through conditional distribu-
3.1 Dynamic Bayesian Networks tions P (At |Bt1 ) and P (Bt |Bt1 ).
The theory of Bayesian networks is well doc-
umented and understood (Jensen, 1996; Pearl, parents of xi in the network. We note that not
1988). A Bayesian network is a probabilistic all of these variables need to be observable; of-
model that represents statistical relationships be- ten in such models several of the variables are
tween random variables via a directed acyclic unobservable, i.e. they are latent. In order
graph (DAG). Formally, it is a directed acyclic to obtain the joint distribution of the observable
graph whose nodes represent random variables variables the latent variables must be marginal-
(which may be observable quantities, latent unob- ized out. I.e. if x1 , , xm are observable
servable variables, or hypotheses to be estimated).
Edges represent conditional dependencies; nodes xm+1 , , xn are latent, P (x1 , , xm ) =
and
xm+1, ,xn P (x1 , x2 , , xn ).
which are connected by an edge represent ran- Dynamic Bayesian networks (DBNs) further
dom variables that have a direct influence on one represent time-series data through a recurrent for-
another. The entire network represents the joint mulation of a basic Bayesian network that repre-
probability of all the variables represented by the sents the relationship between variables. Within
nodes, with appropriate factoring of the condi- a DBN a set of random variables at each time in-
tional dependencies between variables. stance t is represented as a static Bayesian Net-
Consider, for instance, a joint distribution work with temporal dependencies to variables at
over a set of random variables x1 , x2 , , xn , other instants. Namely, the distribution of a vari-
modeled by a Bayesian network. Let V = able xi,t at time t is dependent on other variables
v1 , v2 , , vn represent the set of n nodes in at times t , xj,t through conditional prob-
the network, representing the random variables abilities of the form P r(xi,t |xj,t ). An exam-
x1 , x2 , , xn respectively. Let (vi ) represent ple DBN, consisting of three variables (A, B and
the set of parent nodes of vi , i.e. nodes in V C), two of which have temporal dependencies is
that have a directed edge into a node vi . Then, shown in Figure 1.
by the dependencies specified by the network, One benefit of the DBN formalism is that in
P (xi |x1 , x2 , , xn ) = P (xi |xj : vj (vi )). addition to providing a compact graphical way
In other words, any variable xi is directly depen- of representing statistical relationships between
dent only on its parent variables, i.e. the random variables in a process, the constrained, directed
variables represented by the nodes in (vi ), and network structure also allows for simplified in-
is independent of all other variables given these ference. Moreover, the conditional distributions
variables. The joint probability of x1 , x2 , , xn associated with the network are often assumed
is hence given by not to vary over time, i.e. P r(xi,t |xj,t ) =
P r(xi,t |xj,t ). This allows for a very com-
p(x1 , x2 , ..., xn ) = p(xi |xi ) (1)
pact representation of DBNs and allows for ef-
i
ficient Expectation-Maximization (EM) learning
Where xi represents {xj : vj (vi ), i.e. the algorithms to be applied.
790
In the discussion that follows we do not explic- O1t-1 O1t O1t+1
791
O1t-1 O1t O1t+1 O1t O1t+1
AY 2t AYt-11t AY2t-1t+1
Figure 4: CSDM: A speakers style depends on their Figure 6: AASM: Accommodation state associated
partners style at the previous turn. with every speaker turn
792
O1t-1 O1t O1t+1 tures, as represented in the observation vectors.
It is hence reasonable to assume that they are both
SY1t-1t-1 S1t S1t+1
Yt+1
speaking in similar style. Similarly, the accom-
AYt AYt+1
t-1 modation state cannot be expected to actually de-
S2t
pict accommodation; nevertheless, it can capture
S2t-1 S2t+1
the dependencies that govern when the two speak-
O2t-1 O2t O2t+1
ers are likely to be in the same state.
4 Evaluation
Figure 7: SASDM: A speakers style depends both on
mutual accommodation and the partners style in the The model we have just described allows us to in-
previous turn. vestigate two separate aspects of our concept of
speech style accommodation. The first aspect is
O1t O1t+1 that style accommodation occurs as a local influ-
ence of one speakers style on the other speakers
SY1t-1t S1t+1
Yt+1
style, as depicted by direct links between style
AY 2t AYt-11t AY2t-1t+1 states. The second aspect is that although this is a
S2t S2t+1
local phenomenon, because it is a reflection of a
social process that extends over a period of time,
O2t O2t+1 there will be some persistence of accommodation
over longer periods of time, as characterized by
Figure 8: AASDM: The accommodation state associ- the accommodation state. We presented two dif-
ated with every speaker and a speakers style depends ferent operationalizations of the accommodation
on the partners style. state above, namely Asymmetric and Symmetric.
Accommodation is a phenomenon that occurs
within interactions between speakers; we can ex-
dicate that a speakers style in any turn depends
pect not to observe accommodation occurring be-
both on accommodation and on their partners
tween individuals that have never met and are not
style in the previous turn. Figure 7 shows the
interacting. On average, then, we expect to see
DBN for this model.
more evidence of speech style accommodation in
Asymmetric Accommodated Style Dependence pairs of individuals who are interacting (i.e., Real
Model Pairs) than in pairs of individuals who are not in-
The Asymmetric Accommodated Style Depen- teracting and have never met (i.e., Constructed
dence Model (AASDM) extends the AASM by Pairs). Thus, we may evaluate the extent to which
adding a direct dependence between a speakers our model is sensitive to social dynamics within
style and their partners style in their most recent pairs by the extent to which it is able to distinguish
turn. The DBN for this is shown in Figure 8. between true conversation between Real Pairs of
speaker and synthetic conversation between Con-
3.5 Interpreting the states
structed Pairs. A similar experimental paradigm
We note that we have referred to the states in the has been adopted in prior work on speech style
models above as style states. In reality, in all accommodation (Levitan et al., 2011).
cases, we learn the parameters of the model in Hypothesis: Our hypothesis is that models that
an unsupervised manner, since the data we use to explicitly represent the notion that accommoda-
train it do not have either speaking style or ac- tion occurs over a span of time with consistency
commodation indicated (although, if they were la- of momentum will achieve better success at dis-
beled, the labels could be employed within our tinguishing between Real Pairs and Constructed
models). Consequently, we have no assurance Pairs than models that do not.
that the states learned will actually correspond to Experimental Manipulation: Thus, using the
speaking styles. They can only be considered a model we have just described, we are able to
proxy for speaking style. Nevertheless, if both test our hypothesis using a 2 3 factorial design
speakers are in the same state, they can both be in which one factor is the inclusion of direct
expected to be producing similar prosodic fea- links from the style of one speaker to the style
793
of the other speaker, which we refer to as the factors. Furthermore, because the participants did
DirectInfluence (DI) factor, with values True not know each other before the debate, we can
(T) and False (F), and the second factor is the assume that if accommodation happened, it was
inclusion of links from style states to and from only during the conversation.
Accommodation states, which we refer to as the Real versus Constructed Pairs: In our analy-
IndirectInfluence (II) factor, with values False sis below, we compare measured accommodation
(F), Asymmetric (A), and Symmetric (S). The between pairs of humans who had a real conver-
result of this 2 3 factorial design are the 6 sation and a constructed pair in which one per-
different models described in Section 3, namely son from that conversation is paired with a con-
ISM (DI=False, II=False), CSDM (DI=True, structed partner, where the partners side of the
II=False), SASM (DI=False, II=Symmetric), conversation was constructed from turns that oc-
AASM (DI=False, II=Asymmetric), SASDM curred in other conversations. We set up this com-
(DI=True, II=Symmetric), and AASDM parison in order to isolate speech style conver-
(DI=True, II= Asymmetric). gence from lexical convergence when we evalu-
Corpus: The success criterion in our experiment ate the performance of our model. The difference
is the extent to which models of speech style between the measured accommodation between
accommodation are able to distinguish between real and constructed pairs is treated as a weak op-
Real Pairs and Constructed pairs. In order to set erationalization of model accuracy at measuring
up this comparison, we began with a corpus of de- speech style accommodation.
bates between students about the reasons for the For each of the 20 Real pairs in the test corpus
fall of the Ottoman Empire. We obtained this cor- we composed one Constructed Pair. Each Con-
pus from researchers who originally collected it structed Pair comprised one student from the cor-
to investigate issues related to learning from con- responding Real Pair (i.e., the Real Student) and a
versational interactions (Nokes et al., 2010). The Constructed Partner that resembled the real part-
full corpus contains interactions between 76 pairs ner in content but not necessarily style. We did
of students who interacted for 8 minutes. Within this by iterating through the real partners turns,
each pair, one student was assigned the role of ar- replacing each with a turn that matched as well as
guing that the fall of the Ottoman empire was due possible in terms of lexical content but came from
to internal causes, whereas the other student was a different conversation. Lexical content match
assigned the role of arguing that the fall of the Ot- was measured in terms of cosine similarity. Turns
toman empire was due to external causes. Each were selected from the other Real pairs. Thus, the
student was given a 4 page packet of supporting Constructed Partner had similar content to the cor-
information for their side of the debate to draw responding real partner on a turn by turn basis, but
from in the interaction. the style of expression could not be influenced by
The speech from each participant was recorded the Real Student. Thus, ideally we should not see
on a separate channel. As a first step, we aligned evidence of speech style accommodation within
the speech recordings automatically to their tran- the Constructed Pairs.
scriptions at the word and turn level. After align- Experimental Procedure: For each of the four
ing the corpus at the word level, we identify the models we computed an Accommodation Score
turn interval of each partner in the conversation. for each of the Real Pairs and Constructed Pairs.
We use 66 of the debates out of the complete set In order to obtain a measure that can be used to
of 76 for the experiments discussed in this paper. compute accommodation for all the models con-
We had to eliminate 10 dialogues where the seg- sidered, we compute the accommodation value as
mentation and alignment failed. For each of our the fraction of turns in a session where partners
models, we used the same 3 fold cross-validation. exhibited the same speaking style.
Participants: Participants were all male under- Results: In order to test our hypothesis we con-
graduate students between the ages of 18 and 25. structed an ANOVA model with Accommodation
In prior studies, it has been shown that accommo- Score as the dependent variable and DirectInflu-
dation varies based on gender, age and familiar- ence, IndirectInfluence, RealVsConstructed as in-
ity between partners. This corpus is particularly dependent variables. Additionally we included
appropriate because it controls for most of these the interaction terms between all pairs of inde-
794
DI II Real Constructed Based on this analysis, we find support for our
() () hypothesis. We find that the model that includes
SASDM T S .54 (.23) .44 (.29) Symmetric IndirectInfluence links and DirectIn-
SASM F S .54 (.23) .44 (.29) fluence links is the best balance between represen-
CSDM T F .6 (.26) .52 (.3) tational power and simplicity. The support for the
ISM F F .56 (.25) .51 (.32) inclusion of DirectInfluence links in the model is
AASM F A .6 (.24) .51 (.3) weaker than that of IndirectInfluence links, how-
AASDM T A .61 (.24) .48 (.3) ever. On a larger dataset, we may have observed
stronger effects of both factors. Even on this small
Table 1: Accommodation measured using different dataset, we find evidence that adding that struc-
models. Legend: =mean, = standard deviation, DI ture improves the performance of the model with-
= Direct Influence, II = Indirect Influence. out leading to overfitting.
pendent variables. Using this ANOVA model, we 5 Conclusions and Current Directions
find a highly significant main effect of the Re- In this paper we presented an unsupervised dy-
alVsConstructed factor that demonstrates the gen- namic Bayesian modeling approach to modeling
eral ability of the models to achieve separation be- speech style accommodation in face-to-face inter-
tween Real Pairs and Constructed Pairs; on aver- actions. Our model was motivated by the idea that
age F(1,780) = 18.22, p < .0001. because accommodation reflects social processes
However, when we look more closely, we find that extend over time within an interaction, one
that although the trend is consistently to find more may expect a certain consistency of motion within
evidence of speech style accommodation in Real the stylistic shift. Our evaluation demonstrated a
Pairs than in Constructed Pairs, we see differen- statistically significant advantage for the models
tiation among the models in terms of their abil- that embodied this idea.
ity to achieve this separation. When we exam- An important motivation for our modeling ap-
ine the two way interactions between DirectIn- proach was that it allows us to avoid targeting
fluence and RealVsConstructed as well as be- specific linguistic style features in our measure
tween IndirectInfluence and RealVsConstructed, of accommodation. However, in our evaluation,
although we do not find significant interactions, we only tested our approach on conversations be-
we do find some suggestive patterns when we tween male undergraduate students discussing the
do the student T posthoc analysis. In particular, fall of the Ottoman Empire. Thus, while our eval-
when we explore just the interaction between In- uation provides evidence that we have taken a first
directInfluence links, we find a significant separa- important step towards our ultimate goal, we can-
tion between Real vs Constructed pairs for models not yet claim that we have a model that performs
with Accommodation states, but not for the cases equally effectively across contexts. In our future
where no Accommodation states are included. work, we plan to formally test the extent to which
However, when we do the same for the interaction this allows us to accurately measure accommoda-
between DirectInfluence links and RealVsCon- tion within contexts in which very different stylis-
structed, we find significant separation with or tic elements carry strategic social value.
without those links. This suggests that IndirectIn- Another important direction of our current re-
fluence links are more important than DirectInflu- search is to explore how measures of speech style
ence links. At a finer-grained level, when we ex- accommodation may predict other important mea-
amine the models individually, we only find a sig- sures such as how positively partners view one an-
nificant separation between Real and Constructed other, how successful partners perform tasks to-
pairs with the model that includes both Direct- gether, or how well students learn together.
Influence and Symmetric IndirectInfluence links.
These results suggest that Symmetric IndirectIn- 6 Acknowledgments
fluence links may be slightly better than Asym- We gratefully acknowledge John Levine and Tim-
metric ones, and that combining DirectInfluence othy Nokes for sharing their data with us. This
links and Symmetric IndirectInfluence links may work was funded by NSF SBE 0836012.
be the best combination.
795
References Lauritzen, S. L. & Spiegelhalter, D. J. (1988). Local
computations with probabilities on graphical struc-
Ang, J., Dhillon, R., Krupski, A., Shriberg, E., & Stol- tures and their application to expert systems. Jour-
cke, A. (2002). Prosody-based automatic detection nal of the Royal Statistical Society, 50, 157224.
of annoyance and frustration in human-computer di-
alog. In Proc. ICSLP, volume 3, pages 20372040. Levitan, R. & Hirschberg, J. (2011). Measuring
Citeseer. acoustic-prosodic entrainment with respect to mul-
tiple levels and dimensions. In Proceedings of In-
Bilous, F. & Krauss, R. (1988). Dominance and terspeech.
accommodation in the conversational behaviours
of same-and mixed-gender dyads. Language and Levitan, R., Gravano, A., & Hirschberg, J. (2011).
Communication, 8(3), 4. Entrainment in speech preceding backchannels. In
Proceedings of the 49th Annual Meeting of the As-
Bourhis, R. & Giles, H. (1977). The language of in- sociation for Computational Linguistics: Human
tergroup distinctiveness. Language, ethnicity and Language Technologies: short papers-Volume 2,
intergroup relations, 13, 119. pages 113117. Association for Computational Lin-
Coupland, N. (2007). Style: Language variation and guistics.
identity. Cambridge Univ Pr. Liscombe, J., Hirschberg, J., & Venditti, J. (2005). De-
DiMicco, J., Pandolfo, A., & Bender, W. (2004). Influ- tecting certainness in spoken tutorial dialogues. In
encing group participation with a shared display. In Proceedings of INTERSPEECH, pages 18371840.
Proceedings of the 2004 ACM conference on Com- Citeseer.
puter supported cooperative work, pages 614623. Nenkova, A., Gravano, A., & Hirschberg, J. (2008).
ACM. High frequency word entrainment in spoken dia-
Eckert, P. & Rickford, J. (2001). Style and sociolin- logue. In In Proceedings of ACL-08: HLT. Asso-
guistic variation. Cambridge Univ Pr. ciation for Computational Linguistics.
Edlund, J., Heldner, M., & Hirschberg, J. (2009). opensmile (2011). http://opensmile.sourceforge.net/.
Pause and gap length in face-to-face interaction. In Pearl, J. (1988). Probabilistic Reasoning in Intelligent
Proc. Interspeech. Systems: Networks of Plausible Inference. Morgan
Giles, H. & Coupland, N. (1991). Language: Contexts Kaufmann.
and consequences. Thomson Brooks/Cole Publish- Purcell, A. (1984). Code shifting hawaiian style: chil-
ing Co. drens accommodation along a decreolizing contin-
Giles, H., Mulac, A., Bradac, J., & Johnson, P. (1987). uum. International Journal of the Sociology of Lan-
Speech accommodation theory: The next decade guage, 1984(46), 7186.
and beyond. Communication yearbook, 10, 1348. Putman, W. & Street Jr, R. (1984). The conception
and perception of noncontent speech performance:
Gweon, G. A. P. U. M. R. B. R. C. P. (2011). The
Implications for speech-accommodation theory. In-
automatic assessment of knowledge integration pro-
ternational Journal of the Sociology of Language,
cesses in project teams. In Proceedings of Computer
1984(46), 97114.
Supported Collaborative Learning.
Ranganath, R., Jurafsky, D., & McFarland, D. (2009).
Hecht, M., Boster, F., & LaMer, S. (1989). The ef-
Its not you, its me: detecting flirting and its mis-
fect of extroversion and differentiation on listener-
perception in speed-dates. In Proceedings of the
adapted communication. Communication Reports,
2009 Conference on Empirical Methods in Natural
2(1), 18.
Language Processing: Volume 1-Volume 1, pages
Huffaker, D., Jorgensen, J., Iacobelli, F., Tepper, P., & 334342. Association for Computational Linguis-
Cassell, J. (2006). Computational measures for lan- tics.
guage similarity across time in online communities.
Reitter, D., Keller, F., & Moore, J. D. (2006). Com-
In In ACTS: Proceedings of the HLT-NAACL 2006
putational modelling of structural priming in dia-
Workshop on Analyzing Conversations in Text and
logue. In In Proc. Human Language Technology
Speech, pages 1522.
conference - North American chapter of the Asso-
Jensen, F. V. (1996). An introduction to Bayesian net- ciation for Computational Linguistics annual mtg,
works. UCL Press. pages 121124.
Labov, W. (2010a). Principles of linguistic change: Sanders, R. (1987). Cognitive foundations of calcu-
Internal factors, volume 1. Wiley-Blackwell. lated speech. State University of New York Press.
Labov, W. (2010b). Principles of linguistic change: Scotton, C. (1985). What the heck, sir: Style shifting
Social factors, volume 2. Wiley-Blackwell. and lexical colouring as features of powerful lan-
796
guage. Sequence and pattern in communicative be-
haviour, pages 103119.
Wang, Y., Kraut, R., & Levine, J. (2011). To stay or
leave? the relationship of emotional and informa-
tional support to commitment in online health sup-
port groups. In Proceedings of the ACM conference
on computer-supported cooperative work. ACM.
Ward, A. & Litman, D. (2007). Automatically measur-
ing lexical and acoustic/prosodic convergence in tu-
torial dialog corpora. In Proceedings of the SLaTE
Workshop on Speech and Language Technology in
Education. Citeseer.
Welkowitz, J. & Feldstein, S. (1970). Relation of ex-
perimentally manipulated interpersonal perception
and psychological differentiation to the temporal
patterning of conversation. In Proceedings of the
78th Annual Convention of the American Psycho-
logical Association, volume 5, pages 387388.
797
Learning the Fine-Grained Information Status of Discourse Entities
Abstract has not been previously referred to; and (3) me-
diated (henceforth med) if it is newly mentioned
While information status (IS) plays a cru- in the dialogue but she can infer its identity from
cial role in discourse processing, there have a previously-mentioned entity. To capture finer-
only been a handful of attempts to automat- grained distinctions for IS, Nissim et al. allow an
ically determine the IS of discourse entities.
old or med entity to have a subtype, which subcat-
We examine a related but more challenging
task, fine-grained IS determination, which egorizes an old or med entity. For instance, a med
involves classifying a discourse entity as entity has the subtype set if the NP that refers to
one of 16 IS subtypes. We investigate the it is in a set-subset relation with its antecedent.
use of rich knowledge sources for this task IS plays a crucial role in discourse processing:
in combination with a rule-based approach it provides an indication of how a discourse model
and a learning-based approach. In experi-
should be updated as a dialogue is processed in-
ments with a set of Switchboard dialogues,
the learning-based approach achieves an ac-
crementally. Its importance can be reflected in
curacy of 78.7%, outperforming the rule- part in the amount of attention it has received in
based approach by 21.3%. theoretical linguistics over the years (e.g., Halli-
day (1976), Prince (1981), Hajicova (1984), Vall-
duv (1992), Steedman (2000)), and in part in the
1 Introduction benefits it can potentially bring to NLP applica-
A linguistic notion central to discourse processing tions. One task that could benefit from knowledge
is information status (IS). It describes the extent of IS is identity coreference: since new entities by
definition have not been previously referred to, an
to which a discourse entity, which is typically re-
NP marked as new does not need to be resolved,
ferred to by noun phrases (NPs) in a dialogue, is
available to the hearer. Different definitions of IS thereby improving the precision of a coreference
have been proposed over the years. In this paper, resolver. Knowledge of fine-grained or subcat-
we adopt Nissim et al.s (2004) proposal, since it egorized IS is valuable for other NLP tasks. For
is primarily built upon Princes (1992) and Eck- instance, an NP marked as set signifies that it is in
ert and Strubes (2001) well-known definitions, a set-subset relation with its antecedent, thereby
and is empirically shown by Nissim et al. to yield providing important clues for bridging anaphora
an annotation scheme for IS in dialogue that has resolution (e.g., Gasperin and Briscoe (2008)).
good reproducibility.1 Despite the potential usefulness of IS in NLP
Specifically, Nissim et al. (2004) adopt a three- tasks, there has been little work on learning
way classification scheme for IS, defining a dis- the IS of discourse entities. To investigate the
course entity as (1) old to the hearer if it is known plausibility of learning IS, Nissim et al. (2004)
to the hearer and has previously been referred to in annotate a set of Switchboard dialogues with
the dialogue; (2) new if it is unknown to her and such information2 , and subsequently present a
1 2
It is worth noting that several IS annotation schemes These and other linguistic annotations on the Switch-
have been proposed more recently. See Gotze et al. (2007) board dialogues were later released by the LDC as part of the
and Riester et al. (2010) for details. NXT corpus, which is described in Calhoun et al. (2010).
798
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 798807,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
rule-based approach and a learning-based ap- the hand-written rules and their predictions di-
proach to acquiring such knowledge (Nissim, rectly as features for the learner. In an evalua-
2006). More recently, we have improved Nissims tion on 147 Switchboard dialogues, our learning-
learning-based approach by augmenting her fea- based approach to fine-grained IS determina-
ture set, which comprises seven string-matching tion achieves an accuracy of 78.7%, substan-
and grammatical features, with lexical and syn- tially outperforming the rule-based approach by
tactic features (Rahman and Ng, 2011; hence- 21.3%. Equally importantly, when employing
forth R&N). Despite the improvements, the per- these linguistically rich features to learn Nissims
formance on new entities remains poor: an F- 3-class IS determination task, the resulting classi-
score of 46.5% was achieved. fier achieves an accuracy of 91.7%, surpassing the
Our goal in this paper is to investigate fine- classifier trained on R&Ns state-of-the-art fea-
grained IS determination, the task of classifying ture set by 8.8% in absolute accuracy. Improve-
a discourse entity as one of the 16 IS subtypes ments on the new class are particularly substan-
defined by Nissim et al. (2004).3 Owing in part tial: its F-score rises from 46.7% to 87.2%.
to the increase in the number of categories, fine-
grained IS determination is arguably a more chal- 2 IS Types and Subtypes: An Overview
lenging task than the 3-class IS determination task
In Nissim et al.s (2004) IS classification scheme,
that Nissim and R&N investigated. To our knowl-
an NP can be assigned one of three main types
edge, this is the first empirical investigation of au-
(old, med, new) and one of 16 subtypes. Below
tomated fine-grained IS determination.
we will illustrate their definitions with examples,
We propose a knowledge-rich approach to fine-
most of which are taken from Nissim (2003) or
grained IS determination. Our proposal is moti-
Nissim et al.s (2004) dataset (see Section 3).
vated in part by Nissims and R&Ns poor per-
formance on new entities, which we hypothesize Old. An NP is marked is old if (i) it is corefer-
can be attributed to their sole reliance on shallow ential with an entity introduced earlier, (ii) it is a
knowledge sources. In light of this hypothesis, generic pronoun, or (iii) it is a personal pronoun
our approach employs semantic and world knowl- referring to the dialogue participants. Six sub-
edge extracted from manually and automatically types are defined for old entities: identity, event,
constructed knowledge bases, as well as corefer- general, generic, ident generic, and relative. In
ence information. The relevance of coreference to Example 1, my is marked as old with subtype
IS determination can be seen from the definition identity, since it is coreferent with I.
of IS: a new entity is not coreferential with any (1) I was angry that he destroyed my tent.
previously-mentioned entity, whereas an old en-
However, if the markable has a verb phrase (VP)
tity may. While our use of coreference informa-
rather than an NP as its antecedent, it will be
tion for IS determination and our earlier claim that
marked as old/event, as can be seen in Example
IS annotation would be useful for coreference res-
2, where the antecedent of That is the VP put my
olution may seem to have created a chicken-and-
phone number on the form.
egg problem, they do not: since coreference reso-
lution and IS determination can benefit from each (2) They ask me to put my phone number
other, it may be possible to formulate an approach on the form. That I think is not needed.
where the two tasks can mutually bootstrap. Other NPs marked as old include (i) relative
We investigate rule-based and learning-based pronouns, which have the subtype relative; (ii)
approaches to fine-grained IS determination. In personal pronouns referring to the dialogue par-
the rule-based approach, we manually compose ticipants, which have the subtype general, and
rules to combine the aforementioned knowledge (iii) generic pronouns, which have the subtype
sources. While we could employ the same knowl- generic. The pronoun you in Example 3 is an in-
edge sources in the learning-based approach, we stance of a generic pronoun.
chose to encode, among other knowledge sources, (3) I think to correct the judicial system,
3
you have to get the lawyer out of it.
One of these 16 classes is the new type, for which no
subtype is defined. For ease of exposition, we will refer to Note, however, that in a coreference chain of
the new type as one of the 16 subtypes to be predicted. generic pronouns, every element of the chain is
799
assigned the subtype ident generic instead. If an NP is part of a situation set up by a
Mediated. An NP is marked as med if the en- previously-mentioned entity, it is assigned the
tity it refers to has not been previously introduced subtype situation, as exemplified by the NP a few
in the dialogue, but can be inferred from already- horses in the sentence below, which is involved in
mentioned entities or is generally known to the the situation set up by Johns ranch.
hearer. Nine subtypes are available for med en- (7) Mary went to Johns ranch and saw that
tities: general, bound, part, situation, event, set, there were only a few horses.
poss, func value, and aggregation. Similar to old entities, an NP marked as med may
General is assigned to med entities that are be related to a previously mentioned VP. In this
generally known, such as the Earth, China, and case, the NP will receive the subtype event, as ex-
most proper names. Bound is reserved for bound emplified by the NP the bus in the sentence below,
pronouns, an instance of which is shown in Ex- which is triggered by the VP traveling in Miami.
ample 4, where its is bound to the variable of the (8) We were traveling in Miami, and the
universally quantified NP, Every cat. bus was very full.
(4) Every cat ate its dinner. If an NP refers to a value of a previously men-
Poss is assigned to NPs involved in intra-phrasal tioned function, such as the NP 30 degrees in Ex-
possessive relations, including prenominal geni- ample 9, which is related to the temperature, then
tives (i.e., Xs Y) and postnominal genitives (i.e., it is assigned the subtype func value.
Y of X). Specifically, Y will be marked as poss if (9) The temperature rose to 30 degrees.
X is old or med; otherwise, Y will be new. For ex- Finally, the subtype aggregation is assigned to co-
ample, in cases like a friends boat where a friend ordinated NPs if at least one of the NPs involved
is new, boat is marked as new. is not new. However, if all NPs in the coordinated
Four subtypes, namely part, situation, event, phrase are new, the phrase should be marked as
and set, are used to identify instances of bridg- new. For instance, the NP My son and I in Exam-
ing (i.e., entities that are inferrable from a related ple 10 should be marked as med/aggregation.
entity mentioned earlier in the dialogue). As an
(10) I have a son ... My son and I like to
example, consider the following sentences:
play chess after dinner.
(5a) He passed by the door of Jans house
New. An entity is new if it has not been intro-
and saw that the door was painted red.
duced in the dialogue and the hearer cannot infer
(5b) He passed by Jans house and saw that
it from previously mentioned entities. No subtype
the door was painted red.
is defined for new entities.
In Example 5a, by the time the hearer processes
the second occurrence of the door, she has already There are cases where more than one IS value
had a mental entity corresponding to the door (af- is appropriate for a given NP. For instance, given
ter processing the first occurrence). As a result, two occurrences of China in a dialogue, the sec-
the second occurrence of the door refers to an ond occurrence can be labeled as old/identity (be-
old entity. In Example 5b, on the other hand, the cause it is coreferential with an earlier NP) or
hearer is not assumed to have any mental repre- med/general (because it is a generally known
sentation of the door in question, but she can in- entity). To break ties, Nissim (2003) define a
fer that the door she saw was part of Jans house. precedence relation on the IS subtypes, which
Hence, this occurrence of the door should be yields a total ordering on the subtypes. Since
marked as med with subtype part, as it is involved all the old subtypes are ordered before their med
in a part-whole relation with its antecedent. counterparts in this relation, the second occur-
If an NP is involved in a set-subset relation with rence of China in our example will be labeled as
its antecedent, it inherits the med subtype set. old/identity. Owing to space limitations, we refer
This applies to the NP the house payment in Ex- the reader to Nissim (2003) for details.
ample 6, whose antecedent is our monthly budget.
3 Dataset
(6) What we try to do to stick to our
monthly budget is we pretty much have We employ Nissim et al.s (2004) dataset, which
the house payment. comprises 147 Switchboard dialogues. We parti-
800
tion them into a training set (117 dialogues) and a to the dialogue participants. Note that this and
test set (30 dialogues). A total of 58,835 NPs are several other rules rely on coreference informa-
annotated with IS types and subtypes.4 The distri- tion, which we obtain from two sources: (1)
butions of NPs over the IS subtypes in the training chains generated automatically using the Stan-
set and the test set are shown in Table 1. ford Deterministic Coreference Resolution Sys-
tem (Lee et al., 2011)5 , and (2) manually iden-
Train (%) Test (%) tified coreference chains taken directly from the
old/identity 10236 (20.1) 1258 (15.8)
annotated Switchboard dialogues. Reporting re-
old/event 1943 (3.8) 290 (3.6)
old/general 8216 (16.2) 1129 (14.2) sults using these two ways of obtaining chains fa-
old/generic 2432 (4.8) 427 (5.4) cilitates the comparison of the IS determination
old/ident generic 1730 (3.4) 404 (5.1) results that we can realistically obtain using ex-
old/relative 1241 (2.4) 193 (2.4) isting coreference technologies against those that
med/general 2640 (5.2) 325 (4.1) we could obtain if we further improved exist-
med/bound 529 (1.0) 74 (0.9) ing coreference resolvers. Note that both sources
med/part 885 (1.7) 120 (1.5) provide identity coreference chains. Specifically,
med/situation 1109 (2.2) 244 (3.1)
the gold chains were annotated for NPs belong-
med/event 351 (0.7) 67 (0.8)
med/set 10282 (20.2) 1771 (22.3) ing to old/identity and old/ident generic. Hence,
med/poss 1318 (2.6) 220 (2.8) these chains can be used to distinguish between
med/func value 224 (0.4) 31 (0.4) old/general NPs and old/ident generic NPs, be-
med/aggregation 580 (1.1) 117 (1.5) cause the former are not part of a chain whereas
new 7158 (14.1) 1293 (16.2) the latter are. However, they cannot be used
total 50874 (100) 7961 (100) to distinguish between old/general entities and
old/generic entities, since neither of them belongs
Table 1: Distributions of NPs over IS subtypes. The to any chains. As a result, when gold chains are
corresponding percentages are parenthesized.
used, Rule 1 will classify all occurrences of you
that are not part of a chain as old/general, regard-
less of whether the pronoun is generic. While the
4 Rule-Based Approach
gold chains alone can distinguish old/general and
In this section, we describe our rule-based ap- old/ident generic NPs, the Stanford chains can-
proach to fine-grained IS determination, where we not distinguish any of the old subtypes in the ab-
manually design rules for assigning IS subtypes to sence of other knowledge sources, since it gener-
NPs based on the subtype definitions in Section 2, ates chains for all old NPs regardless of their sub-
Nissims (2003) IS annotation guidelines, and our types. This implies that Rule 1 and several other
inspection of the IS annotations in the training rules are only a very crude approximation of the
set. The motivations behind having a rule-based definition of the corresponding IS subtypes.
approach are two-fold. First, it can serve as a The rules for the remaining old subtypes can be
baseline for fine-grained IS determination. Sec- interpreted similarly. A few points deserve men-
ond, it can provide insight into how the available tion. First, many rules depend on the string of
knowledge sources can be combined into predic- the NP under consideration (e.g., they in Rule 2
tion rules, which can potentially serve as sophis- and whatever in Rule 4). The decision of which
ticated features for a learning-based approach. strings are chosen is based primarily on our in-
As shown in Table 2, our ruleset is composed of spection of the training data. Hence, these rules
18 rules, which should be applied to an NP in the are partly data-driven. Second, these rules should
order in which they are listed. Rules 17 handle be applied in the order in which they are shown.
the assignment of old subtypes to NPs. For in- For instance, though not explicitly stated, Rule 3
stance, Rule 1 identifies instances of old/general, is only applicable to the non-anaphoric you and
which comprises the personal pronouns referring they pronouns, since Rule 2 has already covered
their anaphoric counterparts. Finally, Rule 7 uses
4
Not all NPs have an IS type/subtype. For instance, a non-anaphoricity as a test of old/event NPs. The
pleonastic it does not refer to any real-world entity and
5
therefore does not have any IS, and so are nouns such as The Stanford resolver is available from http://nlp.
course in of course, accident in by accident, etc. stanford.edu/software/corenlp.shtml.
801
1. if the NP is I or you and it is not part of a coreference chain, then
subtype := old/general
2. if the NP is you or they and it is anaphoric, then
subtype := old/ident generic
3. if the NP is you or they, then
subtype := old/generic
4. if the NP is whatever or an indefinite pronoun prefixed by some or any (e.g., somebody), then
subtype := old/generic
5. if the NP is an anaphoric pronoun other than that, or its string is identical to that of a preceding NP, then
subtype := old/ident
6. if the NP is that and it is coreferential with the immediately preceding word, then
subtype := old/relative
7. if the NP is it, this or that, and it is not anaphoric, then
subtype := old/event
8. if the NP is pronominal and is not anaphoric, then
subtype := med/bound
9. if the NP contains and or or, then
subtype := med/aggregation
10. if the NP is a multi-word phrase that (1) begins with so much, something, somebody, someone,
anything, one, or different, or (2) has another, anyone, other, such, that, of or type
as neither its first nor last word, or (3) its head noun is also the head noun of a preceding NP, then
subtype := med/set
11. if the NP contains a word that is a hyponym of the word value in WordNet, then
subtype := med/func value
12. if the NP is involved in a part-whole relation with a preceding NP based on information extracted from
ReVerbs output, then
subtype := med/part
13. if the NP is of the form Xs Y or poss-pro Y, where X and Y are NPs and poss-pro is a possessive
pronoun, then
subtype := med/poss
14. if the NP fills an argument of a FrameNet frame set up by a preceding NP or verb, then
subtype := med/situation
15. if the head of the NP and one of the preceding verbs in the same sentence share the same WordNet
hypernym which is not in synsets that appear one of the top five levels of the noun/verb hierarchy, then
subtype := med/event
16. if the NP is a named entity (NE) or starts with the, then
subtype := med/general
17. if the NP appears in the training set, then
subtype := its most frequent IS subtype in the training set
18. subtype := new
reason is that these NPs have VP antecedents, but Rule 10 concerns med/set. The words and
both the gold chains and the Stanford chains are phrases listed in the rule, which are derived manu-
computed over NPs only. ally from the training data, provide suggestive ev-
idence that the NP under consideration is a subset
Rules 816 concern med subtypes. Apart from
or a specific portion of an entity or concept men-
Rule 8 (med/bound), Rule 9 (med/aggregation),
tioned earlier in the dialogue. Examples include
and Rule 11 (med/func value), which are arguably
another bedroom, different color, somebody
crude approximations of the definitions of the
else, any place, one of them, and most other
corresponding subtypes, the med rules are more
cities. Condition 3 of the rule, which checks
complicated than their old counterparts, in part
whether the head noun of the NP has been men-
because of their reliance on the extraction of so-
tioned previously, is a good test for identity coref-
phisticated knowledge. Below we describe the ex-
erence, but since all the old entities have suppos-
traction process and the motivation behind them.
802
edly been identified by the preceding rules, it be- entities, whose identification is difficult as it re-
comes a reasonable test for set-subset relations. quires world knowledge. Consequently, we apply
For convenience, we identify part-whole rela- this rule only after all other med rules are applied.
tions in Rule 12 based on the output produced by As we can see, the rule assigns med/general to
ReVerb (Fader et al., 2011), an open information NPs that are named entities (NEs) and definite de-
extraction system.6 The output contains, among scriptions (specifically those NPs that start with
other things, relation instances, each of which is the). The reason is simple. Most NEs are gener-
represented as a triple, <A,rel,B>, where rel is ally known. Definite descriptions are typically not
a relation, and A and B are its arguments. To pre- new, so it seems reasonable to assign med/general
process the output, we first identify all the triples to them given that the remaining (i.e., unlabeled)
that are instances of the part-whole relation us- NPs are presumably either new and med/general.
ing regular expressions. Next, we create clusters Before Rule 18, which assigns an NP to the new
of relation arguments, such that each pair of ar- class by default, we have a memorization rule
guments in a cluster has a part-whole relation. that checks whether the NP under consideration
This is easy: since part-whole is a transitive rela- appears in the training set (Rule 17). If so, we
tion (i.e., <A,part,B> and <B,part,C> implies assign to it its most frequent subtype based on its
<A,part,C>), we cluster the arguments by taking occurrences in the training set. In essence, this
the transitive closure of these relation instances. heuristic rule can help classify some of the NPs
Then, given an NP NPi in the test set, we assign that are somehow missed by the first 16 rules.
med/part to it if there is a preceding NP NPj such The ordering of these rules has a direct impact
that the two NPs are in the same argument cluster. on performance of the ruleset, so a natural ques-
In Rule 14, we use FrameNet (Baker et al., tion is: what criteria did we use to order the rules?
1998) to determine whether med/situation should We order them in such a way that they respect the
be assigned to an NP, NPi . Specifically, we check total ordering on the subtypes imposed by Nis-
whether it fills an argument of a frame set up by sims (2003) preference relation (see Section 3),
a preceding NP, NPj , or verb. To exemplify, let except that we give med/general a lower priority
us assume that NPj is capital punishment. We than Nissim due to the difficulty involved in iden-
search for punishment in FrameNet to access tifying generally known entities, as noted above.
the appropriate frame, which in this case is re-
wards and punishments. This frame contains a 5 Learning-Based Approach
list of arguments together with examples. If NPi is
In this section, we describe our learning-based ap-
one of these arguments, we assign med/situation
proach to fine-grained IS determination. Since
to NPi , since it is involved in a situation (described
we aim to automatically label an NP with its IS
by a frame) that is set up by a preceding NP/verb.
subtype, we create one training/test instance from
In Rule 15, we use WordNet (Fellbaum, 1998)
each hand-annotated NP in the training/test set.
to determine whether med/event should be as-
Each instance is represented using five types of
signed to an NP, NPi , by checking whether NPi is
features, as described below.
related to an event, which is typically described
by a verb. Specifically, we use WordNet to check Unigrams (119704). We create one binary fea-
whether there exists a verb, v, preceding NPi such ture for each unigram appearing in the training
that v and NPi have the same hypernym. If so, we set. Its value indicates the presence or absence
assign NPi the subtype med/event. Note that we of the unigram in the NP under consideration.
ensure that the hypernym they share does not ap- Markables (209751). We create one binary fea-
pear in the top five levels of the WordNet noun ture for each markable (i.e., an NP having an IS
and verb hierarchies, since we want them to be subtype) appearing in the training set. Its value is
related via a concept that is not overly general. 1 if and only if the markable has the same string
Rule 16 identifies instances of med/general. as the NP under consideration.
The majority of its members are generally-known Markable predictions (17). We create 17 bi-
6
We use ReVerb ClueWeb09 Extractions 1.1, which
nary features, 16 of which correspond to the 16
is available from http://reverb.cs.washington. IS subtypes and the remaining one corresponds to
edu/reverb_clueweb_tuples-1.1.txt.gz. a dummy subtype. Specifically, if the NP un-
803
der consideration appears in the training set, we 6 Evaluation
use Rule 17 in our hand-crafted ruleset to deter-
mine the IS subtype it is most frequently associ- Next, we evaluate the rule-based approach and
ated with in the training set, and then set the value the learning-based approach to determining the IS
of the feature corresponding to this IS subtype to subtype of each hand-annotated NP in the test set.
1. If the NP does not appear in the training set, we Classification results. Table 3 shows the results
set the value of the dummy subtype feature to 1. of the two approaches. Specifically, row 1 shows
Rule conditions (17). As mentioned before, we their accuracy, which is defined as the percent-
can create features based on the hand-crafted rules age of correctly classified instances. For each
in Section 4. To describe these features, let us in- approach, we present results that are generated
troduce some notation. Let Rule i be denoted by based on gold coreference chains as well as auto-
Ai Bi , where Ai is the condition that must matic chains computed by the Stanford resolver.
be satisfied before the rule can be applied and Bi As we can see, the rule-based approach
is the IS subtype predicted by the rule. We could achieves accuracies of 66.0% (gold coreference)
create one binary feature from each Ai , and set its and 57.4% (Stanford coreference), whereas the
value to 1 if Ai is satisfied by the NP under con- learning-based approach achieves accuracies of
sideration. These features, however, fail to cap- 86.4% (gold) and 78.7% (Stanford). In other
ture a crucial aspect of the ruleset: the ordering of words, the gold coreference results are better than
the rules. For instance, Rule i should be applied the Stanford coreference results, and the learning-
only if the conditions of the first i 1 rules are not based results are better than the rule-based results.
satisfied by the NP, but such ordering is not en- While perhaps neither of these results are surpris-
coded in these features. To address this problem, ing, we are pleasantly surprised by the extent to
we capture rule ordering information by defining which the learned classifier outperforms the hand-
binary feature fi as A1 A2 . . . Ai1 Ai , crafted rules: accuracies increase by 20.4% and
where 1 i 16. In addition, we define a fea- 21.3% when gold coreference and Stanford coref-
ture, f18 , for the default rule (Rule 18) in a simi- erence are used, respectively. In other words, ma-
lar fashion, but since it does not have any condi- chine learning has transformed a ruleset that
tion, we simply define f18 as A1 . . . A16 . achieves mediocre performance into a system that
The value of a feature in this feature group is 1 achieves relatively high performance.
if and only if the NP under consideration satis- These results also suggest that coreference
fies the condition defined by the feature. Note that plays a crucial role in IS subtype determination:
we did not create any features from Rule 17 here, accuracies could increase by up to 7.78.6% if
since we have already generated markables and we solely improved coreference resolution perfor-
markable prediction features for it. mance. This is perhaps not surprising: IS and
coreference can mutually benefit from each other.
Rule predictions (17). None of the features fi s
To gain additional insight into the task, we also
defined above makes use of the predictions of our
show in rows 217 of Table 3 the performance
hand-crafted rules (i.e., the Bi s). To make use
on each of the 16 subtypes, expressed in terms of
of these predictions, we define 17 binary features,
recall (R), precision (P), and F-score (F). A few
one for each Bi , where i = 1, . . . , 16, 18. Specif-
points deserve mention. First, in comparison to
ically, the value of the feature corresponding to
the rule-based approach, the learning-based ap-
Bi is 1 if and only if fi is 1, where fi is a rule
proach achieves considerably better performance
condition feature as defined above.
on almost all classes. One that is of particular in-
Since IS subtype determination is a 16-class terest is the new class. As we can see in row 17,
classification problem, we train a multi-class its F-score rises by about 30 points. These gains
SVM classifier on the training instances using are accompanied by a simultaneous rise in recall
SVMmulticlass (Tsochantaridis et al., 2004), and and precision. In particular, recall increases by
use it to make predictions on the test instances.7 about 40 points. Now, recall from the introduc-
7
For all the experiments involving SVMmulticlass , we to overfitting (by setting C to a small value) tends to yield
set C, the regularization parameter, to 500,000, since pre- poorer classification performance. The remaining learning
liminary experiments indicate that preferring generalization parameters are set to their default values.
804
Rule-Based Approach Learning-Based Approach
Gold Coreference Stanford Coreference Gold Coreference Stanford Coreference
1 Accuracy 66.0 57.4 86.4 78.7
IS Subtype R P F R P F R P F R P F
2 old/ident 77.5 78.2 77.8 66.1 52.7 58.7 82.8 85.2 84.0 75.8 64.2 69.5
3 old/event 98.6 50.4 66.7 71.3 43.2 53.8 98.3 87.9 92.8 2.4 31.8 4.5
4 old/general 81.9 82.7 82.3 72.3 83.6 77.6 97.7 93.7 95.6 87.8 92.7 90.2
5 old/generic 55.9 55.2 55.5 39.2 39.8 39.5 76.1 87.3 81.3 39.9 85.9 54.5
6 old/ident generic 48.7 77.7 59.9 27.2 51.8 35.7 57.1 87.5 69.1 47.2 44.8 46.0
7 old/relative 55.0 69.2 61.3 55.1 63.4 59.0 98.0 63.0 76.7 99.0 37.5 54.4
8 med/general 29.9 19.8 23.8 29.5 19.6 23.6 91.2 87.7 89.4 84.0 72.2 77.7
9 med/bound 56.4 20.5 30.1 56.4 20.5 30.1 25.7 65.5 36.9 2.7 40.0 5.1
10 med/part 19.5 100.0 32.7 19.5 100.0 32.7 73.2 96.8 83.3 73.2 96.8 83.3
11 med/situation 28.7 100.0 44.6 28.7 100.0 44.6 68.4 95.4 79.7 68.0 97.7 80.2
12 med/event 10.5 100.0 18.9 10.5 100.0 18.9 46.3 100.0 63.3 46.3 100.0 63.3
13 med/set 82.9 61.8 70.8 78.0 59.4 67.4 90.4 87.8 89.1 88.4 86.0 87.2
14 med/poss 52.9 86.0 65.6 52.9 86.0 65.6 93.2 92.4 92.8 90.5 97.6 93.9
15 med/func value 81.3 74.3 77.6 81.3 74.3 77.6 88.1 85.9 87.0 88.1 85.9 87.0
16 med/aggregation 57.4 44.0 49.9 57.4 43.6 49.6 85.2 72.9 78.6 83.8 93.9 88.6
17 new 50.4 65.7 57.0 50.3 65.1 56.7 90.3 84.6 87.4 90.4 83.6 86.9
Table 3: IS subtype accuracies and F-scores. In each row, the strongest result, as well as those that are statistically
indistinguishable from it according to the paired t-test (p < 0.05), are boldfaced.
tion that previous attempts on 3-class IS determi- and 10.5 for event. Nevertheless, the learning
nation by Nissim and R&N have achieved poor algorithm has again discovered a profitable way
performance on the new class. We hypothesize to combine the available features, enabling the F-
that the use of shallow features in their approaches scores of these classes to increase by 35.150.6%.
were responsible for the poor performance they While most classes are improved by machine
observed, and that using our knowledge-rich fea- learning, the same is not true for old/event and
ture set could improve its performance. We will med/bound, whose F-scores are 4.5% (row 3) and
test this hypothesis at the end of this section. 5.1% (row 9), respectively, when Stanford coref-
Other subtypes that are worth discussing erence is employed. This is perhaps not surpris-
are med/aggregation, med/func value, and ing. Recall that the multi-class SVM classifier
med/poss. Recall that the rules we designed for was trained to maximize classification accuracy.
these classes were only crude approximations, or, Hence, if it encounters a class that is both difficult
perhaps more precisely, simplified versions of the to learn and is under-represented, it may as well
definitions of the corresponding subtypes. For aim to achieve good performance on the easier-
instance, to determine whether an NP belongs to to-learn, well-represented classes at the expense
med/aggregation, we simply look for occurrences of these hard-to-learn, under-represented classes.
of and and or (Rule 9), whereas its definition Feature analysis. In an attempt to gain addi-
requires that not all of the NPs in the coordinated tional insight into the performance contribution
phrase are new. Despite the over-simplicity of each of the five types of features used in the
of these rules, machine learning has enabled learning-based approach, we conduct feature ab-
the available features to be combined in such a lation experiments. Results are shown in Table 4,
way that high performance is achieved for these where each row shows the accuracy of the classi-
classes (see rows 1416). fier trained on all types of features except for the
Also worth examining are those classes for one shown in that row. For easy reference, the
which the hand-crafted rules rely on sophisti- accuracy of the classifier trained on all types of
cated knowledge sources. They include med/part, features is shown in row 1 of the table. According
which relies on ReVerb; med/situation, which re- to the paired t-test (p < 0.05), performance drops
lies on FrameNet; and med/event, which relies on significantly whichever feature type is removed.
WordNet. As we can see from the rule-based re- This suggests that all five feature types are con-
sults (rows 1012), these knowledge sources have tributing positively to overall accuracy. Also, the
yielded rules that achieved perfect precision but markables features are the least important in the
low recall: 19.5% for part, 28.7% for situation, presence of other feature groups, whereas mark-
805
Feature Type Gold Coref Stanford Coref Feature Type Gold Coref Stanford Coref
All features 86.4 78.7 All rules 66.0 57.4
rule predictions 77.5 70.0 memorization 62.6 52.0
markable predictions 72.4 64.7 ReVerb 64.2 56.6
rule conditions 81.1 71.0 cue words 63.8 54.0
unigrams 74.4 58.6
markables 83.2 75.5
Table 6: Accuracies of the simplified ruleset.
806
Acknowledgments Malvina Nissim, Shipra Dingare, Jean Carletta, and
Mark Steedman. 2004. An annotation scheme for
We thank the three anonymous reviewers for their information status in dialogue. In Proceedings of
detailed and insightful comments on an earlier the 4th International Conference on Language Re-
draft of the paper. This work was supported sources and Evaluation, pages 10231026.
in part by NSF Grants IIS-0812261 and IIS- Malvina Nissim. 2003. Annotation scheme
1147644. for information status in dialogue. Available
from http://www.stanford.edu/class/
cs224u/guidelines-infostatus.pdf.
References Malvina Nissim. 2006. Learning information status of
discourse entities. In Proceedings of the 2006 Con-
Collin F. Baker, Charles J. Fillmore, and John B.
ference on Empirical Methods in Natural Language
Lowe. 1998. The Berkeley FrameNet project.
Processing, pages 94102.
In Proceedings of the 36th Annual Meeting of the
Ellen F. Prince. 1981. Toward a taxonomy of given-
Association for Computational Linguistics and the
new information. In P. Cole, editor, Radical Prag-
17th International Conference on Computational
matics, pages 223255. New York, N.Y.: Academic
Linguistics, Volume 1, pages 8690.
Press.
Sasha Calhoun, Jean Carletta, Jason Brenier, Neil
Ellen F. Prince. 1992. The ZPG letter: Subjects,
Mayo, Dan Jurafsky, Mark Steedman, and David
definiteness, and information-status. In Discourse
Beaver. 2010. The NXT-format Switchboard cor-
Description: Diverse Analysis of a Fund Raising
pus: A rich resource for investigating the syntax, se-
Text, pages 295325. John Benjamins, Philadel-
mantics, pragmatics and prosody of dialogue. Lan-
phia/Amsterdam.
guage Resources and Evaluation, 44(4):387419.
Miriam Eckert and Michael Strube. 2001. Dialogue Altaf Rahman and Vincent Ng. 2011. Learning the
acts, synchronising units and anaphora resolution. information status of noun phrases in spoken dia-
Journal of Semantics, 17(1):5189. logues. In Proceedings of the 2011 Conference on
Empirical Methods in Natural Language Process-
Anthony Fader, Stephen Soderland, and Oren Etzioni.
ing, pages 10691080.
2011. Identifying relations for open information ex-
traction. In Proceedings of the 2011 Conference on Arndt Riester, David Lorenz, and Nina Seemann.
Empirical Methods in Natural Language Process- 2010. A recursive annotation scheme for referential
ing, pages 15351545. information status. In Proceedings of the Seventh
Christiane Fellbaum. 1998. WordNet: An Electronic International Conference on Language Resources
Lexical Database. MIT Press, Cambridge, MA. and Evaluation, pages 717722.
Caroline Gasperin and Ted Briscoe. 2008. Statisti- Mark Steedman. 2000. The Syntactic Process. The
cal anaphora resolution in biomedical texts. In Pro- MIT Press, Cambridge, MA.
ceedings of the 22nd International Conference on Ioannis Tsochantaridis, Thomas Hofmann, Thorsten
Computational Linguistics, pages 257264. Joachims, and Yasemin Altun. 2004. Support vec-
Michael Gotze, Thomas Weskott, Cornelia En- tor machine learning for interdependent and struc-
driss, Ines Fiedler, Stefan Hinterwimmer, Svetlana tured output spaces. In Proceedings of the 21st
Petrova, Anne Schwarz, Stavros Skopeteas, and International Conference on Machine Learning,
Ruben Stoel. 2007. Information structure. In pages 104112.
Working Papers of the SFB632, Interdisciplinary Enric Vallduv. 1992. The Informational Component.
Studies on Information Structure (ISIS). Potsdam: Garland, New York.
Universitatsverlag Potsdam.
Eva Hajicova. 1984. Topic and focus. In Contri-
butions to Functional Syntax, Semantics, and Lan-
guage Comprehension (LLSEE 16), pages 189202.
John Benjamins, Amsterdam.
Michael A. K. Halliday. 1976. Notes on transitiv-
ity and theme in English. Journal of Linguistics,
3(2):199244.
Heeyoung Lee, Yves Peirsman, Angel Chang,
Nathanael Chambers, Mihai Surdeanu, and Dan Ju-
rafsky. 2011. Stanfords multi-pass sieve corefer-
ence resolution system at the CoNLL-2011 shared
task. In Proceedings of the Fifteenth Confer-
ence on Computational Natural Language Learn-
ing: Shared Task, pages 2834.
807
Composing extended top-down tree transducers
Aurelie Lagoutte
Ecole normale superieure de Cachan, Departement Informatique
alagoutt@dptinfo.ens-cachan.fr
Abstract RC
C
PREL C 7
A composition procedure for linear and NP VP
nondeleting extended top-down tree trans- that NP VP
ducers is presented. It is demonstrated that C C
the new procedure is more widely applica-
NP VP 7 NP VP
ble than the existing methods. In general,
the result of the composition is an extended VAUX VPART NP VAUX NP VPART
top-down tree transducer that is no longer
linear or nondeleting, but in a number of Figure 1: Word drop [top] and reordering [bottom].
cases these properties can easily be recov-
ered by a post-processing step.
The newswire reported yesterday that the Serbs have
completed the negotiations.
1 Introduction Gestern [Yesterday] berichtete [reported] die [the]
Nachrichtenagentur [newswire] die [the] Serben
Tree-based translation models such as syn- [Serbs] hatten [would have] die [the] Verhandlungen
chronous tree substitution grammars (Eisner, [negotiations] beendet [completed].
2003; Shieber, 2004) or multi bottom-up tree
transducers (Lilin, 1978; Engelfriet et al., 2009; The relation between them can be described
Maletti, 2010; Maletti, 2011) are used for sev- (Yamada and Knight, 2001) by three operations:
eral aspects of syntax-based machine transla- drop of the relative pronoun, movement of the
tion (Knight and Graehl, 2005). Here we consider participle to end of the clause, and word-to-word
the extended top-down tree transducer (XTOP), translation. Figure 1 shows the first two oper-
which was studied in (Arnold and Dauchet, ations, and Figure 2 shows ln-XTOP rules per-
1982; Knight, 2007; Graehl et al., 2008; Graehl forming them. Let us now informally describe
et al., 2009) and implemented in the toolkit the execution of an ln-XTOP on the top rule
T IBURON (May and Knight, 2006; May, 2010). of Figure 2. In general, ln-XTOPs process an in-
Specifically, we investigate compositions of linear put tree from the root towards the leaves using
and nondeleting XTOPs (ln-XTOP). Arnold and a set of rules and states. The state p in the left-
Dauchet (1982) showed that ln-XTOPs compute hand side of controls the particular operation of
a class of transformations that is not closed under Figure 1 [top]. Once the operation has been per-
composition, so we cannot compose two arbitrary formed, control is passed to states pNP and pVP ,
ln-XTOPs into a single ln-XTOP. However, we which use their own rules to process the remain-
will show that ln-XTOPs can be composed into a ing input subtree governed by the variable below
(not necessarily linear or nondeleting) XTOP. To them (see Figure 2). In the same fashion, an ln-
illustrate the use of ln-XTOPs in machine transla- XTOP containing the bottom rule of Figure 2 re-
tion, we consider the following English sentence orders the English verbal complex.
together with a German reference translation: In this way we model the word drop by an ln-
All authors were financially supported by the E MMY
XTOP M and reordering by an ln-XTOP N . The
N OETHER project MA / 4959 / 1-1 of the German Research syntactic properties of linearity and nondeletion
Foundation (DFG). yield nice algorithmic properties, and the mod-
808
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 808817,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
p ()
C
RC q (1) (3)
pNP pVP (2)
q
PREL C (11)
y1 y2 x1 (21) q (22) (31) x1
that y1 y2
(221) p
q C x2 p(311)
x3
C qNP VP (3111)
x3
z1 VP z1 qVA qVP qNP
z2 z3 z4 z2 z4 z3 Figure 3: Linear normalized tree t T (Q(X)) [left]
and t[]2 [right] with var(t) = {x1 , x2 , x3 }. The posi-
Figure 2: XTOP rules for the operations of Figure 1. tions are indicated in t as superscripts. The subtree t|2
is (, q(x2 )).
809
qS
S
x1 S
7
[ qV qNP qNP
x3 x1 VP
x2 x1 x1
x2 x2 x2 x3
810
qS S RC
S qS p
S qNP VP
qV qNP qNP S PREL C
x1 VP x1 qV qNP C
x2 x1 x3 x2 x1 x3 that pNP pVP
x2 x3 x2 x3 y1 y2
y1 y2
Figure 6: Rule [left] and reversed rule [right].
Figure 7: Top rule of Figure 2 reversed.
Knight, 2007; Graehl et al., 2008; Graehl et
al., 2009). Formally, an extended top-down tree of M is independent of the choice of 0 and 0 .
transducer with finite look-ahead (XTOPF ) is a Moreover, it is known (Graehl et al., 2009) that
system M = (Q, , , I, R, c) where each XTOPF can be transformed into an equiva-
Q is a finite set of states, lent XTOP preserving both linearity and nondele-
and are alphabets of input and output tion. However, the notion of XTOPF will be con-
symbols, respectively, venient in our composition construction. A de-
I Q is a set of initial states, tailed exposition to XTOPs is presented by Arnold
R is a finite set of (rewrite) rules of the form and Dauchet (1982) and Graehl et al. (2009).
` r where ` Q(T (X)) is linear and A linear and nondeleting XTOP M with
r T (Q(var(`))), and rules R can easily be reversed to obtain
c : R X T (X) assigns a look-ahead a linear and nondeleting XTOP M 1 with
restriction to each rule and variable such that rules R1 , which computes the inverse transfor-
c(, x) is linear for each R and x X. mation M 1 = M 1
, by reversing all its rules.
The XTOPF M is linear (respectively, nondelet- A (suitable) rule is reversed by exchanging the
ing) if r is linear (respectively, var(r) = var(`)) locations of the states. More precisely, given
for every rule ` r R. It has no look-ahead a rule q(l) r R, we obtain the rule
(or it is an XTOP) if c(, x) X for all rules q(r0 ) l0 of R1 , where l0 = l and r0 is the
R and x X. In this case, we drop the look- unique tree such that there exists a substitution
ahead component c from the description. A rule : X Q(X) with (x) Q({x}) for every
` r R is consuming (respectively, produc- x X and r = r0 . Figure 6 displays a rule
ing) if pos (`) 6= (respectively, pos (r) 6= ). and its corresponding reversed rule. The reversed
We let Lhs(M ) = {l | q, r : q(l) r R}. form of the XTOP rule modeling the insertion op-
Let M = (Q, , , I, R, c) be an XTOPF . In eration in Figure 2 is displayed in Figure 7.
order to facilitate composition, we define senten- Finally, let us formally define composition.
tial forms more generally than immediately nec- The XTOP M computes the tree transformation
essary. Let 0 and 0 be such that 0 M T T . Given another XTOP N that
and 0 . To keep the presentation sim- computes a tree transformation N T T ,
ple, we assume that Q (0 0 ) = . A we might be interested in the tree transforma-
sentential form of M (using 0 and 0 ) is a tion computed by the composition of M and N
tree of SF(M ) = T0 (Q(T0 )). For every (i.e., running M first and then N ). Formally, the
, SF(M ), we write M if there exist a composition M ; N of the tree transformations
position w posQ (), a rule = ` r R, and M and N is defined by
a substitution : X T0 such that (x) is an in-
stance of c(, x) for every x X and = [`]w M ; N = {(s, u) | t : (s, t) M , (t, u) N }
and = [r]w . If the applicable rules are re-
stricted to a certain subset R0 R, then we also and we often also use the notion composition for
write R0 . Figure 5 illustrates a derivation XTOP with the expectation that the composition
step. The tree transformation computed by M is of M and N computes exactly M ; N .
811
LHS(M 1 ) LHS(N ) Rule of M 1 Rule of N
q
C p
C
z1 VP p1 p2 q1 q2
y1 y2
z2 z3 z4 y1 y2 y1 y2 z1 z2
z1 z2
Figure 8: Incompatible left-hand sides of Example 3.
Figure 9: Rules used in Example 5.
812
Another rule of N q C
q 1 :
C qNP q0
z1 z
q1 q2 q3 z1 z
z1
z1 z2 z3 q0 VP
z2 z3
2 : VP qVA qVP qNP
Figure 10: Additional rule used in Example 5.
z2 z3 z4 z2 z4 z3
var(r2 |w0 ) var(l2 |v ) and var(r2 []w0 )V = , Figure 11: Rules replacing the rule in Figure 7.
where is arbitrary. Let z X \ var(l2 ) be
a fresh variable, q 0 be a new state of N , and
Example 5. Let us consider the rules illustrated
V 0 = var(l2 |v ) \ V . We replace the rule
in Figure 9. We might first note that y1 has to
= q(l2 ) r2 of RN by
be unified with . Since does not contain any
1 = q(l2 [z]v ) trans(r2 )[q 0 (z)]w0 variables and the right-hand side of the rule of N
does not contain any non-variable leaves, we are
2 = q 0 (l2 |v ) r2 |w0 .
in case (i) in the proof of Theorem 4. Conse-
The look-ahead for z is trivial and other- quently, the displayed rule of N is replaced by a
wise we simply copy the old look-ahead, so variant, in which is replaced by a new variable z
cN (1 , z) = z and cN (1 , x) = cN (, x) for all with look-ahead .
x X \ {z}. Moreover, cN (2 , x) = cN (, x) Secondly, with this new rule there is an mgu,
for all x X. The mapping trans is given for in which y2 is mapped to (z1 , z2 ). Clearly, we
t = (t1 , . . . , tk ) and q 00 (z 00 ) Q(Z) by are now in case (ii). Furthermore, we can select
the set V = {z1 , z2 } and position w0 = . Cor-
trans(t) = (trans(t1 ), . . . , trans(tk )) respondingly, the following two new rules for N
( replace the old rule:
hl2 |v , q 00 , v 0 i(z) if z 00 V 0
trans(q 00 (z 00 )) =
q 00 (z 00 ) otherwise, q((z, z 0 )) q 0 (z 0 )
where v 0 = posz 00 (l2 |v ). q 0 ((z1 , z2 )) (q1 (z1 ), q2 (z2 )) ,
Finally, we collect all newly generated states
of the form hl, q, vi in Ql and for every such where the look-ahead for z remains .
state with l = (l1 , . . . , lk ) and v = iw, let Figure 10 displays another rule of N . There is
l0 = (z1 , . . . , zk ) and an mgu, in which y2 is mapped to (z2 , z3 ). Thus,
we end up in case (ii) again and we can select the
set V = {z2 } and position w0 = 2. Thus, we
(
q(zi ) if w =
hl, q, vi(l0 ) replace the rule of Figure 10 by the new rules
hli , q, wi(zi ) otherwise
be a new rule of N without look-ahead. q((z1 , z)) (q1 (z1 ), q 0 (z), q3 (z)) (?)
0
Overall, we run the procedure until N 0 is com- q ((z2 , z3 )) q2 (z2 )
patible with M . The procedure eventually ter- q3 ((z1 , z2 )) q3 (z2 ) ,
minates since the left-hand sides of the newly
added rules are always smaller than the replaced where q3 = h(z2 , z3 ), q3 , 2i.
rules. Moreover, each step preserves the seman-
Let us use the construction in the proof of The-
tics of N 0 , which completes the proof.
orem 4 to resolve the incompatibility (see Exam-
We note that the look-ahead of N 0 after the con- ple 3) between the XTOPs presented in the Intro-
struction used in the proof of Theorem 4 is either duction. Fortunately, the incompatibility can be
trivial (i.e., a variable) or a ground tree (i.e., a tree resolved easily by cutting the rule of N (see Fig-
without variables). Let us illustrate the construc- ure 7) into the rules of Figure 11. In this example,
tion used in the proof of Theorem 4. linearity and nondeletion are preserved.
813
4.2 Local determinism q i s
i
p i q q0 ps i
After the first pre-processing step, we have the ps
ps s0 ps
original linear and nondeleting XTOP M and
an XTOPF N 0 = (Q0 , , , IN , RN 0 , c ) that is
N y1 y2 y1 y2 y2 y1 y1
equivalent to N and compatible with M . How-
q q0 q
ever, in the first pre-processing step we might i q i
have introduced some non-linear (copying) rules s s s,s0 /0s,s0
ps p ps
in N 0 (see rule (?) in Example 5), and it is known
y1 y2 y1
that nondeterminism [in M ] followed by copy-
y1 y2 y1 y2
ing [in N 0 ] is a feature that prevents composition y1 y2 y3
to work (Engelfriet, 1975; Baker, 1979). How- q0 q0
ever, our copying is very local and the copies
are only used to project to different subtrees. 0s,s0 i i s,s0 i q q0
Nevertheless, during those projection steps, we ps0 p p s0
need to make sure that the processing in M pro- y2 y3 y1 y2 y3 y2 y3 y3
y1 y2 y3
ceeds deterministically. We immediately note that
all but one copy are processed by states of the Figure 12: Useful rules for the composition M 0 ; N 0 of
form hl, q, vi Ql . These states basically pro- Example 8, where s, s0 {, } and P(z2 ,z3 ) .
cess (part of) the tree l and project (with state q)
to the subtree at position v. It is guaranteed that
p(l) M 0 M 0 r for some that is not an
each such subtree (indicated by v) is reached only
instance of t. In other words, we construct each
once. Thus, the copying is resolved once the
rule of Rt by applying existing rules of RM in
states of the form hl, q, vi are left. To keep the
sequence to generate a (minimal) right-hand side
presentation simple, we just add expanded rules
that is an instance of t. We thus potentially make
to M such that any rule that can produce a part of
the right-hand sides of M bigger by joining sev-
a tree l immediately produces the whole tree. A
eral existing rules into a single rule. Note that
similar strategy is used to handle the look-ahead
this affects neither compatibility nor the seman-
of N 0 . Any right-hand side of a rule of M that
tics. In the second step, we add pure -rules
produces part of a left-hand side of a rule of N 0
that allow us to change the state to one that we
with look-ahead is expanded to produce the re-
constructed in the previous step. For every new
quired look-ahead immediately.
state p = p(l) r, let base(p) = p.S Then
Let L T (Z) be the set of trees l such that 0 = R 0
RM M RL RE and P = P tL Pt
hl, q, vi appears as a state of Ql , or where
l = l2 for some 2 = q(l2 ) r2 RN 0
[
0
of N with non-trivial look-ahead (i.e., RL = Rt and Pt = {`() | ` r Rt }
cN (2 , z) / X for some z X), where tL
[
(x) = cN (2 , x) for every x X. RE = {base(p)(x1 ) p(x1 ) | p Pt } .
To keep the presentation uniform, we assume tL
that for every l L, there exists a state of the Clearly, this does not change the semantics be-
form hl, q, vi Q0 . If this is not already the cause each rule of RM 0 can be simulated by a
case, then we can simply add useless states with- chain of rules of RM . Let us now do a full ex-
out rules for them. In other words, we assume that ample for the pre-processing step. We consider a
the first case applies to each l L. nondeterministic variant of the classical example
Next, we add two sets of rules to RM , which by Arnold and Dauchet (1982).
will not change the semantics but prove to be use- Example 6. Let M = (P, , , {p}, RM )
ful in the composition construction. First, for be the linear and nondeleting XTOP such that
every tree t L, let Rt contain all the rules P = {p, p , p }, = {, , , , }, and
p(l) r, where p = p(l) r is a new state RM contains the following rules
with p P , minimal normalized tree l T (X),
and an instance r T (P (X)) of t such that p((y1 , y2 )) (ps (y1 ), p(y2 )) ()
814
p((y1 , y2 , y3 )) (ps (y1 ), (ps0 (y2 ), p(y3 ))) hq, pi
C
p((y1 , y2 , y3 )) (ps (y1 ), (ps0 (y2 ), p (y3 )))
RC
ps (s0 (y1 )) s(ps (y1 )) hqNP , pNP i hq 0 , pVP i
ps () PREL C
x1 x2
that x1 x2
for every s, s0 {, }. Similarly, we let
N = (Q, , , {q}, RN ) be the linear and non- Figure 13: Composed rule created from the rule of Fig-
deleting XTOP such that Q = {q, i} and RN con- ure 7 and the rules of N 0 displayed in Figure 11.
tains the following rules
q((z1 , z2 )) (i(z1 ), i(z2 )) 5 Composition
q((z1 , (z2 , z3 ))) (i(z1 ), i(z2 ), q(z3 )) ()
Now we are ready for the actual composition. For
i(s(z1 )) s(i(z1 )) space efficiency reasons we reuse the notations
i() used in Section 4. Moreover, we identify trees of
T (Q0 (P 0 (X))) with trees of T ((Q0 P 0 )(X)).
for all s {, }. It can easily be verified that
In other words, when meeting a subtree q(p(x))
M and N meet our requirements. However, N is
with q Q0 , p P 0 , and x X, then we also
not yet compatible with M because an mgu be-
view this equivalently as the tree hq, pi(x), which
tween rules () of M and () of N might map y2
could be part of a rule of our composed XTOP.
to (z2 , z3 ). Thus, we decompose () into
However, not all combinations of states will be
q((z1 , z)) (i(z1 ), q(z), q 0 (z)) allowed in our composed XTOP, so some combi-
q 0 ((z2 , z3 )) q(z3 ) nations will never yield valid rules.
Generally, we construct a rule of M 0 ; N 0 by ap-
q((z1 , z2 )) i(z1 )
plying a single rule of M 0 followed by any num-
where q = h(z2 , z3 ), i, 1i. This newly obtained ber of pure -rules of RE , which can turn states
XTOP N 0 is compatible with M . In addition, we base(p) into p. Then we apply any number of
only have one special tree (z2 , z3 ) that occurs in rules of N 0 and try to obtain a sentential form that
states of the form hl, q, vi. Thus, we need to com- has the required shape of a rule of M 0 ; N 0 .
pute all minimal derivations whose output trees
are instances of (z2 , z3 ). This is again simple Definition 7. Let M 0 = (P 0 , , , IM , RM0 ) and
815
q q
p p i i
i i q
ps ps0 i q q0
ps ps p
y1 y1
y1 y2 y3 y1 y2 ps00 0 0
y2 y3 y2 y3 y4 y3 y4 y4
Figure 14: Successfully expanded rule from Exam-
ple 9. Figure 15: Expanded rule that remains copying (see
Example 9).
q(s0 ,s00 ((x1 , x2 , x3 ))) Example 9. The first (top row, left-most) rule of
Figure 12 is non-linear in the variable y2 . Thus,
M 0 q((ps0 (x1 ), (ps00 (x2 ), p(x3 ))))
we expand the calls hq, i(y2 ) and hq 0 , i(y2 ). If
N 0 q 0 (ps0 (x1 )) = s for some s {, }, then the next rules
Finally, let us construct a rule for the state combi- are uniquely determined and we obtain the rule
nation hq 00 , s0 ,s00 i. displayed in Figure 14. Here the expansion was
successful and we could delete the original rule
q 00 (s0 ,s00 ((x1 , x2 , x3 ))) for = s and replace it by the displayed ex-
M 0 q((ps0 (x1 ), (ps00 (x2 ), p(x3 )))) panded rule. However, if = 0s0 ,s00 , then we can
RE q((ps0 (x1 ), (ps00 (x2 ), s (x3 )))) also expand the rule to obtain the rule displayed in
N 0 q((ps00 (x2 ), s (x3 ))) Figure 15. It is still copying and we could repeat
the process of expansion here, but we cannot get
N 0 (q 0 (ps00 (x1 )), q(s (x2 )), q 00 (s (x2 )))
rid of all copying rules using this approach (as ex-
for every s {, }. pected since there is no linear XTOP computing
After having pre-processed the XTOPs in our the same tree transformation).
introductory example, the devices M and N 0 can
be composed into M ; N 0 . One rule of the com-
posed XTOP is illustrated in Figure 13.
816
References Jonathan May, Kevin Knight, and Heiko Vogler. 2010.
Efficient inference through cascades of weighted
Andre Arnold and Max Dauchet. 1982. Morphismes tree transducers. In Proc. ACL, pages 10581066.
et bimorphismes darbres. Theoretical Computer Association for Computational Linguistics.
Science, 20(1):3393.
Jonathan May. 2010. Weighted Tree Automata and
Brenda S. Baker. 1979. Composition of top-down Transducers for Syntactic Natural Language Pro-
and bottom-up tree transductions. Information and cessing. Ph.D. thesis, University of Southern Cali-
Control, 41(2):186213. fornia, Los Angeles.
Jason Eisner. 2003. Learning non-isomorphic tree John Alan Robinson. 1965. A machine-oriented logic
mappings for machine translation. In Proc. ACL, based on the resolution principle. Journal of the
pages 205208. Association for Computational Lin- ACM, 12(1):2341.
guistics.
William C. Rounds. 1970. Mappings and grammars
Joost Engelfriet, Eric Lilin, and Andreas Maletti. on trees. Mathematical Systems Theory, 4(3):257
2009. Composition and decomposition of extended 287.
multi bottom-up tree transducers. Acta Informatica, Stuart M. Shieber. 2004. Synchronous grammars as
46(8):561590. tree transducers. In Proc. TAG+7, pages 8895.
Joost Engelfriet. 1975. Bottom-up and top-down James W. Thatcher. 1970. Generalized2 sequential
tree transformationsA comparison. Mathemati- machine maps. Journal of Computer and System
cal Systems Theory, 9(3):198231. Sciences, 4(4):339367.
Joost Engelfriet. 1977. Top-down tree transducers Kenji Yamada and Kevin Knight. 2001. A syntax-
with regular look-ahead. Mathematical Systems based statistical translation model. In Proc. ACL,
Theory, 10(1):289303. pages 523530. Association for Computational Lin-
Jonathan Graehl, Kevin Knight, and Jonathan May. guistics.
2008. Training tree transducers. Computational
Linguistics, 34(3):391427.
Jonathan Graehl, Mark Hopkins, Kevin Knight, and
Andreas Maletti. 2009. The power of extended top-
down tree transducers. SIAM Journal on Comput-
ing, 39(2):410430.
Kevin Knight and Jonathan Graehl. 2005. An over-
view of probabilistic tree transducers for natural
language processing. In Proc. CICLing, volume
3406 of LNCS, pages 124. Springer.
Kevin Knight. 2007. Capturing practical natural
language transformations. Machine Translation,
21(2):121133.
Eric Lilin. 1978. Une generalisation des transduc-
teurs detats finis darbres: les S-transducteurs.
These 3eme cycle, Universite de Lille.
Andreas Maletti and Heiko Vogler. 2010. Composi-
tions of top-down tree transducers with -rules. In
Proc. FSMNLP, volume 6062 of LNAI, pages 69
80. Springer.
Andreas Maletti. 2010. Why synchronous tree sub-
stitution grammars? In Proc. HLT-NAACL, pages
876884. Association for Computational Linguis-
tics.
Andreas Maletti. 2011. An alternative to synchronous
tree substitution grammars. Natural Language En-
gineering, 17(2):221242.
Alberto Martelli and Ugo Montanari. 1982. An effi-
cient unification algorithm. ACM Transactions on
Programming Languages and Systems, 4(2):258
282.
Jonathan May and Kevin Knight. 2006. Tiburon: A
weighted tree automata toolkit. In Proc. CIAA, vol-
ume 4094 of LNCS, pages 102113. Springer.
817
Structural and Topical Dimensions in Multi-Task Patent Translation
818
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 818828,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
(b) a description of the invention; adapting unsupervised generative modules such
(c) one or more claims; as translation models or language models to new
(d) any drawings referred to in the de- tasks. For example, transductive approaches have
scription or the claims; used automatic translations of monolingual cor-
(e) an abstract, pora for self-training modules of the generative
SMT pipeline (Ueffing et al., 2007; Schwenk,
and satisfy the requirements laid down 2008; Bertoldi and Federico, 2009). Other ap-
in the Implementing Regulations. proaches have extracted parallel data from similar
The request for grant contains the patent title; thus or comparable corpora (Zhao et al., 2004; Snover
a patent document comprises the textual elements et al., 2008). Several approaches have been pre-
of title, description, claim, and abstract. sented that train separate translation and language
We investigate whether it is worthwhile to treat models on task-specific subsets of the data and
different values along the structural and topical combine them in different mixture models (Fos-
dimensions as different tasks that are not com- ter and Kuhn, 2007; Koehn and Schroeder, 2007;
pletely independent of each other but share some Foster et al., 2010). The latter kind of approach is
commonalities, yet differ enough to counter a applied in our work to multiple patent tasks.
simple pooling of data. For example, we con- Multi-task learning efforts in patent transla-
sider different tasks such as patents from different tion have so far been restricted to experimental
IPC classes, or along an orthogonal dimension, combinations of translation and language mod-
patent documents of all IPC classes but consisting els from different sets of IPC sections. For ex-
only of titles or only of claims. We ask whether ample, Utiyama and Isahara (2007) and Tinsley
such tasks should be addressed as separate trans- et al. (2010) investigate translation and language
lation tasks, or whether translation performance models trained on different sets of patent sections,
can be improved by learning several tasks simul- with larger pools of parallel data improving re-
taneously through shared models that are more so- sults. Ceausu et al. (2011) find that language mod-
phisticated than simple data pooling. Our goal is els always and translation model mostly benefit
to learn a patent translation system that performs from larger pools of data from different sections.
well across several different tasks, thus benefits Models trained on pooled patent data are used as
from shared information, but is yet able to address baselines in our approach.
the specifics of each task. The machine learning community has devel-
One contribution of this paper is a thorough oped several different formalizations of the cen-
analysis of the differences and similarities of mul- tral idea of trading off optimality of parameter
tilingual patent data along the dimensions of tex- vectors for each task-specific model and close-
tual structure and topic. The second contribution ness of these model parameters to the average pa-
is the experimental investigation of the influence rameter vector across models. For example, start-
of various such tasks on patent translation perfor- ing from a separate SVM for each task, Evgeniou
mance. Starting from baseline models that are and Pontil (2004) present a regularization method
trained on individual tasks or on data pooled from that trades off optimization of the task-specific pa-
all tasks, we apply mixtures of translation mod- rameter vectors and the distance of each SVM to
els and multi-task minimum error rate training to the average SVM. Equivalent formalizations re-
multiple patent translation tasks. A by-product of place parameter regularization by Bayesian prior
our research is a parallel patent corpus of over 23 distributions on the parameters (Finkel and Man-
million sentence pairs. ning, 2009) or by augmentation of the feature
space with domain independent features (Daume,
2 Related work
2007). Besides SVMs, several learning algo-
Multi-task learning has mostly been discussed un- rithms have been extended to the multi-task sce-
der the name of multi-domain adaptation in the nario in a parameter regularization setting, e.g.,
area of statistical machine translation (SMT). If perceptron-type algorithms (Dredze et al., 2010)
we consider domains as tasks, domain adapta- or boosting (Chapelle et al., 2011). Further vari-
tion is a special two-task case of multi-task learn- ants include different formalizations of norms for
ing. Most previous work has concentrated on parameter regularization, e.g., `1,2 regularization
819
(Obozinski et al., 2010) or `1, regularization pass alignment. This yields the parallel corpus
(Quattoni et al., 2009), where only the features listed in table 2 with high input-output ratios for
that are most important across all tasks are kept in claims, and much lower ratios for abstracts and
the model. In our experiments, we apply parame- descriptions, showing that claims exhibit a nat-
ter regularization for multi-task learning to mini- ural parallelism due to their structure, while ab-
mum error rate training for patent translation. stracts and descriptions are considerably less par-
allel. Removing duplicates and adding parallel ti-
3 Extraction of a parallel patent corpus tles results in a corpus of over 23 million parallel
from comparable data sentence pairs.
820
around 300,000 parallel sentences. In order to ob- ison to the task-specific MAREC model, although
tain similar amounts of training data for each task the former has been learned on more than three
along the topical dimension, we sampled 300,000 times the amount of data. An analysis of the out-
sentences from each IPC class for training, and put of both system shows that the Europarl model
2,000 sentences for each IPC class for develop- suffers from two problems: Firstly, there is an ob-
ment and testing. vious out of vocabulary (OOV) problem of the
Europarl model compared to the MAREC model.
A 1,947,542 Secondly, the Europarl model suffers from incor-
B 2,522,995 rect word sense disambiguation, as illustrated by
C 2,263,375 the samples in table 6.
D 299,742
E 353,910 source steuerbar leitet
F 1,012,808 Europarl taxable is in charge of
G 2,066,132 MAREC controllable guiding
H 1,754,573 reference controllable guides
Table 4: Distribution of IPC sections on claims. Table 6: Output of Europarl model on MAREC data.
821
is best translated with a model trained on data section B and C is trained on a data set composed
from the same section. Note that best section of 150,000 sentences from each IPC section. The
scores vary considerably, ranging from 0.5719 on pooled model for pairing data from abstracts and
C to 0.4714 on H, indicating that higher-scoring claims is trained on data composed of 250,000
classes, such as C and A, are more homogeneous sentences from each text section.
and therefore easier to translate. C, the Chem- Another approach to exploit commonalities be-
istry section, presumably benefits from the fact tween tasks is to train separate language and trans-
that the data contain chemical formulae, which lation models9 on the sentences from each task
are language-independent and do not have to be and combine the models in the global log-linear
translated. Again, for determining the relation- model of the SMT framework, following Fos-
ship between the classes, we examine the best ter and Kuhn (2007) and Koehn and Schroeder
runner-up on each section, considering the B LEU (2007). Model combination is accomplished by
score, although asymmetrical, as a kind of mea- adding additional language model and translation
sure of similarity between classes. We can es- model features to the log-linear model and tuning
tablish symmetric relationships between sections the additional meta-parameters by standard mini-
A and C, B and F as well as G and H, which mum error rate training (Bertoldi et al., 2009).
means that the models are mutual runner-up on We try out mixture and pooling for all pairwise
the others test section. combinations of the three structural sections, for
The similarities of translation tasks estab- which we have high-quality data, i.e. abstract,
lished in the previous section can be confirmed claims and title. Due to the large number of pos-
by information-theoretic similarity measures that sible combinations of IPC sections, we limit the
perform a pairwise comparison of the vocabulary experiments to pairs of similar sections, based on
probability distribution of each task-specific cor- the A-distance measure.
pus. This distribution is calculated on the basis of Table 10 lists the results for two combinations
the 500 most frequent words in the union of two of data from different sections: a log-linear mix-
corpora, normalized by vocabulary size. As met- ture of separately trained models and simple pool-
ric we use the A-distance measure of Kifer et al. ing, i.e. concatenation, of the training data. Over-
(2004). If A is the set of events on which the word all, the mixture models perform slightly better
distributions of two corpora are defined, then the than the pooled models on the text sections, al-
A-distance is the supremum of the difference of though the difference is significant only in two
probabilities assigned to the same event. Low dis- cases. This is indicated by highlighting best re-
tance means higher similarity. sults in bold face (with more than one result high-
Table 9 shows the A-distance of corpora spe- lighted if the difference is not significant).10
cific to IPC classes. The most similar section or We investigate the same mixture and pooling
sections apart from the section itself on the di- techniques on the IPC sections we considered
agonal is indicated in bold face. The pairwise pairwise similar (see table 11). Somehow contra-
similarity of A and C, B and F, G and H obtained dicting the former results, the mixture models per-
by B LEU score is confirmed. Furthermore, a close form significantly worse than the pooled model on
similarity between E and F is indicated. G and three sections. This might be the result of inade-
H (electricity and physics, respectively) are very quate tuning, since most of the time the MERT
similar to each other but not close to any other algorithm did not converge after the maximum
section apart from B. number of iterations, due to the larger number of
features when using several models.
4.2 Task pooling and mixture
9
Following Duh et al. (2010), we use the alignment
One straightforward technique to exploit com- model trained on the pooled data set in the phrase extraction
monalities between tasks is pooling data from phase of the separate models. Similarly, we use a globally
separate tasks into a single training set. Instead of trained lexical reordering model.
10
a trivial enlargement of training data by pooling, For assessing significance, we apply the approximate
randomization method described in Riezler and Maxwell
we train the pooled models on the same amount (2005). We consider pairwise differing results scoring a p-
of sentences as the individual models. For in- value smaller than 0.05 as significant; the assessment is re-
stance, the pooled model for the pairing of IPC peated three times and the average value is taken.
822
test
train A B C D E F G H
A 0.5349 0.4475 0.5472 0.4746 0.4438 0.4523 0.4318 0.4109
B 0.4846 0.4736 0.5161 0.4847 0.4578 0.4734 0.4396 0.4248
C 0.5047 0.4257 0.5719 0.462 0.4134 0.4249 0.409 0.3845
D 0.47 0.4387 0.5106 0.5167 0.4344 0.4435 0.407 0.3917
E 0.4486 0.4458 0.4681 0.4531 0.4771 0.4591 0.4073 0.4028
F 0.4595 0.4588 0.4761 0.4655 0.4517 0.4909 0.422 0.4188
G 0.4935 0.4489 0.5239 0.4629 0.4414 0.4565 0.4748 0.4532
H 0.4628 0.4484 0.4914 0.4621 0.4421 0.4616 0.4588 0.4714
A B C D E F G H
A 0 0.1303 0.1317 0.1311 0.188 0.186 0.164 0.1906
B 0.1302 0 0.2388 0.1242 0.0974 0.0875 0.1417 0.1514
C 0.1317 0.2388 0 0.1992 0.311 0.3068 0.2506 0.2825
D 0.1311 0.1242 0.1992 0 0.1811 0.1808 0.1876 0.201
E 0.188 0.0974 0.311 0.1811 0 0.0921 0.2058 0.2025
F 0.186 0.0875 0.3068 0.1808 0.0921 0 0.1824 0.1743
G 0.164 0.1417 0.2506 0.1876 0.2056 0.1824 0 0.064
H 0.1906 0.1514 0.2825 0.201 0.2025 0.1743 0.064 0
Table 10: Mixture and pooling on text sections. Table 11: Mixture and pooling on IPC sections.
A comparison of the results for pooling and SMT pipeline is not adaptable. Such situations
mixture with the respective results for individual arise if there are not enough data to train transla-
models (tables 7 and 8) shows that replacing data tion models or language models on the new tasks.
from the same task by data from related tasks However, we assume that there are enough paral-
decreases translation performance in almost all lel data available to perform meta-parameter tun-
cases. The exception is the title model that bene- ing by minimum error rate training (MERT) (Och,
fits from pooling and mixing with both abstracts 2003; Bertoldi et al., 2009) for each task.
and claims due to their richer data structure. A generic algorithm for multi-task learning
can be motivated as follows: Multi-task learning
4.3 Multi-task minimum error rate training aims to take advantage of commonalities shared
In contrast to task pooling and task mixtures, the among tasks by learning several independent but
specific setting addressed by multi-task minimum related tasks together. Information is shared be-
error rate training is one in which the generative tween tasks through a joint representation and in-
823
tuning
test individual pooled average MMERT MMERT-average
abstract 0.3721 0.362 0.3657+ 0.3719 +
0.3685+
claim 0.4711 0.4681 0.4749+ 0.475+ 0.4734+
title 0.3228 0.3152 0.3326+ 0.3268+ 0.3325+
tuning
test individual pooled average MMERT MMERT-average
A 0.5187 0.5199 0.5213+ 0.5195 0.5196
B 0.4877 0.4885 0.4908+ 0.4911+ 0.4921+
C 0.5214 0.5175 0.5199+ 0.5218+ 0.5162+
D 0.4724 0.4730 0.4733 0.4736 0.4734
E 0.4666 0.4661 0.4679+ 0.4669+ 0.4685+
F 0.4794 0.4801 0.4811 0.4821+ 0.4830+
G 0.4596 0.4576 0.4607+ 0.4606+ 0.4610+
H 0.4573 0.4560 0.4578 0.4581+ 0.4581+
troduces an inductive bias. Evgeniou and Pon- moves beyond the average, it is clipped to the av-
til (2004) propose a regularization method that erage value. The process is iterated until a stop-
balances task-specific parameter vectors and their ping criterion is met, e.g. a threshold on the max-
distance to the average. The learning objective is imum change in the average weight vector. The
to minimize task-specific loss functions ld across parameter controls the influence of the regular-
all tasks d with weight vectors wd , while keep- ization. A larger pulls the weights closer to the
ingPeach parameter vector close to the average average, a smaller leaves more freedom to the
1 D
D d=1 wd = wavg . This is enforced by min- individual tasks.
imizing the norm (here the `1 -norm) of the dif-
ference of each task-specific weight vector to the
avarage weight vector. MMERT(w(0) , D, {ld }D d=1 ):
for t = 1, . . . , T do
(t) 1 PD (t1)
D D wavg = D d=1 wd
for d = 1, . . . , D parallel do
X X
min ld (wd ) + ||wd wavg ||1 (1) (t) (t1)
w1 ,...,wD
d=1 d=1 wd = MERT(wd , ld )
for k = 1, . . . , K do
The MMERT algorithm is given in figure 1. (t) (t)
The algorithm starts with initial weights w(0) . At if w[k]d wavg [k] > 0 then
(t) (t) (t)
each iteration step, the average of the parame- wd [k] = max(wavg [k], wd [k] )
(t) (t)
ter vectors from the previous iteration is com- else if wd [k] wavg [k] < 0 then
puted. For each task d D, one iteration of stan- (t) (t) (t)
wd [k] = min(wavg [k], wd [k] + )
dard MERT is called, continuing from weight vec- end if
(t1)
tor wd and minimizing translation loss func- end for
tion ld on the data from task d. The individu- end for
ally tuned weight vectors returned by MERT are end for
then moved towards the previously calculated av- (T ) (T ) (T )
return w1 , . . . , wD , wavg
erage by adding or subtracting a penalty term
(t) Figure 1: Multi-task MERT.
for each weight component wd [k]. If a weight
824
The weight updates and the clipping strategy for each task, where no information has been
can be motivated in a framework of gradient de- shared between the tasks. The second baseline
scent optimization under `1 -regularization (Tsu- simulates the setting where the sections are not
ruoka et al., 2009). Assuming MERT as algorith- differentiated at all. We tune the model on a
mic minimizer11 of the loss function ld in equa- pooled development set of 2,000 sentences that
tion 1, the weight update towards the average combines the same amount of data from all sec-
follows from the subgradient of the `1 regular- tions (pooled). This yields a single joint weight
(t)
izer. Since wavg is taken as average over weights vector for all tasks optimized to perform well
wd
(t1) (t)
from the step before, the term wavg is con- across all sections. Furthermore, we compare
(t) multi-task MERT tuning with two parameter av-
stant with respect to wd , leading to the follow-
eraging methods. The first method computes the
ing subgradient (where sgn(x) = 1 if x > 0,
arithmetic mean of the weight vectors returned by
sgn(x) = 1 if x < 0, and sgn(x) = 0 if x = 0):
the individual baseline for each weight compo-
nent, yielding a joint average vector for all tasks
D
D
X
(t) 1 X (t1)
wd ws (average). The second method takes the last av-
(t) D
wr [k] d=1
s=1
1 erage vector computed during multi-task MERT
D tuning (MMERT-average).12
!
1 X (t1)
= sgn wr(t) [k] ws [k] . Tables 12 and 13 give the results for multi-task
D
s=1 learning on text and IPC sections. The latter re-
Gradient descent minimization tells us to move in sults have been presented earlier in Simianer et al.
the opposite direction of the subgradient, thus mo- (2011). The former table extends the technique
tivating the addition or subtraction of the regular- of multi-task MERT to the structural dimension
ization penalty. Clipping is motivated by the de- of patent SMT tasks. In all experiments, the pa-
sire to avoid oscillating parameter weights and in rameter was adjusted to 0.001 after evaluating
order to to enforce parameter sharing. different settings on a development set. The best
Experimental results for multi-task MERT result on each section is indicated in bold face; *
(MMERT) are reported for both dimensions of indicates significance with respect to the individ-
patent tasks. For the IPC sections we trained ual baseline, + the same for the pooled baseline.
a pooled model on 1,000,000 sentences sampled We observe statistically significant improvements
from abstracts and claims from all sections. We of 0.5 to 1% B LEU over the individual baseline for
did not balance the sections but kept their orig- claims and titles; for abstracts, the multi-task vari-
inal distribution, reflecting a real-life task where ant yields the same result as the baseline, while
the distribution of sections is unknown. We then the averaging methods perform worse. Multi-task
extend this experiment to the structural dimen- MERT yields the best result for claims; on titles,
sion. Since we do not have an intuitive notion of a the simple average and the last MMERT average
natural distribution for the text sections, we train dominate. Pooled tuning always performs signifi-
a balanced pooled model on a corpus composed cantly worse than any other method, confirming
of 170,000 sentences each from abstracts, claims that it is beneficial to differentiate between the
and titles, i.e. 510,000 sentences in total. For text section sections.
both dimensions, for each task, we sampled 2,000 Similarly for IPC sections, small but statisti-
parallel sentences for development, development- cally significant improvements over the individual
testing, and testing from patents that were pub- and pooled baselines are achieved by multi-task
lished in different years than the training data. tuning and averaging over IPC sections, except-
We compare the multi-task experiments with ing C and D. However, an advantage of multi-task
two baselines. The first baseline is individual tuning over averaging is hard to establish.
task learning, corresponding to standard separate Note that the averaging techniques implicitly
MERT tuning on each section (individual). This benefit from a larger tuning set. In order to ascer-
results in three separately learned weight vectors tain that the improvements by averaging are not
11 12
MERT as presented in Och (2003) is not a gradient- The aspect of averaging found in all of our multi-task
based optimization techniquem, thus MMERT is strictly learning techniques effectively controls for optimizer insta-
speaking only inspired by gradient descent optimization. bility as mentioned in Clark et al. (2011).
825
test pooled-6k significance ley et al. (2010) and Utiyama and Isahara (2007).
A caveat in this situation is that data need to be
abstract 0.3628 <
from the general patent domain, as shown by the
claim 0.4696 <
inferior performance of a large Europarl-trained
title 0.3174 <
model compared to a small patent-trained model.
Table 14: Multi-task tuning on 6,000 sentences pooled The goal of this paper is to analyze patent data
from text sections. < denotes a statistically signifi- along the topical dimension of IPC classes and
cant difference to the best result. along the structural dimension of textual sections.
Instead of trying to beat a pooling baseline that
simply increases the data size, our research goal
simply due to increasing the size of the tuning set,
is to investigate whether different subtasks along
we ran a control experiment where we tuned the
these dimensions share commonalities that can
model on a pooled development set of 3 2, 000
fruitfully be exploited by multi-task learning in
sentences for text sections and on a development
machine translation. We thus aim to investigate
set of 8 2, 000 sentences for IPC sections. The
the benefits of multi-task learning in realistic sit-
results given in table 14 show that tuning on a
uations where a simple enlargement of training
pooled set of 6,000 text sections yields only min-
data is not possible.
imal differences to tuning on 2,000 sentence pairs
such that the B LEU scores for the new pooled Starting from baseline models that are trained
models are still significantly lower than the best on individual tasks or on data pooled from all
results in table 12 (indicated by <). However, tasks, we apply mixtures of translation models
increasing the tuning set to 16,000 sentence pairs and multi-task MERT tuning to multiple patent
for IPC sections makes the pooled baseline per- translation tasks. We find small, but statistically
form as well as the best results in table 13, except significant improvements for multi-task MERT
for two cases (indicated by <) (see table 15). tuning and parameter averaging techniques. Im-
This is due to the smaller differences between best provements are more pronounced for multi-task
and worst results for tuning on IPC sections com- learning on textual domains than on IPC domains.
pared to tuning on text sections, indicating that This might indicate that the IPC sections are less
IPC sections are less well suited for multi-task well delimitated than the structural domains. Fur-
tuning than the textual domains. thermore, this is owing to the limited expressive-
ness of a standard linear model including 14-20
test pooled-16k significance features in tuning. The available features are very
coarse and more likely to capture structural dif-
A 0.5177 < ferences, such as sentence length, than the lexi-
B 0.4920 cal differences that differentiate the semantic do-
C 0.5133 < mains. We expect to see larger gains due to multi-
D 0.4737 task learning for discriminatively trained SMT
E 0.4685 models that involve very large numbers of fea-
F 0.4832 tures, especially when multi-task learning is done
G 0.4608 in a framework that combines parameter regular-
H 0.4579 ization with feature selection (Obozinski et al.,
2010). In future work, we will explore a combina-
Table 15: Multi-task tuning on 16,000 sentences
pooled from IPC sections. < denotes a statistically tion of large-scale discriminative training (Liang
significant difference to the best result. et al., 2006) with multi-task learning for SMT.
Acknowledgments
5 Conclusion
The most straightforward approach to improve This work was supported in part by DFG grant
machine translation performance on patents is to Cross-language Learning-to-Rank for Patent Re-
enlarge the training set to include all available trieval.
data. This question has been investigated by Tins-
826
References Chapter of the Association for Computational Lin-
guistics - Human Language Technologies (NAACL-
Nicola Bertoldi and Marcello Federico. 2009. Do- HLT09), Boulder, CO.
main adaptation for statistical machine translation George Foster and Roland Kuhn. 2007. Mixture-
with monolingual resources. In Proceedings of the model adaptation for SMT. In Proceedings of the
4th EACL Workshop on Statistical Machine Trans- Second Workshop on Statistical Machine Transla-
lation, Athens, Greece. tion, Prague, Czech Republic.
Nicola Bertoldi, Barry Haddow, and Jean-Baptiste George Foster, Pierre Isabelle, and Roland Kuhn.
Fouet. 2009. Improved minimum error rate train- 2010. Translating structured documents. In Pro-
ing in Moses. The Prague Bulletin of Mathematical ceedings of the 9th Conference of the Association
Linguistics, 91:716. for Machine Translation in the Americas (AMTA
Fabienne Braune and Alexander Fraser. 2010. Im- 2010), Denver, CO.
proved unsupervised sentence alignment for sym- Kenneth Heafield. 2011. KenLM: faster and smaller
metrical and asymmetrical parallel corpora. In Pro- language model queries. In Proceedings of the
ceedings of the 23rd International Conference on EMNLP 2011 Sixth Workshop on Statistical Ma-
Computational Linguistics (COLING10), Beijing, chine Translation (WMT11), Edinburgh, UK.
China. Daniel Kifer, Shain Ben-David, and Johannes Gehrke.
Alexandru Ceausu, John Tinsley, Jian Zhang, and 2004. Detecting change in data streams. In Pro-
Andy Way. 2011. Experiments on domain adap- ceedings of the 30th international conference on
tation for patent machine translation in the PLuTO Very large data bases, Toronta, Ontario, Canada.
project. In Proceedings of the 15th Conference of Philipp Koehn and Josh Schroeder. 2007. Experi-
the European Assocation for Machine Translation ments in domain adaptation for statistical machine
(EAMT 2011), Leuven, Belgium. translation. In Proceedings of the Second Workshop
Olivier Chapelle, Pannagadatta Shivaswamy, Srinivas on Statistical Machine Translation, Prague, Czech
Vadrevu, Kilian Weinberger, Ya Zhang, and Belle Republic.
Tseng. 2011. Boosted multi-task learning. Ma- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
chine Learning. Callison-Birch, Marcello Federico, Nicola Bertoldi,
Jonathan Clark, Chris Dyer, Alon Lavie, and Noah Brooke Cowan, Wade Shen, Christine Moran,
Smith. 2011. Better hypothesis testing for statis- Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra
tical machine translation: Controlling for optimizer Constantin, and Evan Herbst. 2007. Moses: Open
instability. In Proceedings of the 49th Annual Meet- source toolkit for statistical machine translation. In
ing of the Association for Computational Linguis- Proceedings of the ACL 2007 Demo and Poster Ses-
tics (ACL11), Portland, OR. sions, Prague, Czech Republic.
Hal Daume. 2007. Frustratingly easy domain adap- Philipp Koehn. 2005. Europarl: A parallel corpus for
tation. In Proceedings of the 45th Annual Meet- statistical machine translation. In Proceedings of
ing of the Association for Computational Linguis- Machine Translation Summit X, Phuket, Thailand.
tics (ACL07), Prague, Czech Republic. Percy Liang, Alexandre Bouchard-Cote, Dan Klein,
and Ben Taskar. 2006. An end-to-end dis-
Mark Dredze, Alex Kulesza, and Koby Crammer.
criminative approach to machine translation. In
2010. Multi-domain learning by confidence-
Proceedings of the joint conference of the Inter-
weighted parameter combination. Machine Learn-
national Committee on Computational Linguistics
ing, 79:123149.
and the Association for Computational Linguistics
Kevin Duh, Katsuhito Sudoh, and Hajime Tsukada. (COLING-ACL06), Sydney, Australia.
2010. Analysis of translation model adaptation in Guillaume Obozinski, Ben Taskar, and Michael I. Jor-
statistical machine translation. In Proceedings of dan. 2010. Joint covariate selection and joint sub-
the International Workshop on Spoken Language space selection for multiple classification problems.
Translation (IWSLT10), Paris, France. Statistics and Computing, 20:231252.
Theodoros Evgeniou and Massimiliano Pontil. 2004. Franz Josef Och. 2003. Minimum error rate train-
Regularized multi-task learning. In Proceedings of ing in statistical machine translation. In Proceed-
the 10th ACM SIGKDD conference on knowledge ings of the Human Language Technology Confer-
discovery and data mining (KDD04), Seattle, WA. ence and the 3rd Meeting of the North American
Marcello Federico, Nicola Bertoldi, and Mauro Cet- Chapter of the Association for Computational Lin-
tolo. 2008. IRSTLM: an open source toolkit for guistics (HLT-NAACL03), Edmonton, Cananda.
handling large scale language models. In Proceed- Kishore Papineni, Salim Roukos, Todd Ward, and
ings of Interspeech, Brisbane, Australia. Wei-Jing Zhu. 2001. Bleu: a method for auto-
Jenny Rose Finkel and Christopher D. Manning. 2009. matic evaluation of machine translation. Technical
Hierarchical bayesian domain adaptation. In Pro- Report IBM Research Division Technical Report,
ceedings of the Conference of the North American RC22176 (W0190-022), Yorktown Heights, N.Y.
827
Ariadna Quattoni, Xavier Carreras, Michael Collins,
and Trevor Darrell. 2009. An efficient projec-
tion for `1, regularization. In Proceedings of the
26th International Conference on Machine Learn-
ing (ICML09), Montreal, Canada.
Stefan Riezler and John Maxwell. 2005. On some pit-
falls in automatic evaluation and significance testing
for MT. In Proceedings of the ACL-05 Workshop on
Intrinsic and Extrinsic Evaluation Measures for MT
and/or Summarization, Ann Arbor, MI.
Holger Schwenk. 2008. Investigations on large-
scale lightly-supervised training for statistical ma-
chine translation. In Proceedings of the Interna-
tional Workshop on Spoken Language Translation
(IWSLT08), Hawaii.
Patrick Simianer, Katharina Waschle, and Stefan Rie-
zler. 2011. Multi-task minimum error rate train-
ing for SMT. The Prague Bulletin of Mathematical
Linguistics, 96:99108.
Matthew Snover, Bonnie Dorr, and Richard Schwartz.
2008. Language and translation model adaptation
using comparable corpora. In Proceedings of the
Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP08), Honolulu, Hawaii.
John Tinsley, Andy Way, and Paraic Sheridan. 2010.
PLuTO: MT for online patent translation. In Pro-
ceedings of the 9th Conference of the Association
for Machine Translation in the Americas (AMTA
2010), Denver, CO.
Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ana-
niadou. 2009. Stochastic gradient descent train-
ing for `1 -regularized log-linear models with cumu-
lative penalty. In Proceedings of the 47th Annual
Meeting of the Association for Computational Lin-
guistics (ACL-IJCNLP09), Singapore.
Nicola Ueffing, Gholamreza Haffari, and Anoop
Sarkar. 2007. Transductive learning for statistical
machine translation. In Proceedings of the 45th An-
nual Meeting of the Association of Computational
Linguistics (ACL07), Prague, Czech Republic.
Masao Utiyama and Hitoshi Isahara. 2007. A
Japanese-English patent parallel corpus. In Pro-
ceedings of MT Summit XI, Copenhagen, Denmark.
Bing Zhao, Matthias Eck, and Stephan Vogel. 2004.
Language model adaptation for statistical machine
translation with structured query models. In Pro-
ceedings of the 20th International Conference on
Computational Linguistics (COLING04), Geneva,
Switzerland.
828
Not as Awful as it Seems: Explaining German Case through
Computational Experiments in Fluid Construction Grammar
829
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 829839,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
Another problem for the syncretism by acci- However, it is a well-established fact that dis-
dent hypothesis is the fact that the collapsing of junctions are computationally expensive, which
case forms is not randomly distributed over the is illustrated in the top of Figure 1. This Fig-
whole paradigm as would be expected. Hawkins ure shows the search tree of a small grammar
(2004, p. 78) observes that instead there is a sys- when parsing the utterance Die Kinder gaben der
tematic tendency for lower cells in the paradigm Lehrerin die Zeichnung (the children gave the
(e.g. genitive; Table 1) to collapse before cells in drawing to the (female) teacher), which is un-
higher positions (e.g. nominative) do so. ambiguous to German speakers. As can be seen
in the Figure, the search tree has to explore sev-
2.2 Formal Linguistics eral branches before arriving at a valid solution.
Many hidden effects of verbal linguistic theo- Most of the splits are caused by disjunctions. For
ries can be uncovered through explicit formaliza- example, when a determiner-noun construction
tions. Unfortunately, formal linguists also typi- specifies that the case features of the definite ar-
cally distinguish between systematic and non- ticle die (nominative or accusative) and the noun
systematic syncretism when analyzing German Kinder (children; nominative, accusative or gen-
case. For instance, in his review of a number of itive) have to unify, the search tree splits into two
studies on German (a.o. Bierwisch, 1967; Blevins, hypotheses (a nominative and an accusative read-
1995; Wiese, 1996; Wunderlich, 1997), Mller ing) even though for native speakers of German,
(2002) concludes that none of these approaches the syntactic context unambiguously points to a
is able to rule out accidental syncretism. nominative reading (because it is the only noun
There is however one major stone that has been phrase that agrees with the main verb).
left unturned by formal linguists: processing. It should be no surprise, then, that a lot of work
Most formal theories, such as HPSG (Ginzburg has focused on processing disjunctions more ef-
and Sag, 2000), assume a strict division between ficiently (e.g. Carter, 1990; Ramsay, 1990). As
competence and performance and therefore observed by Flickinger (2000), however, most of
represent linguistic knowledge in a purely declar- these studies implicitly assume that the grammar
ative, process-independent way (Sag and Wasow, representation has to remain unchanged. He then
2011). While such an approach may be desirable demonstrates through computational experiments
from a mathematical point of view, it puts the how a different representation can directly impact
burden of efficient processing on the shoulders efficiency, and argues that revisions of the gram-
of computational linguists, who have to develop mar for efficiency should be discussed more thor-
more intelligent interpreters. oughly in the literature.
One example of the gap between description The impact of representation on processing is
and computational implementation is disjunctive illustrated at the bottom of Figure 1, which shows
feature representation, which became popular in the performance of a grammar that uses the same
feature-based grammar formalisms in the 1980s processing technique for handling the same utter-
(Karttunen, 1984). Disjunctions allow an elegant ance, but a different representation than the dis-
notation for multiple feature values, as illustrated junctive grammar. As can be seen, the alternative
in example 1 for the German definite article die, grammar (whose technical details are disclosed
which is either assigned nominative or accusative further below) is able to parse the German defi-
case, and which is either feminine-singular or plu- nite articles without tears, and the resulting search
ral. The feature structure (adopted from Kart- tree arguably better reflects the actual processing
tunen, 1984, p. 30) represents disjunctions by en- performed by native speakers of German.
closing the alternatives in curly brackets ({ }).
2.3 Alternative Hypothesis
" #
(1)
GENDER f The effect of processing-friendly representations
on search suggests that answers for the unsolved
AGREEMENT NUM sg
problems concerning case syncretism have to
h i
NUM pl
be sought in performance. This paper there-
n o fore rejects the processing-independent approach
CASE nom acc and explores the alternative hypothesis, following
830
initial sem syn
structure top top
(a) Search with disjunctive feature representation:
application
process determiner-nominal-
phrase-cxn
determiner-nominal- kinder- (marked-phrasal)
phrase-cxn lex
(marked-phrasal) (lex) determiner-nominal-
phrase-cxn
determiner- (marked-phrasal)
nominal-phrase- lehrerin-
cxn lex (lex) determiner-
(marked-phrasal) nominal-phrase- ditransitive-
determiner- cxn cxn (arg)
kinder-
nominal-phrase- (marked-phrasal)
lex
* der-lex cxn
(lex)
(lex), die- (marked-phrasal)
determiner-nominal-phrase-cxn
lex (lex), (marked-phrasal)
die-lex
initial (lex), +
gaben-lex
(lex), determiner-nominal-
zeichnung- phrase-cxn
(marked-phrasal)
determiner-nominal-
Parsing "die Kinder gaben der Lehrerin die
lex (lex)
Zeichnung ."
phrase-cxn
kinder-
lex
(marked-phrasal) (lex) determiner-nominal-
phrase-cxn
determiner-
(marked-phrasal)
nominal-phrase- lehrerin-
Applying construction set (8) in
cxndirection lex (lex)
(marked-phrasal) determiner-nominal-phrase-cxn
(marked-phrasal)
determiner-
kinder-
nominal-phrase-
lex determiner-
Found a solution cxn
(lex) nominal-phrase- ditransitive-
(marked-phrasal)
cxn cxn (arg)
initial sem syn (marked-phrasal)
structure top top
queue determiner-nominal-phrase-cxn (marked-phrasal) kinder-lex (lex) lehrerin-lex (lex) zeichnung-lex (lex)
(b) Search with feature matrices:
application
reset * zeichnung-lex, kinder-lex, lehrerin-lex, gaben-lex, die-lex, detnp-cxn, ditransitive-
process initial
die-lex , detnp-cxn, der-lex, detnp-cxn cxn
831
ing data. All experiments reported in this paper case, the value for dative means that die can-
have therefore been implemented in Fluid Con- not be assigned dative case. We can do the same
struction Grammar (FCG; Steels, 2011, 2012a), a for Kinder (children), which can be nominative
unification-based grammar formalism that comes or accusative, but not dative:
equipped with an interactive web interface and
monitoring tools (Loetzsch, 2012). A second ad- (3) Kinder: nom ?nom
vantage of FCG is that it features strong bidirec- CASE
acc ?acc
tionality: the FCG-interpreter can achieve both dat
parsing and production using the same linguistic
inventory. Other feature structure platforms, such As demonstrated in Figure 1, disjunctive fea-
as the lkb-system (Copestake, 2002), require a ture representation would cause a split in the
separate parser and generator for formalizing bidi- search tree when unifying die and Kinder. Us-
rectional grammars, which make them less suited ing a feature matrix, however, the choice between
for substantiating the claims of this paper. a nominative and accusative reading can simply
be postponed until enough information from the
3.1 Distinctive Feature Matrix rest of the utterance is available. Unifying die and
German case has become the litmus test for Kinder yields the following feature structure:
demonstrating how well a feature-based grammar
formalism copes with multifunctionality, espe- (4) die Kinder: nom ?nom
cially since Ingria (1990) provocatively stated that CASE
acc ?acc
unification is not the best technique for handling dat
it. People have gone to great lengths to counter
Ingrias claim, especially within the HPSG frame- 3.2 A Three-Dimensional Matrix
work (e.g. Mller, 1999; Daniels, 2001; Sag, The German case paradigm is obviously more
2003), and various formalizations have been of- complex than the examples shown so far. Lets
fered for German case (Heinz and Matiasek, consider Table 1 again, but this time we replace
1994; Mller, 2001; Crysmann, 2005). However, every cell in the table by a variable. This leads to
these proposals either do not succeed in avoiding the following feature matrix for the German defi-
inefficient disjunctions or they require a complex nite articles:
double type hierarchy (Crysmann, 2005).
The experiments in this paper use a more Case SG-M SG-F SG-N PL
straightforward solution, called a distinctive fea- ?NOM ?n-s-m ?n-s-f ?n-s-n ?n-pl
ture matrix, which is based on an idea that was ?ACC ?a-s-m ?a-s-f ?a-s-n ?a-pl
first explored by Ingria (1990) and of which a ?DAT ?d-s-m ?d-s-f ?d-s-n ?d-pl
variation has recently also been proposed for ?GEN ?g-s-m ?g-s-f ?g-s-n ?g-pl
Lexical Functional Grammar (Dalrymple et al., Table 2: A distinctive feature matrix for German case.
2009). Instead of treating case as a single-valued
feature, it can be represented as an array of fea- Each cell in this matrix represents a specific
tures, as shown for the definite article die (ignor- feature bundle that collects the features case,
ing the genitive case for the time being): number, and person. For example, the variable
?n-s-m stands for nominative singular mascu-
(2) die: nom ?nom
line. Note that also the cases themselves have
CASE
acc ?acc their own variable (?nom, ?acc, ?dat and
dat ?gen). This allows us to single out a specific di-
mension of the matrix for constructions that only
The case feature includes a paradigm of three care about case distinctions, but abstract away
cases (nom, acc and dat), whose values can ei- from gender or number. Each linguistic item fills
ther be + or , or left unspecified through a in as much information as possible in this case
variable (indicated by a question mark). The two matrix. For example, Table 3 shows how the def-
variables ?nom and ?acc indicate that die can inite article die underspecifies its potential values
potentially be assigned nominative or accusative and rules out all other options through .
832
Case SG-M SG-F SG-N PL 4 Experiments
?NOM ?n-s-f ?n-pl
This section describes the experimental set-up and
?ACC ?a-s-f ?a-pl
discusses the experimental results.
4.1 Three Paradigms
Table 3: The feature matrix of die. The experiments compare three different variants
of the German definite article paradigm.
Standard German. The Standard German
paradigm has been illustrated in Table 1 and its
The feature matrix of Kinder (children), operationalization has been shown in section 3.2.
which underspecifies for nominative, accusative The paradigm has been inherited without signifi-
and genitive, is shown in Table 4. Notice, how- cant changes from Middle High German (1050-
ever, that the same variable names are used for 1350; Walshe, 1974) and features six different
both the column that singles out the case dimen- forms.
sion as for the column of the plural feature bun- Old High German. The Old High German
dles. paradigm is the direct predecessor of the current
Case SG-M SG-F SG-N PL paradigm of definite articles. It contained at least
?n-pl ?n-pl twelve distinct forms (depending on which varia-
?a-pl ?a-pl tion is taken) that included gender distinctions in
plural (Wright, 1906, p. 67). It also included one
?g-pl ?g-pl definite article that marked the now extinct instru-
mental case, which is ignored in this paper. The
Table 4: The feature matrix of Kinder (children). variant of the Old High German paradigm that has
been implemented in the experiments is summa-
Unification of die and Kinder can exploit these rized in Table 6.
variable equalities for ruling out a singular value
of the definite article. Likewise, the matrix of die Case Singular
rules out the genitive reading of Kinder, as illus- M F N
trated in Table 5. NOM dr diu daz
ACC dn die daz
Case SG-M SG-F SG-N PL DAT dmu dru dmu
?n-pl ?n-pl GEN ds dra ds
?a-pl ?a-pl Plural
M F N
NOM die deo diu
Table 5: The feature matrix of die Kinder. ACC die deo diu
DAT dem dem dem
Argument structure constructions (Goldberg, GEN dro dro dro
2006), such as the ditransitive, can then later as-
Table 6: The Old High German definite article system.
sign either nominative or accusative case. The
main advantage of feature matrices is that linguis-
tic search only has to commit to specific feature- Texas German. The third variant is an
values once sufficient information is available, so American-German dialect called Texas German
the search tree only splits when there is an actual (Boas, 2009a,b), which evolved a two-way case
ambiguity. Moreover, they can be handled using distinction between nominative and oblique. This
standard unification. Interested readers can con- type of case system, in which the accusative and
sult van Trijp (2011) for a thorough description of dative case have collapsed, is also a common
the approach, as well as a discussion on how the evolution in the Low German dialects (Shrier,
FCG implementation differs from Ingria (1990) 1965). The implemented paradigm of Texas
and Dalrymple et al. (2009). German is shown in Table 7.
833
Case SG-M SG-F SG-N PL The experiments exploit types because there
NOM der die das die are three different language systems, hence it is
ACC/DAT den die den die impossible to use a single, real corpus and its to-
ken frequencies. It would also be unwarranted to
Table 7: The Texas German definite article system.
use different corpora because corpus-specific bi-
ases would distort the comparative results. Sec-
ondly, as the experiments involve models of deep
language processing (as opposed to stochastic
4.2 Production and Parsing Tasks models), the use of types instead of tokens is
justified in this phase of the research: the first
Each grammar is tested as to how efficiently it can
concern of precision-grammars is descriptive ade-
produce and parse utterances in terms of cognitive
quacy, for which types are a more reliable source.
effort and search (see section 4.3). There are three
Obviously, the effect of token frequency needs to
basic types of utterances:
be examined in future research.
1. Ditransitive: NOM Verb DAT ACC 4.3 Measuring Cognitive Effort
2. Transitive (a): NOM Verb ACC The experiments measure two kinds of cognitive
effort: syntactic search and semantic ambiguity.
3. Transitive (b): NOM Verb DAT
Search. The search measure counts the number
The argument roles are filled by noun phrases of branches in the search process that reach an end
whose head nouns always have a distinct form node, which can either be a possible solution or
for singular and plural (e.g. Mann vs. Mn- a dead end (i.e. no constructions can be applied
ner; man vs. men), but that are unmarked for anymore). Duplicate nodes (for instance, nodes
case. The combinations of arguments is always that use the same rules but in a different order)
unique along the dimensions of number and gen- are not counted. The search measure is then used
der, which yields 216 unique utterance types for as a sanity check to verify whether the three dif-
the ditransitive as follows: ferent paradigms can be processed with the same
efficiency in terms of search tree length, as hy-
NOM.S.M V DAT.S.M ACC.S.M pothesized by this paper. More specifically, the
NOM.S.M V DAT.S.F ACC.S.M following conditions have to be met:
(5) NOM.S.M V DAT.S.N ACC.S.M
NOM.S.M V DAT.PL.M ACC.S.M 1. In production, there should only be one
etc. branch.
In transitive utterances, there is an additional 2. In parsing, search has to be equal to the se-
distinction based on animacy for noun phrases in mantic effort.
the Object position of the utterance, which yields
The single branch constraint in production
72 types in the NOM-ACC configuration and 72
checks whether the definite articles are suffi-
in the NOM-DAT configuration. Together, there
ciently distinct from one another. Since there is no
are 360 unique utterance types. As can be gleaned
ambiguity about which argument plays which role
from the utterance types, the genitive case is not
in the utterance, the grammar should only come
considered by the experiments, as the genitive is
up with one solution. In parsing, the number of
not part of basic German argument structures and
branches has to correspond to real semantic am-
it has almost disappeared in most dialects of Ger-
biguities and not create additional search, as ar-
man (Shrier, 1965).
gued in section 2.2.
In production, the grammar is presented with a
meaning that needs to be verbalized into an utter- Semantic Ambiguity. Semantic ambiguity
ance. In parsing, the produced utterance has to be equals the number of possible interpretations
analyzed back into a meaning. Every utterance is of an utterance. For instance, the utterance
processed using a full search, that is, all branches Der Hund beit den Mann the dog bites the
and solutions are calculated. man is unambiguous in Modern High German,
834
since der Hund can only be nominative singular- Cue E1 E2 E3 E4
masculine, and den Mann can only be accusative SV-agreement + +
masculine-singular. There is thus only one pos- Selection restrictions + +
sible interpretation in which the dog is the biter
and the man is being bitten, illustrated as follows
using a logic-based meaning representation (also
see Steels, 2004, for this operationalization of
cognitive effort): SV-agreement restricts the subject to singular
or plural nouns, and semantic selection restric-
(6) Interpretation 1: tions can disambiguate utterances in which for ex-
Der Hund beit den Mann. ample the Agent-role has to be animate (e.g. in
perception verbs such as sehen to see). All other
dog(?a) bite(?ev) man(?b) possible cues, such as word order, are ignored.
biter(?ev, ?x)
?a=?x bitten(?ev, ?y)
?b=?y 5 Results
5.1 Search
However, an utterance such as die Katze beit
die Frau the cat bites the woman is ambiguous In all experiments, the constraints of the search
because die has both a nominative and accusative measure were satisfied: every grammar only re-
singular-feminine reading: quired one branch per utterance in production,
and the number of branches in parsing never ex-
(7) a. Interpretation 1: ceeded the number of possible interpretations. In
Die Katze beit die Frau. terms of search length, more syncretism therefore
cat(?a) bite(?ev) woman(?b) does not automatically harm efficiency, provided
biter(?ev, ?x) that the grammar uses an adequate representation.
?a=?x bitten(?ev, ?y)
?b=?y Arguably, the smaller paradigms are even more
efficient because they require less unifications to
b. Interpretation 2:
Die Katze beit die Frau. be performed.
835
E3). Here, the difference between Old and Mod- amount of ambiguity remains more than 20% us-
ern High German becomes trivial with 4.44% and ing all available cues. One verifiable predic-
6.94% of ambiguous utterances respectively. The tion of the experiments is therefore that this di-
difference with Texas German remains apparent, alect should show an increase in alternative syn-
even though its ambiguity is cut by half. tactic restrictions (such as word order) in order
In set-up E4 (case, SV-agreement and selection to make up for the lost case distinctions. Inter-
restrictions), the Old and Modern High German estingly, such alternatives have been attested in
paradigms resolve almost all ambiguities, leaving Low German dialects that have evolved a simi-
little difference between them. Using the Texas lar two-way case system (Shrier, 1965). Modern
German dialect, one utterance out of five remains High German, on the other hand, has already re-
ambiguous and requires additional grammatical cruited word order for other purposes (such as in-
cues or inferencing for semantic interpretation. formation structure; Lenerz, 1977; Micelli, 2012),
which may explain why the current paradigm has
Number of possible interpretations. Semantic
been able to survive since the Middle Ages.
ambiguity can also be measured by counting the
Instead of an accidental by-product of phono-
number of possible interpretations per utterance.
logical and morphological changes, then, a new
A non-ambiguous language would thus have 1
picture emerges for explaining syncretism in
possible interpretation per utterance. The aver-
Modern High German definite articles: German
age number of interpretations per utterance (per
speakers have been able to reduce their case
paradigm and per set-up) is shown in Table 8.
paradigm without loss in processing and interpre-
Paradigm E1 E2 E3 E4 tation efficiency. With cognitive effort as a selec-
Old High German 1.56 1.22 1.04 1.03 tion criterion, subsequent generations of speakers
Modern High German 1.56 1.28 1.07 1.04 found no linguistic pressures for maintaining par-
Texas German 2.84 2.39 1.36 1.22 ticular distinctions such as gender in plural arti-
cles. Especially forms whose acoustic distinctions
Table 8: Average number of interpretations per utter- are harder to perceive are candidates for collapse
ance type. if they are no longer functional for processing or
interpretation. Other factors, such as frequency,
The Old High German paradigm has the least may accelerate this evolution, as also argued by
semantic ambiguity throughout, except in Exper- Bardal (2009). For instance, there may be less
iment 1 (E1). Here, Modern High German has benefits for upholding a case distinction for infre-
the same average effort despite having more am- quent than for frequent forms.
biguous utterances. This means that the Old High If case syncretism is not randomly distributed
German paradigm provides a better coverage in over a grammatical paradigm, but rather func-
terms of construction types, but when ambiguity tionally motivated, a new explanatory model is
occurs, more possible interpretations exist. needed. One candidate is evolutionary linguistics
(Steels, 2012b), a framework of cultural evolu-
6 Discussion
tion in which populations of language users con-
The experiments compare how well three differ- stantly shape and reshape their language in re-
ent paradigms of definite articles perform if they sponse to their communicative needs. The ex-
are inserted in the grammar of Modern High Ger- periments reported here suggest that this dynamic
man. The results show that, in isolation, Old High shaping process is guided by the linguistic land-
German offers the best cue-reliability for retriev- scape of a language. For instance, the pres-
ing whos doing what to whom in events. How- ence of grammatical cues such as gender, num-
ever, when other grammatical cues are taken into ber and SV-agreement may encourage paradigm
account, it turns out that Modern High German reduction. However, reduction may be the start
achieves similar results with respect to syntactic of a self-enforcing loop in which the decreasing
search and semantic ambiguity, with a reduced cue-reliability of a paradigm may pressure lan-
paradigm (using only six instead of twelve forms). guage users into enforcing the alternatives to take
As for the Texas German dialect, which has on even more of the cognitive load of processing.
collapsed the accusative-dative distinction, the The intricate interactions between grammati-
836
%
of
ambiguous
u,erances
100
90
80
77.78
71.11
70
60 55.56
50
40
35.56
35.56
28.89
30
22.22
22.22
20
10
6.94
4.44
2.78
3.61
0
E1
E2
E3
E4
Figure 2: This chart shows the number of ambiguous utterances per paradigm per E(xperimental set-up) in %.
cal systems also requires more sophisticated mea- experiments have demonstrated that Modern High
sures. A promising extension of this paper could German achieves a similar performance as its Old
lie in an information-theoretic approach to lan- High German predecessor using only half of the
guage (Hale, 2003; Jaeger and Tily, 2011), which forms in its definite article paradigm.
has recently explored a set of tools for assessing Instead of a series of historical accidents, the
linguistic complexity, processing effort and un- German case system thus underwent a systematic
certainty. Unfortunately, only little work has been and performance-driven [...] morphological re-
done on morphological paradigms so far (see e.g. structuring (Hawkins, 2004, p. 79), in which lin-
Ackerman et al., 2011), and the approach is typi- guistic pressures such as cognitive effort decided
cally applied in stochastic or Probabilistic Context on the maintenance or loss of certain distinctions.
Free Grammars, hence it remains unclear how the The case study makes clear that formal and com-
assumptions of this field fit into models of deep putational models of deep language understand-
language processing. ing have to reconsider their strict division between
competence and performance if the goal is to ex-
7 Conclusions plain individual language development. This pa-
per proposed that new tools and methodologies
More than 130 years after Mark Twains com- should be sought in evolutionary linguistics.
plaints, it seems that the German language is not
that awful after all. Through a series of compu- Acknowledgements
tational experiments, this paper has proposed a
different explanation for German case syncretism This research has been conducted at the Sony
that answers some of the unsolved riddles of pre- Computer Science Laboratory Paris. I would like
vious studies. First, the experiments have shown to thank Luc Steels, director of Sony CSL Paris
that an increase in syncretism does not necessar- and the VUB AI-Lab of the University of Brus-
ily lead to an increase in the cognitive effort re- sels, for his support and feedback. I also thank
quired for syntactic search, provided that the rep- Hans Boas, Jhanna Bardal, Peter Hanappe,
resentation of the grammar is processing-friendly. Manfred Hild and the anonymous reviewers for
Secondly, by comparing cue-reliability of differ- helping to improve this article. All errors remain
ent paradigms for semantic disambiguation, the of course my own.
837
References minacy, and likeness of case. In Stefan Mller,
editor, Proceedings of the 12th International
Farrell Ackerman, James P. Blevins, and Robert
Conference on Head-Driven Phrase Structure
Malouf. Parts and wholes: Implicative patterns
Grammar, pages 91107, Stanford, 2005. CSLI
in inflectional paradigms. In J.P. Blevins and
Publications.
J. Blevins, editors, Analogy in Grammar: Form
and Acquisition, pages 5481. Oxford Univer- Mary Dalrymple, Tracy Holloway King, and
sity Press, Oxford, 2011. Louisa Sadler. Indeterminacy by underspecifi-
cation. Journal of Linguistics, 45:3168, 2009.
Matthew Baerman. Case syncretism. In An-
drej Malchukov and Andrew Spencer, editors, Michael Daniels. On a type-based analysis of fea-
The Oxford Handbook of Case, chapter 14, ture neutrality and the coordination of unlikes.
pages 219230. Oxford University Press, Ox- In Proceedings of the 8th International Confer-
ford, 2009. ence on HPSG, pages 137147, Stanford, 2001.
CSLI.
J. Bardal. The development of case in germanic.
In J. Bardal and S. Chelliah, editors, The Role Joachim De Beule. A formal deconstruction of
of Semantics and Pragmatics in the Develop- Fluid Construction Grammar. In Luc Steels, ed-
ment of Case, pages 123159. John Benjamins, itor, Computational Issues in Fluid Construc-
Amsterdam, 2009. tion Grammar. Springer Verlag, Berlin, 2012.
Manfred Bierwisch. Syntactic features in Daniel P. Flickinger. On building a more efficient
morphology: General problems of so-called grammar by exploiting types. Natural Lan-
pronominal inflection in German. In To Hon- guage Engineering, 6(1):1528, 2000.
our Roman Jakobson, pages 239270. Mouton Jonathan Ginzburg and Ivan A. Sag. Interroga-
De Gruyter, Berlin, 1967. tive Investigations: the Form, the Meaning, and
James Blevins. Syncretism and paradigmatic op- Use of English Interrogatives. CSLI Publica-
position. Linguistics and Philosophy, 18:113 tions, Stanford, 2000.
152, 1995. Adele E. Goldberg. Constructions At Work: The
Joris Bleys, Kevin Stadler, and Joachim De Beule. Nature of Generalization in Language. Oxford
Search in linguistic processing. In Luc Steels, University Press, Oxford, 2006.
editor, Design Patterns in Fluid Construction John T. Hale. The information conveyed by words
Grammar. John Benjamins, Amsterdam, 2011. in sentences. Journal of Psycholinguistic Re-
Hans C. Boas. Case loss in Texas German: The search, 32(2):101123, 2003.
influence of semantic and pragmatic factors. In John A. Hawkins. Efficiency and Complexity in
J. Bardal and S. Chelliah, editors, The Role of Grammars. Oxford University Press, Oxford,
Semantics and Pragmatics in the Development 2004.
of Case, pages 347373. John Benjamins, Am- Bernd Heine and Tania Kuteva. Language Con-
sterdam, 2009a. tact and Grammatical Change. Cambridge
Hans C. Boas. The Life and Death of Texas University Press, Cambridge, 2005.
German, volume 93 of Publication of the The Wolfgang Heinz and Johannes Matiasek. Argu-
American Dialect Society. Duke University ment structure and case assignment in german.
Press, Durham, 2009b. In John Nerbonne, Klaus Netter, and Carl Pol-
David Carter. Efficient disjunctive unification lard, editors, German in Head-Driven Phrase
for bottom-up parsing. In Proceedings of the Structure Grammar, volume 46 of CSLI Lec-
13th Conference on Computational Linguistics, ture Notes, pages 199236. CSLI Publications,
pages 7075. ACL, 1990. Stanford, 1994.
Ann Copestake. Implementing Typed Feature R.J.P. Ingria. The limits of unification. In Pro-
Structure Grammars. CSLI Publications, Stan- ceedings of the 28th Annual Meeting of the
ford, 2002. ACL, pages 194204, 1990.
Berthold Crysmann. Syncretism in german: A T. Florian Jaeger and Harry Tily. On language
unified approach to underspecification, indeter- utility: Processing complexity and commu-
838
nicative efficiency. WIREs: Cognitive Science, Meeting of the Association for Computational
2(3):323335, 2011. Linguistics, pages 919, Barcelona, 2004.
L. Karttunen. Features and values. In Proceedings Luc Steels, editor. Design Patterns in Fluid Con-
of the 10th International Conference on Com- struction Grammar. John Benjamins, Amster-
putational Linguistics, Stanford, 1984. dam, 2011.
Jrgen Lenerz. Zur Abfolge nominaler Luc Steels, editor. Computational Issues in
Satzglieder im Deutschen. Narr, Tbin- Fluid Construction Grammar. Springer, Berlin,
gen, 1977. 2012a.
Martin Loetzsch. Tools for grammar engineering. Luc Steels. Self-organization and selection in cul-
In Luc Steels, editor, Computational Issues in tural language evolution. In Luc Steels, editor,
Fluid Construction Grammar. Springer Verlag, Experiments in Cultural Language Evolution.
Berlin, 2012. John Benjamins, Amsterdam, 2012b.
Vanessa Micelli. Field topology and information Luc Steels and Joachim De Beule. Unify and
structure: A case study for German constituent merge in Fluid Construction Grammar. In
order. In Luc Steels, editor, Computational Is- P. Vogt, Y. Sugita, E. Tuci, and C. Nehaniv,
sues in Fluid Construction Grammar. Springer editors, Symbol Grounding and Beyond., LNAI
Verlag, Berlin, 2012. 4211, pages 197223, Berlin, 2006. Springer.
Gereon Mller. Remarks on nominal inflection Remi van Trijp. Feature matrices and agreement:
in German. In Ingrid Kaufmann and Bar- A case study for German case. In Luc Steels,
bara Stiebels, editors, More than Words: A editor, Design Patterns in Fluid Construction
Festschrift for Dieter Wunderlich, pages 113 Grammar. John Benjamins, Amsterdam, 2011.
145. Akademie Verlag, Berlin, 2002. M. Walshe. A Middle High German Reader: With
Stefan Mller. An HPSG-analysis for free rela- Grammar, Notes and Glossary. Oxford Univer-
tive clauses in german. Grammars, 2(1):53 sity Press, Oxford, 1974.
105, 1999. Bernd Wiese. Iconicity and syncretism. on
Stefan Mller. Case in German towards and pronominal inflection in Modern German. In
HPSG analysis. In Tibor Kiss and Det- Robin Sckmann, editor, Theoretical Linguistics
mar Meurers, editors, Constraint-Based Ap- and Grammatical Description, pages 323344.
proaches to Germanic Syntax. CSLI, Stanford, John Benjamins, Amsterdam, 1996.
2001. Joseph Wright. An Old High German Primer.
Clarendon Press, Oxford, 2nd edition, 1906.
Allan Ramsay. Disjunction without tears. Com-
putational Linguistics, 16(3):171174, 1990. Dieter Wunderlich. Der unterspezifizierte Artikel.
In Karl Heinz Ramers Drscheid and Monika
Ivan A. Sag. Coordination and underspecifica-
Schwarz, editors, Sprache im Fokus, pages 47
tion. In Jongbok Kom and Stephen Wechsler,
55. Niemeyer, Tbingen, 1997.
editors, Proceedings of the Ninth International
Conference on HPSG, Stanford, 2003. CSLI.
Ivan A. Sag and Thomas Wasow. Performance-
compatible competence grammar. In Robert D.
Borsley and Kersti Brjars, editors, Non-
Transformational Syntax: Formal and Explicit
Models of Grammar. Wiley-Blackwell, Ox-
ford, 2011.
Martha Shrier. Case systems in German dialects.
Language, 41(3):420438, 1965.
Luc Steels. Constructivist development of
grounded construction grammars. In Walter
Daelemans, editor, Proceedings 42nd Annual
839
Managing Uncertainty in Semantic Tagging
840
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 840850,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
compared, the experiment showed that the men- cording to a reference source. There is, never-
tal representations of a words semantics differ for theless, a substantial difference: whereas mor-
each group (Fellbaum et al., 1997), and cf. (Jor- phologically or syntactically annotated data ex-
gensen, 1990). Lexicographers are trained in con- ist separately from the reference (tagset, anno-
sidering subtle differences among various uses of tation guide, annotation scheme), a semantically
a word, which ordinary language users do not re- tagged resource can be regarded both as a cor-
flect. Identifying a semantic difference between pus of texts disambiguated according to an at-
uses of a word and deciding whether a difference tached inventory of semantic categories and as
is important enough to constitute a separate sense a lexicon with links to example concordances
means presenting a word with a certain degree for each semantic category. So, in semanti-
of semantic granularity. Intuitively, the finer the cally tagged resources, the data and the reference
granularity of a word entry is, the more oppor- are intertwined. Such double-faced semantic re-
tunities for interannotator disagreement there are sources have also been called semantic concor-
and the lower IAA can be expected. Brown et al. dances (Miller et al., 1993a). For instance, one of
proved this hypothesis experimentally (Brown et the earlier versions of WordNet, the largest lexi-
al., 2010). Also, the annotators are less confident cal resource for English, was used in the seman-
in their decisions, when they have many options tic concordance SemCor (Miller et al., 1993b).
to choose from (Fellbaum et al. (1998) reported a More recent lexical resources have been built as
drop in subjective annotators confidence in words semantic concordances from the very beginning
with 8+ senses). (PropBank (Palmer et al., 2005), OntoNotes word
Despite all the known issues in semantic tag- senses (Weischedel et al., 2011)).
ging, the major lexical resources (WordNet (Fell- In morphological or syntactic annotation, the
baum, 1998), FrameNet (Ruppenhofer et al., tagset or inventory of constituents are given be-
2010), PropBank (Palmer et al., 2005) and the forehand and are supposed to hold for all to-
word-sense part of OntoNotes (Weischedel et al., kens/sentences contained in the corpus. Prob-
2011)) are still maintained and their annotation lematic and theory-dependent issues are few and
schemes are adopted for creating new manually mostly well-known in advance. Therefore they
annotated data (e.g. MASC, the Manually An- can be reflected by a few additional conventions in
notated Subcorpus (Ide et al., 2008)). More to the annotation manual (e.g. where to draw the line
say, these resources are not only used in WSD and between particles and prepositions or between ad-
semantic labeling, but also in research directions jectives and verbs in past participles (Santorini,
that in their turn do not rely on the idea of an in- 1990) or where to attach a prepositional phrase
ventory of discrete senses any more, e.g. in dis- following a noun phrase and how to treat specific
tributional semantics (Erk, 2010) and recognizing financialspeak structures (Bies et al., 1995)).
textual entailment (e.g. (Zanzotto et al., 2009) and Even in difficult cases, there are hardly more than
(Aharon et al., 2010)). two options of interpretation. Data manually an-
It is a remarkable fact that, to the best of our notated for morphology or surface syntax are reli-
knowledge, there is no measure that would relate able enough to train syntactic parsers with an ac-
granularity, reliability of the annotation (derived curacy above 80 % (e.g. (Zhang and Clark, 2011;
from IAA) and the resulting information gain. McDonald et al., 2006)).
Therefore it is impossible to say where the opti- On the other hand, semantic tagging actually
mum for granularity and IAA lies. employs a different tagset for each word lemma.
Even within the same part of speech, individual
2 Approaches to semantic tagging words require individual descriptions. Possible
similarities among them come into relief ex post
2.1 Semantic tagging vs. morphological or
rather than that they could be imposed on the lex-
syntactic analysis
icographers from the beginning. When assign-
Manual semantic tagging is in many respects sim- ing senses to concordances, the annotator often
ilar to morphological tagging and syntactic anal- has to select among more than two relevant op-
ysis: human annotators are trained to sort cer- tions. These two aspects make achieving good
tain elements occurring in a running text ac- IAA much harder than in morphology and syn-
841
tax tasks. In addition, while a linguistically edu- In FrameNet corpora, content words are associ-
cated annotator can have roughly the same idea of ated to particular semantic frames that they evoke
parts of speech as the author of the tagset, there (e.g. charm would relate to the Aesthetics frame)
is no chance that two humans (not even two pro- and their collocates in relevant syntactic positions
fessional lexicographers) would create identical (arguments of verbs, head nouns of adjectives,
entries for e.g. a polysemous verb. Any human etc.) would be assigned the corresponding frame-
evaluation of complete entries would be subjec- element labels (e.g. in their dazzling charm, their
tive. The maximum to be achieved is that the en- would be The Entity for which a particular grad-
try reflects the corpus data in a reasonable gran- able Attribute is appropriate and under considera-
ular way on which annotators still can reach rea- tion and dazzling would be Degree). Neither IAA
sonable IAA. nor granularity seem to be an issue in FrameNet.
We have not succeeded in finding a report on IAA
2.2 Major existing semantic resources in the original FrameNet annotation, except one
The granularity vs. IAA equilibrium is of great measurement in progress in the annotation of the
concern in creating lexical resources as well as in Manually Annotated Subcorpus of English (Ide et
applications dealing with semantic tasks. When al., 2008).1
WordNet (Fellbaum, 1998) was created, both IAA PropBank is a valency (argument structure) lex-
and subjective confidence measurements served icon. The current resource lists and labels ar-
as an informal feedback to lexicographers (Fell- guments and obligatory modifiers typical of each
baum et al., (1998), p. 200). In general, WordNet (very coarse) word sense (called frameset). Two
has been considered a resource too fine-grained core criteria for distinguishing among framesets
for most annotations (and applications). Nav- are the semantic roles of the arguments along
igli (2006) developed a method of reducing the with the syntactic alternations that the verb can
granularity of WordNet by mapping the synsets undergo with that particular argument set. To
to senses in a more coarse-grained dictionary. A keep low granularity, this lexiconamong other
manual, more coarse-grained grouping of Word- thingsdoes usually not make special framesets
Net senses has been performed in OntoNotes for metaphoric uses. The overall IAA measured
(Weischedel et al., 2011). The OntoNotes 90 % on verbs was 94 % (Palmer et al., 2005).
solution (Hovy et al., 2006) actually means such
a degree of granularity that enables a 90-%-IAA. 2.3 Semantic Pattern Recognition
OntoNotes is a reaction to the traditionally poor From corpus-based lexicography to semantic
IAA in WordNet annotated corpora, caused by the patterns
high granularity of senses. The quality of seman-
The modern, corpus-based lexicology of 1990s
tic concordances is maintained by numerous itera-
(Sinclair, 1991; Fillmore and Atkins, 1994) has
tions between lexicographers and annotators. The
had a great impact on lexicography. There is a
categories rightwrong have been, for the pur-
general consensus that dictionary definitions need
pose of the annotated linguistic resource, defined
to be supported by corpus examples. Cf. Fell-
by the IAA score, which isin OntoNotes
baum (2001):
calculated as the percentage of agreements be-
For polysemous words, dictionaries [. . . ] do
tween two annotators.
not say enough about the range of possible con-
Two other, somewhat different, lexical re-
texts that differentiate the senses. [. . . ] On the
sources have to be mentioned to complete the pic-
other hand, texts or corpora [. . . ] are not ex-
ture: FrameNet (Ruppenhofer et al., 2010) and
plicit about the words meaning. When we first
PropBank (Palmer et al., 2005). While Word-
encounter a new word in a text, we can usually
Net and OntoNotes pair words and word senses in
form only a vague idea of its meaning; checking a
a way comparable to printed lexicons, FrameNet
dictionary will clarify the meaning. But the more
is primarily an inventory of semantic frames and
contexts we encounter for a word, the harder it is
PropBank focuses the argument structure of verbs
to match them against only one dictionary sense.
and nouns (NomBank (Meyers et al., 2008), a re-
lated project capturing the argument structure of 1
Checked on the project web www.anc.org/MASC/Home
nouns, was later integrated in OntoNotes). 2011-10-29.
842
The lexical description in modern English can be semantically so tightly related that they
monolingual dictionaries (Sinclair et al., 1987; could appear together under one sense in a tradi-
Rundell, 2002) explicitly emphasizes contextual tional dictionary. The patterns are not senses but
clues, such as typical collocates and the syntac- syntactico-semantically characterized prototypes
tic surroundings of the given lexical item, rather (see the example verb submit in Table 1). Con-
than relying on very detailed definitions. In cordances that match these prototypes well are
other words, the sense definitions are obtained called norms in Hanks (forthcoming). Concor-
as syntactico-semantic abstractions of manually dances that match with a reservation (metaphor-
clustered corpus concordances in the modern ical uses, argument mismatch, etc.) are called ex-
corpus-based lexicography: in classical dictionar- ploitations. The PDEV corpus annotation indi-
ies as well as in semantic concordances. cates the norm-exploitation status for each con-
Nevertheless, the word senses, even when ob- cordance.
tained by a collective mind of lexicographers and Compared to other semantic concordances, the
annotators, are naturally hard-wired and tailored granularity of PDEV is high and thus discourag-
to the annotated corpus. They may be too fine- ing in terms of expected IAA. However, select-
grained or too coarse-grained for automatic pro- ing among patterns does not really mean disam-
cessing of different corpora (e.g. a restricted- biguating a concordance but rather determining to
domain corpus). Kilgarriff (1997, p. 115) shows which pattern it is most similara task easier for
(the handbag example) that there is no reason to humans than WSD is. This principle seems par-
expect the same set of word senses to be relevant ticularly promising for verbs as words expressing
for different tasks and that the corpus dictates the events, which resist the traditional word sense dis-
word senses and therefore word sense was not ambiguation the most.
found to be sufficiently well-defined to be a work-
able basic unit of meaning (p. 116). On the other A novel approach to semantic tagging
hand, even non-experts seem to agree reasonably We present the semantic pattern recognition as
well when judging the similarity of use of a word a novel approach to semantic tagging, which is
in different contexts (Rumshisky et al., 2009). Erk different from the traditional word-sense assign-
et al. (2009) showed promising annotation results ment tasks. We adopt the central idea of CPA that
with a scheme that allowed the annotators graded words do not have fixed senses but that regular
judgments of similarity between two words or be- patterns can be identified in the corpus that ac-
tween a word and its definition. tivate different conversational implicatures from
Verbs are the most challenging part of speech. the meaning potential of the given verb. Our
We see two major causes: vagueness and coer- method draws on a hard-wired, fine-grained in-
cion. We neglect ambiguity, since it has proved to ventory of semantic categories manually extracted
be rare in our experience. from corpus data. This inventory represents the
maximum semantic granularity that humans are
CPA and PDEV able to recognize in normal and frequent uses of a
Our current work focuses on English verbs. verb in a balanced corpus. We thoroughly analyze
It has been inspired by the manual Corpus Pat- the interannotator agreement to find out which of
tern Analysis method (CPA) (Hanks, forthcom- the highly semantic categories are useful in the
ing) and its implementation, the Pattern Dictio- sense of information gain. Our goal is a dynamic
nary of English Verbs (PDEV) (Hanks and Puste- optimization of semantic granularity with respect
jovsky, 2005). PDEV is a semantic concordance to given data and target application.
built on yet a different principle than FrameNet, Like Passonneau et al. (2010), we are con-
WordNet, PropBank or OntoNotes. The man- vinced that IAA is specific to each respective
ually extracted patterns of frequent and normal word and reflects its inherent semantic properties
verb uses are, roughly speaking, intuitively sim- as well as the specificity of contexts the given
ilar uses of a verb that expressin a syntacti- word occurs in, even within the same balanced
cally similar forma similar event in which sim- corpus. We accept as a matter of fact that inter-
ilar participants (e.g. humans, artifacts, institu- annotator confusion is inevitable in semantic tag-
tions, other events) are involved. Two patterns ging. However, the amount of uncertainty of the
843
No. Pattern / Implicature
[[Human 1 | Institution 1] [Human 1 | Institution 1 = Competitor]] submit [[Plan | Document
| Speech Act | Proposition | {complaint | demand | request | claim | application | proposal
| report | resignation | information | plea | petition | memorandum | budget | amendment |
programme | . . . }] [Artifact | Artwork | Service | Activity | {design | tender | bid | entry
1
| dance | . . . }]] (({to} Human 2 | Institution 2 = authority)({to} Human 2 | Institution 2 =
referee)) ({for} {approval | discussion | arbitration | inspection | designation | assessment |
funding | taxation | . . . })
[[Human 1 | Institution 1]] presents [[Plan | Document]] to [[Human 2 | Institution 2]] for {approval
| discussion | arbitration | inspection | designation | assessment | taxation | . . . }
[Human | Institution] submit [THAT-CL|QUOTE]
2
[[Human | Institution]] respectfully expresses {that [CLAUSE]} and invites listeners or readers to
accept that {that [CLAUSE]} is true}
[Human 1 | Institution 1] submit (Self) ({to} Human 2 | Institution 2)
4
[[Human 1 | Institution 1]] acknowledges the superior force of [[Human 2 | Institution 2]] and puts
[[Self]] in the power of [[Human 2 | Institution 2]]
[Human 1] submit (Self) [[{to} Eventuality = Unpleasant] [{to} Rule]]
5
[[Human 1]] accepts [[Rule |Eventuality = Unpleasant]] without complaining
[passive]
6 [Human| Institution] submit [Anything] [{to} Eventuality]
[[Human 1|Institution 1]] exposes [[Anything]] to [[Eventuality]]
right tag differs a lot, and should be quantified. per verb). The annotators were given the en-
For that purpose we developed the reliable infor- tries as well as the reference sample annotated
mation gain measure presented in Section 3.2. by the lexicographer and a test sample of 50 con-
cordances for annotation. We measured IAA, us-
CPA Verb Validation Sample ing Fleisss kappa,3 and analyzed the interannota-
The original PDEV had never been tested with tor confusion manually. IAA varied from verb to
respect to IAA. Each entry had been based on verb, mostly reaching safely above 0.6. When the
concordances annotated solely by the author of IAA was low and the type of confusion indicated a
that particular entry. The annotation instructions problem in the entry, the entry was revised. Then
had been transmitted only orally. The data had the lexicographer revised the original reference
been evolving along with the method, which im- sample along with the first 50-concordance sam-
plied inconsistencies. We put down an annotation ple. The annotators got back the revised entry, the
manual (a momentary snapshot of the theory) and newly revised reference sample and an entirely
trained three annotators accordingly. For practical new 50-concordance annotation batch. The fi-
annotation we use the infrastructure developed at nal multiple 50-concordance sample went through
Masaryk University in Brno (Horak et al., 2008), one more additional procedure, the adjudication:
which was also used for the original PDEV de- first, the lexicographer compared the three anno-
velopment. After initial IAA experiments with tations and eliminated evident errors. Then the
the original PDEV, we decided to select 30 verb lexicographer selected one value for each concor-
entries from PDEV along with the annotated con- dance to remain in the resulting one-value-per-
cordances. We made a new semantic concordance concordance gold standard data and recorded it
sample (Cinkova et al., 2012) for the validation of into the gold standard set. The adjudication pro-
the annotation scheme. We refer to this new col-
lection2 as VPS-30-En (Verb Pattern Sample, 30 3
Fleisss kappa (Fleiss, 1971) is a generalization of
English verbs). Scotts statistic (Scott, 1955). In contrast to Cohens kappa
(Cohen, 1960), Fleisss kappa evaluates agreement between
We slightly revised some entries and updated multiple raters. However, Fleisss kappa is not a generaliza-
the reference samples (usually 250 concordances tion of Cohens kappa, which is a different, yet related, sta-
tistical measure. Sometimes, the terminology about kappas
2
This new lexical resource, including the complete docu- is confusing in the literature. For a detailed explanation refer
mentation, is publicly available at http://ufal.mff.cuni.cz/spr. e.g. to (Artstein and Poesio, 2008).
844
tocol has been kept for further experiments. All Properties: ACM is symmetric and for any i 6= j
values except the marked errors are regarded as the number Cij ? says how many times a pair of
equally acceptable for this type of experiments. annotators disagreed on two tags ti and tj , while
In the end, we get for each verb: Cii? is the frequency
P of? agreements on ti ; the sum
in the i-th row j Cij is the total frequency of
an entry, which is an inventory of semantic assigned sets {t, t0 } that contain ti .
categories (patterns) An example of ACM is given in Table 2. The
corresponding confusion matrices are shown in
300+ manually annotated concordances (sin-
Table 3.
gle values)
845
A1 vs. A2 A1 vs. A3 A2 vs. A3
1 1.a 2 4 5 1 1.a 2 4 5 1 1.a 2 4 5
1 29 1 1 0 0 1 29 2 0 0 0 1 27 2 0 0 0
1.a 0 1 0 0 0 1.a 1 0 0 0 0 1.a 2 0 1 0 0
2 0 1 11 0 0 2 0 0 12 0 0 2 1 0 11 0 0
4 0 0 0 2 0 4 0 0 0 1 1 4 0 0 0 1 4
5 0 0 0 3 1 5 0 0 0 0 4 5 0 0 0 0 1
Table 3: Example of all confusion matrices for the target word submit and three annotators.
Obviously, it can be computed as only on the probability of that tag, and would be
defined as I(tj ) = log p1 (tj ). However, intu-
Pr(T2 = {ti , tj })
p2 (ti | tj ) = itively one can say that a good measure of use-
Pr(T2 {tj }) fulness of a particular tag should also take into
Cij? ?
Cij consideration the expected tagging confusion re-
= m
= P ? .
2 r p2 (tj ) k Cjk lated to the tag. Therefore, to exactly measure
usefulness of the tag tj we propose to compare
and measure similarity of the distribution p1 (ti )
Definition: Confusion Probability Matrix (CPM) and the distribution p2 (ti | tj ), i = 1, . . . , n.
?
Cij How much information do we gain when an an-
p
Cji = p2 (ti | tj ) = P ? . notator assigns the tag tj to an instance? When
k Cjk the tag tj has once been assigned to an instance
Properties: The sum in any row is 1. The j-th by an annotator, one would naturally expect that
row of CPM contains probabilities of assigning ti another annotator will probably tend to assign the
given that another annotator has chosen tj for the same tag tj to the same instance. Formally, things
same instance. Thus, the j-th row of CPM de- make good sense if p2 (tj | tj ) > p1 (tj ) and if
scribes expected tagging confusion related to the p2 (ti | tj ) < p1 (ti ) for any i different from j.
tag tj . If p2 (tj | tj ) = 100 %, then there is full con-
An example is given in Table 3 (all confusion sensus about assigning tj among annotators; then
matrices for three annotators), in Table 2 (the and only then the measure of usefulness of the tag
corresponding ACM), and in Table 4 (the corre- tj should be maximal and should have the value
sponding CPM). of log p1 (tj ). Otherwise, the value of useful-
ness should be smaller. This is our motivation to
1 1.a 2 4 5 define a quantity of reliable information gain ob-
1 0.895 0.084 0.021 0.000 0.000 tained from semantic tags as follows:
1.a 0.727 0.091 0.182 0.000 0.000 Definition: Reliable Gain (RG) from the tag tj is
2 0.053 0.053 0.895 0.000 0.000
4 0.000 0.000 0.000 0.333 0.667 X p2 (tk |tj )
RG(tj ) = (1)kj p2 (tk |tj ) log .
5 0.000 0.000 0.000 0.571 0.429 p1 (tk )
k
Table 4: Example of Confusion Probability Matrix. Properties: RG is similar to the well known
Kullback-Leibler divergence (or information
gain). If p2 (ti | tj ) = p1 (ti ) for all i = 1, . . . , n,
3.2 Semantic granularity optimization then RG(tj ) = 0. If p2 (tj | tj ) = 100 %, then
Now, having a detailed analysis of expected tag- and only then RG(tj ) = log p1 (tj ), which
ging confusion described in CPM, we are able to is the maximum. If p2 (ti | tj ) < p1 (ti ) for
compare usefulness of different semantic tags us- all i different from j, the greater difference in
ing a measure of the information content associ- probabilities, the bigger (and positive) RG(tj ).
ated with them (in the information theory sense). And vice versa, the inequality p2 (ti | tj ) > p1 (ti )
Traditionally, the amount of self-information con- for all i different from j implies a negative value
tained in a tag (as a probabilistic event) depends of RG(tj ).
846
Definition: Average Reliable Gain (ARG) from 3.3 Classifier evaluation with respect to
the tagset {t1 , . . . , tn } is computed as an expected expected tagging confusion
value of RG(tj ): An automatic classifier is considered to be a func-
X tion c thatthe same way as annotators assigns
ARG = p1 (tj )RG(tj ) tags to instances s S, so that c(s) = {t},
j t T . The traditional way to evaluate the ac-
curacy of an automatic classifier means to com-
Properties: ARG has its maximum value if the
pare its output with the correct semantic tags on
CPM is a unit matrix, which is the case of the
a Gold Standard (GS) dataset. Within our formal
absolute agreement among all annotators. Then
framework, we can imagine that we have a gold
ARG has the value of the entropy of the p1 distri-
annotator Ag , so that the GS dataset is represented
bution: ARGmax = H(p1 (t1 ), . . . , p1 (tn )).
by Ag (s1 ), . . . , Ag (sr ). Then the classic accuracy
1 Pr
Merging tags with poor RG score can be computed as r i=1 |Ag (si )c(si )|.
The main motivation for developing the ARG However, that approach does not take into con-
value was the optimization of the tagset granular- sideration the fact that some semantic tags are
ity. We use a semi-greedy algorithm that searches quite confusing even for human annotators. In our
for an optimal tagset. The optimization process opinion, automatic classifier should not be penal-
starts with the fine-grained list of CPA semantic ized for mistakes that would be made even by hu-
categories and then the algorithm merges some mans. So we propose a more complex evaluation
tags in order to maximize the ARG value. An ex- score using the knowledge of the expected tagging
ample is given in Table 5. Tables 6 and 7 show confusion stored in CPM.
the ACM and the CPM after merging. The ex- Definition: Classifier evaluation Score with re-
amples relate to the verb submit already shown in spect to tagging confusion is defined as the pro-
Tables 1, 2, 3 and 4. portion Score(c) = S(c)/Smax , where
=1 = 0.5 =0
1 2 4 Verb Score Score Score
1 94 4 0 halt 1 0.84 2 0.90 4 0.81
2 4 34 0 submit 2 0.83 1 0.90 1 0.84
4 0 0 18 ally 3 0.82 3 0.89 5 0.76
cry 4 0.79 4 0.88 2 0.82
Table 6: Aggregated Confusion Matrix after merging. arrive 5 0.74 5 0.85 3 0.81
plough 6 0.70 6 0.81 6 0.72
deny 7 0.62 7 0.74 7 0.66
1 2 4 cool 8 0.58 8 0.69 8 0.53
1 0.959 0.041 0.000 yield 9 0.55 9 0.67 9 0.52
2 0.105 0.895 0.000
4 0.000 0.000 1.000 Table 8: Evaluation with different values.
Table 7: Confusion Probability Matrix after merging. Table 8 gives an illustration of the fact that us-
ing different values one can get different re-
847
sults when comparing tagging accuracy for dif- 4 Conclusion
ferent words (a classifier based on bag-of-words
The usefulness of a semantic resource depends on
approach was used). The same holds true for com-
two aspects:
parison of different classifiers.
reliability of the annotation
3.4 Related work
In their extensive survey article Artstein and Poe- information gain from the annotation.
sio (2008) state that word sense tagging is one
In practice, each semantic resource emphasizes
of the hardest annotation tasks. They assume
one aspect: OntoNotes, e.g., guarantees reliabil-
that making distinctions between semantic cate-
ity, whereas the WordNet-annotated corpora seek
gories must rely on a dictionary. The problem
to convey as much semantic nuance as possible.
is that annotators often cannot consistently make
To the best of our knowledge, there has been no
the fine-grained distinctions proposed by trained
exact measure for the optimization, and the use-
lexicographers, which is particularly serious for
fulness of a given resource can only be assessed
verbs, because verbs generally tend to be polyse-
when it is finished and used in applications. We
mous rather than homonymous.
propose the reliable information gain, a measure
A few approaches have been suggested in
based on information theory and on the analysis of
the literature that address the problem of the
interannotator confusion matrices for each word
fine-grained semantic distinctions by (automatic)
entry, that can be continually applied during the
measuring sense distinguishability. Diab (2004)
creation of a semantic resource, and that provides
computes sense perplexity using the entropy func-
automatic feedback about the granularity of the
tion as a characteristic of training data. She also
used tagset. Moreover, the computed information
compares the sense distributions to obtain sense
about the amount of expected tagging confusion
distributional correlation, which can serve as a
is also used in evaluation of automatic classifiers.
very good direct indicator of performance ra-
tio, especially together with sense context con- Acknowledgments
fusability (another indicator observed in the train-
ing data). Resnik and Yarowsky (1999) intro- This work has been supported by the Czech Sci-
duced the communicative/semantic distance be- ence Foundation projects GK103/12/G084 and
tween the predicted sense and the correct sense. P406/2010/0875 and partly by the project Euro-
Then they use it for evaluation metric that pro- MatrixPlus (FP7-ICT-2007-3-231720 of the EU
vides partial credit for incorrectly classified in- and 7E09003+7E11051 of the Ministry of Edu-
stances. Cohn (2003) introduces the concept of cation, Youth and Sports of the Czech Republic).
(non-uniform) misclassification costs. He makes We thank our friends from Masaryk University
use of the communicative/semantic distance and in Brno for providing the annotation infrastruc-
proposes a metric for evaluating word sense dis- ture and for their permanent technical support.
ambiguation performance using the Receiver Op- We thank Patrick Hanks for his CPA method, for
erating Characteristics curve that takes the mis- the original PDEV development, and for numer-
classification costs into account. Bruce and ous discussions about the semantics of English
Wiebe (1998) analyze the agreement among hu- verbs. We also thank three anonymous reviewers
man judges for the purpose of formulating a re- for their valuable comments.
fined and more reliable set of sense tags. Their
method is based on statistical analysis of inter-
annotator confusion matrices. An extended study
is given in (Bruce and Wiebe, 1999).
848
References Christiane Fellbaum, Joachim Grabowski, and Shari
Landes. 1997. Analysis of a hand-tagging task. In
Roni Ben Aharon, Idan Szpektor, and Ido Dagan.
Proceedings of the ACL/Siglex Workshop, Somer-
2010. Generating entailment rules from FrameNet.
set, NJ.
In Proceedings of the ACL 2010 Conference Short
Christiane Fellbaum, J. Grabowski, and S. Landes.
Papers., pages 241246, Uppsala, Sweden.
1998. Performance and confidence in a semantic
Ron Artstein and Massimo Poesio. 2008. Inter-coder
annotation task. In WordNet: An Electronic Lexical
agreement for computational linguistics. Computa-
Database, pages 217238. Cambridge (Mass.): The
tional Linguistics, 34(4):555596, December.
MIT Press., Cambridge (Mass.).
Ann Bies, Mark Ferguson, Karen Katz, Robert Mac-
Christiane Fellbaum, Martha Palmer, Hoa Trang Dang,
Intyre, Victoria Tredinnick, Grace Kim, Mary Ann
Lauren Delfs, and Susanne Wolf. 2001. Manual
Marcinkiewicz, and Britta Schasberger. 1995.
and automatic semantic annotation with WordNet.
Bracketing guidelines for treebank II style. Tech-
nical report, University of Pennsylvania. Christiane Fellbaum. 1998. WordNet. An Electronic
Lexical Database. MIT Press, Cambridge, MA.
Susan Windisch Brown, Travis Rood, and Martha
Palmer. 2010. Number or nuance: Which factors Charles J. Fillmore and B. T. S. Atkins. 1994. Start-
restrict reliable word sense annotation? In LREC, ing where the dictionaries stop: The challenge for
pages 32373243. European Language Resources computational lexicography. In Computational Ap-
Association (ELRA). proaches to the Lexicon, pages 349393. Oxford
Rebecca F. Bruce and Janyce M. Wiebe. 1998. Word- University Press.
sense distinguishability and inter-coder agreement. Joseph L. Fleiss. 1971. Measuring nominal scale
In Proceedings of the Third Conference on Em- agreement among many raters. Psychological Bul-
pirical Methods in Natural Language Processing letin, 76:378382.
(EMNLP 98), pages 5360. Granada, Spain, June. Patrick Hanks and James Pustejovsky. 2005. A pat-
Rebecca F. Bruce and Janyce M. Wiebe. 1999. Recog- tern dictionary for natural language processing. Re-
nizing subjectivity: A case study of manual tagging. vue Francaise de linguistique applique, 10(2).
Natural Language Engineering, 5(2):187205. Patrick Hanks. forthcoming. Lexical Analysis: Norms
Silvie Cinkova, Martin Holub, Adam Rambousek, and and Exploitations. MIT Press.
Lenka Smejkalova. 2012. A database of seman- Ales Horak, Adam Rambousek, and Piek Vossen.
tic clusters of verb usages. In Proceedings of the 2008. A distributed database system for develop-
LREC 2012 International Conference on Language ing ontological and lexical resources in harmony.
Resources and Evaluation. To appear. In 9th International Conference on Intelligent Text
Jacob Cohen. 1960. A coefficient of agreement for Processing and Computational Linguistics, pages
nominal scales. Educational and Psychological 115. Berlin: Springer.
Measurement, 20(1):3746. Eduard Hovy, Mitchell Marcus, Martha Palmer,
Trevor Cohn. 2003. Performance metrics for word Lance Ramshaw, and Ralph Weischedel. 2006.
sense disambiguation. In Proceedings of the Aus- OntoNotes: the 90% solution. In Proceedings
tralasian Language Technology Workshop 2003, of the Human Language Technology Conference
pages 8693, Melbourne, Australia, December. of the NAACL, Companion Volume: Short Papers,
Mona T. Diab. 2004. Relieving the data acquisition NAACL-Short 06, pages 5760, Stroudsburg, PA,
bottleneck in word sense disambiguation. In Pro- USA. Association for Computational Linguistics.
ceedings of the 42nd Annual Meeting of the ACL, Nancy Ide, Collin Baker, Christiane Fellbaum, Charles
pages 303310. Barcelona, Spain. Association for Fillmore, and Rebecca Passoneau. 2008. MASC:
Computational Linguistics. The Manually Annotated Sub-Corpus of American
Katrin Erk, Diana McCarthy, and Nicholas Gaylord. English. In Proceedings of the Sixth International
2009. Investigations on word senses and word us- Conference on Language Resources and Evaluation
ages. In Proceedings of the Joint Conference of the (LREC08), pages 2830. European Language Re-
47th Annual Meeting of the ACL and the 4th In- sources Association (ELRA).
ternational Joint Conference on Natural Language Julia Jorgensen. 1990. The psycholinguistic reality of
Processing of the AFNLP, pages 1018, Suntec, word senses. Journal of Psycholinguistic Research,
Singapore, August. Association for Computational (19):167190.
Linguistics. Adam Kilgarriff. 1997. I dont believe in word
Katrin Erk. 2010. What is word meaning, really? senses. Computers and the Humanities, 31(2):91
(And how can distributional models help us de- 113.
scribe it?). In Proceedings of the 2010 Workshop Ramesh Krishnamurthy and Diane Nicholls. 2000.
on GEometrical Models of Natural Language Se- Peeling an onion: The lexicographers experience
mantics, pages 1726, Uppsala, Sweden, July. As- of manual sense tagging. Computers and the Hu-
sociation for Computational Linguistics. manities, 34:8597.
849
Ryan McDonald, Kevin Lerman, and Fernando John Sinclair. 1991. Corpus, Concordance, Colloca-
Pereira. 2006. Multilingual dependency analysis tion. Describing English Language. Oxford Univer-
with a two-stage discriminative parser. In Proceed- sity Press.
ings of the Tenth Conference on Computational Nat- Ralph Weischedel, Martha Palmer, Mitchell Marcus,
ural Language Learning CoNLLX 06, pages 216 Eduard Hovy, Sameer Pradhan, Lance Ramshaw,
220. Association for Computational Linguistics. Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle
Adam Meyers, Ruth Reeves, and Catherine Macleod. Franchini, Mohammed El-Bachouti, Robert Belvin,
2008. NomBank v 1.0. and Ann Houston. 2011. OntoNotes release 4.0.
G. A. Miller, C. Leacock, R. Tengi, and R. T. Bunker. Fabio Massimo Zanzotto, Marco Pennacchiotti, and
1993a. A semantic concordance. In Proceedings of Alessandro Moschitti. 2009. A machine learning
ARPA Workshop on Human Language Technology. approach to textual entailment recognition. Natural
G. A. Miller, C. Leacock, R. Tengi, and R. T. Bunker. Language Engineering, 15(4):551582.
1993b. A semantic concordance. In Proceedings of Yue Zhang and Stephen Clark. 2011. Syntactic pro-
ARPA Workshop on Human Language Technology. cessing using the generalized perceptron and beam
Roberto Navigli. 2006. Meaningful clustering of search. Computational Linguistics, 37(November
senses helps boost word sense disambiguation per- 2009):105151.
formance. In Proceedings of the 21st International
Conference on Computational Linguistics and 44th
Annual Meeting of the ACL, pages 105112, Syd-
ney, Australia.
Martha Palmer, Dan Gildea, and Paul Kingsbury.
2005. The proposition bank: A corpus annotated
with semantic roles. Computational Linguistics
Journal, 31(1).
Rebecca J. Passonneau, Ansaf Salleb-Aoussi, Vikas
Bhardwaj, and Nancy Ide. 2010. Word sense anno-
tation of PolysemousWords by multiple annotators.
In LREC Proceedings, pages 32443249, Valetta,
Malta.
Philip Resnik and David Yarowsky. 1999. Distin-
guishing systems and distinguishing senses: New
evaluation methods for word sense disambiguation.
Natural Language Engineering, 5(2):113133.
Anna Rumshisky, M. Verhagen, and J. Moszkowicz.
2009. The holy grail of sense definition: Creating
a Sense-Disambiguated corpus from scratch. Pisa,
Italy.
Michael Rundell. 2002. Macmillan English Dictio-
nary for advanced learners. Macmillan Education.
Josef Ruppenhofer, Michael Ellsworth, Miriam R. L.
Petruck, Christopher R. Johnson, and Jan Schef-
fczyk. 2010. FrameNet II: Extended Theory and
Practice. ICSI, University of Berkeley, September.
Beatrice Santorini. 1990. Part-of-Speech tagging
guidelines for the penn treebank project. University
of Pennsylvania 3rd Revision 2nd Printing, (MS-
CIS-90-47):33.
William A. Scott. 1955. Reliability of content analy-
sis: The case of nominal scale coding. Public Opin-
ion Quarterly, 19(3):321325.
John Sinclair, Patrick Hanks, and et al. 1987. Collins
Cobuild English Dictionary for Advanced Learn-
ers 4th edition published in 2003. HarperCollins
Publishers 1987, 1995, 2001, 2003 and Collins
AZ Thesaurus 1st edition first published in 1995.
HarperCollins Publishers 1995.
850
Parallel and Nested Decomposition for Factoid Questions
851
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 851860,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
The particular relationship between indepen- 2005), lists (Hartrumpf, 2008) or lists of sets (Lin
dent facts in any given question leads us to catego- and Liu, 2008), and so forth.
rize decomposable questions broadly into two In the literature, we find descriptions of pro-
types: parallel and nested. Examples (2) above cesses like local decomposition and meronymy
and (3) below are parallel decomposable: sub- decomposition (Hartrumpf, 2008), semantic de-
questions can be evaluated independently of one composition using knowledge templates (Katz et
another. In contrast, nested questions require their al., 2005), question refocusing (Hartrumpf, 2008;
decompositions to be processed in sequence, with Katz et al., 2005), and textual entailment (Laca-
the answer to an inner sub-question plugged tusu et al., 2006) to connect, through semantics
into the outer. In Example (4), the inner sub- and discourse, the original question with its nu-
question is marked in brackets; its answer, cir- merous decompositions. In general, such pro-
rhosis, then leads to an outer question In the cesses are not limited to using only lexical mate-
treatment of cirrhosis, which drug reduces portal rial explicitly present in the question: a constraint
venous blood inflow, the answer to which is also we place upon our decomposition algorithms in
the answer to the original question. order to retain the ability to do open-domain QA.
(3) Which 2011 tax form do I fill if I need to do Closer to our strategy are notions like the syn-
itemized deductions and I have an IRA rollover tactic decomposition of Katz et al. (2005), and the
from 2010? temporal/spatial analysis of Saquete et al. (2004)
(4) In the treatment of [a condition that causes and Hartrumpf (2008). Still, our approach differs
bleeding esophageal varices], which drug
in at least two significant ways. We offer a prin-
reduces portal venous blood inflow?
cipled solution to the problem of the final combi-
Questions like these are found in domains such as
nation and ranking of candidate answers returned
medical, legal, etc., as they tend to arise in more
from multiple decompositions, by means of train-
dynamic QA system setting. Independently of do-
ing a model to weigh the effects of decomposition
main and type, however, they share a common
recognition rules. We also note that spatial and
characteristic: if a search query is constructed
temporal decomposition are just special cases of
from all the facts collectively describing the an-
solving nested decomposable questions.
swer, it is likely to flood the system with noise,
The closest similarity our fact-based decompo-
and confuse the identification of potential answer-
sition has with an established approach is with the
bearing passages. The notion of decomposition
notion of asking additional questions in order to
thus goes hand in hand with that of recursively
derive constraints on candidate answers (Prager
applying a QA system to the individual facts (sub-
et al., 2004). However, the additional questions
questions), followed by suitable re-composition
there are generated through knowledge of the do-
of the candidate answer lists for the sub-questions.
main, making that technique hard to apply in an
This paper presents a novel decomposition ap- open domain setting. In contrast, we developed
proach for such questions. We discuss the partic- a domain-independent approach to question de-
ular strategies for recognizing and typing decom- composition, in which we use the question con-
posable questions, and the subsequent processing text alone in generating queriable constraints.
of sub-questions, and their candidate answer lists,
in ways which can improve the performance of an 3 Fact-based Decomposition
existing state-of-the-art QA system.
Enhancing a single-shot QA system with a ca-
2 Related work pability for incremental solving of decomposable
questions requires recognizing that a question is
A variety of approaches to QA cite decomposi- decomposable, and engaging in a staged process-
tion, in the context of addressing question com- ing of its sub-question parts. Whether parallel or
plexity. In most work to date, however, com- nested, the system needs to identify the multiple
plex refers to questions requiring non-factoid facts, and configure itself as appropriate. Figure
answers: e.g. multiple sentences or summaries 1 shows our fact-based decomposition meta-
of answers (Lacatusu et al., 2006), connected framework (meta, as it builds on top of an ex-
paragraphs (Soricut and Brill, 2004), explana- isting QA system). It comprises four main com-
tions and/or justification of an answer (Katz et al., ponents as illustrated in the figure.
852
Question
the two different pathways in the figure, multi-
ple parallel facts submitted to the base QA system
vs. inner-outer sub-question pairs, processed via a
feedback loop. The base system is invoked on the
full question, and on its decompositions.
4 Decomposition Recognizers
The primary goal in decomposing questions is to
identify facts involving the entity being asked for
Ranked Ranked Ranked
Candidates Candidates Candidates (henceforth the focus), simpler than the full ques-
tion and solvable independently (Section 1). Most
question decomposition work (Section 2) tends to
defer to semantic, discourse, and other domain-
Final
Answer List specific information; in contrast, we recognize de-
composable questions primarily on the basis of
Figure 1: Fact-based decomposition framework their syntactic shape. This is important for our
claim that the decomposition framework outlined
in Section 3 is generally applicable to multiple
Decomposition Recognizers analyze the input
QA tasks and system configurations.
question and identify decomposable parts using a
In our work, we use a dataset of factoid ques-
set of predominantly lexico-syntactic cues (Sec-
tion/answer pairs from Jeopardy!,1 a popular TV
tion 4). Question Rewriters re-write the sub-
quiz show in the US. The data is particularly chal-
questions found by the recognizer, retaining key
lenging, not least for the broad domain it covers
contextual information (Section 5.1). Underly-
and the complex language used. In addition to
ing QA System generates, for any factoid ques-
making for an excellent test-bed for open-domain
tion, a ranked list of answer candidates, each with
QA, the data offers a wide choice of questions
a confidence corresponding to the probability of
which require decomposing.
the answer being correct. Answer Synthesis and
Re-ranking is a placeholder for the particular pro- 4.1 Decomposition Patterns
cess which tries to combine ranked candidate an-
Our analysis of complex decomposable questions
swers obtained to the original question with so-
highlights numerous syntactic cues that are reli-
lutions for the decomposed facts into a uniform
able indicators for decomposition, and it is pre-
ranked answer list. In general, different combi-
dominantly such cues we exploit for driving the
nation functions may be appropriate for different
recognition and typing of decomposable ques-
types of decomposable questions. Thus, for the
tions. A set of recognition patterns can be formu-
classes of parallel and nested questions, our de-
lated in terms of fine-grained lexico-syntactic in-
composition strategies (described in Sections 5.2
formation, expressed over the predicate-argument
and 5.3) defer to an Answer Merger. Other combi-
structure (PAS) for the syntactic parse of the
nation functions may be required for e.g. selecting
question. We identify three major categories
from or aggregating over lists; cf. Hartrumpfs op-
of configurationally-based patterns: independent
erational decomposition (2008), or Lin and Lius
subtrees, composable units and segments with
multi-focus questions (2008); see also the special
qualifiers. These are general, in the sense
questions solving techniques of (Prager et al., ).
that they capture relationships between configura-
We use a particular QA system (Ferrucci et
tional properties of a question and its status with
al., 2010) as base. However, any system can be
respect to decomposability. The specific rules im-
plugged into our meta-framework, as long as: it
plementing the patterns may, or may not, have to
can solve factoid questions by providing answers
be modified as, for instance, there may be a style
with confidences reflecting correctness probabil-
change, or a shift in the syntactic analysis frame-
ity; and it maintains context/topic information for
work of the base QA system, to a different parser;
the question separately from its main content.
1
Parallel and nested processing are distinct: note http://www.jeopardy.com.
853
Independent Subtrees
(1.P) Parallel
clause Its original name meant bitter water and it was Fact #1: Its original name meant bitter water
made palatable to Europeans after the Spaniards Fact #2: It was made palatable to Europeans after the
added sugar Spaniards added Sugar
complementary American Prometheus is a biography of this physi- Fact #1: this physicist who died in 1967
cist who died in 1967 Fact #2: American Prometheus is a biography of
this physicist
(1.N) Nested
coincidental When 60 Minutes premiered, this man was U.S. Inner Fact: When 60 Minutes premiered
President Outer Fact: When this man was president
based-on A controversial 1979 war film was based on a 1902 Inner Fact: A controversial 1979 war film
work by this author Outer Fact: film was based on a work by this author
named-for Article of clothing named for an old character who Inner Fact: an old character who dressed in loose
dressed in loose trousers in commedia dellarte trousers in commedia dellarte
Outer Fact: Article of clothing named for character
Composable Units
(2.P) Parallel
verb-args He launched his lecturing career in 1866 with a talk Fact #1: He launched his lecturing career in 1866
later titled Our fellow savages of the Sandwich Is-
lands
focus-mod The Mute was the working title of this 1940 novel Fact #1: this 1940 novel by a female author
by a female author
triple His rise began when he upset Robert M. La Follette, Fact #1: he upset Robert M. La Follette, Jr.
Jr. in a 1946 Senate primary
(2.N) Nested
explicit-link To honor his work, this mans daughter took the name Inner Fact: To honor his work, [this] daughter took
Maria Celeste when she became a nun in 1616 the name Maria Celeste, when . . .
Outer Fact: this mans daughter
descriptive-np The word for this congressional job comes from a fox- Inner Fact: a fox-hunting term for someone who
hunting term for someone who keeps the hunting dogs keeps the hunting dogs from straying
from straying Outer Fact: The word for this congressional job
comes from term
Segments with Qualifiers
(3.P) Parallel
qualifier Winning in 1965 and 1966, he was the first man to win Fact #1: he was the first man to win the Masters golf
the Masters golf tournament in 2 consecutive years tournament in 2 consecutive years
such implementations do not affect our analysis subtree from the question as a decomposable fact.
of syntactically-cued decomposition recognition.
Table 1 shows example decompositions within
pattern categories; note that within a category,
typically there are rule sets for parallel and nested
decomposition types.2
854
sub-questions in the original question. Examples Units rules combine separate parts of the PAS
in Table 1/Row (1.P) illustrate this distinction. into a fact. For instance, a sub-question can be
For nested decomposition, we have three rule created by associating the focus head with its pre-
sets: coincidental, based-on and named-for. modifiers and postmodifiers. If the premodifiers
These use lexical cues to detect specific seman- and postmodifiers are sufficiently specific, we ob-
tic relations within the question that could indi- tain reasonably independent sub-questions, with
cate nestedness. For instance, the coincidental parallel-decomposable behavior.
rules identify sub-questions resolving a tempo- Three parallel decomposition rule sets are de-
ral link with the focus of the original question. fined in this category: verb-args, focus-mod and
The based-on and named-for rules detect sub- triple (see Table 1/row (2.P)). The rules in verb-
questions where the answer to the original ques- args compose a fact from the verb and its ar-
tion is based on or named for the answer to the in- guments (subject, object, PP complements). The
ner sub-question (Table 1/row (1.N)). Note that in focus-mod rules combine the head of the focus
different domains, different relations may corre- NP with its modifiers to generate a sub-question.
late with nestedness, for instance, disease-causes- Similar to verb-args are triple rules, which
symptom in a medical setting; cf. Example (4) in create less constrained sub-questions (in that the
Section 1. The general pattern would still apply, composition always links only two of the argu-
even if we need different rule(s) to implement it. ments to the underlying predicate, e.g. subject-
Configurational information is used to deter- verb-object or subject-verb-complement).
mine whether the question exhibits parallel or Here also, a particular configuration around the
nested decomposition profile. Thus the syntac- focus may indicate a question requiring nested
tic contour of Example (3) shows that two clauses processing. For nested, the Composable Units
characterize the same entity (the focus): a clear category has two rule sets: explicit-link and
indicator that the sub-questions are parallel. Con- descriptive-np (Table 1/row (2.N)).
versely, A controversial 1979 war film was based In contrast to questions where modifiers of the
on a 1902 work by this author exhibits a very focus can be cues for parallel decomposition (i.e.
different set of configurational properties. There the focus-mod rules above), the explicit-link
are two underspecified entities (including the fo- rules detect nested decomposition, signaled by the
cus), both characterized as head-plus-modifiers focus itself being a modifier. For example, in
syntactic units; however, there is no sharing of To honor his work, this mans daughter took the
the separate characterizations (facts) via a com- name Maria Celeste when she became a nun in
mon head. This indicates nestedness: the inner 1616, the focus (this man) is a determiner to
sub-question is the one around the underspecified, an underspecified node (daughter). Traversing
but non-focus, element (a controversial 1979 the tree without descending to the level of the fo-
war film); the outer is [film] was based on a cus would carve out an inner sub-question itself
1902 work by this author. focused on that underspecified node (daughter):
Another cue for nested questions is a sub-tree see Table 1/row (2.N).
labeled by a temporal subordinate conjunction,
or a subordinate clause, away from the focus-
enclosing top level of the question and itself un-
derspecified. Such analysis will motivate the
question When 60 Minutes premiered, this
man was U.S. president to be solved first for the
temporal expression, When did 60 Minutes The descriptive-np rule set finds parenthetical
premiere?, followed by Who was U.S. Presi- descriptions of underspecified nouns in the pri-
dent in 1968?. mary question, as in e.g. This arboreally named
area was made famous by [a prince in the re-
Composable Units An alternate strategy for gion noted for impaling enemies on stakes]:
identifying sub-questions is to compose a fact the nested-decomposable nature of this question
by combining elements from the question. In con- is captured in the descriptive phrase (in square
trast to the previous category, the Composable brackets) functioning as an inner sub-question.
855
Segments with Qualifiers This category of parallel or nested, the appropriate pathway in the
rules covers cases where the modifier of the fo- framework (Figure 1) needs to get instantiated;
cus is a relative qualifier, such as the first, before sub-questions are submitted to the base QA
only, the westernmost. In such cases, in- system, they may need augmentation to facilitate
formation from another clause is usually re- the recursive system invocation. The answer sets
quired to complete the relative qualifier: con- obtained from sub-questions processing need then
sider e.g. the incomplete the third man vs. to be analyzed and rationalized, to determine the
the fact the third man . . . to climb Mt. Ever- final answer to the original question.
est) To deal with these cases, rules in this cat-
egory combine the characteristics of Composable 5.1 Question Re-Writing
Units with those of Independent Subtrees rules. For parallel decomposition, the goal is to solve the
We compose the relative qualifier, the focus original question Q by solving sub-questions in-
(along with its modifiers) and the attached sup- dependently and combining results appropriately.
porting clause subtree to generate this type of For example, consider the Jeopardy! question
rules. As illustrated in row (3.P) of Table 1, for (5) H ISTORIC P EOPLE: The life story of this man
parallel decomposition our rule set covers sub- who died in 1801 was chronicled in an A&E
questions expressed as superlatives. We do not Biography DVD titled Triumph and treason
have any rules of this type for the nested case. We get two decompositions:3
Q1 : This man who died in 1801
Q2 : The life story of this man was chronicled in an
A&E Biography DVD titled Triumph and
treason
Submitting sub-questionsunmodifiedto the
base QA system raises at least two problems.
Sub-questions are often much shorter than the
original question, and in many cases no longer
4.2 Decomposition Filters have a unique answer. Moreover, some of the
All three pattern categories above rely only on a information from the original question that was
syntactic analysis of the question; this is delivered dropped in a sub-question may be relevant con-
by the English Slot Grammar (ESG) parser (Mc- textual cues that the QA system needs to come up
Cord, 1989). When rules fire, they also identify with the correct answer. Q1 above illustrates these
question segments proposed as sub-questions. problems: it does not have a unique answer, and
Not surprisingly, the rules over-generate; to suffers from a recall problem (the correct answer
mitigate against that, we apply several heuristic is not in the candidate answer list of the base sys-
filters to the proposed sub-questions. The filters tem when it considers this sub-question alone).
discard sub-questions that do not contain either a Our solution is to insert contextual informa-
named entity, a quoted string, or a time or date ex- tion into the sub-questions. In a two-step pro-
pression (these are detected by the ESG parser). cess for a sub-question Qi , we obtain the set of
Additionally, we discard sub-questions that al- all named entities and nouns (ignoring stopwords)
most completely overlap the entire question or a in the original question text outside of Qi , and
sub-question from a prior rule. A partial prior- we insert these keywords into the original ques-
ity order is imposed on rule application, based on tion category. In Jeopardy! questions, the cate-
intuitions of how informative the facts generated gory field is the context/topic information which
by a rule are; this order is reflected on a per-type the underlying QA system needs in order to use
basis in Table 1: e.g. within type (2.P) we prefer the decomposition framework, as stated in Sec-
verb-args to triple since the latter tends to produce tion 3. In general, a QA system may derive such
less constrained facts than the former. information in a variety of ways, e.g. by exploit-
ing the problem description in a technical assis-
5 Using Decomposition tance QA setting, or a patients medical history,
In essence, decomposition recognition informs 3
Jeopardy! questions also contain category information,
two processes. According to the question type, which further contextualizes the search for the answer.
856
in a medical QA setting. What is important here Feature Name Description
Binary feature signaling whether
is that the base system treat such information dif- candidate was top answer to non-
Orig. Top Answer
ferently from the question itself. Rewriting takes decomposed question
advantage of this differential weighting to ensure Confidence for candidate answer to
Orig. Confidence non-decomposed question
that the larger context of the original question is
still taken into account when evaluating a sub- Number of sub-questions which
# Facts Matched have candidate answer in top 10
question, albeit with less weight.
Rule-verb-args Features corresponding to the rules
The re-written Q1 /Q2 for Example (5) are: Rule-clause sets used in parallel decomposition
(5-1) H ISTORIC P EOPLE (A&E B IOGRAPHY DVD Rule-qualifier each feature takes a numeric value,
T RIUMPH AND T REASON ): This man who Rule-focus-mod which is the confidence of the QA
Rule-complementary system on a fact identified by the cor-
died in 1801
Rule-triple responding rule set
(5-2) H ISTORIC P EOPLE (1801): The life story of
this man was chronicled in an A&E Biography Table 2: Features in Parallel Re-ranking Model
DVD titled Triumph and treason
The keywords are inserted in parentheses, to en-
Finally, if the sub-questions are not of a good
sure a clear separation between the original cat-
quality (e.g. due to a bad parse), we need a fall-
egory terms and the context terms added. Other
back to the original question, which implies that
systems may need a different re-writing tactic.
the confidence for the candidate answer for the
The above re-writing technique is used for both
entire question should also be considered when
parallel and nested decomposable questions. For
making a final decision. Consequently, we use a
the nested case, there is an additional re-writing
machine-learning model to combine information
step that needs to be done after solving the in-
across sub-question answer confidences, with fea-
ner question, we need to substitute its answer into
tures capturing the above information (Table 2).
the outer when solving for it. Thus the first ex-
In case a candidate answer is not in the answer
ample in Table 1/row(1.N) would have its inner
list of the full question or any of the decomposed
focus When 60 Minutes premiered replaced
sub-questions, the corresponding feature value is
with In 1968 creating the outer question In
set to missing. If a rule generates multiple sub-
1968, this man was U.S. President whose solu-
questions, its corresponding feature value for the
tion is the answer to the original question.
candidate answer is set to the sum of the confi-
dences obtained for that answer across all sub-
5.2 Answer Re-Ranking: Parallel
questions. The model is trained using Wekas
The base QA system will process the re-written (Witten and Frank, 2000) logistic regression al-
category/sub-question pairs, and will produce a gorithm with instance weighting.
set of ranked candidate lists with confidences.
These need to be combined into a final answer list 5.3 Answer Re-Ranking: Nested
for the original question, accounting for informa- Nested questions decompose into inner/outer
tion across all sub-question candidate lists. question pairs. The task is to solve the inner ques-
One way to produce a final score for each can- tion first, substitute the answer obtained, based on
didate answer is simply to take the product of its confidence, into the outer, and solve that for the
the scores returned by the QA system for each final answer. This is contingent upon selecting an-
of the sub-questions. This assumes that the sub- swers to the inner question which might profitably
questions are typically independent and that the be plugged into the outer; substituting incorrect
QA system produces a confidence which corre- answers will only lead to noisy final answers, with
sponds to the probability of the answer being cor- negative impact to overall accuracy.
rect. However, even if the sub-questions are inde- We rely on the ability of the underlying QA sys-
pendent, question re-writing breaks this assump- tem to produce meaningful confidences for its an-
tion as it brings information from the remain- swers, and only consider the top answer to the in-
der of the question into the sub-question context. ner question for substitution into the outerif its
Also, the sub-questions are generated by decom- confidence exceeds some threshold.
position rules that have varying precision and re- Finally, the answers to the outer question need
call, and thus should not be weighted equally. to be related to the full question answer list, to
857
produce the final ranked answers. For answer QA End-to-End Decomposable Q
System Accuracy Accuracy
re-ranking, we use the following heuristic selec-
PB 635/1269 (50.05%) 339/598 (56.68%)
tion strategy: we compute the aggregate confi-
PDQR 634/1269 (49.96%) 338/598 (56.52%)
dence of the answer obtained through decompo- PD+QR 643/1269 (50.66%) 347/598 (58.02%)
sition as the product of the inner-question answer NB 635/1269 (50.05%) 129/255 (50.58%)
confidence and the outer-question answer confi- ND+QR 640/1269 (50.43%) 134/255 (52.54%)
dence, and compare this value with that of the top
Table 3: Evaluating Decomposition
answer confidence to the entire question select-
ing the higher confidence one as our final answer.
Note that this re-ranking is different from the one NB to Nested Baseline; both are results from run-
used in parallel decomposition where we combine ning the underlying QA system without any de-
results from multiple sub-questions into a single composition capabilities. PD and ND refer to Par-
confidence. allel and Nested Decomposition systems respec-
tively and QR refers to question re-writing. Sepa-
6 Evaluation rate experiments determined end-to-end accuracy
for the different system configurations, with re-
6.1 Evaluation Data
spect to the entire test set, and accuracy over the
As we discuss question decomposition in the con- decomposable questions subsets of the test set.
text of Jeopardy! data (Section 4), our test set con- We do not offer separate analysis of decompo-
tains only Final Jeopardy! (FJ) questions. They sition recognition. Manual creation of decompo-
are often long and complex, with multiple facts sition standard is highly non-trivial, largely due
or constraints that need to be satisfied. Also, they to the numerous alternative ways to decompose
are typically much harder to answer than regular a question, and synthesize unique facts from the
Jeopardy! questions both for humans and for our segments. Indeed, this is precisely the motivation
base QA system. The test set comprises close to for weighting the decomposition rules in a trained
3000 FJ questions, broken into 1138 for training, re-ranking model (Section 5.2). Given this, we
517 for development and 1269 questions for test- are interested only in measuring the impact of de-
ing (as blind data). composition on end-to-end QA performance.
858
our parallel decomposition algorithm was able to 7 Conclusion
achieve a gain of 1.4% on the parallel decompos-
In this paper, we presented a general-purpose de-
able question set, which translated to an end-to-
composition framework for answering complex
end gain of 0.6%.
factoid questions, which consists of three compo-
Separately, the table shows that roughly a
nents: 1) a decomposition recognizer, which iden-
fifth (255 out of 1269 questions) of the entire
tifies the subparts of a decomposable question, 2)
test set were recognized as nested decomposable.
a question re-writer, which composes new sub-
Again, interestingly, the performance of the base-
questions from the identified subparts, taking into
line QA system on the nested decomposable set
account context from the original question, and
was roughly the same as the overall performance
3) an answer synthesis and re-ranking component,
(and much lower than the parallel decomposable
which synthesizes and ranks final answers based
cases). The likely explanation here is that nested
on candidate answers to the sub-questions. Addi-
questions require solving for an inner fact first,
tionally, this framework leverages an underlying
and it is the answer to this, which often provides
factoid QA system for producing answers to the
the necessary missing information required to find
sub-questions. Any QA system that can associate
the correct answer: this makes nested questions
confidence scores with its answers and can make
much harder to solve than parallel decomposable
distinctions between the question and the context
ones with their multiple independent facts. Our
in which the question should be interpreted can be
nested decomposition algorithm using the heuris-
adopted in this decomposition framework.
tic re-ranking approach (Section 5.3) was able to
We applied our decomposition framework to
achieve a gain of 2% on the nested decompos-
address two broad classes of complex factoid
able question set, which translated to an end-to-
questions, parallel and nested decomposition
end gain of 0.4%.
questions. These are distinguished by how the
The aggregate impact of parallel and nested de-
identified sub-questions related to each other,
composition was a 1.5% gain in accuracy on the
which in turn affects how the candidate answers
decomposable set, and a 1% gain on end-to-end
to the sub-questions are combined to form the fi-
system accuracy (in our case the questions that are
nal answers. In order to maintain generality and
classified as parallel or nested form disjoint sets).
facilitate domain adaptation, the rule-based pat-
To put these results in perspective, we empha-
terns for decomposition recognition leverage syn-
size that the baseline QA system represents state-
tactic characteristics of the question that are in-
of-the-art in solving Jeopardy! questions. The
dicative of sub-question boundaries. To optimally
FJ questions, which exclusively comprise our test
leverage these patterns, a machine learning model
data, are known to be harder than regular Jeop-
was trained to properly weigh the possibly over-
ardy!: qualified Jeopardy! players accuracy on
lapping, and occasionally conflicting, patterns.
this kind of questions is 48%,4 and the underlying
We demonstrated the impact of our question
QA system has an accuracy close to 51% on pre-
decomposition approach on a state-of-the-art fac-
viously unseen FJ questions. A gain of 1% end-
toid QA system. On a test set of 1269 Final Jeop-
to-end on such questions, therefore, represents a
ardy! questions, 47% of the question were found
strong improvement. Also, using the statistical
to be parallel decomposable and 20% were nested
McNemars test (McNemar, 1947), we found the
decomposable. Overall, the system achieved a
net end-to-end impact to be statistically signifi-
statistically significant gain of 1.5% in accuracy
cant at a 99% confidence interval.
on these questions, further increasing the systems
Finally, we note that our error analysis of the
lead over human Jeopardy! players performance
test questions shows a wide variety of reasons for
on these questions.
their failures beyond question decomposition. To
Given that factoid (and, often, complex) ques-
further improve the system on this test set would
tions are typically found in several real world do-
require advances beyond deciding whether to take
mains (e.g. medical, legal, technical support),
a single-shot or decomposable approach to ques-
we expect our decomposition framework to have
tions, which is beyond the scope of this paper.
broad impact, both in open- and specialized-
4
Calculated over historical games data, from J-archive domain QA.
(http://www.j-archive.com).
859
References E. Voorhees. 2002. Overview of the TREC 2002
Question Answering Track. In NIST Special Pub-
D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, lication 500-251: The Eleventh Text REtrieval Con-
D. Gondek, A. Kalyanpur, A. Lally, J. W. Murdock, ference (TREC 2002), Gaithersburg, MD, Novem-
E. Nyberg, J. Prager, N. Schlaefer, and C. Welty. ber.
2010. Building Watson: An Overview of the I. Witten and E. Frank. 2000. Data Mining - Practical
DeepQA Project. AI Magazine, 31(3):5979, Fall. Machine Learning Tools and Techniques with Java
S. Hartrumpf. 2008. Semantic Decomposition for Implementations. MorganKaufmann, San Fran-
Question Answering. In Proceedings of the 18th cisco, CA.
European Conference on Artificial Intelligence,
pages 313317, Patras, Greece, July.
B. Katz, G. Borchardt, and S. Felshin. 2005. Syntactic
and Semantic Decomposition Strategies for Ques-
tion Answering from Multiple Sources. In Proceed-
ings of the AAAI Workshop on Inference for Textual
Question Answering, pages 3541, Pittsburgh, PA,
July.
F. Lacatusu, A. Hickl, and S. Harabagiu. 2006. The
Impact of Question Decomposition on the Quality
of Answer Summaries. In Proceedings of the Fifth
Language Resources and Evaluation Conference,
pages 11471152, Genoa, Italy, May.
C.J. Lin and R.R. Liu. 2008. An Analysis of Multi-
Focus Questions. In Proceedings of the SIGIR 2008
Workshop on Focused Retrieval, pages 3036, Sin-
gapore, July.
M. McCord. 1989. Slot Grammar: A System for
Simpler Construction of Practical Natural Language
Grammars. In Proceedings of the International
Symposium on Natural Language and Logic, pages
118145, Hamburg, Germany, May.
Q. McNemar. 1947. Note on the Sampling Error of
the Difference Between Correlated Proportions or
Percentages. Psychometrika, 12(2):153157.
J. Prager, E. Brown, and J. Chu-Carroll. Special Ques-
tions and Techniques. Submitted to IBM Jour-
nal of Research and Development, Special Issue on
DeepQA.
J. Prager, J. Chu-Carroll, and K. Czuba. 2004. Ques-
tion Answering by Constraint Satisfaction: QA-by-
Dossier with Constraints. In Proceedings of the
42nd Annual Meeting of the Association for Com-
putational Linguistics, pages 574581, Barcelona,
Spain, July.
E. Saquete, P. Martnez-Barco, R. Munoz, and
J. Vicedo. 2004. Splitting Complex Temporal
Questions for Question Answering Systems. In
Proceedings of the 42nd Annual Meeting of the
Association for Computational Linguistics, pages
566573, Barcelona, Spain, July.
R. Soricut and E. Brill. 2004. Automatic Question
Answering: Beyond the Factoid. In Proceedings
of the Human Language Technology Conference of
the North American Chapter of the Association for
Computational Linguistics, pages 5764, Boston,
MA, May.
860
Author Index
861
Habash, Nizar, 675 Min, Bonan, 194
Han, Xufeng, 747 Mitchell, Margaret, 747
Hanamoto, Atsushi, 430 Mitkov, Ruslan, 706
Hartmann, Silvana, 580 Miyao, Yusuke, 686
Henrich, Verena, 387 Moens, Marie-Francine, 336, 449
Hinrichs, Erhard, 387 Mohit, Behrang, 162
Hirst, Graeme, 315 Monz, Christof, 2, 109, 356
Holub, Martin, 840 Mooney, Raymond, 602
Hoppe, Dennis, 570 Moore, Johanna D., 471
Hovy, Dirk, 185 Mostow, Jack, 377
Huang, Ruihong, 286
Nakov, Preslav, 492
Irvine, Ann, 130 Newman, David, 591
Isard, Amy, 471 Ng, Vincent, 798
Niculae, Vlad, 524
Jagarlamudi, Jagadeesh, 204 Nikoulina, Vassilina, 109
Jain, Mahaveer, 787 Nivre, Joakim, 44
Jang, Hyeju, 377
Jans, Bram, 336 Oflazer, Kemal, 162
Joachims, Thorsten, 224 Ordan, Noam, 255
Ortiz-Martnez, Daniel, 245
Kaisser, Michael, 88 Osenova, Petya, 492
Kalyanpur, Aditya, 851
Klakow, Dietrich, 325 Pado, Sebastian, 623
Klementiev, Alexandre, 12, 130 Pasca, Marius, 503
Koller, Alexander, 757 Patwardhan, Siddharth, 185, 851
Kovachev, Bogomil, 109 Peldszus, Andreas, 514
Krz, Vincent, 840 Penn, Gerald, 33, 696
Kuhn, Jonas, 77, 767 Penstein Rose, Carolyn, 787
Kwiatkowski, Tom, 234 Powers, David Martin Ward, 345
Purver, Matthew, 482
Lagos, Nikolaos, 109
Lagoutte, Aurelie, 808 Qu, Zhonghua, 367
Lally, Adam, 851 Quattoni, Ariadna, 409
Lau, Jey Han, 591 Quernheim, Daniel, 808
Lavelli, Alberto, 420
Lembersky, Gennadi, 255 Rahman, Altaf, 798
Liu, Ting, 296 Raj, Bhiksha, 787
Liu, Yang, 367 Ranta, Aarne, 645
Luque, Franco M., 409 Rello, Luz, 706
Riezler, Stefan, 818
Maletti, Andreas, 808 Riloff, Ellen, 286
Manandhar, Suresh, 654 Rocha, Martha-Alicia, 152
Marchetti-Bowick, Micol, 603 Rosset, Sophie, 174
Martzoukos, Spyros, 2
Matsubayashi, Yuichiroh, 686 Sanchis-Trilles, German, 152
Matsuzaki, Takuya, 430 Schlangen, David, 514
Matuschek, Michael, 580 Schmid, Helmut, 55
Max, Aurelien, 716 Schneider, Nathan, 162
McCarthy, Diana, 591 Schutze, Hinrich, 276
McDonough, John, 787 Sennrich, Rico, 539
Mensch, Alyssa, 747 Shan, Chung-chieh, 23
Meyer, Christian M., 580 Shivaswamy, Pannaga, 224
Simov, Kiril, 492
Sipos, Ruben, 224
Smith, Noah A., 162
Sokolov, Artem, 120
Steedman, Mark, 234
Stein, Benno, 570
Stratos, Karl, 747
Strik, Helmer, 561
Strzalkowski, Tomek, 296
Sulea, Octavia-Maria, 524